detect_outliers() evaluates a numeric indicator by administrative unit and
time using three methods (mean + SD, median + MAD, Tukey upper fence) to
identify unusually HIGH values only (potential outbreaks or data anomalies).
Low values are NOT flagged as outliers. It enforces reporting and sample-size
guardrails, supports strictness presets. Optional outbreak classification
distinguishes between isolated outliers and sustained outbreak patterns using
gap-aware clustering that can bridge short interruptions in the outbreak
signal. It returns per-method flags (optional), a consensus classification,
scale statistics, and metadata.
Usage
detect_outliers(
data,
column,
record_id = "record_id",
admin_level = c("adm1", "adm2"),
spatial_level = "hf_uid",
date = "date",
time_mode = c("across_time", "within_year", "seasonal"),
value_type = c("count", "rate"),
strictness = c("balanced", "lenient", "strict", "advanced"),
methods = c("iqr", "median", "mean", "consensus"),
sd_multiplier = 3,
mad_constant = 1.4826,
mad_multiplier = 9,
iqr_multiplier = 2,
consensus_rule = 3,
output_profile = c("standard", "lean", "audit"),
min_n = 8,
reporting_rate_col = NULL,
reporting_rate_min = 0.5,
key_indicators_hf = NULL,
classify_outbreaks = FALSE,
outbreak_min_run = 2,
outbreak_prop_tolerance = 0.9,
outbreak_max_gap = 12,
verbose = TRUE
)Arguments
- data
Data frame containing the indicator to analyse.
- column
Name of the numeric column to evaluate.
- record_id
Unique record identifier column.
- admin_level
Character vector of administrative level columns for parallel grouping, ordered from higher to lower resolution. Defaults to
c("adm1", "adm2").- spatial_level
Character string specifying the finest spatial unit for analysis (e.g., "hf_uid" for facility-level). When specified,
admin_leveldefines grouping boundaries whilespatial_leveldefines the unit of analysis. This prevents excessive grouping while maintaining spatial granularity. Default ishf_uid.- date
Date column (Date, POSIXt, or parseable character string). Year, month, and yearmon are automatically derived from this column.
- time_mode
Pooling strategy:
"across_time","within_year", or"seasonal". Seasonal mode groups by month across all years (e.g., all Januaries together), useful for detecting values that are unusual for a specific month regardless of year.- value_type
Indicator type:
"count"or"rate". Counts floor lower bounds at 0.- strictness
Strictness preset:
"lenient","balanced","strict", or"advanced". Presets map to method multipliers. If not"advanced", any manual multipliers are ignored.- methods
Character vector specifying which outlier detection methods to use: "iqr" (Interquartile Range), "median" (Median Absolute Deviation), "mean" (Mean +/- SD), and/or "consensus". Default is
c("iqr", "median", "mean", "consensus"). For consensus, at least two other methods must be selected.- sd_multiplier
Width (in SD units) for the mean method (used only when
strictness = "advanced").- mad_constant
Constant passed to
stats::mad()in advanced mode (default 1.4826).- mad_multiplier
Width multiplier for the MAD method (advanced mode).
- iqr_multiplier
Tukey fence multiplier for the IQR method (advanced mode).
- consensus_rule
Number of methods that must agree (
1,2, or3) for the consensus flag to call an outlier. Default2.- output_profile
Controls the amount of detail returned:
"lean"(minimal columns: id, admin, date, value, consensus flag, reason),"standard"(lean + per-method flags + bounds + seasonality mode),"audit"(all columns for full reproducibility). Default"standard".- min_n
Minimum observations required in the active comparison bucket before flagging is attempted (applies to any seasonal bucket or fallback).
- reporting_rate_col
Optional column with reporting completeness in
[0, 1].- reporting_rate_min
Minimum acceptable reporting rate. Rows below the threshold receive
reason = "low_reporting"and are not flagged.- key_indicators_hf
Optional character vector of indicator names used to determine facility activeness. If supplied, the function uses a fast path to filter out inactive facility-months. A facility-month is considered active if ANY of the specified key indicators have non-NA values. Inactive facility-months are excluded from outlier detection. If
NULL(default), activeness filtering is skipped. Typical indicators include"allout","test", or"conf". This adjustment prevents false positives caused by facilities that start or stop reporting mid-period.- classify_outbreaks
Logical. When
TRUE(default), applies outbreak classification to distinguish between isolated outliers and sustained outbreak patterns. Consecutive outliers meeting the outbreak criteria are reclassified from "outlier" to "outbreak". This is particularly useful for epidemiological surveillance to identify disease outbreak patterns. Set toFALSEto disable outbreak classification.- outbreak_min_run
Integer. Minimum number of consecutive outliers required to classify as an outbreak (default
2). Must be >= 2.- outbreak_prop_tolerance
Numeric. Proportional tolerance for outbreak consistency (default
0.9). Values within this tolerance of the run median are considered consistent. Range: (0, 1).- outbreak_max_gap
Integer. Maximum allowed gap (non-outlier months) between outliers that can still be considered part of the same outbreak (default
1). For example, withoutbreak_max_gap = 12, the pattern "outlier-normal-outlier-outlier" would be classified as one outbreak of length 3, rather than separate incidents. Set to0for strict consecutive-only outbreaks. Useful for real-world data with reporting gaps.- verbose
Logical. When
TRUE, prints an informative summary showing which methods are being applied, the pooling strategy, strictness settings, guardrails, and consensus rule. Default isFALSE.
Value
Tibble with outlier classifications and metadata. Columns include:
identifiers (record_id, admin levels, yearmon, year, derived
month), column_name, value, value_type, scale stats (mean,
sd, median, mad, q1, q3, iqr), method bounds, n_in_group,
guardrail reason, method flags (optional), outlier_flag_consensus,
strictness multipliers, and (if activeness filtering was applied)
activeness_applied and key_indicators_used.
Outlier flag categories are simplified to three intuitive groups:
"normal": Values within expected statistical bounds"outlier": Values exceeding thresholds (potential anomalies/outbreaks)"insufficient_data": Various data quality issues preventing determination (consolidates insufficient_n, insufficient_evidence, unstable_scale, etc.)
Details
Workflow
Validation & prep: confirm required columns, coerce target to numeric safely, derive month from
yearmon.Activeness filtering (if
key_indicators_hfis supplied): callclassify_facility_activity()to tag inactive facilities. Inactive facility-months are excluded from detection and assignedreason = "inactive_facility".Strictness: presets map to (SD, MAD, IQR) multipliers; advanced mode honours manual multipliers. On across-time fallback the strictness shifts one step toward lenient.
Guardrails: rows failing
reporting_rate_minormin_nbypass flagging and recordreason(low_reporting,insufficient_n,insufficient_data).Flagging: each method checks if values exceed the UPPER threshold only (e.g., mean + multiplier x SD). Low values are never flagged. Methods are suppressed when scales are unstable (
sd,mad, oriqrequals zero).Consensus: final
outlier_flag_consensusrequires at leastconsensus_rulemethods to agree over available (non-suppressed) methods.
Facility activeness adjustment
When inactive or newly activated health facilities are included in aggregated
totals, apparent spikes or dips can occur that do not represent real
epidemiological changes. For example, if ten new facilities start reporting
in Matoto in mid-2022, the total number of confirmed cases rises even if
incidence per facility remains constant. To prevent such artefacts,
detect_outliers() uses a fast, optimized approach for activeness filtering.
When key_indicators_hf is specified, the function checks if each
facility-month has ANY non-NA values in the specified key indicators.
Only facility-months with at least one reported key indicator contribute to
the comparison pool for outlier detection. This lightweight approach avoids
the computational overhead of full activity classification while still
preventing false positives from inactive facilities.
Presets (for high outliers only)
lenient: values > mean + 4.0 x SD, median + 12 x MAD, or Q3 + 3.0 x IQR
balanced: values > mean + 3.0 x SD, median + 9 x MAD, or Q3 + 2.0 x IQR
strict: values > mean + 2.5 x SD, median + 6 x MAD, or Q3 + 1.5 x IQR
advanced: use user-supplied multipliers
Note: Only values ABOVE the upper thresholds are flagged. Low values are always classified as "normal" regardless of how far below the mean/median.
The returned tibble always contains identifiers, scale statistics, bounds,
strictness metadata, and the guardrail reason. When output_profile = "standard" or "audit", method-specific flags are included alongside
the consensus.
Examples
if (FALSE) { # \dontrun{
# NOTE: Only HIGH values are flagged as outliers (e.g., disease outbreaks).
# Low values are always considered "normal".
# 1) Minimal consensus output at adm1-only level
detect_outliers(
data = malaria_data,
column = "confirmed_cases",
date = "date",
record_id = "facility_id",
admin_level = c("adm1"), # ignore adm2 if not present
time_mode = "across_time"
)
# 2) Within-year comparison
detect_outliers(
data = malaria_data,
column = "confirmed_cases",
date = "date",
admin_level = c("adm1"),
time_mode = "within_year",
consensus_rule = 2,
output_profile = "audit"
)
# 3) Advanced overrides for rates
detect_outliers(
data = malaria_data,
column = "positivity_rate",
date = "date",
value_type = "rate",
strictness = "advanced",
sd_multiplier = 2.5,
mad_multiplier = 7,
iqr_multiplier = 1.8,
min_n = 10,
output_profile = "audit"
)
# 4) With facility activeness filtering
detect_outliers(
data = malaria_data,
column = "conf",
date = "date",
admin_level = c("adm1", "adm2"),
time_mode = "across_time",
key_indicators_hf = c("allout", "test", "conf")
)
# 5) With binary activeness classification
detect_outliers(
data = malaria_data,
column = "conf",
date = "date",
admin_level = c("adm1", "adm2"),
time_mode = "across_time",
key_indicators_hf = c("allout", "test", "conf")
)
# 6) Seasonal comparison (same month across years)
# Compares all Januaries together, all Februaries together, etc.
# Useful for detecting values unusual for a specific month
detect_outliers(
data = malaria_data,
column = "conf",
date = "date",
admin_level = c("adm1", "adm2"),
time_mode = "seasonal"
)
} # }
