Outlier detection methods
Overview
In subnational malaria analysis, outlier correction helps separate real epidemiological signals from data noise. This ensures reliable decision-making for district-level targeting and resource allocation. Before outliers can be corrected, they must be identified and investigated.
- Compare how outlier detection methods affect different malaria indicators
- Assess how methods balance data integrity with sensitivity
- Select appropriate detection based on analysis goals and data quality
Introduction to Outliers and Detection Methods
Outliers in SNT
Outliers can represent either data quality issues or true epidemiological events. Their impact is magnified in subnational analysis where small numbers matter. For example:
- A reporting error at one facility can distort district-level estimates
- A real outbreak might be mistakenly “corrected” if not properly validated
Outlier detection occurs at the health facility-month (HF-month) level, where each facility’s monthly reports are compared against its own historical patterns.
Seasonal Considerations in Malaria Data
In malaria surveillance, seasonal peaks are expected epidemiological patterns, not outliers. True outliers represent deviations from expected seasonal patterns. The methods here account for this by: - Comparing facilities against their own monthly baselines - Using robust methods (Median/MAD, IQR) that are less influenced by seasonal extremes - Requiring manual investigation to distinguish data errors from genuine outbreaks
Outlier Detection Methods
Three outlier detection methods are explored on this page, each with distinct statistical properties that make them suitable for different data characteristics:
1. Mean/Standard Deviation Method (±2σ):
- Basis: Assumes normal distribution (Gaussian)
- Formula: Outlier if
|value - mean| > 2 × standard deviation - Best for: Normally distributed data without extreme skew
- Limitations: Sensitive to extreme values; mean and SD are heavily influenced by outliers themselves
- Malaria context: Rarely ideal due to right-skewed case count distributions
2. Median/MAD Method (Median ± 2.5×MAD):
- Basis: Non-parametric, robust to distribution shape
- Formula: Outlier if
|value - median| > 2.5 × Median Absolute Deviation - MAD calculation:
MAD = median(|values - median|) × 1.4826(scaled to match SD for normal data) - Best for: Skewed data with asymmetric distributions common in malaria reporting
- Advantage: Resistant to up to 50% contamination by outliers
- Malaria context: Ideal for facility-level malaria data where most facilities report low case counts (right-skewed distribution) with occasional genuine outbreaks that shouldn’t be over-penalized
3. Interquartile Range Method (IQR ± 1.5×IQR):
- Basis: Non-parametric, based on data quartiles
- Formula: Outlier if
value < Q1 - 1.5×IQRorvalue > Q3 + 1.5×IQR - IQR calculation:
IQR = Q3 - Q1(contains middle 50% of data) - Best for: Volatile datasets with many small facilities and varying reporting patterns
- Malaria context: Effective for zero-inflated and over-dispersed count data
Statistical Note: The multipliers have been empirically calibrated to (1.8, 3.2, 1.6) to account for malaria data’s extreme right-skew. In normally distributed data, these multipliers would flag approximately 7.2%, 0.2%, and 4.7% of points as outliers respectively. However, for highly right-skewed malaria data, the multipliers have been adjusted from standard values (3, 2.5, 1.5) to ensure appropriate sensitivity for this specific distribution.
The provided code applies these detection methods to the conf_hf_u5 variable representing confirmed malaria cases in children under 5 detected at health facilities. Outliers should be identified in disaggregated, extracted columns. Aggregated columns may obscure outliers in disaggregated components.The plot below visualizes outliers in aggregated columns and disaggregatd components, highlighting the need to perform detection directly on the columns extracted from DHIS2.
Step-by-Step
Below is a step-by-step walkthrough of outlier detection methods, with emphasis on comparison to help you visualize and use the methods appropriately. Each step includes notes to help you understand, adapt, and use the code.
To skip the step-by-step explanation, jump to the full code at the end of this page.
Step 1: Load libraries and data
First we install and load the packages and data required for this section. This page continues the use of the preprocessed Sierra Leone DHIS2 example data.
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# Install or load relevant packages
pacman::p_load(
readxl, # import and read Excel files
dplyr, # data manipulation
ggplot2, # plotting
tidyr, # data management
patchwork, # for arranging plots
purrr, # vector management
viridis, # colour-blind friendly palettes
here # for easy file referencing
)
clean_malaria_routine_data_final <-
readRDS(
here::here(
"1.2_epidemiology/1.2a_routine_surveillance/raw",
"clean_malaria_routine_data_final.rds"
)
)Step 2: Outlier detection
Step 2.1: Calculate outlier statistics
We begin by defining the calculate_outlier_stats function to streamline the process of identifying points as outliers. The function calculates the various statistical metrics required to identify outliers.
calculate_outlier_stats <- function(df, column) {
df |>
dplyr::group_by(hf_uid, month) |> # Facility-month grouping
dplyr::filter(n() >= 3) |> # Require minimum 3 observations per group
dplyr::summarise(
# Mean/SD metrics
mean = mean(.data[[column]], na.rm = TRUE),
sd = sd(.data[[column]], na.rm = TRUE),
# Median/MAD metrics
median = median(.data[[column]], na.rm = TRUE),
mad = mad(.data[[column]], constant = 1, na.rm = TRUE),
# IQR metrics
Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
.groups = "drop"
) |>
dplyr::mutate(
# Mean/SD bounds. For highly right-skewed data, use much tighter bounds
mean_lower = pmax(0, mean - 1.8 * sd),
mean_upper = mean + 1.8 * sd,
# Median/MAD bounds
median_lower = pmax(0, median - 3.2 * mad),
median_upper = median + 3.2 * mad,
# IQR bounds
iqr_lower = pmax(0, Q1 - 1.6 * (Q3 - Q1)),
iqr_upper = Q3 + 1.6 * (Q3 - Q1)
)
}To adapt the code:
- Line 20-30: (Advanced) Change the multiplier in each outlier detection method to modify the sensitivity of each method. The multipliers used here are 1.8, 3.2, and 1.6 for mean/SD, median/MAD, and IQR detection respectively. These multipliers were empirically calibrated for Sierra Leone malaria data and may need adjustment for different datasets or surveillance systems.
Step 2.2: Flag outliers
Next we define the flag_outliers function to identify outliers based on the three methods and attach this information to the dataframe create in the Step 2.1. This code flags outliers based on the statistics calculated in the previous step.
flag_outliers <- function(df, stats_df, column) {
df |>
dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
dplyr::mutate(
outliers_moyenne = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
TRUE ~ "normal"
),
outliers_halper = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < median_lower | .data[[column]] > median_upper ~ "outlier",
TRUE ~ "normal"
),
outliers_IQR = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
TRUE ~ "normal"
)
) |>
# Remove the intermediate calculation columns
dplyr::select(-mean, -sd, -median, -mad, -Q1, -Q3,
-mean_lower, -mean_upper, -median_lower, -median_upper,
-iqr_lower, -iqr_upper)
}
conf_hf_u5_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf_hf_u5")
conf_hf_u5_results <- flag_outliers(clean_malaria_routine_data_final, conf_hf_u5_stats, "conf_hf_u5")To adapt the code:
- Line 28-29: Modify
conf_hf_u5_resultsandconf_hf_u5_statsto reflect the variable you are performing outlier detection on. These are the names given to each function’s output. It is recommended to follow thevariablename_resultsandvariablename__resultsnaming convention. - Line 28: Modify
(clean_malaria_routine_data_final, "conf_hf_u5")to reflect the(df, "column")you are performing outlier detection on - Line 29: Modify
(clean_malaria_routine_data_final, conf_hf_u5_stats, "conf_hf_u5")to reflect the(df, variablename_stats, "column")you are performing outlier detection on
Step 3: Visualize outliers
Step 3.1: Calculate outlier proportions
First we use the calculate_pct_labels function to calculate the proportion of conf_hf_u5 values labelled outliers for each of the three detection methods.
calculate_pct_labels <- function(var) {
non_na <- sum(!is.na(var))
normal_pct <- round(sum(var == "normal", na.rm = TRUE)/non_na*100, 2)
outlier_pct <- round(sum(var == "outlier", na.rm = TRUE)/non_na*100, 2)
na_count <- sum(is.na(var))
list(
normal = paste0("Normal (", normal_pct, "%)"),
outlier = paste0("Outliers (", outlier_pct, "%)"),
missing = paste0("NA (n=", na_count, ")")
)
}Step 3.2: Outlier plotting function
Next we define the create_outlier_plots function which configures the plot style and format to visualize and compare outlier detection methods. The exampled code stratifies confirmed cases by year.
create_outlier_plot <-
function(data, method, title, pct_labels, value_col = "conf") {
ggplot2::ggplot(data,
aes(x = year,
y = .data[[value_col]],
color = .data[[method]])) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c(
"normal" = "grey70",
"outlier" = "#FF0000",
"NA" = "#ffffff"),
labels = c(pct_labels$normal, pct_labels$outlier , pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = "Confirmed Cases HF U5", color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}To adapt the code:
- Line 17: Modify the y-axes label based on the column you are performing outlier detection on
Step 3.3: Generate outlier visualization plots
We now define a function called generate_outier_plots which combines the calculate_pct_labels function from Step 3.1 and the create_outlier_plot function from Step 3.2.
generate_outlier_plots <- function(data, column) {
pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])
create_plot <- function(data, method, title, pct_labels) {
ggplot2::ggplot(data, aes(x = year,
y = .data[[column]],
color = .data[[method]])) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c("normal" = "grey70", "outlier" = "#FF0000", "NA" = "#ffffff"),
labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = column, color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}
p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)
patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
patchwork::plot_annotation(
title = paste("Comparison of Outlier Detection Methods:", column),
theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
)}
conf_hf_u5_outlier_plots <- generate_outlier_plots(conf_hf_u5_results, "conf_hf_u5")To adapt the code:
- Line 37: Modify
conf_hf_u5_outlier_plotsto reflect the column you are performing outlier detection on. Remember to follow thevariablename_outlier_plotsconvention. - Line 37: Modify the input of
generate_outlier_plotsto reflect the(variablename_results, "variablename")you are performing outlier detection on
Step 4: Outlier investigation
Outliers cannot be removed or corrected without investigation. Pinpoint observations categorized as outliers and present them to the SNT team. Oftentimes, the context within which a facility operates validates extreme values.
After outlier detection, the data must be investigated at a more granular level. By this point analysts should have selected which outlier detection method will be used. This page uses the median/MAD-detected outliers going forward. Analysts should choose the outlier detection method based on statistical knowledge, data context, and SNT team consultation.
Step 4.1: Investigate and export outlier data
The confirmed cases observations identified as outliers by the median/MAD method are first exported to a CSV file for further investigation and review by the SNT team.
# Filter for median method outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(conf_hf_u5))
# Export outliers to CSV
outlier_file <- "conf_hf_u5_median_method_outliers.csv"
write.csv(conf_hf_u5_median_outliers, file = outlier_file, row.names = FALSE)To adapt the code:
- Line 2: Rename
conf_hf_u5_median_outliersto match thevariablename_method_outliersnaming convention suggested on this page - Line 2 Change
conf_hf_u5_resultstovariablename_resultsbased on the column you are performing outlier detection on - Line 3: Replace
outliers_halperwithoutliers_moyenneoroutliers_IQRif you have chosen to proceed with mean detection or IQR detection respectively - Line 7: Change
"conf_hf_u5_median_method_outliers.csv"to reflect the desired CSV output file name - Line 8: Change
conf_hf_u5_median_outliersto reflect thevariablename_method_outliersobject defined in Line 2
Step 4.2: Outlier investigation function
This step defines a function to facilitate outlier investigation, including a diagnostic plot.
Show the code
plot_outlier_timeseries <- function(
full_data, # Original dataset
outlier_row, # Single row from outlier results
time_var = "year", # Time variable (year or date)
value_var = "conf_hf_u5", # Value variable to plot
highlight_year = NA # Year to focus on
) {
# Get all conf* variables
conf_vars <- grep("^conf_hf_", names(full_data), value = TRUE)
# Plot colour palette
component_palette <- c(
RColorBrewer::brewer.pal(8, "Set2"), # First 8 colors
RColorBrewer::brewer.pal(4, "Dark2") # Next 4 (avoid yellow)
)[1:length(conf_vars)] # Trim to needed length
# Get all data for this facility for the focus year
facility_data <- full_data |>
dplyr::filter(hf_uid == outlier_row$hf_uid,
year == highlight_year) |>
dplyr::mutate(
date = as.Date(paste(year, month, "01", sep = "-")),
is_outlier = ifelse(
outliers_halper == "outlier" & year == highlight_year,
"Outlier",
"Normal"
)
) |>
tidyr::pivot_longer(
cols = all_of(conf_vars),
names_to = "conf_variable",
values_to = "conf_value"
)
# Base plot
p <- ggplot2::ggplot(facility_data, aes(x = date)) +
# Plot all conf variables as dashed lines (except main variable)
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable != value_var),
aes(y = conf_value, color = conf_variable, group = conf_variable),
linetype = "dashed",
alpha = 0.8,
linewidth = 0.8
) +
# Plot main variable as solid line
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, group = 1),
color = "grey70",
linewidth = 1.2
) +
ggplot2::geom_point(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, fill = is_outlier),
shape = 21, size = 3, color = "black"
) +
ggplot2::expand_limits(y = 0) +
ggplot2::scale_color_manual(
name = "Component Variables",
values = component_palette
) +
ggplot2::scale_fill_manual(
values = c("Outlier" = "red", "Normal" = "grey70"),
name = paste(value_var, "monthly report classification")
) +
ggplot2::labs(
title = paste("Outlier and Components:", "Monthly", outlier_row$hf, "Reports for", value_var),
subtitle = paste("HF UID:", outlier_row$hf_uid, "| Year:", highlight_year),
x = "Month",
y = "Cases",
caption = ifelse(
any(facility_data$is_outlier == "Outlier" &
facility_data$conf_variable == value_var),
paste("Red points indicate outlier values in", value_var),
paste("No outliers detected in", value_var, "for this year")
)
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.box = "vertical"
) +
ggplot2::scale_x_date(
date_labels = "%b",
date_breaks = "1 month"
)
return(p)
}To adapt the code:
- Line 5: Replace
"conf_hf_u5"with the column you are performing outlier detection on. Options include"susp","test"."maltreat", etc. - Line 10: Replace
conf_varsto reflect thevariablename_varsbased on the column you are performing outlier detection on. Recall the goal here is to plot all related disaggregated columns alongside the primary column - Line 10: Replace
"^conf_hf"to reflect the prefix of the"^variablename"you are performing outlier detection on - Line 16: Replace
conf_varswith thevariablename_varsdefined in Line 10 - Lines 31-32: Modify
"conf_variable"and"conf_value"to match the"variablename_variable"and"variablename_value"of the column you are performing outlier detection on - Line 40, 41, 48, 49, 55, 56: Similarly, replace all instances of
conf_variableandconf_valuebased on the column you are performing outlier detection on
Step 4.3: Select and investigate a single outlier
A time series enables visualization of monthly reports within a given year to contextualize the classification of a point as an outlier. Recall that we calculated totals on the DHIS2 preprocessing page. Therefore we must investigate outliers in the disaggregated, extracted components rather than calculated totals.
# Select an outlier to visualize
selected_outlier <- conf_hf_u5_median_outliers |>
dplyr::filter(year == 2017) |>
dplyr::slice_max(conf, n = 1)
# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
full_data = conf_hf_u5_results, # Need to use the flagged data here
outlier_row = selected_outlier,
value_var = "conf_hf_u5",
highlight_year = 2017
)To adapt the code:
- Line 2: Replace
conf_hf_u5_median_outliersfollowing thevariablename_method_outliersnaming convention - Line 2: Replace 2017 with the year of data observation to investigate
- Line 8: Replace
conf_hf_u5_resultsfollowing thevariablename_resultsoutput defined in Step 2.3 - Line 10: Replace
"conf_hf_u5"with the column you are performing outlier detection on. This is the same column specified in Step 2.3 - Line 11: Replace 2017 with the year of data observation to investigate, also specified in Line 3 of this step
Step 5: Outlier recatgorization
Step 5.1: Outlier recategorization function
Outliers cannot be removed or corrected without investigation. Pinpoint observations categorized as outliers and present them to the SNT team. Oftentimes, the context within which a facility operates validates extreme values.
Based on review with the SNT team, outliers may need to be categorized. For example, extremely high values categorized as outliers may result from genuine spikes or outbreaks. In this case, we cannot leave the observation flagged as an outlier and correct it.
Show the code
reflag_outliers <- function(df,
hf_name = NULL,
hf_uid = NULL,
year = NULL,
month = NULL,
outlier_metric = NULL,
new_status = "normal") {
conditions <- list()
if(!is.null(hf_name)) conditions$hf <- hf_name
if(!is.null(hf_uid)) conditions$hf_uid <- hf_uid
if(!is.null(year)) conditions$year <- year
if(!is.null(month)) conditions$month <- month
status_col <- paste0("outlier_status_", outlier_metric)
flag_col <- paste0("outlier_flag_", outlier_metric)
df |>
dplyr::mutate(
!!status_col := ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~.x == .y),
`&`
),
new_status,
.data[[status_col]]
),
!!flag_col := ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~.x == .y),
`&`
),
new_status == "outlier",
.data[[flag_col]]
)
)
}
# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
reflag_outliers(
hf_uid = "SL_45678",
year = 2022,
month = 10,
outlier_metric = "conf_hf_u5",
new_status = "investigated_normal"
)Step 5.2: Apply outlier recategorization
This code applies the function to recategorize specific outliers as normal, while still maintaing a history of the investigation and re-categorization.
To adapt the code:
- Line 2: Rename
reflagged_conf_hf_u5_outlier_dataandconf_hf_u5_resultsbased on the columns you are working with. - Line 4: Replace
"SL_45678"with the hf_uid of the health facility whose outlier needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3 - Line 5: Replace
2022with the year of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 4.3 - Line 7: Replace
10to correspond with the month of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 4.3
Step 6: Loop outlier detection
Step 6.1: Define outlier detection loop function
While the code below provides the option to loop through outlier detection for multiple indicators, thorough investigation of each identified outlier is still essential. The loop function streamlines the detection process but should be used in conjunction with manual review rather than as a replacement for it.
Show the code
#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#' - flagged_data: Full dataset with outlier flags
#' - stats: Summary statistics for each indicator
#' - plots: Outlier methods visualization plots
#' - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {
# Initialize storage
results <- list(
flagged_data = df,
stats = list(),
plots = list(),
outliers = list()
)
# Loop through each indicator
for (indicator in indicators) {
# Skip if indicator doesn't exist
if (!indicator %in% names(df)) {
warning(paste("Indicator", indicator, "not found in dataframe. Skipping."))
next
}
# 1. Calculate statistics by HF-month group
stats <- calculate_outlier_stats(df, indicator)
# 2. Flag outliers (keeping all method columns for plotting)
flagged <- flag_outliers(df, stats, indicator)
# 3. Add status and flag columns (without removing method columns)
flagged <- flagged |>
dplyr::mutate(
!!paste0("outlier_status_", indicator) := outliers_halper,
!!paste0("outlier_flag_", indicator) := ifelse(outliers_halper == "outlier", TRUE, FALSE)
)
# 4. Generate plots (using all method columns)
plots <- generate_outlier_plots(flagged, indicator)
# 5. Extract outliers
outlier_records <- flagged |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(.data[[indicator]]))
# Store results
results$stats[[indicator]] <- stats
results$plots[[indicator]] <- plots
results$outliers[[indicator]] <- outlier_records
# Update main dataframe (only keep final status/flags)
results$flagged_data <- results$flagged_data |>
dplyr::left_join(
flagged |>
dplyr::select(hf_uid, year, month,
!!paste0("outlier_status_", indicator),
!!paste0("outlier_flag_", indicator)),
by = c("hf_uid", "year", "month")
)
}
return(results)
}Step 6.2: Use outlier detection loop function
To use the outlier detection loop on multiple variables at once, follow the code below. This code exemplifies the outlier detection loop performed on disaggregated test columns.
Step 7: Set outliers to missing
Once outliers have been investigated and finalized, the next step is to set them to missing (i.e. NA values). This prepares outliers to be imputed like other missingness, in a manner than condiers downstream indicators and clinical logic.
# Function to clean variable and return minimal columns
clean_single_variable <- function(df, variable_name, method = "halper") {
method_col <- switch(method,
"moyenne" = "outliers_moyenne",
"halper" = "outliers_halper",
"IQR" = "outliers_IQR")
df |>
dplyr::mutate(
!!paste0(variable_name, "_cleaned") := ifelse(.data[[method_col]] == "outlier", NA, .data[[variable_name]]),
!!paste0("outlier_status_", variable_name) := .data[[method_col]]
) |>
dplyr::select(
# Keep all original columns
dplyr::all_of(names(clean_malaria_routine_data_final)),
# Add only the essential columns for the variable
!!paste0("outlier_status_", variable_name),
!!paste0(variable_name, "_cleaned")
)
}
# Clean multiple variables sequentially
final_detected_data <- conf_hf_u5_results |>
# Process conf_hf_u5
clean_single_variable("conf_hf_u5", "halper") |>
# Process test_hf_u5 (would need to run detection first)
clean_single_variable("test_hf_u5", "halper") |>
# Add more variables as needed...
# Final dataset: original columns + cleaned versions of all processed variables
saveRDS(final_detected_data, "outlier_detected_malaria_routine_data.rds")Summary
This section has walked through multiple outlier detection methods using the conf_hf_u5 column as an example. We defined functions to identify, visualize, and investigate outliers to streamline the detection process. This code will help you detect outliers using appropriate methods based on the data and variables you are working with.
Once you have detected, validated, and possibly recategorized outliers in your data, your data frame will include all original columns of your data as well as an outlier column for each indicator. This new data frame will be used in the outlier correction page.
Full code
Show full code
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# Install or load relevant packages
pacman::p_load(
readxl, # import and read Excel files
dplyr, # data manipulation
ggplot2, # plotting
tidyr, # data management
patchwork, # for arranging plots
purrr, # vector management
viridis, # colour-blind friendly palettes
here # for easy file referencing
)
clean_malaria_routine_data_final <-
readRDS(
here::here(
"1.2_epidemiology/1.2a_routine_surveillance/raw",
"clean_malaria_routine_data_final.rds"
)
)
calculate_outlier_stats <- function(df, column) {
df |>
dplyr::group_by(hf_uid, month) |> # Facility-month grouping
dplyr::filter(n() >= 3) |> # Require minimum 3 observations per group
dplyr::summarise(
# Mean/SD metrics
mean = mean(.data[[column]], na.rm = TRUE),
sd = sd(.data[[column]], na.rm = TRUE),
# Median/MAD metrics
median = median(.data[[column]], na.rm = TRUE),
mad = mad(.data[[column]], constant = 1, na.rm = TRUE),
# IQR metrics
Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
.groups = "drop"
) |>
dplyr::mutate(
# Mean/SD bounds
mean_lower = pmax(0, mean - 1.8 * sd),
mean_upper = mean + 1.8 * sd,
# Median/MAD bounds
median_lower = pmax(0, median - 3.2 * mad),
median_upper = median + 3.2 * mad,
# IQR bounds
iqr_lower = pmax(0, Q1 - 1.6 * (Q3 - Q1)),
iqr_upper = Q3 + 1.6 * (Q3 - Q1)
)
}
flag_outliers <- function(df, stats_df, column) {
df |>
dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
dplyr::mutate(
outliers_moyenne = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
TRUE ~ "normal"
),
outliers_halper = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < median_lower | .data[[column]] > median_upper ~ "outlier",
TRUE ~ "normal"
),
outliers_IQR = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
TRUE ~ "normal"
)
) |>
# Remove the intermediate calculation columns
dplyr::select(-mean, -sd, -median, -mad, -Q1, -Q3,
-mean_lower, -mean_upper, -median_lower, -median_upper,
-iqr_lower, -iqr_upper)
}
conf_hf_u5_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf_hf_u5")
conf_hf_u5_results <- flag_outliers(clean_malaria_routine_data_final, conf_hf_u5_stats, "conf_hf_u5")
calculate_pct_labels <- function(var) {
non_na <- sum(!is.na(var))
normal_pct <- round(sum(var == "normal", na.rm = TRUE)/non_na*100, 2)
outlier_pct <- round(sum(var == "outlier", na.rm = TRUE)/non_na*100, 2)
na_count <- sum(is.na(var))
list(
normal = paste0("Normal (", normal_pct, "%)"),
outlier = paste0("Outliers (", outlier_pct, "%)"),
missing = paste0("NA (n=", na_count, ")")
)
}
create_outlier_plot <-
function(data, method, title, pct_labels, value_col = "conf") {
ggplot2::ggplot(data,
aes(x = year,
y = .data[[value_col]],
color = .data[[method]])) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c(
"normal" = "grey70",
"outlier" = "#FF0000",
"NA" = "#ffffff"),
labels = c(pct_labels$normal, pct_labels$outlier , pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = "Confirmed Cases", color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}
generate_outlier_plots <- function(data, column) {
pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])
create_plot <- function(data, method, title, pct_labels) {
ggplot2::ggplot(data, aes(x = year,
y = .data[[column]],
color = .data[[method]])) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c("normal" = "grey70", "outlier" = "#FF0000", "NA" = "#ffffff"),
labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = column, color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}
p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)
patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
patchwork::plot_annotation(
title = paste("Comparison of Outlier Detection Methods:", column),
theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
)}
conf_hf_u5_outlier_plots <- generate_outlier_plots(conf_hf_u5_results, "conf_hf_u5")
# Filter for median method outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(conf_hf_u5))
# Export outliers to CSV
outlier_file <- "conf_hf_u5_median_method_outliers.csv"
write.csv(conf_hf_u5_median_outliers, file = outlier_file, row.names = FALSE)
# Outlier timeseries
plot_outlier_timeseries <- function(
full_data, # Original dataset
outlier_row, # Single row from outlier results
time_var = "year", # Time variable (year or date)
value_var = "conf_hf_u5", # Value variable to plot
highlight_year = NA # Year to focus on
) {
# Get all conf* variables
conf_vars <- grep("^conf_hf_", names(full_data), value = TRUE)
# Plot colour palette
component_palette <- c(
RColorBrewer::brewer.pal(8, "Set2"), # First 8 colors
RColorBrewer::brewer.pal(4, "Dark2") # Next 4 (avoid yellow)
)[1:length(conf_vars)] # Trim to needed length
# Get all data for this facility for the focus year
facility_data <- full_data |>
dplyr::filter(hf_uid == outlier_row$hf_uid,
year == highlight_year) |>
dplyr::mutate(
date = as.Date(paste(year, month, "01", sep = "-")),
is_outlier = ifelse(
outliers_halper == "outlier" & year == highlight_year,
"Outlier",
"Normal"
)
) |>
tidyr::pivot_longer(
cols = all_of(conf_vars),
names_to = "conf_variable",
values_to = "conf_value"
)
# Base plot
p <- ggplot2::ggplot(facility_data, aes(x = date)) +
# Plot all conf variables as dashed lines (except main variable)
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable != value_var),
aes(y = conf_value, color = conf_variable, group = conf_variable),
linetype = "dashed",
alpha = 0.8,
linewidth = 0.8
) +
# Plot main variable as solid line
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, group = 1),
color = "grey70",
linewidth = 1.2
) +
ggplot2::geom_point(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, fill = is_outlier),
shape = 21, size = 3, color = "black"
) +
ggplot2::expand_limits(y = 0) +
ggplot2::scale_color_manual(
name = "Component Variables",
values = component_palette
) +
ggplot2::scale_fill_manual(
values = c("Outlier" = "red", "Normal" = "grey70"),
name = paste(value_var, "monthly report classification")
) +
ggplot2::labs(
title = paste("Outlier and Components:", "Monthly", outlier_row$hf, "Reports for", value_var),
subtitle = paste("HF UID:", outlier_row$hf_uid, "| Year:", highlight_year),
x = "Month",
y = "Cases",
caption = ifelse(
any(facility_data$is_outlier == "Outlier" &
facility_data$conf_variable == value_var),
paste("Red points indicate outlier values in", value_var),
paste("No outliers detected in", value_var, "for this year")
)
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.box = "vertical"
) +
ggplot2::scale_x_date(
date_labels = "%b",
date_breaks = "1 month"
)
return(p)
}
# Select an outlier to visualize
selected_outlier <- conf_hf_u5_median_outliers |>
dplyr::filter(year == 2017) |>
dplyr::slice_max(conf, n = 1)
# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
full_data = conf_hf_u5_results, # Need to use the flagged data here
outlier_row = selected_outlier,
value_var = "conf_hf_u5",
highlight_year = 2017
)
reflag_outliers <- function(df,
hf_name = NULL,
hf_uid = NULL,
year = NULL,
month = NULL,
outlier_metric = NULL,
new_status = "normal") {
conditions <- list()
if(!is.null(hf_name)) conditions$hf <- hf_name
if(!is.null(hf_uid)) conditions$hf_uid <- hf_uid
if(!is.null(year)) conditions$year <- year
if(!is.null(month)) conditions$month <- month
status_col <- paste0("outlier_status_", outlier_metric)
flag_col <- paste0("outlier_flag_", outlier_metric)
df |>
dplyr::mutate(
!!status_col := ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~.x == .y),
`&`
),
new_status,
.data[[status_col]]
),
!!flag_col := ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~.x == .y),
`&`
),
new_status == "outlier",
.data[[flag_col]]
)
)
}
# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
reflag_outliers(
hf_uid = "SL_45678",
year = 2022,
month = 10,
outlier_metric = "conf_hf_u5",
new_status = "investigated_normal"
)
# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
reflag_outliers(
hf_uid = "SL_45678",
year = 2022,
month = 10,
outlier_metric = "conf_hf_u5",
new_status = "investigated_normal"
)
#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#' - flagged_data: Full dataset with outlier flags
#' - stats: Summary statistics for each indicator
#' - plots: Outlier methods visualization plots
#' - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {
# Initialize storage
results <- list(
flagged_data = df,
stats = list(),
plots = list(),
outliers = list()
)
# Loop through each indicator
for (indicator in indicators) {
# Skip if indicator doesn't exist
if (!indicator %in% names(df)) {
warning(paste(
"Indicator",
indicator,
"not found in dataframe. Skipping."
))
next
}
# 1. Calculate statistics by HF-month group
stats <- calculate_outlier_stats(df, indicator)
# 2. Flag outliers (keeping all method columns for plotting)
flagged <- flag_outliers(df, stats, indicator)
# 3. Add status and flag columns (without removing method columns)
flagged <- flagged |>
dplyr::mutate(
!!paste0("outlier_status_", indicator) := outliers_halper,
!!paste0("outlier_flag_", indicator) := ifelse(
outliers_halper == "outlier",
TRUE,
FALSE
)
)
# 4. Generate plots (using all method columns)
plots <- generate_outlier_plots(flagged, indicator)
# 5. Extract outliers
outlier_records <- flagged |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(.data[[indicator]]))
# Store results
results$stats[[indicator]] <- stats
results$plots[[indicator]] <- plots
results$outliers[[indicator]] <- outlier_records
# Update main dataframe (only keep final status/flags)
results$flagged_data <- results$flagged_data |>
dplyr::left_join(
flagged |>
dplyr::select(
hf_uid,
year,
month,
!!paste0("outlier_status_", indicator),
!!paste0("outlier_flag_", indicator)
),
by = c("hf_uid", "year", "month")
)
}
return(results)
}
test_components <- c(
"test_hf_u5",
"test_hf_5_14",
"test_hf_ov15",
"test_com_u5",
"test_com_5_14",
"test_com_ov15"
)
test_results <- detect_outliers_loop(
df = clean_malaria_routine_data_final,
indicators = test_components,
method = "median"
)
print(test_results$plots$test_hf_u5)