Outlier detection methods
Overview
In subnational malaria analysis, outlier correction helps separate real epidemiological signals from data noise. This ensures reliable decision-making for district-level targeting and resource allocation. Before outliers can be corrected, they must be identified and investigated.
- Compare how outlier detection methods affect different malaria indicators
- Assess how methods balance data integrity with sensitivity
- Select appropriate detection based on analysis goals and data quality
Introduction to Outliers and Detection Methods
Outliers in SNT
Outliers can represent either data quality issues or true epidemiological events. Their impact is magnified in subnational analysis where small numbers matter. For example:
- A reporting error at one facility can distort district-level estimates
- A real outbreak might be mistakenly “corrected” if not properly validated
Outlier Detection Methods
Three outlier detection methods are explored on this page, each of which varies in robustness and sensitivity:
- Mean/standard deviation method: Best for normally distributed data without extreme values
- Median/MAD method: More robust for skewed data common in malaria reporting
- Interquartile range (IQR) method: Effective for volatile datasets with many small facilities
The provided code first applies detection methods to the conf variable representing confirmed malaria cases. Outliers in an aggregated total may indicate outliers in one or more of its disaggregated components. The plot below exemplifies this case. Investigating outliers in aggregated columns helps analysts identify which disaggregated columns require attention.
Step-by-Step
Below is a step-by-step walkthrough of outlier detection methods, with emphasis on comparison to help you visualize and use the methods appropriately. Each step includes notes to help you understand, adapt, and use the code.
To skip the step-by-step explanation, jump to the full code at the end of this page.
Step 1: Load libraries and data
First we install and load the packages and data required for this section. This page continues the use of the preprocessed Sierra Leone DHIS2 example data.
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# Install or load relevant packages
pacman::p_load(
readxl, # import and read Excel files
dplyr, # data manipulation
ggplot2, # plotting
tidyr, # data management
patchwork, # for arranging plots
purrr, # vector management
viridis, # colour-blind friendly palettes
here # for easy file referencing
)
clean_malaria_routine_data_final <-
readRDS(
here::here(
"1.2_epidemiology/1.2a_routine_surveillance/raw",
"clean_malaria_routine_data_final.rds"
)
)Step 2: Outlier detection
Step 2.1: Calculate outlier statistics
We begin by defining the calculate_outlier_stats function to streamline the process of identifying points as outliers. The function calculates the various statistical metrics required to identify outliers.
calculate_outlier_stats <- function(df, column) {
df |>
dplyr::group_by(hf_uid, month) |> # Facility-month grouping
dplyr::filter(n() >= 3) |> # Require minimum 3 observations per group
dplyr::summarise(
# Mean/SD metrics
mean = mean(.data[[column]], na.rm = TRUE),
sd = sd(.data[[column]], na.rm = TRUE),
# Median/MAD metrics
median = median(.data[[column]], na.rm = TRUE),
mad = mad(.data[[column]], constant = 1, na.rm = TRUE),
# IQR metrics
Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
.groups = "drop"
) |>
dplyr::mutate(
# Mean/SD bounds
mean_lower = pmax(0, mean - 3 * sd),
mean_upper = mean + 3 * sd,
# Median/MAD bounds
median_lower = pmax(0, median - 2.5 * mad),
median_upper = median + 2.5 * mad,
# IQR bounds
iqr_lower = pmax(0, Q1 - 1.5 * (Q3 - Q1)),
iqr_upper = Q3 + 1.5 * (Q3 - Q1)
)
}To adapt the code:
- Line 20-30: (Advanced) Change the multiplier in each outlier detection method to modify the sensitivity of each method. The multipliers used here are 3, 2.5, and 1.5 for mean/SD, mean/median, and IQR detection respectively. These multipliers were selected based on statistical conventions.
Step 2.2: Flag outliers
Next we define the flag_outliers function to identify outliers based on the three methods and attach this information to the dataframe create in the Step 2.1. This code flags outliers based on the statistics calculated in the previous step.
flag_outliers <- function(df, stats_df, column) {
df |>
dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
dplyr::mutate(
outliers_moyenne = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
TRUE ~ "normal"
),
outliers_halper = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < median_lower | .data[[column]] > median_upper ~ "outlier",
TRUE ~ "normal"
),
outliers_IQR = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
TRUE ~ "normal"
)
) |>
# Remove the intermediate calculation columns
dplyr::select(-mean, -sd, -median, -mad, -Q1, -Q3,
-mean_lower, -mean_upper, -median_lower, -median_upper,
-iqr_lower, -iqr_upper)
}
conf_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf")
conf_results <- flag_outliers(clean_malaria_routine_data_final, conf_stats, "conf")To adapt the code:
- Line 28-29: Modify
conf_resultsandconf_statsto reflect the variable you are performing outlier detection on. These are the names given to each function’s output. It is recommended to follow thevariablename_resultsandvariablename_resultsnaming convention. - Line 28: Modify
(clean_malaria_routine_data_final, "conf")to reflect the(df, "column")you are performing outlier detection on - Line 29: Modify
(clean_malaria_routine_data_final, conf_stats, "conf")to reflect the(df, variablename_stats, "column")you are performing outlier detection on
Step 3: Visualize outliers
Step 3.1: Calculate outlier proportions
First we use the calculate_pct_labels function to calculate the proportion of conf values labelled outliers for each of the three detection methods.
calculate_pct_labels <- function(var) {
non_na <- sum(!is.na(var))
normal_pct <- round(sum(var == "normal", na.rm = TRUE)/non_na*100, 2)
outlier_pct <- round(sum(var == "outlier", na.rm = TRUE)/non_na*100, 2)
na_count <- sum(is.na(var))
list(
normal = paste0("Normal (", normal_pct, "%)"),
outlier = paste0("Outliers (", outlier_pct, "%)"),
missing = paste0("NA (n=", na_count, ")")
)
}Step 3.2: Outlier plotting function
Next we define the create_outlier_plots function which configures the plot style and format to visualize and compare outlier detection methods. The exampled code stratifies confirmed cases by year.
create_outlier_plot <-
function(data, method, title, pct_labels, value_col = "conf") {
ggplot2::ggplot(data,
aes(x = year,
y = .data[[value_col]],
color = .data[[method]])) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c(
"normal" = "grey70",
"outlier" = "#FF0000",
"NA" = "#ffffff"),
labels = c(pct_labels$normal, pct_labels$outlier , pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = "Confirmed Cases", color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}To adapt the code:
- Line 17: Modify the y-axes label based on the column you are performing outlier detection on
Step 3.3: Generate outlier visualization plots
We now define a function called generate_outier_plots which combines the calculate_pct_labels function from Step 3.1 and the create_outlier_plot function from Step 3.2.
generate_outlier_plots <- function(data, column) {
pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])
create_plot <- function(data, method, title, pct_labels) {
ggplot2::ggplot(data, aes(x = year,
y = .data[[column]],
color = .data[[method]])) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c("normal" = "grey70", "outlier" = "#FF0000", "NA" = "#ffffff"),
labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = column, color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}
p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)
patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
patchwork::plot_annotation(
title = paste("Comparison of Outlier Detection Methods:", column),
theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
)}
conf_outlier_plots <- generate_outlier_plots(conf_results, "conf")To adapt the code:
- Line 37: Modify
conf_outlier_plotsto reflect the column you are performing outlier detection on. Remember to follow thevariablename_outlier_plotsconvention. - Line 37: Modify the input of
generate_outlier_plotsto reflect the(variablename_results, "variablename")you are performing outlier detection on
Step 4: Outlier investigation
Outliers cannot be removed or corrected without investigation. Pinpoint observations categorized as outliers and present them to the SNT team. Oftentimes, the context within which a facility operates validates extreme values.
After outlier detection, the data must be investigated at a more granular level. By this point analysts should have selected which outlier detection method will be used. This page uses the median/MAD-detected outliers going forward. Analysts should choose the outlier detection method based on statistical knowledge, data context, and SNT team consultation.
Step 4.1: Investigate and export outlier data
The confirmed cases observations identified as outliers by the median/MAD method are first exported to a CSV file for further investigation and review by the SNT team.
To adapt the code:
- Line 2: Rename
conf_median_outliersto match thevariablename_method_outliersnaming convention suggested on this page - Line 2 Change
conf_resultstovariablename_resultsbased on the column you are performing outlier detection on - Line 3: Replace
outliers_halperwithoutliers_moyenneoroutliers_IQRif you have chosen to proceed with mean detection or IQR detection respectively - Line 7: Change
"conf_median_method_outliers.csv"to reflect the desired CSV output file name - Line 8: Change
conf_median_outliersto reflect thevariablename_method_outliersobject defined in Line 2
Step 4.2: Outlier investigation function
This step defines a function to facilitate outlier investigation, including a diagnostic plot.
Show the code
plot_outlier_timeseries <- function(
full_data, # Original dataset
outlier_row, # Single row from outlier results
time_var = "year", # Time variable (year or date)
value_var = "conf", # Value variable to plot
highlight_year = NA # Year to focus on
) {
# Get all conf* variables
conf_vars <- grep("^conf", names(full_data), value = TRUE)
# Plot colour palette
component_palette <- c(
RColorBrewer::brewer.pal(8, "Set2"), # First 8 colors
RColorBrewer::brewer.pal(4, "Dark2") # Next 4 (avoid yellow)
)[1:length(conf_vars)] # Trim to needed length
# Get all data for this facility for the focus year
facility_data <- full_data |>
dplyr::filter(hf_uid == outlier_row$hf_uid,
year == highlight_year) |>
dplyr::mutate(
date = as.Date(paste(year, month, "01", sep = "-")),
is_outlier = ifelse(
outliers_halper == "outlier" & year == highlight_year,
"Outlier",
"Normal"
)
) |>
tidyr::pivot_longer(
cols = all_of(conf_vars),
names_to = "conf_variable",
values_to = "conf_value"
)
# Base plot
p <- ggplot2::ggplot(facility_data, aes(x = date)) +
# Plot all conf variables as dashed lines (except main variable)
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable != value_var),
aes(y = conf_value, color = conf_variable, group = conf_variable),
linetype = "dashed",
alpha = 0.8,
linewidth = 0.8
) +
# Plot main variable as solid line
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, group = 1),
color = "grey70",
linewidth = 1.2
) +
ggplot2::geom_point(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, fill = is_outlier),
shape = 21, size = 3, color = "black"
) +
ggplot2::expand_limits(y = 0) +
ggplot2::scale_color_manual(
name = "Component Variables",
values = component_palette
) +
ggplot2::scale_fill_manual(
values = c("Outlier" = "red", "Normal" = "grey70"),
name = paste(value_var, "monthly report classification")
) +
ggplot2::labs(
title = paste("Outlier and Components:", "Monthly", outlier_row$hf, "Reports for", value_var),
subtitle = paste("HF UID:", outlier_row$hf_uid, "| Year:", highlight_year),
x = "Month",
y = "Cases",
caption = ifelse(
any(facility_data$is_outlier == "Outlier" &
facility_data$conf_variable == value_var),
paste("Red points indicate outlier values in", value_var),
paste("No outliers detected in", value_var, "for this year")
)
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.box = "vertical"
) +
ggplot2::scale_x_date(
date_labels = "%b",
date_breaks = "1 month"
)
return(p)
}To adapt the code:
- Line 5: Replace
"conf"with the column you are performing outlier detection on. Options include"susp","test"."maltreat", etc. - Line 10: Replace
conf_varsto reflect thevariablename_varsbased on the column you are performing outlier detection on - Line 10: Replace
"^conf"to reflect the"^variablename"of the column you are performing outlier detection on - Line 16: Replace
conf_varswith thevariablename_varsdefined in Line 10 - Lines 31-32: Modify
"conf_variable"and"conf_value"to match the"variablename_variable"and"variablename_value"of the column you are performing outlier detection on - Line 40, 41, 48, 49, 55, 56: Similarly, replace all instances of
conf_variableandconf_valuebased on the column you are performing outlier detection on
Step 4.3: Select and investigate a single outlier
A time series enables visualization of monthly reports within a given year to contextualize the classification of a point as an outlier. Recall that we calculated totals on the DHIS2 preprocessing page. This means that an outlier identified in the aggregated confirmed cases column conf may be indiciative of an outlier in one of it’s disaggregated components. Here we plot both the aggregated column and its components to determine the driving force of outliers in the aggregated column.
# Select an outlier to visualize
selected_outlier <- conf_median_outliers |>
dplyr::filter(year == 2017) |>
dplyr::slice_max(conf, n = 1)
# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
full_data = conf_results, # Need to use the flagged data here
outlier_row = selected_outlier,
value_var = "conf",
highlight_year = 2017
)To adapt the code:
- Line 2: Replace
conf_median_outliersfollowing thevariablename_method_outliersnaming convention - Line 2: Replace 2017 with the year of data observation to investigate
- Line 8: Replace
conf_resultsfollowing thevariablename_resultsoutput defined in Step 2.3 - Line 10: Replace
"conf"with the column you are performing outlier detection on. This is the same column specified in Step 2.3 - Line 11: Replace 2017 with the year of data observation to investigate, also specified in Line 3 of this step
Step 5: Component outlier detection
Step 5.1: Detect component outliers
Given that the aggregated column outlier investigation indicates possible outliers in the conf_hf_u5 column, the next step is to perform outlier detection in this column. We can reuse the outlier detection function defined in Step 2. These outliers can be visualized in the same way as Step 3.
To adapt the code:
- Line 1, 3: Replace
clean_malaria_routine_data_finalwith the name of your data frame - Line 1, 3, 5, 7: Replace all instances of
conf_hf_u5_stats,conf_hf_u5_results, andconf_hf_u5_outlier_plotsbased on the disaggregated column you are performing outlier detection on. Remember that thevariablename_stats,variablename_results, andvariablename_outlier_plotsis strongly recommended here
Step 5.2: Filter and export median component outliers
Since the outlier detection function includes outliers detected by all three methods, here we filter to only median/MAD-detected outliers. This file is then exported so it can be shared for review by the SNT team.
# Filter for median outlier detection column
conf_hf_u5_results <- conf_hf_u5_results |>
dplyr::mutate(
outlier_status_conf_u5 = outliers_halper,
outlier_flag_conf_u5 = ifelse(outliers_halper == "outlier", TRUE, FALSE)
) |>
dplyr::select(-outliers_moyenne, -outliers_IQR) # Remove unused method columns
# Filter for points detected as outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(conf_hf_u5))
# Export outliers to CSV
outlier_file <- here::here(
"english/data_r/routine_cases/conf_hf_u5_median_method_outliers.csv"
)
write.csv(conf_hf_u5_median_outliers, file = outlier_file, row.names = FALSE)To adapt the code:
- Line 1, 4, 5, 9, 11, 14, 15: Replace all instances of
conf_hf_u5_resultsbased on the disaggregated column you are performing outlier detection on. Remember that thevariablename_resultsis strongly recommended here - Line 4, 5, 10: Replace
outliers_halperwithoutliers_moyenneoroutliers_IQRif you have chosen to proceed with mean detection or IQR detection respectively - Line 6: Replace
(-outliers_moyenne, -outliers_IQR)with the two outlier methods you are not using for detection - Line 9, 15: Rename
conf_hf_u5_median_outliersto match thevariablename_method_outliersnaming convention suggested on this page - Line 7: Change
"conf_hf_u5_median_method_outliers.csv"to reflect the desired CSV output file name
Step 5.3: Component outlier investigation
From the aggregated column outlier investigation, we found outlier in monthly reports for facility hf_1756 on 2017. We can use this criteria to guide our investigation of the component column outliers.
# Select an outlier to visualize
selected_outlier_conf_hf_u5 <- conf_hf_u5_results |>
dplyr::filter(
hf_uid == "hf_0046", # Replace with known UID from your conf outlier
year == 2017,
)
# Generate plot
outlier_timeseries_conf_hf_u5_2017 <- plot_outlier_timeseries(
full_data = conf_hf_u5_results,
outlier_row = selected_outlier_conf_hf_u5,
value_var = "conf_hf_u5",
highlight_year = 2017
)To adapt the code:
- Line 2: Replace
selected_outlier_conf_hf_u5following theselected_outlier_variablenamenaming convention - Line 2: Replace
conf_hf_u5_resultswithvariablename_resultsbased on the disaggregated column you are performing outlier detection on - Line 4: Replace the alphanumeric hf_uid with that of the facility of interest. The examples take the hf_uid directly from the Step 4.3 output
- Line 5: Replace 2017 with the year of data observation to investigate
- Line 9: Replace
outlier_timeseries_conf_hf_u5_2017following theoutlier_timeseries_variablename_yearconvention used here - Line 10: Replace
conf_hf_u5_resultswith thevariablename_resultsbased on the disaggregated column you are performing outlier detection on - Line 11: Replace
selected_outlier_conf_hf_u5with theselected_outlier_variablenamedefined in Line 2 of this step - Line 12: Replace
"conf_hf_u5"with the disaggregated column you are performing outlier detection on. - Line 13: Replace 2017 with the year of data observation to investigate, also specified in Line 5 of this step
Outliers cannot be removed or corrected without investigation. Pinpoint observations categorized as outliers and present them to the SNT team. Oftentimes, the context within which a facility operates validates extreme values.
Step 5.5: Recategorize component outliers
Based on review with the SNT team, outliers may need to categorized. For example, extremely high values categorized as outliers may result from genuine spikes or outbreaks. In this case, we cannot leave the observation flagged as an outlier and correct it. This code recategorizes such points as normal, while still maintaing a history of the investigation and re-categorization.
Show the code
reflag_outliers <- function(df,
hf_name = NULL,
hf_uid = NULL,
year = NULL,
month = NULL,
outlier_metric = NULL,
new_status = "normal") {
conditions <- list()
if(!is.null(hf_name)) conditions$hf <- hf_name
if(!is.null(hf_uid)) conditions$hf_uid <- hf_uid
if(!is.null(year)) conditions$year <- year
if(!is.null(month)) conditions$month <- month
status_col <- paste0("outlier_status_", outlier_metric)
flag_col <- paste0("outlier_flag_", outlier_metric)
df |>
dplyr::mutate(
!!status_col := ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~.x == .y),
`&`
),
new_status,
.data[[status_col]]
),
!!flag_col := ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~.x == .y),
`&`
),
new_status == "outlier",
.data[[flag_col]]
)
)
}
# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
reflag_outliers(
hf_uid = "SL_45678",
year = 2022,
month = 10,
outlier_metric = "conf_hf_u5",
new_status = "investigated_normal"
)To adapt the code:
- Line 43: Rename
reflagged_conf_hf_u5_outlier_dataandconf_hf_u5_resultsbased on the columns you are working with. - Line 45: Replace
"SL_45678"with the hf_uid of the health facility whose outlier needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3 - Line 46: Replace
2022with the year of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3 - Line 47: Replace
10to correspond with the month of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3
Step 6: Loop outlier detection
Step 6.1: Define outlier detection loop function
Thorough investigation of outliers is strongly recommended. However, the code below provides the option to loop through the steps conducted on this page. This loop may result in overlooking details, errors, and other granular aspects of outlier detection. Use the loop function with caution.
Show the code
#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#' - flagged_data: Full dataset with outlier flags
#' - stats: Summary statistics for each indicator
#' - plots: Outlier methods visualization plots
#' - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {
# Initialize storage
results <- list(
flagged_data = df,
stats = list(),
plots = list(),
outliers = list()
)
# Loop through each indicator
for (indicator in indicators) {
# Skip if indicator doesn't exist
if (!indicator %in% names(df)) {
warning(paste("Indicator", indicator, "not found in dataframe. Skipping."))
next
}
# 1. Calculate statistics by HF-month group
stats <- calculate_outlier_stats(df, indicator)
# 2. Flag outliers (keeping all method columns for plotting)
flagged <- flag_outliers(df, stats, indicator)
# 3. Add status and flag columns (without removing method columns)
flagged <- flagged |>
dplyr::mutate(
!!paste0("outlier_status_", indicator) := outliers_halper,
!!paste0("outlier_flag_", indicator) := ifelse(outliers_halper == "outlier", TRUE, FALSE)
)
# 4. Generate plots (using all method columns)
plots <- generate_outlier_plots(flagged, indicator)
# 5. Extract outliers
outlier_records <- flagged |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(.data[[indicator]]))
# Store results
results$stats[[indicator]] <- stats
results$plots[[indicator]] <- plots
results$outliers[[indicator]] <- outlier_records
# Update main dataframe (only keep final status/flags)
results$flagged_data <- results$flagged_data |>
dplyr::left_join(
flagged |>
dplyr::select(hf_uid, year, month,
!!paste0("outlier_status_", indicator),
!!paste0("outlier_flag_", indicator)),
by = c("hf_uid", "year", "month")
)
}
return(results)
}Step 6.2: Use outlier detection loop function
To use the outlier detection loop on multiple variables at once, follow the code below. This code exemplifies outlier detection performed on disaggregated test columns.
Summary
This section has walked through multiple outlier detection methods using the confirmed malaria cases column “conf” as an example. We defined functions to identify, visualize, and investigate outliers to streamline the detection process. This code will help you detect outliers using appropriate methods based data and variables you are working with.
Once you have detected, validated, and possibly recategorized outliers in your data, your data frame will include all original columns of your data as well as an outlier column for each indicator. This new data frame will be used in the outlier correction page.
Full code
Show full code
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# Install or load relevant packages
pacman::p_load(
readxl, # import and read Excel files
dplyr, # data manipulation
ggplot2, # plotting
tidyr, # data management
patchwork, # for arranging plots
purrr, # vector management
viridis, # colour-blind friendly palettes
here # for easy file referencing
)
clean_malaria_routine_data_final <-
readRDS(
here::here(
"1.2_epidemiology/1.2a_routine_surveillance/raw",
"clean_malaria_routine_data_final.rds"
)
)
calculate_outlier_stats <- function(df, column) {
df |>
dplyr::group_by(hf_uid, month) |> # Facility-month grouping
dplyr::filter(n() >= 3) |> # Require minimum 3 observations per group
dplyr::summarise(
# Mean/SD metrics
mean = mean(.data[[column]], na.rm = TRUE),
sd = sd(.data[[column]], na.rm = TRUE),
# Median/MAD metrics
median = median(.data[[column]], na.rm = TRUE),
mad = mad(.data[[column]], constant = 1, na.rm = TRUE),
# IQR metrics
Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
.groups = "drop"
) |>
dplyr::mutate(
# Mean/SD bounds
mean_lower = pmax(0, mean - 3 * sd),
mean_upper = mean + 3 * sd,
# Median/MAD bounds
median_lower = pmax(0, median - 2.5 * mad),
median_upper = median + 2.5 * mad,
# IQR bounds
iqr_lower = pmax(0, Q1 - 1.5 * (Q3 - Q1)),
iqr_upper = Q3 + 1.5 * (Q3 - Q1)
)
}
flag_outliers <- function(df, stats_df, column) {
df |>
dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
dplyr::mutate(
outliers_moyenne = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
TRUE ~ "normal"
),
outliers_halper = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < median_lower | .data[[column]] > median_upper ~
"outlier",
TRUE ~ "normal"
),
outliers_IQR = dplyr::case_when(
is.na(.data[[column]]) ~ NA_character_,
.data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
TRUE ~ "normal"
)
) |>
# Remove the intermediate calculation columns
dplyr::select(
-mean,
-sd,
-median,
-mad,
-Q1,
-Q3,
-mean_lower,
-mean_upper,
-median_lower,
-median_upper,
-iqr_lower,
-iqr_upper
)
}
conf_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf")
conf_results <- flag_outliers(
clean_malaria_routine_data_final,
conf_stats,
"conf"
)
calculate_pct_labels <- function(var) {
non_na <- sum(!is.na(var))
normal_pct <- round(sum(var == "normal", na.rm = TRUE) / non_na * 100, 2)
outlier_pct <- round(sum(var == "outlier", na.rm = TRUE) / non_na * 100, 2)
na_count <- sum(is.na(var))
list(
normal = paste0("Normal (", normal_pct, "%)"),
outlier = paste0("Outliers (", outlier_pct, "%)"),
missing = paste0("NA (n=", na_count, ")")
)
}
create_outlier_plot <-
function(data, method, title, pct_labels, value_col = "conf") {
ggplot2::ggplot(
data,
aes(x = year, y = .data[[value_col]], color = .data[[method]])
) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c(
"normal" = "grey70",
"outlier" = "#FF0000",
"NA" = "#ffffff"
),
labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = "Confirmed Cases", color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}
generate_outlier_plots <- function(data, column) {
pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])
create_plot <- function(data, method, title, pct_labels) {
ggplot2::ggplot(
data,
aes(x = year, y = .data[[column]], color = .data[[method]])
) +
geom_point(alpha = 0.7) +
scale_color_manual(
values = c(
"normal" = "grey70",
"outlier" = "#FF0000",
"NA" = "#ffffff"
),
labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
na.value = "#ffffff",
drop = FALSE
) +
labs(title = title, y = column, color = NULL) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
)
}
p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)
patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
patchwork::plot_annotation(
title = paste("Comparison of Outlier Detection Methods:", column),
theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
)
}
conf_outlier_plots <- generate_outlier_plots(conf_results, "conf")
# Filter for median method outliers
conf_median_outliers <- conf_results |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(conf))
# Export outliers to CSV
outlier_file <- "conf_median_method_outliers.csv"
write.csv(conf_median_outliers, file = outlier_file, row.names = FALSE)
plot_outlier_timeseries <- function(
full_data, # Original dataset
outlier_row, # Single row from outlier results
time_var = "year", # Time variable (year or date)
value_var = "conf", # Value variable to plot
highlight_year = NA # Year to focus on
) {
# Get all conf* variables
conf_vars <- grep("^conf", names(full_data), value = TRUE)
# Plot colour palette
component_palette <- c(
RColorBrewer::brewer.pal(8, "Set2"), # First 8 colors
RColorBrewer::brewer.pal(4, "Dark2") # Next 4 (avoid yellow)
)[1:length(conf_vars)] # Trim to needed length
# Get all data for this facility for the focus year
facility_data <- full_data |>
dplyr::filter(hf_uid == outlier_row$hf_uid, year == highlight_year) |>
dplyr::mutate(
date = as.Date(paste(year, month, "01", sep = "-")),
is_outlier = ifelse(
outliers_halper == "outlier" & year == highlight_year,
"Outlier",
"Normal"
)
) |>
tidyr::pivot_longer(
cols = all_of(conf_vars),
names_to = "conf_variable",
values_to = "conf_value"
)
# Base plot
p <- ggplot2::ggplot(facility_data, aes(x = date)) +
# Plot all conf variables as dashed lines (except main variable)
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable != value_var),
aes(y = conf_value, color = conf_variable, group = conf_variable),
linetype = "dashed",
alpha = 0.8,
linewidth = 0.8
) +
# Plot main variable as solid line
ggplot2::geom_line(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, group = 1),
color = "grey70",
linewidth = 1.2
) +
ggplot2::geom_point(
data = facility_data |> dplyr::filter(conf_variable == value_var),
aes(y = conf_value, fill = is_outlier),
shape = 21,
size = 3,
color = "black"
) +
ggplot2::expand_limits(y = 0) +
ggplot2::scale_color_manual(
name = "Component Variables",
values = component_palette
) +
ggplot2::scale_fill_manual(
values = c("Outlier" = "red", "Normal" = "grey70"),
name = paste(value_var, "monthly report classification")
) +
ggplot2::labs(
title = paste(
"Outlier and Components:",
"Monthly",
outlier_row$hf,
"Reports for",
value_var
),
subtitle = paste(
"HF UID:",
outlier_row$hf_uid,
"| Year:",
highlight_year
),
x = "Month",
y = "Cases",
caption = ifelse(
any(
facility_data$is_outlier == "Outlier" &
facility_data$conf_variable == value_var
),
paste("Red points indicate outlier values in", value_var),
paste("No outliers detected in", value_var, "for this year")
)
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.box = "vertical"
) +
ggplot2::scale_x_date(
date_labels = "%b",
date_breaks = "1 month"
)
return(p)
}
# Select an outlier to visualize
selected_outlier <- conf_median_outliers |>
dplyr::filter(year == 2017) |>
dplyr::slice_max(conf, n = 1)
# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
full_data = conf_results, # Need to use the flagged data here
outlier_row = selected_outlier,
value_var = "conf",
highlight_year = 2017
)
conf_hf_u5_stats <- calculate_outlier_stats(
clean_malaria_routine_data_final,
"conf_hf_u5"
)
conf_hf_u5_results <- flag_outliers(
clean_malaria_routine_data_final,
conf_stats,
"conf_hf_u5"
)
conf_hf_u5_outlier_plots <- generate_outlier_plots(
conf_hf_u5_results,
"conf_hf_u5"
)
print(conf_hf_u5_outlier_plots)
# Filter for median outlier detection column
conf_hf_u5_results <- conf_hf_u5_results |>
dplyr::mutate(
outlier_status_conf_u5 = outliers_halper,
outlier_flag_conf_u5 = ifelse(outliers_halper == "outlier", TRUE, FALSE)
) |>
dplyr::select(-outliers_moyenne, -outliers_IQR) # Remove unused method columns
# Filter for points detected as outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(conf_hf_u5))
# Export outliers to CSV
outlier_file <- here::here(
"english/data_r/routine_cases/conf_hf_u5_median_method_outliers.csv"
)
write.csv(conf_hf_u5_median_outliers, file = outlier_file, row.names = FALSE)
# Select an outlier to visualize
selected_outlier_conf_hf_u5 <- conf_hf_u5_results |>
dplyr::filter(
hf_uid == "hf_1756", # Replace with known UID from your conf outlier
year == 2017,
)
# Generate plot
outlier_timeseries_conf_hf_u5_2023 <- plot_outlier_timeseries(
full_data = conf_hf_u5_results, # Need to use the flagged data here
outlier_row = selected_outlier_conf_hf_u5,
value_var = "conf_hf_u5",
highlight_year = 2017
)
reflag_outliers <- function(
df,
hf_name = NULL,
hf_uid = NULL,
year = NULL,
month = NULL,
new_status = "normal"
) {
# Build filter conditions
conditions <- list()
if (!is.null(hf_name)) {
conditions$hf <- hf_name
}
if (!is.null(hf_uid)) {
conditions$hf_uid <- hf_uid
}
if (!is.null(year)) {
conditions$year <- year
}
if (!is.null(month)) {
conditions$month <- month
}
# Apply recategorization to ALL outlier flags
df |>
dplyr::mutate(
dplyr::across(
dplyr::matches("^outlier_status_"),
~ ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~ .x == .y),
`&`
),
new_status,
.
)
),
dplyr::across(
dplyr::matches("^outlier_flag_"),
~ ifelse(
purrr::reduce(
purrr::map2(names(conditions), conditions, ~ .x == .y),
`&`
),
new_status == "outlier",
.
)
)
)
}
cleaned_data <- outlier_flagged_data |>
reflag_outliers(
hf_uid = "SL_45678",
year = 2022,
month = 10,
new_status = "investigated_normal" # Custom status
)
#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#' - flagged_data: Full dataset with outlier flags
#' - stats: Summary statistics for each indicator
#' - plots: Outlier methods visualization plots
#' - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {
# Initialize storage
results <- list(
flagged_data = df,
stats = list(),
plots = list(),
outliers = list()
)
# Loop through each indicator
for (indicator in indicators) {
# Skip if indicator doesn't exist
if (!indicator %in% names(df)) {
warning(paste(
"Indicator",
indicator,
"not found in dataframe. Skipping."
))
next
}
# 1. Calculate statistics by HF-month group
stats <- calculate_outlier_stats(df, indicator)
# 2. Flag outliers (keeping all method columns for plotting)
flagged <- flag_outliers(df, stats, indicator)
# 3. Add status and flag columns (without removing method columns)
flagged <- flagged |>
dplyr::mutate(
!!paste0("outlier_status_", indicator) := outliers_halper,
!!paste0("outlier_flag_", indicator) := ifelse(
outliers_halper == "outlier",
TRUE,
FALSE
)
)
# 4. Generate plots (using all method columns)
plots <- generate_outlier_plots(flagged, indicator)
# 5. Extract outliers
outlier_records <- flagged |>
dplyr::filter(outliers_halper == "outlier") |>
dplyr::arrange(desc(.data[[indicator]]))
# Store results
results$stats[[indicator]] <- stats
results$plots[[indicator]] <- plots
results$outliers[[indicator]] <- outlier_records
# Update main dataframe (only keep final status/flags)
results$flagged_data <- results$flagged_data |>
dplyr::left_join(
flagged |>
dplyr::select(
hf_uid,
year,
month,
!!paste0("outlier_status_", indicator),
!!paste0("outlier_flag_", indicator)
),
by = c("hf_uid", "year", "month")
)
}
return(results)
}
test_components <- c(
"test_hf_u5",
"test_hf_5_14",
"test_hf_ov15",
"test_com_u5",
"test_com_5_14",
"test_com_ov15"
)
test_results <- detect_outliers_loop(
df = clean_malaria_routine_data_final,
indicators = test_components,
method = "median"
)
print(test_results$plots$test_hf_u5)