Outlier detection methods

Overview

In subnational malaria analysis, outlier correction helps separate real epidemiological signals from data noise. This ensures reliable decision-making for district-level targeting and resource allocation. Before outliers can be corrected, they must be identified and investigated.

Objectives

Compare how outlier detection methods affect different malaria indicators
Assess how methods balance data integrity with sensitivity
Select appropriate detection based on analysis goals and data quality

Introduction to Outliers and Detection Methods

Outliers in SNT

Outliers can represent either data quality issues or true epidemiological events. Their impact is magnified in subnational analysis where small numbers matter. For example:

A reporting error at one facility can distort district-level estimates
A real outbreak might be mistakenly “corrected” if not properly validated

Outlier detection occurs at the health facility-month (HF-month) level, where each facility’s monthly reports are compared against its own historical patterns.

Seasonal Considerations in Malaria Data

In malaria surveillance, seasonal peaks are expected epidemiological patterns, not outliers. True outliers represent deviations from expected seasonal patterns. The methods here account for this by: - Comparing facilities against their own monthly baselines - Using robust methods (Median/MAD, IQR) that are less influenced by seasonal extremes - Requiring manual investigation to distinguish data errors from genuine outbreaks

Outlier Detection Methods

Three outlier detection methods are explored on this page, each with distinct statistical properties that make them suitable for different data characteristics:

1. Mean/Standard Deviation Method (±2σ):

Basis: Assumes normal distribution (Gaussian)
Formula: Outlier if |value - mean| > 2 × standard deviation
Best for: Normally distributed data without extreme skew
Limitations: Sensitive to extreme values; mean and SD are heavily influenced by outliers themselves
Malaria context: Rarely ideal due to right-skewed case count distributions

2. Median/MAD Method (Median ± 2.5×MAD):

Basis: Non-parametric, robust to distribution shape
Formula: Outlier if |value - median| > 2.5 × Median Absolute Deviation
MAD calculation: MAD = median(|values - median|) × 1.4826 (scaled to match SD for normal data)
Best for: Skewed data with asymmetric distributions common in malaria reporting
Advantage: Resistant to up to 50% contamination by outliers
Malaria context: Ideal for facility-level malaria data where most facilities report low case counts (right-skewed distribution) with occasional genuine outbreaks that shouldn’t be over-penalized

3. Interquartile Range Method (IQR ± 1.5×IQR):

Basis: Non-parametric, based on data quartiles
Formula: Outlier if value < Q1 - 1.5×IQR or value > Q3 + 1.5×IQR
IQR calculation: IQR = Q3 - Q1 (contains middle 50% of data)
Best for: Volatile datasets with many small facilities and varying reporting patterns
Malaria context: Effective for zero-inflated and over-dispersed count data

Statistical Note: The multipliers have been empirically calibrated to (1.8, 3.2, 1.6) to account for malaria data’s extreme right-skew. In normally distributed data, these multipliers would flag approximately 7.2%, 0.2%, and 4.7% of points as outliers respectively. However, for highly right-skewed malaria data, the multipliers have been adjusted from standard values (3, 2.5, 1.5) to ensure appropriate sensitivity for this specific distribution.

The provided code applies these detection methods to the conf_hf_u5 variable representing confirmed malaria cases in children under 5 detected at health facilities. Outliers should be identified in disaggregated, extracted columns. Aggregated columns may obscure outliers in disaggregated components.The plot below visualizes outliers in aggregated columns and disaggregatd components, highlighting the need to perform detection directly on the columns extracted from DHIS2.

Step-by-Step

Below is a step-by-step walkthrough of outlier detection methods, with emphasis on comparison to help you visualize and use the methods appropriately. Each step includes notes to help you understand, adapt, and use the code.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Load libraries and data

First we install and load the packages and data required for this section. This page continues the use of the preprocessed Sierra Leone DHIS2 example data.

# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Install or load relevant packages
pacman::p_load(
  readxl,     # import and read Excel files
  dplyr,      # data manipulation
  ggplot2,    # plotting
  tidyr,      # data management
  patchwork,  # for arranging plots
  purrr,      # vector management
  viridis,    # colour-blind friendly palettes
  here        # for easy file referencing
)

clean_malaria_routine_data_final <-
 readRDS(
   here::here(
     "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed",
     "clean_malaria_routine_data_final.rds"
   )
 )

Step 2: Outlier detection

Step 2.1: Calculate outlier statistics

We begin by defining the calculate_outlier_stats function to streamline the process of identifying points as outliers. The function calculates the various statistical metrics required to identify outliers.

calculate_outlier_stats <- function(df, column) {
  df |>
    dplyr::group_by(hf_uid, month) |>  # Facility-month grouping
    dplyr::filter(n() >= 3) |>  # Require minimum 3 observations per group
    dplyr::summarise(
      # Mean/SD metrics
      mean = mean(.data[[column]], na.rm = TRUE),
      sd = sd(.data[[column]], na.rm = TRUE),

      # Median/MAD metrics
      median = median(.data[[column]], na.rm = TRUE),
      mad = mad(.data[[column]], constant = 1, na.rm = TRUE),

      # IQR metrics
      Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
      Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
      .groups = "drop"
    ) |>
    dplyr::mutate(
      # Mean/SD bounds. For highly right-skewed data, use much tighter bounds
      mean_lower = pmax(0, mean - 1.8 * sd),
      mean_upper = mean + 1.8 * sd,

      # Median/MAD bounds
      median_lower = pmax(0, median - 3.2 * mad),
      median_upper = median + 3.2 * mad,

      # IQR bounds
      iqr_lower = pmax(0, Q1 - 1.6 * (Q3 - Q1)),
      iqr_upper = Q3 + 1.6 * (Q3 - Q1)
    )
}

To adapt the code:

Line 20-30: (Advanced) Change the multiplier in each outlier detection method to modify the sensitivity of each method. The multipliers used here are 1.8, 3.2, and 1.6 for mean/SD, median/MAD, and IQR detection respectively. These multipliers were empirically calibrated for Sierra Leone malaria data and may need adjustment for different datasets or surveillance systems.

Step 2.2: Flag outliers

Next we define the flag_outliers function to identify outliers based on the three methods and attach this information to the dataframe create in the Step 2.1. This code flags outliers based on the statistics calculated in the previous step.

flag_outliers <- function(df, stats_df, column) {
  df |>
    dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
    dplyr::mutate(
      outliers_moyenne = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
        TRUE ~ "normal"
      ),
      outliers_halper = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < median_lower | .data[[column]] > median_upper ~ "outlier",
        TRUE ~ "normal"
      ),
      outliers_IQR = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
        TRUE ~ "normal"
      )
    ) |>
    # Remove the intermediate calculation columns
    dplyr::select(-mean, -sd, -median, -mad, -Q1, -Q3,
                 -mean_lower, -mean_upper, -median_lower, -median_upper,
                 -iqr_lower, -iqr_upper)
}

conf_hf_u5_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf_hf_u5")

conf_hf_u5_results <- flag_outliers(clean_malaria_routine_data_final, conf_hf_u5_stats, "conf_hf_u5")

To adapt the code:

Line 28-29: Modify conf_hf_u5_resultsand conf_hf_u5_stats to reflect the variable you are performing outlier detection on. These are the names given to each function’s output. It is recommended to follow the variablename_results and variablename__results naming convention.
Line 28: Modify (clean_malaria_routine_data_final, "conf_hf_u5") to reflect the (df, "column") you are performing outlier detection on
Line 29: Modify (clean_malaria_routine_data_final, conf_hf_u5_stats, "conf_hf_u5") to reflect the (df, variablename_stats, "column") you are performing outlier detection on

Step 3: Visualize outliers

Step 3.1: Calculate outlier proportions

First we use the calculate_pct_labels function to calculate the proportion of conf_hf_u5 values labelled outliers for each of the three detection methods.

calculate_pct_labels <- function(var) {
  non_na <- sum(!is.na(var))

  normal_pct <- round(sum(var == "normal", na.rm = TRUE)/non_na*100, 2)
  outlier_pct <- round(sum(var == "outlier", na.rm = TRUE)/non_na*100, 2)
  na_count <- sum(is.na(var))

  list(
    normal = paste0("Normal (", normal_pct, "%)"),
    outlier = paste0("Outliers (", outlier_pct, "%)"),
    missing = paste0("NA (n=", na_count, ")")
  )
}

Step 3.2: Outlier plotting function

Next we define the create_outlier_plots function which configures the plot style and format to visualize and compare outlier detection methods. The exampled code stratifies confirmed cases by year.

create_outlier_plot <-
  function(data, method, title, pct_labels, value_col = "conf") {
  ggplot2::ggplot(data,
                  aes(x = year,
                      y = .data[[value_col]],
                      color = .data[[method]])) +
    geom_point(alpha = 0.7) +
    scale_color_manual(
      values = c(
        "normal" = "grey70",
        "outlier" = "#FF0000",
        "NA" = "#ffffff"),
      labels = c(pct_labels$normal, pct_labels$outlier , pct_labels$missing),
      na.value = "#ffffff",
      drop = FALSE
    ) +
    labs(title = title, y = "Confirmed Cases HF U5", color = NULL) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      legend.position = "bottom",
      legend.direction = "vertical",
      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
    )
}

To adapt the code:

Line 17: Modify the y-axes label based on the column you are performing outlier detection on

Step 3.3: Generate outlier visualization plots

We now define a function called generate_outier_plots which combines the calculate_pct_labels function from Step 3.1 and the create_outlier_plot function from Step 3.2.

generate_outlier_plots <- function(data, column) {
  pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
  pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
  pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])

  create_plot <- function(data, method, title, pct_labels) {
    ggplot2::ggplot(data, aes(x = year,
                              y = .data[[column]],
                              color = .data[[method]])) +
      geom_point(alpha = 0.7) +
      scale_color_manual(
        values = c("normal" = "grey70", "outlier" = "#FF0000", "NA" = "#ffffff"),
        labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
        na.value = "#ffffff",
        drop = FALSE
      ) +
      labs(title = title, y = column, color = NULL) +
      theme_minimal() +
      theme(
        plot.title = element_text(hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "vertical",
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
      )
  }

  p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
  p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
  p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)

  patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
    patchwork::plot_annotation(
      title = paste("Comparison of Outlier Detection Methods:", column),
      theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
)}

conf_hf_u5_outlier_plots <- generate_outlier_plots(conf_hf_u5_results, "conf_hf_u5")

Output

To adapt the code:

Line 37: Modify conf_hf_u5_outlier_plots to reflect the column you are performing outlier detection on. Remember to follow the variablename_outlier_plots convention.
Line 37: Modify the input of generate_outlier_plots to reflect the (variablename_results, "variablename") you are performing outlier detection on

Step 4: Outlier investigation

Consult SNT team

Outliers cannot be removed or corrected without investigation. Pinpoint observations categorized as outliers and present them to the SNT team. Oftentimes, the context within which a facility operates validates extreme values.

After outlier detection, the data must be investigated at a more granular level. By this point analysts should have selected which outlier detection method will be used. This page uses the median/MAD-detected outliers going forward. Analysts should choose the outlier detection method based on statistical knowledge, data context, and SNT team consultation.

Step 4.1: Investigate and export outlier data

The confirmed cases observations identified as outliers by the median/MAD method are first exported to a CSV file for further investigation and review by the SNT team.

# Filter for median method outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
  dplyr::filter(outliers_halper == "outlier") |>
  dplyr::arrange(desc(conf_hf_u5))

# save outliers as RDS for R users
saveRDS(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.rds"
  )
)

# save same dataset as parquet for python interop
arrow::write_parquet(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.parquet"
  )
)

# save same dataset as csv for human inspection in excel
write.csv(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.csv"
  ),
  row.names = FALSE
)

To adapt the code:

Line 2: Rename conf_hf_u5_median_outliers to match the variablename_method_outliers naming convention suggested on this page
Line 2 Change conf_hf_u5_results to variablename_results based on the column you are performing outlier detection on
Line 3: Replace outliers_halper with outliers_moyenne or outliers_IQR if you have chosen to proceed with mean detection or IQR detection respectively
Output basenames: Change "conf_hf_u5_median_method_outliers" (the shared basename across the .rds, .parquet, and .csv writes) to reflect the desired output file name
Object name: Change conf_hf_u5_median_outliers to reflect the variablename_method_outliers object defined above

Step 4.2: Outlier investigation function

This step defines a function to facilitate outlier investigation, including a diagnostic plot.

Show the code

plot_outlier_timeseries <- function(
    full_data,          # Original dataset
    outlier_row,        # Single row from outlier results
    time_var = "year",  # Time variable (year or date)
    value_var = "conf_hf_u5", # Value variable to plot
    highlight_year = NA # Year to focus on
) {

  # Get all conf* variables
  conf_vars <- grep("^conf_hf_", names(full_data), value = TRUE)

  # Plot colour palette
  component_palette <- c(
    RColorBrewer::brewer.pal(8, "Set2"),  # First 8 colors
    RColorBrewer::brewer.pal(4, "Dark2")    # Next 4 (avoid yellow)
  )[1:length(conf_vars)]  # Trim to needed length

  # Get all data for this facility for the focus year
  facility_data <- full_data |>
    dplyr::filter(hf_uid == outlier_row$hf_uid,
                  year == highlight_year) |>
    dplyr::mutate(
      date = as.Date(paste(year, month, "01", sep = "-")),
      is_outlier = ifelse(
        outliers_halper == "outlier" & year == highlight_year,
        "Outlier",
        "Normal"
      )
    ) |>
    tidyr::pivot_longer(
      cols = all_of(conf_vars),
      names_to = "conf_variable",
      values_to = "conf_value"
    )

  # Base plot
  p <- ggplot2::ggplot(facility_data, aes(x = date)) +
    # Plot all conf variables as dashed lines (except main variable)
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable != value_var),
      aes(y = conf_value, color = conf_variable, group = conf_variable),
      linetype = "dashed",
      alpha = 0.8,
      linewidth = 0.8
    ) +
    # Plot main variable as solid line
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, group = 1),
      color = "grey70",
      linewidth = 1.2
    ) +
    ggplot2::geom_point(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, fill = is_outlier),
      shape = 21, size = 3, color = "black"
    ) +
    ggplot2::expand_limits(y = 0) +
    ggplot2::scale_color_manual(
      name = "Component Variables",
      values = component_palette
    ) +
    ggplot2::scale_fill_manual(
      values = c("Outlier" = "red", "Normal" = "grey70"),
      name = paste(value_var, "monthly report classification")
    ) +
    ggplot2::labs(
      title = paste("Outlier and Components:", "Monthly", outlier_row$hf, "Reports for", value_var),
      subtitle = paste("HF UID:", outlier_row$hf_uid, "| Year:", highlight_year),
      x = "Month",
      y = "Cases",
      caption = ifelse(
        any(facility_data$is_outlier == "Outlier" &
            facility_data$conf_variable == value_var),
        paste("Red points indicate outlier values in", value_var),
        paste("No outliers detected in", value_var, "for this year")
      )
    ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "bottom",
      legend.box = "vertical"
    ) +
    ggplot2::scale_x_date(
      date_labels = "%b",
      date_breaks = "1 month"
    )

  return(p)
}

To adapt the code:

Line 5: Replace "conf_hf_u5" with the column you are performing outlier detection on. Options include "susp", "test". "maltreat", etc.
Line 10: Replace conf_vars to reflect the variablename_vars based on the column you are performing outlier detection on. Recall the goal here is to plot all related disaggregated columns alongside the primary column
Line 10: Replace "^conf_hf" to reflect the prefix of the"^variablename" you are performing outlier detection on
Line 16: Replace conf_vars with the variablename_vars defined in Line 10
Lines 31-32: Modify "conf_variable" and "conf_value" to match the "variablename_variable" and "variablename_value" of the column you are performing outlier detection on
Line 40, 41, 48, 49, 55, 56: Similarly, replace all instances of conf_variable and conf_value based on the column you are performing outlier detection on

Step 4.3: Select and investigate a single outlier

A time series enables visualization of monthly reports within a given year to contextualize the classification of a point as an outlier. Recall that we calculated totals on the DHIS2 preprocessing page. Therefore we must investigate outliers in the disaggregated, extracted components rather than calculated totals.

# Select an outlier to visualize
selected_outlier <- conf_hf_u5_median_outliers |>
  dplyr::filter(year == 2017) |>
  dplyr::slice_max(conf, n = 1)

# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
  full_data = conf_hf_u5_results,  # Need to use the flagged data here
  outlier_row = selected_outlier,
  value_var = "conf_hf_u5",
  highlight_year = 2017
)

Output

To adapt the code:

Line 2: Replace conf_hf_u5_median_outliers following the variablename_method_outliers naming convention
Line 2: Replace 2017 with the year of data observation to investigate
Line 8: Replace conf_hf_u5_results following the variablename_results output defined in Step 2.3
Line 10: Replace "conf_hf_u5" with the column you are performing outlier detection on. This is the same column specified in Step 2.3
Line 11: Replace 2017 with the year of data observation to investigate, also specified in Line 3 of this step

Context matters!

When investigating outliers in conf_hf_u5, it’s essential to examine all related component columns (conf_com_u5, conf_hf_5_14, conf_hf_ov15, etc.) to understand the complete epidemiological picture. Seasonality is also of improtance in this context. In Sierra Leone’s malaria transmission:

High transmission typically occurs during the rainy season (May-October)
Low transmission occurs during the dry season (November-April)
True outliers represent deviations from these expected seasonal patterns, not the seasonal peaks themselves

Sierra Leone health facilities also report cases for community health workers alongside their monthly reports. Examining all component variables within the reporting context helps distinguish between data quality issues, genuine outbreaks, and expected seasonal patterns.

Step 5: Outlier recatgorization

Step 5.1: Outlier recategorization function

Consult SNT team

Based on review with the SNT team, outliers may need to be categorized. For example, extremely high values categorized as outliers may result from genuine spikes or outbreaks. In this case, we cannot leave the observation flagged as an outlier and correct it.

Show the code

reflag_outliers <- function(df,
                           hf_name = NULL,
                           hf_uid = NULL,
                           year = NULL,
                           month = NULL,
                           outlier_metric = NULL,
                           new_status = "normal") {

  conditions <- list()
  if(!is.null(hf_name)) conditions$hf <- hf_name
  if(!is.null(hf_uid)) conditions$hf_uid <- hf_uid
  if(!is.null(year)) conditions$year <- year
  if(!is.null(month)) conditions$month <- month

  status_col <- paste0("outlier_status_", outlier_metric)
  flag_col <- paste0("outlier_flag_", outlier_metric)

  df |>
    dplyr::mutate(
      !!status_col := ifelse(
        purrr::reduce(
          purrr::map2(names(conditions), conditions, ~.x == .y),
          `&`
        ),
        new_status,
        .data[[status_col]]
      ),
      !!flag_col := ifelse(
        purrr::reduce(
          purrr::map2(names(conditions), conditions, ~.x == .y),
          `&`
        ),
        new_status == "outlier",
        .data[[flag_col]]
      )
    )
}

# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
  reflag_outliers(
    hf_uid = "SL_45678",
    year = 2022,
    month = 10,
    outlier_metric = "conf_hf_u5",
    new_status = "investigated_normal"
  )

Step 5.2: Apply outlier recategorization

This code applies the function to recategorize specific outliers as normal, while still maintaing a history of the investigation and re-categorization.

# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
  reflag_outliers(
    hf_uid = "hf_0682",
    year = 2017,
    month = 2,
    outlier_metric = "conf_hf_u5",
    new_status = "investigated_normal"
  )

To adapt the code:

Line 2: Rename reflagged_conf_hf_u5_outlier_data and conf_hf_u5_results based on the columns you are working with.
Line 4: Replace "SL_45678" with the hf_uid of the health facility whose outlier needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3
Line 5: Replace 2022 with the year of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 4.3
Line 7: Replace 10 to correspond with the month of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 4.3

Step 6: Loop outlier detection

Step 6.1: Define outlier detection loop function

While the code below provides the option to loop through outlier detection for multiple indicators, thorough investigation of each identified outlier is still essential. The loop function streamlines the detection process but should be used in conjunction with manual review rather than as a replacement for it.

Show the code

#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#'   - flagged_data: Full dataset with outlier flags
#'   - stats: Summary statistics for each indicator
#'   - plots: Outlier methods visualization plots
#'   - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {

  # Initialize storage
  results <- list(
    flagged_data = df,
    stats = list(),
    plots = list(),
    outliers = list()
  )

  # Loop through each indicator
  for (indicator in indicators) {

    # Skip if indicator doesn't exist
    if (!indicator %in% names(df)) {
      warning(paste("Indicator", indicator, "not found in dataframe. Skipping."))
      next
    }

    # 1. Calculate statistics by HF-month group
    stats <- calculate_outlier_stats(df, indicator)

    # 2. Flag outliers (keeping all method columns for plotting)
    flagged <- flag_outliers(df, stats, indicator)

    # 3. Add status and flag columns (without removing method columns)
    flagged <- flagged |>
      dplyr::mutate(
        !!paste0("outlier_status_", indicator) := outliers_halper,
        !!paste0("outlier_flag_", indicator) := ifelse(outliers_halper == "outlier", TRUE, FALSE)
      )

    # 4. Generate plots (using all method columns)
    plots <- generate_outlier_plots(flagged, indicator)

    # 5. Extract outliers
    outlier_records <- flagged |>
      dplyr::filter(outliers_halper == "outlier") |>
      dplyr::arrange(desc(.data[[indicator]]))

    # Store results
    results$stats[[indicator]] <- stats
    results$plots[[indicator]] <- plots
    results$outliers[[indicator]] <- outlier_records

    # Update main dataframe (only keep final status/flags)
    results$flagged_data <- results$flagged_data |>
      dplyr::left_join(
        flagged |>
          dplyr::select(hf_uid, year, month,
                       !!paste0("outlier_status_", indicator),
                       !!paste0("outlier_flag_", indicator)),
        by = c("hf_uid", "year", "month")
      )
  }

  return(results)
}

Step 6.2: Use outlier detection loop function

To use the outlier detection loop on multiple variables at once, follow the code below. This code exemplifies the outlier detection loop performed on disaggregated test columns.

test_components <- c("test_hf_u5", "test_hf_5_14", "test_hf_ov15",
                    "test_com_u5", "test_com_5_14", "test_com_ov15")

test_results <- detect_outliers_loop(
  df = clean_malaria_routine_data_final,
  indicators = test_components,
  method = "median"
)

print(test_results$plots$test_hf_u5)

Output

Step 7: Set outliers to missing

Once outliers have been investigated and finalized, the next step is to set them to missing (i.e. NA values). This prepares outliers to be imputed like other missingness, in a manner than condiers downstream indicators and clinical logic.

# Function to clean variable and return minimal columns
clean_single_variable <- function(df, variable_name, method = "halper") {
  method_col <- switch(method,
                       "moyenne" = "outliers_moyenne",
                       "halper" = "outliers_halper", 
                       "IQR" = "outliers_IQR")
  
  df |>
    dplyr::mutate(
      !!paste0(variable_name, "_cleaned") := ifelse(.data[[method_col]] == "outlier", NA, .data[[variable_name]]),
      !!paste0("outlier_status_", variable_name) := .data[[method_col]]
    ) |>
    dplyr::select(
      # Keep all original columns
      dplyr::all_of(names(clean_malaria_routine_data_final)),
      # Add only the essential columns for the variable
      !!paste0("outlier_status_", variable_name),
      !!paste0(variable_name, "_cleaned")
    )
}

# Clean multiple variables sequentially
final_detected_data <- conf_hf_u5_results |>
  # Process conf_hf_u5
  clean_single_variable("conf_hf_u5", "halper") |>
  # Process test_hf_u5 (would need to run detection first)
  clean_single_variable("test_hf_u5", "halper") |>
  # Add more variables as needed...
  
# Final dataset: original columns + cleaned versions of all processed variables
saveRDS(
  final_detected_data,
  here::here(
    "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed",
    "outlier_detected_malaria_routine_data.rds"
  )
)

# save same dataset as parquet for python interop
arrow::write_parquet(
  final_detected_data,
  here::here(
    "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed",
    "outlier_detected_malaria_routine_data.parquet"
  )
)

Summary

This section has walked through multiple outlier detection methods using the conf_hf_u5 column as an example. We defined functions to identify, visualize, and investigate outliers to streamline the detection process. This code will help you detect outliers using appropriate methods based on the data and variables you are working with.

Once you have detected, validated, and possibly recategorized outliers in your data, your data frame will include all original columns of your data as well as an outlier column for each indicator. This new data frame will be used in the outlier correction page.

Full code

Show full code

# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Install or load relevant packages
pacman::p_load(
  readxl,     # import and read Excel files
  dplyr,      # data manipulation
  ggplot2,    # plotting
  tidyr,      # data management
  patchwork,  # for arranging plots
  purrr,      # vector management
  viridis,    # colour-blind friendly palettes
  here        # for easy file referencing
)

clean_malaria_routine_data_final <-
 readRDS(
   here::here(
     "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed",
     "clean_malaria_routine_data_final.rds"
   )
 )

calculate_outlier_stats <- function(df, column) {
  df |>
    dplyr::group_by(hf_uid, month) |>  # Facility-month grouping
    dplyr::filter(n() >= 3) |>  # Require minimum 3 observations per group
    dplyr::summarise(
      # Mean/SD metrics
      mean = mean(.data[[column]], na.rm = TRUE),
      sd = sd(.data[[column]], na.rm = TRUE),

      # Median/MAD metrics
      median = median(.data[[column]], na.rm = TRUE),
      mad = mad(.data[[column]], constant = 1, na.rm = TRUE),

      # IQR metrics
      Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
      Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
      .groups = "drop"
    ) |>
    dplyr::mutate(
      # Mean/SD bounds
      mean_lower = pmax(0, mean - 1.8 * sd),
      mean_upper = mean + 1.8 * sd,

      # Median/MAD bounds
      median_lower = pmax(0, median - 3.2 * mad),
      median_upper = median + 3.2 * mad,

      # IQR bounds
      iqr_lower = pmax(0, Q1 - 1.6 * (Q3 - Q1)),
      iqr_upper = Q3 + 1.6 * (Q3 - Q1)
    )
}

flag_outliers <- function(df, stats_df, column) {
  df |>
    dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
    dplyr::mutate(
      outliers_moyenne = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
        TRUE ~ "normal"
      ),
      outliers_halper = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < median_lower | .data[[column]] > median_upper ~ "outlier",
        TRUE ~ "normal"
      ),
      outliers_IQR = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
        TRUE ~ "normal"
      )
    ) |>
    # Remove the intermediate calculation columns
    dplyr::select(-mean, -sd, -median, -mad, -Q1, -Q3,
                 -mean_lower, -mean_upper, -median_lower, -median_upper,
                 -iqr_lower, -iqr_upper)
}

conf_hf_u5_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf_hf_u5")

conf_hf_u5_results <- flag_outliers(clean_malaria_routine_data_final, conf_hf_u5_stats, "conf_hf_u5")


calculate_pct_labels <- function(var) {
  non_na <- sum(!is.na(var))

  normal_pct <- round(sum(var == "normal", na.rm = TRUE)/non_na*100, 2)
  outlier_pct <- round(sum(var == "outlier", na.rm = TRUE)/non_na*100, 2)
  na_count <- sum(is.na(var))

  list(
    normal = paste0("Normal (", normal_pct, "%)"),
    outlier = paste0("Outliers (", outlier_pct, "%)"),
    missing = paste0("NA (n=", na_count, ")")
  )
}

create_outlier_plot <-
  function(data, method, title, pct_labels, value_col = "conf") {
  ggplot2::ggplot(data,
                  aes(x = year,
                      y = .data[[value_col]],
                      color = .data[[method]])) +
    geom_point(alpha = 0.7) +
    scale_color_manual(
      values = c(
        "normal" = "grey70",
        "outlier" = "#FF0000",
        "NA" = "#ffffff"),
      labels = c(pct_labels$normal, pct_labels$outlier , pct_labels$missing),
      na.value = "#ffffff",
      drop = FALSE
    ) +
    labs(title = title, y = "Confirmed Cases", color = NULL) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      legend.position = "bottom",
      legend.direction = "vertical",
      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
    )
}


generate_outlier_plots <- function(data, column) {
  pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
  pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
  pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])

  create_plot <- function(data, method, title, pct_labels) {
    ggplot2::ggplot(data, aes(x = year,
                              y = .data[[column]],
                              color = .data[[method]])) +
      geom_point(alpha = 0.7) +
      scale_color_manual(
        values = c("normal" = "grey70", "outlier" = "#FF0000", "NA" = "#ffffff"),
        labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
        na.value = "#ffffff",
        drop = FALSE
      ) +
      labs(title = title, y = column, color = NULL) +
      theme_minimal() +
      theme(
        plot.title = element_text(hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "vertical",
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
      )
  }

  p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
  p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
  p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)

  patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
    patchwork::plot_annotation(
      title = paste("Comparison of Outlier Detection Methods:", column),
      theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
)}

conf_hf_u5_outlier_plots <- generate_outlier_plots(conf_hf_u5_results, "conf_hf_u5")

# Filter for median method outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
  dplyr::filter(outliers_halper == "outlier") |>
  dplyr::arrange(desc(conf_hf_u5))

# save outliers as RDS for R users
saveRDS(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.rds"
  )
)

# save same dataset as parquet for python interop
arrow::write_parquet(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.parquet"
  )
)

# save same dataset as csv for human inspection in excel
write.csv(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.csv"
  ),
  row.names = FALSE
)

# Outlier timeseries
plot_outlier_timeseries <- function(
    full_data,          # Original dataset
    outlier_row,        # Single row from outlier results
    time_var = "year",  # Time variable (year or date)
    value_var = "conf_hf_u5", # Value variable to plot
    highlight_year = NA # Year to focus on
) {

  # Get all conf* variables
  conf_vars <- grep("^conf_hf_", names(full_data), value = TRUE)

  # Plot colour palette
  component_palette <- c(
    RColorBrewer::brewer.pal(8, "Set2"),  # First 8 colors
    RColorBrewer::brewer.pal(4, "Dark2")    # Next 4 (avoid yellow)
  )[1:length(conf_vars)]  # Trim to needed length

  # Get all data for this facility for the focus year
  facility_data <- full_data |>
    dplyr::filter(hf_uid == outlier_row$hf_uid,
                  year == highlight_year) |>
    dplyr::mutate(
      date = as.Date(paste(year, month, "01", sep = "-")),
      is_outlier = ifelse(
        outliers_halper == "outlier" & year == highlight_year,
        "Outlier",
        "Normal"
      )
    ) |>
    tidyr::pivot_longer(
      cols = all_of(conf_vars),
      names_to = "conf_variable",
      values_to = "conf_value"
    )

  # Base plot
  p <- ggplot2::ggplot(facility_data, aes(x = date)) +
    # Plot all conf variables as dashed lines (except main variable)
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable != value_var),
      aes(y = conf_value, color = conf_variable, group = conf_variable),
      linetype = "dashed",
      alpha = 0.8,
      linewidth = 0.8
    ) +
    # Plot main variable as solid line
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, group = 1),
      color = "grey70",
      linewidth = 1.2
    ) +
    ggplot2::geom_point(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, fill = is_outlier),
      shape = 21, size = 3, color = "black"
    ) +
    ggplot2::expand_limits(y = 0) +
    ggplot2::scale_color_manual(
      name = "Component Variables",
      values = component_palette
    ) +
    ggplot2::scale_fill_manual(
      values = c("Outlier" = "red", "Normal" = "grey70"),
      name = paste(value_var, "monthly report classification")
    ) +
    ggplot2::labs(
      title = paste("Outlier and Components:", "Monthly", outlier_row$hf, "Reports for", value_var),
      subtitle = paste("HF UID:", outlier_row$hf_uid, "| Year:", highlight_year),
      x = "Month",
      y = "Cases",
      caption = ifelse(
        any(facility_data$is_outlier == "Outlier" &
            facility_data$conf_variable == value_var),
        paste("Red points indicate outlier values in", value_var),
        paste("No outliers detected in", value_var, "for this year")
      )
    ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "bottom",
      legend.box = "vertical"
    ) +
    ggplot2::scale_x_date(
      date_labels = "%b",
      date_breaks = "1 month"
    )

  return(p)
}

# Select an outlier to visualize
selected_outlier <- conf_hf_u5_median_outliers |>
  dplyr::filter(year == 2017) |>
  dplyr::slice_max(conf, n = 1)

# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
  full_data = conf_hf_u5_results,  # Need to use the flagged data here
  outlier_row = selected_outlier,
  value_var = "conf_hf_u5",
  highlight_year = 2017
)

reflag_outliers <- function(df,
                           hf_name = NULL,
                           hf_uid = NULL,
                           year = NULL,
                           month = NULL,
                           outlier_metric = NULL,
                           new_status = "normal") {

  conditions <- list()
  if(!is.null(hf_name)) conditions$hf <- hf_name
  if(!is.null(hf_uid)) conditions$hf_uid <- hf_uid
  if(!is.null(year)) conditions$year <- year
  if(!is.null(month)) conditions$month <- month

  status_col <- paste0("outlier_status_", outlier_metric)
  flag_col <- paste0("outlier_flag_", outlier_metric)

  df |>
    dplyr::mutate(
      !!status_col := ifelse(
        purrr::reduce(
          purrr::map2(names(conditions), conditions, ~.x == .y),
          `&`
        ),
        new_status,
        .data[[status_col]]
      ),
      !!flag_col := ifelse(
        purrr::reduce(
          purrr::map2(names(conditions), conditions, ~.x == .y),
          `&`
        ),
        new_status == "outlier",
        .data[[flag_col]]
      )
    )
}

# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
  reflag_outliers(
    hf_uid = "SL_45678",
    year = 2022,
    month = 10,
    outlier_metric = "conf_hf_u5",
    new_status = "investigated_normal"
  )

# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
  reflag_outliers(
    hf_uid = "SL_45678",
    year = 2022,
    month = 10,
    outlier_metric = "conf_hf_u5",
    new_status = "investigated_normal"
  )


#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#'   - flagged_data: Full dataset with outlier flags
#'   - stats: Summary statistics for each indicator
#'   - plots: Outlier methods visualization plots
#'   - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {
  # Initialize storage
  results <- list(
    flagged_data = df,
    stats = list(),
    plots = list(),
    outliers = list()
  )

  # Loop through each indicator
  for (indicator in indicators) {
    # Skip if indicator doesn't exist
    if (!indicator %in% names(df)) {
      warning(paste(
        "Indicator",
        indicator,
        "not found in dataframe. Skipping."
      ))
      next
    }

    # 1. Calculate statistics by HF-month group
    stats <- calculate_outlier_stats(df, indicator)

    # 2. Flag outliers (keeping all method columns for plotting)
    flagged <- flag_outliers(df, stats, indicator)

    # 3. Add status and flag columns (without removing method columns)
    flagged <- flagged |>
      dplyr::mutate(
        !!paste0("outlier_status_", indicator) := outliers_halper,
        !!paste0("outlier_flag_", indicator) := ifelse(
          outliers_halper == "outlier",
          TRUE,
          FALSE
        )
      )

    # 4. Generate plots (using all method columns)
    plots <- generate_outlier_plots(flagged, indicator)

    # 5. Extract outliers
    outlier_records <- flagged |>
      dplyr::filter(outliers_halper == "outlier") |>
      dplyr::arrange(desc(.data[[indicator]]))

    # Store results
    results$stats[[indicator]] <- stats
    results$plots[[indicator]] <- plots
    results$outliers[[indicator]] <- outlier_records

    # Update main dataframe (only keep final status/flags)
    results$flagged_data <- results$flagged_data |>
      dplyr::left_join(
        flagged |>
          dplyr::select(
            hf_uid,
            year,
            month,
            !!paste0("outlier_status_", indicator),
            !!paste0("outlier_flag_", indicator)
          ),
        by = c("hf_uid", "year", "month")
      )
  }

  return(results)
}

test_components <- c(
  "test_hf_u5",
  "test_hf_5_14",
  "test_hf_ov15",
  "test_com_u5",
  "test_com_5_14",
  "test_com_ov15"
)

test_results <- detect_outliers_loop(
  df = clean_malaria_routine_data_final,
  indicators = test_components,
  method = "median"
)

print(test_results$plots$test_hf_u5)