Outlier detection methods

Overview

In subnational malaria analysis, outlier correction helps separate real epidemiological signals from data noise. This ensures reliable decision-making for district-level targeting and resource allocation. Before outliers can be corrected, they must be identified and investigated.

Objectives

Compare how outlier detection methods affect different malaria indicators
Assess how methods balance data integrity with sensitivity
Select appropriate detection based on analysis goals and data quality

Introduction to Outliers and Detection Methods

Outliers in SNT

Outliers can represent either data quality issues or true epidemiological events. Their impact is magnified in subnational analysis where small numbers matter. For example:

A reporting error at one facility can distort district-level estimates
A real outbreak might be mistakenly “corrected” if not properly validated

Outlier Detection Methods

Three outlier detection methods are explored on this page, each of which varies in robustness and sensitivity:

Mean/standard deviation method: Best for normally distributed data without extreme values
Median/MAD method: More robust for skewed data common in malaria reporting
Interquartile range (IQR) method: Effective for volatile datasets with many small facilities

The provided code first applies detection methods to the conf variable representing confirmed malaria cases. Outliers in an aggregated total may indicate outliers in one or more of its disaggregated components. The plot below exemplifies this case. Investigating outliers in aggregated columns helps analysts identify which disaggregated columns require attention.

Step-by-Step

Below is a step-by-step walkthrough of outlier detection methods, with emphasis on comparison to help you visualize and use the methods appropriately. Each step includes notes to help you understand, adapt, and use the code.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Load libraries and data

First we install and load the packages and data required for this section. This page continues the use of the preprocessed Sierra Leone DHIS2 example data.

# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Install or load relevant packages
pacman::p_load(
  readxl,     # import and read Excel files
  dplyr,      # data manipulation
  ggplot2,    # plotting
  tidyr,      # data management
  patchwork,  # for arranging plots
  purrr,      # vector management
  viridis,    # colour-blind friendly palettes
  here        # for easy file referencing
)

clean_malaria_routine_data_final <-
 readRDS(
   here::here(
     "1.2_epidemiology/1.2a_routine_surveillance/raw",
     "clean_malaria_routine_data_final.rds"
   )
 )

Step 2: Outlier detection

Step 2.1: Calculate outlier statistics

We begin by defining the calculate_outlier_stats function to streamline the process of identifying points as outliers. The function calculates the various statistical metrics required to identify outliers.

calculate_outlier_stats <- function(df, column) {
  df |>
    dplyr::group_by(hf_uid, month) |>  # Facility-month grouping
    dplyr::filter(n() >= 3) |>  # Require minimum 3 observations per group
    dplyr::summarise(
      # Mean/SD metrics
      mean = mean(.data[[column]], na.rm = TRUE),
      sd = sd(.data[[column]], na.rm = TRUE),

      # Median/MAD metrics
      median = median(.data[[column]], na.rm = TRUE),
      mad = mad(.data[[column]], constant = 1, na.rm = TRUE),

      # IQR metrics
      Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
      Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
      .groups = "drop"
    ) |>
    dplyr::mutate(
      # Mean/SD bounds
      mean_lower = pmax(0, mean - 3 * sd),
      mean_upper = mean + 3 * sd,

      # Median/MAD bounds
      median_lower = pmax(0, median - 2.5 * mad),
      median_upper = median + 2.5 * mad,

      # IQR bounds
      iqr_lower = pmax(0, Q1 - 1.5 * (Q3 - Q1)),
      iqr_upper = Q3 + 1.5 * (Q3 - Q1)
    )
}

To adapt the code:

Line 20-30: (Advanced) Change the multiplier in each outlier detection method to modify the sensitivity of each method. The multipliers used here are 3, 2.5, and 1.5 for mean/SD, mean/median, and IQR detection respectively. These multipliers were selected based on statistical conventions.

Step 2.2: Flag outliers

Next we define the flag_outliers function to identify outliers based on the three methods and attach this information to the dataframe create in the Step 2.1. This code flags outliers based on the statistics calculated in the previous step.

flag_outliers <- function(df, stats_df, column) {
  df |>
    dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
    dplyr::mutate(
      outliers_moyenne = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
        TRUE ~ "normal"
      ),
      outliers_halper = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < median_lower | .data[[column]] > median_upper ~ "outlier",
        TRUE ~ "normal"
      ),
      outliers_IQR = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
        TRUE ~ "normal"
      )
    ) |>
    # Remove the intermediate calculation columns
    dplyr::select(-mean, -sd, -median, -mad, -Q1, -Q3,
                 -mean_lower, -mean_upper, -median_lower, -median_upper,
                 -iqr_lower, -iqr_upper)
}

conf_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf")

conf_results <- flag_outliers(clean_malaria_routine_data_final, conf_stats, "conf")

To adapt the code:

Line 28-29: Modify conf_resultsand conf_stats to reflect the variable you are performing outlier detection on. These are the names given to each function’s output. It is recommended to follow the variablename_results and variablename_results naming convention.
Line 28: Modify (clean_malaria_routine_data_final, "conf") to reflect the (df, "column") you are performing outlier detection on
Line 29: Modify (clean_malaria_routine_data_final, conf_stats, "conf") to reflect the (df, variablename_stats, "column") you are performing outlier detection on

Step 3: Visualize outliers

Step 3.1: Calculate outlier proportions

First we use the calculate_pct_labels function to calculate the proportion of conf values labelled outliers for each of the three detection methods.

calculate_pct_labels <- function(var) {
  non_na <- sum(!is.na(var))

  normal_pct <- round(sum(var == "normal", na.rm = TRUE)/non_na*100, 2)
  outlier_pct <- round(sum(var == "outlier", na.rm = TRUE)/non_na*100, 2)
  na_count <- sum(is.na(var))

  list(
    normal = paste0("Normal (", normal_pct, "%)"),
    outlier = paste0("Outliers (", outlier_pct, "%)"),
    missing = paste0("NA (n=", na_count, ")")
  )
}

Step 3.2: Outlier plotting function

Next we define the create_outlier_plots function which configures the plot style and format to visualize and compare outlier detection methods. The exampled code stratifies confirmed cases by year.

create_outlier_plot <-
  function(data, method, title, pct_labels, value_col = "conf") {
  ggplot2::ggplot(data,
                  aes(x = year,
                      y = .data[[value_col]],
                      color = .data[[method]])) +
    geom_point(alpha = 0.7) +
    scale_color_manual(
      values = c(
        "normal" = "grey70",
        "outlier" = "#FF0000",
        "NA" = "#ffffff"),
      labels = c(pct_labels$normal, pct_labels$outlier , pct_labels$missing),
      na.value = "#ffffff",
      drop = FALSE
    ) +
    labs(title = title, y = "Confirmed Cases", color = NULL) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      legend.position = "bottom",
      legend.direction = "vertical",
      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
    )
}

To adapt the code:

Line 17: Modify the y-axes label based on the column you are performing outlier detection on

Step 3.3: Generate outlier visualization plots

We now define a function called generate_outier_plots which combines the calculate_pct_labels function from Step 3.1 and the create_outlier_plot function from Step 3.2.

generate_outlier_plots <- function(data, column) {
  pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
  pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
  pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])

  create_plot <- function(data, method, title, pct_labels) {
    ggplot2::ggplot(data, aes(x = year,
                              y = .data[[column]],
                              color = .data[[method]])) +
      geom_point(alpha = 0.7) +
      scale_color_manual(
        values = c("normal" = "grey70", "outlier" = "#FF0000", "NA" = "#ffffff"),
        labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
        na.value = "#ffffff",
        drop = FALSE
      ) +
      labs(title = title, y = column, color = NULL) +
      theme_minimal() +
      theme(
        plot.title = element_text(hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "vertical",
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
      )
  }

  p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
  p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
  p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)

  patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
    patchwork::plot_annotation(
      title = paste("Comparison of Outlier Detection Methods:", column),
      theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
)}

conf_outlier_plots <- generate_outlier_plots(conf_results, "conf")

Output

To adapt the code:

Line 37: Modify conf_outlier_plots to reflect the column you are performing outlier detection on. Remember to follow the variablename_outlier_plots convention.
Line 37: Modify the input of generate_outlier_plots to reflect the (variablename_results, "variablename") you are performing outlier detection on

Step 4: Outlier investigation

Consult SNT team

Outliers cannot be removed or corrected without investigation. Pinpoint observations categorized as outliers and present them to the SNT team. Oftentimes, the context within which a facility operates validates extreme values.

After outlier detection, the data must be investigated at a more granular level. By this point analysts should have selected which outlier detection method will be used. This page uses the median/MAD-detected outliers going forward. Analysts should choose the outlier detection method based on statistical knowledge, data context, and SNT team consultation.

Step 4.1: Investigate and export outlier data

The confirmed cases observations identified as outliers by the median/MAD method are first exported to a CSV file for further investigation and review by the SNT team.

# Filter for median method outliers
conf_median_outliers <- conf_results |>
  dplyr::filter(outliers_halper == "outlier") |>
  dplyr::arrange(desc(conf))

# Export outliers to CSV
outlier_file <- "conf_median_method_outliers.csv"
write.csv(conf_median_outliers, file = outlier_file, row.names = FALSE)

To adapt the code:

Line 2: Rename conf_median_outliers to match the variablename_method_outliers naming convention suggested on this page
Line 2 Change conf_results to variablename_results based on the column you are performing outlier detection on
Line 3: Replace outliers_halper with outliers_moyenne or outliers_IQR if you have chosen to proceed with mean detection or IQR detection respectively
Line 7: Change "conf_median_method_outliers.csv" to reflect the desired CSV output file name
Line 8: Change conf_median_outliers to reflect the variablename_method_outliers object defined in Line 2

Step 4.2: Outlier investigation function

This step defines a function to facilitate outlier investigation, including a diagnostic plot.

Show the code

plot_outlier_timeseries <- function(
    full_data,          # Original dataset
    outlier_row,        # Single row from outlier results
    time_var = "year",  # Time variable (year or date)
    value_var = "conf", # Value variable to plot
    highlight_year = NA # Year to focus on
) {

  # Get all conf* variables
  conf_vars <- grep("^conf", names(full_data), value = TRUE)

  # Plot colour palette
  component_palette <- c(
    RColorBrewer::brewer.pal(8, "Set2"),  # First 8 colors
    RColorBrewer::brewer.pal(4, "Dark2")    # Next 4 (avoid yellow)
  )[1:length(conf_vars)]  # Trim to needed length

  # Get all data for this facility for the focus year
  facility_data <- full_data |>
    dplyr::filter(hf_uid == outlier_row$hf_uid,
                  year == highlight_year) |>
    dplyr::mutate(
      date = as.Date(paste(year, month, "01", sep = "-")),
      is_outlier = ifelse(
        outliers_halper == "outlier" & year == highlight_year,
        "Outlier",
        "Normal"
      )
    ) |>
    tidyr::pivot_longer(
      cols = all_of(conf_vars),
      names_to = "conf_variable",
      values_to = "conf_value"
    )

  # Base plot
  p <- ggplot2::ggplot(facility_data, aes(x = date)) +
    # Plot all conf variables as dashed lines (except main variable)
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable != value_var),
      aes(y = conf_value, color = conf_variable, group = conf_variable),
      linetype = "dashed",
      alpha = 0.8,
      linewidth = 0.8
    ) +
    # Plot main variable as solid line
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, group = 1),
      color = "grey70",
      linewidth = 1.2
    ) +
    ggplot2::geom_point(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, fill = is_outlier),
      shape = 21, size = 3, color = "black"
    ) +
    ggplot2::expand_limits(y = 0) +
    ggplot2::scale_color_manual(
      name = "Component Variables",
      values = component_palette
    ) +
    ggplot2::scale_fill_manual(
      values = c("Outlier" = "red", "Normal" = "grey70"),
      name = paste(value_var, "monthly report classification")
    ) +
    ggplot2::labs(
      title = paste("Outlier and Components:", "Monthly", outlier_row$hf, "Reports for", value_var),
      subtitle = paste("HF UID:", outlier_row$hf_uid, "| Year:", highlight_year),
      x = "Month",
      y = "Cases",
      caption = ifelse(
        any(facility_data$is_outlier == "Outlier" &
            facility_data$conf_variable == value_var),
        paste("Red points indicate outlier values in", value_var),
        paste("No outliers detected in", value_var, "for this year")
      )
    ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "bottom",
      legend.box = "vertical"
    ) +
    ggplot2::scale_x_date(
      date_labels = "%b",
      date_breaks = "1 month"
    )

  return(p)
}

To adapt the code:

Line 5: Replace "conf" with the column you are performing outlier detection on. Options include "susp", "test". "maltreat", etc.
Line 10: Replace conf_vars to reflect the variablename_vars based on the column you are performing outlier detection on
Line 10: Replace "^conf" to reflect the "^variablename" of the column you are performing outlier detection on
Line 16: Replace conf_vars with the variablename_vars defined in Line 10
Lines 31-32: Modify "conf_variable" and "conf_value" to match the "variablename_variable" and "variablename_value" of the column you are performing outlier detection on
Line 40, 41, 48, 49, 55, 56: Similarly, replace all instances of conf_variable and conf_value based on the column you are performing outlier detection on

Step 4.3: Select and investigate a single outlier

A time series enables visualization of monthly reports within a given year to contextualize the classification of a point as an outlier. Recall that we calculated totals on the DHIS2 preprocessing page. This means that an outlier identified in the aggregated confirmed cases column conf may be indiciative of an outlier in one of it’s disaggregated components. Here we plot both the aggregated column and its components to determine the driving force of outliers in the aggregated column.

# Select an outlier to visualize
selected_outlier <- conf_median_outliers |>
  dplyr::filter(year == 2017) |>
  dplyr::slice_max(conf, n = 1)

# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
  full_data = conf_results,  # Need to use the flagged data here
  outlier_row = selected_outlier,
  value_var = "conf",
  highlight_year = 2017
)

Output

To adapt the code:

Line 2: Replace conf_median_outliers following the variablename_method_outliers naming convention
Line 2: Replace 2017 with the year of data observation to investigate
Line 8: Replace conf_results following the variablename_results output defined in Step 2.3
Line 10: Replace "conf" with the column you are performing outlier detection on. This is the same column specified in Step 2.3
Line 11: Replace 2017 with the year of data observation to investigate, also specified in Line 3 of this step

Context matters!

Outlier detection on the aggregated confirmed cases column conf reveals that it is conf_hf driving 2017 outliers at this particular facility. Remember that in the case of Sierra Leone, conf_hf too is an imputed column, calculated by summing the confirmed cases from each age group at health facilities. We also need to remember that in Sierra Leone, health facilities report confirmed cases for community health workers alongside their monthly reports. In Sierra Leone, there are far more health facilities reporting testing than community health workers. The time series shows that the highest disaggregated components in this reporting period are for conf_hf_u5, which is is almost equivalent to the conf_u5 line.

This means that, although the time series indicates outliers in conf_u5, it is the sub-component column conf_hf_u5 driving this outlier. Therefore outlier detection should be performed on conf_hf_u5. Without our analysis of the aggregated column, we would not be able to identify this column.

Step 5: Component outlier detection

Step 5.1: Detect component outliers

Given that the aggregated column outlier investigation indicates possible outliers in the conf_hf_u5 column, the next step is to perform outlier detection in this column. We can reuse the outlier detection function defined in Step 2. These outliers can be visualized in the same way as Step 3.

conf_hf_u5_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf_hf_u5")

conf_hf_u5_results <- flag_outliers(clean_malaria_routine_data_final, conf_hf_u5_stats, "conf_hf_u5")

conf_hf_u5_outlier_plots <- generate_outlier_plots(conf_hf_u5_results, "conf_hf_u5")

Output

To adapt the code:

Line 1, 3: Replace clean_malaria_routine_data_final with the name of your data frame
Line 1, 3, 5, 7: Replace all instances of conf_hf_u5_stats, conf_hf_u5_results, and conf_hf_u5_outlier_plots based on the disaggregated column you are performing outlier detection on. Remember that the variablename_stats, variablename_results, and variablename_outlier_plots is strongly recommended here

Step 5.2: Filter and export median component outliers

Since the outlier detection function includes outliers detected by all three methods, here we filter to only median/MAD-detected outliers. This file is then exported so it can be shared for review by the SNT team.

# Filter for median outlier detection column
conf_hf_u5_results <- conf_hf_u5_results |>
  dplyr::mutate(
    outlier_status_conf_u5 = outliers_halper,
    outlier_flag_conf_u5 = ifelse(outliers_halper == "outlier", TRUE, FALSE)
  ) |>
  dplyr::select(-outliers_moyenne, -outliers_IQR) # Remove unused method columns

# Filter for points detected as outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
  dplyr::filter(outliers_halper == "outlier") |>
  dplyr::arrange(desc(conf_hf_u5))

# save outliers as RDS for R users
saveRDS(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.rds"
  )
)

# save same dataset as parquet for python interop
arrow::write_parquet(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.parquet"
  )
)

# save same dataset as csv for human inspection in excel
write.csv(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.csv"
  ),
  row.names = FALSE
)

To adapt the code:

Line 1, 4, 5, 9, 11, 14, 15: Replace all instances of conf_hf_u5_results based on the disaggregated column you are performing outlier detection on. Remember that the variablename_results is strongly recommended here
Line 4, 5, 10: Replace outliers_halper with outliers_moyenne or outliers_IQR if you have chosen to proceed with mean detection or IQR detection respectively
Line 6: Replace (-outliers_moyenne, -outliers_IQR) with the two outlier methods you are not using for detection
Line 9, 15: Rename conf_hf_u5_median_outliers to match the variablename_method_outliers naming convention suggested on this page
Line 7: Change "conf_hf_u5_median_method_outliers.csv" to reflect the desired CSV output file name

Step 5.3: Component outlier investigation

From the aggregated column outlier investigation, we found outlier in monthly reports for facility hf_1756 on 2017. We can use this criteria to guide our investigation of the component column outliers.

# Select an outlier to visualize: pick the highest-conf outlier in 2017
# (data-driven so the example survives any hf_uid format)
selected_outlier_conf_hf_u5 <- conf_hf_u5_median_outliers |>
  dplyr::filter(year == 2017) |>
  dplyr::slice_max(conf_hf_u5, n = 1)

# Generate plot
outlier_timeseries_conf_hf_u5_2017 <- plot_outlier_timeseries(
  full_data = conf_hf_u5_results,
  outlier_row = selected_outlier_conf_hf_u5,
  value_var = "conf_hf_u5",
  highlight_year = 2017
)

Output

To adapt the code:

Line 2: Replace selected_outlier_conf_hf_u5 following the selected_outlier_variablename naming convention
Line 2: Replace conf_hf_u5_results with variablename_results based on the disaggregated column you are performing outlier detection on
Line 4: Replace the alphanumeric hf_uid with that of the facility of interest. The examples take the hf_uid directly from the Step 4.3 output
Line 5: Replace 2017 with the year of data observation to investigate
Line 9: Replace outlier_timeseries_conf_hf_u5_2017 following the outlier_timeseries_variablename_year convention used here
Line 10: Replace conf_hf_u5_results with the variablename_results based on the disaggregated column you are performing outlier detection on
Line 11: Replace selected_outlier_conf_hf_u5 with the selected_outlier_variablename defined in Line 2 of this step
Line 12: Replace "conf_hf_u5" with the disaggregated column you are performing outlier detection on.
Line 13: Replace 2017 with the year of data observation to investigate, also specified in Line 5 of this step

Consult SNT team

Step 5.5: Recategorize component outliers

Based on review with the SNT team, outliers may need to categorized. For example, extremely high values categorized as outliers may result from genuine spikes or outbreaks. In this case, we cannot leave the observation flagged as an outlier and correct it. This code recategorizes such points as normal, while still maintaing a history of the investigation and re-categorization.

Show the code

reflag_outliers <- function(df,
                           hf_name = NULL,
                           hf_uid = NULL,
                           year = NULL,
                           month = NULL,
                           outlier_metric = NULL,
                           new_status = "normal") {

  conditions <- list()
  if(!is.null(hf_name)) conditions$hf <- hf_name
  if(!is.null(hf_uid)) conditions$hf_uid <- hf_uid
  if(!is.null(year)) conditions$year <- year
  if(!is.null(month)) conditions$month <- month

  status_col <- paste0("outlier_status_", outlier_metric)
  flag_col <- paste0("outlier_flag_", outlier_metric)

  df |>
    dplyr::mutate(
      !!status_col := ifelse(
        purrr::reduce(
          purrr::map2(names(conditions), conditions, ~.x == .y),
          `&`
        ),
        new_status,
        .data[[status_col]]
      ),
      !!flag_col := ifelse(
        purrr::reduce(
          purrr::map2(names(conditions), conditions, ~.x == .y),
          `&`
        ),
        new_status == "outlier",
        .data[[flag_col]]
      )
    )
}

# Example usage
reflagged_conf_hf_u5_outlier_data <- conf_hf_u5_results |>
  reflag_outliers(
    hf_uid = "SL_45678",
    year = 2022,
    month = 10,
    outlier_metric = "conf_hf_u5",
    new_status = "investigated_normal"
  )

To adapt the code:

Line 43: Rename reflagged_conf_hf_u5_outlier_data and conf_hf_u5_results based on the columns you are working with.
Line 45: Replace "SL_45678" with the hf_uid of the health facility whose outlier needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3
Line 46: Replace 2022 with the year of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3
Line 47: Replace 10 to correspond with the month of the outlier that needs to be reflagged. This information can be found in the timeseries plot output from Step 5.3

Step 6: Loop outlier detection

Step 6.1: Define outlier detection loop function

Thorough investigation of outliers is strongly recommended. However, the code below provides the option to loop through the steps conducted on this page. This loop may result in overlooking details, errors, and other granular aspects of outlier detection. Use the loop function with caution.

Show the code

#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#'   - flagged_data: Full dataset with outlier flags
#'   - stats: Summary statistics for each indicator
#'   - plots: Outlier methods visualization plots
#'   - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {

  # Initialize storage
  results <- list(
    flagged_data = df,
    stats = list(),
    plots = list(),
    outliers = list()
  )

  # Loop through each indicator
  for (indicator in indicators) {

    # Skip if indicator doesn't exist
    if (!indicator %in% names(df)) {
      warning(paste("Indicator", indicator, "not found in dataframe. Skipping."))
      next
    }

    # 1. Calculate statistics by HF-month group
    stats <- calculate_outlier_stats(df, indicator)

    # 2. Flag outliers (keeping all method columns for plotting)
    flagged <- flag_outliers(df, stats, indicator)

    # 3. Add status and flag columns (without removing method columns)
    flagged <- flagged |>
      dplyr::mutate(
        !!paste0("outlier_status_", indicator) := outliers_halper,
        !!paste0("outlier_flag_", indicator) := ifelse(outliers_halper == "outlier", TRUE, FALSE)
      )

    # 4. Generate plots (using all method columns)
    plots <- generate_outlier_plots(flagged, indicator)

    # 5. Extract outliers
    outlier_records <- flagged |>
      dplyr::filter(outliers_halper == "outlier") |>
      dplyr::arrange(desc(.data[[indicator]]))

    # Store results
    results$stats[[indicator]] <- stats
    results$plots[[indicator]] <- plots
    results$outliers[[indicator]] <- outlier_records

    # Update main dataframe (only keep final status/flags)
    results$flagged_data <- results$flagged_data |>
      dplyr::left_join(
        flagged |>
          dplyr::select(hf_uid, year, month,
                       !!paste0("outlier_status_", indicator),
                       !!paste0("outlier_flag_", indicator)),
        by = c("hf_uid", "year", "month")
      )
  }

  return(results)
}

Step 6.2: Use outlier detection loop function

To use the outlier detection loop on multiple variables at once, follow the code below. This code exemplifies outlier detection performed on disaggregated test columns.

test_components <- c("test_hf_u5", "test_hf_5_14", "test_hf_ov15",
                    "test_com_u5", "test_com_5_14", "test_com_ov15")

test_results <- detect_outliers_loop(
  df = clean_malaria_routine_data_final,
  indicators = test_components,
  method = "median"
)

print(test_results$plots$test_hf_u5)

Output

Summary

This section has walked through multiple outlier detection methods using the confirmed malaria cases column “conf” as an example. We defined functions to identify, visualize, and investigate outliers to streamline the detection process. This code will help you detect outliers using appropriate methods based data and variables you are working with.

Once you have detected, validated, and possibly recategorized outliers in your data, your data frame will include all original columns of your data as well as an outlier column for each indicator. This new data frame will be used in the outlier correction page.

Full code

Show full code

# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Install or load relevant packages
pacman::p_load(
  readxl, # import and read Excel files
  dplyr, # data manipulation
  ggplot2, # plotting
  tidyr, # data management
  patchwork, # for arranging plots
  purrr, # vector management
  viridis, # colour-blind friendly palettes
  here # for easy file referencing
)

clean_malaria_routine_data_final <-
  readRDS(
    here::here(
      "1.2_epidemiology/1.2a_routine_surveillance/raw",
      "clean_malaria_routine_data_final.rds"
    )
  )

calculate_outlier_stats <- function(df, column) {
  df |>
    dplyr::group_by(hf_uid, month) |> # Facility-month grouping
    dplyr::filter(n() >= 3) |> # Require minimum 3 observations per group
    dplyr::summarise(
      # Mean/SD metrics
      mean = mean(.data[[column]], na.rm = TRUE),
      sd = sd(.data[[column]], na.rm = TRUE),

      # Median/MAD metrics
      median = median(.data[[column]], na.rm = TRUE),
      mad = mad(.data[[column]], constant = 1, na.rm = TRUE),

      # IQR metrics
      Q1 = quantile(.data[[column]], 0.25, na.rm = TRUE),
      Q3 = quantile(.data[[column]], 0.75, na.rm = TRUE),
      .groups = "drop"
    ) |>
    dplyr::mutate(
      # Mean/SD bounds
      mean_lower = pmax(0, mean - 3 * sd),
      mean_upper = mean + 3 * sd,

      # Median/MAD bounds
      median_lower = pmax(0, median - 2.5 * mad),
      median_upper = median + 2.5 * mad,

      # IQR bounds
      iqr_lower = pmax(0, Q1 - 1.5 * (Q3 - Q1)),
      iqr_upper = Q3 + 1.5 * (Q3 - Q1)
    )
}

flag_outliers <- function(df, stats_df, column) {
  df |>
    dplyr::left_join(stats_df, by = c("hf_uid", "month")) |>
    dplyr::mutate(
      outliers_moyenne = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < mean_lower | .data[[column]] > mean_upper ~ "outlier",
        TRUE ~ "normal"
      ),
      outliers_halper = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < median_lower | .data[[column]] > median_upper ~
          "outlier",
        TRUE ~ "normal"
      ),
      outliers_IQR = dplyr::case_when(
        is.na(.data[[column]]) ~ NA_character_,
        .data[[column]] < iqr_lower | .data[[column]] > iqr_upper ~ "outlier",
        TRUE ~ "normal"
      )
    ) |>
    # Remove the intermediate calculation columns
    dplyr::select(
      -mean,
      -sd,
      -median,
      -mad,
      -Q1,
      -Q3,
      -mean_lower,
      -mean_upper,
      -median_lower,
      -median_upper,
      -iqr_lower,
      -iqr_upper
    )
}

conf_stats <- calculate_outlier_stats(clean_malaria_routine_data_final, "conf")

conf_results <- flag_outliers(
  clean_malaria_routine_data_final,
  conf_stats,
  "conf"
)


calculate_pct_labels <- function(var) {
  non_na <- sum(!is.na(var))

  normal_pct <- round(sum(var == "normal", na.rm = TRUE) / non_na * 100, 2)
  outlier_pct <- round(sum(var == "outlier", na.rm = TRUE) / non_na * 100, 2)
  na_count <- sum(is.na(var))

  list(
    normal = paste0("Normal (", normal_pct, "%)"),
    outlier = paste0("Outliers (", outlier_pct, "%)"),
    missing = paste0("NA (n=", na_count, ")")
  )
}

create_outlier_plot <-
  function(data, method, title, pct_labels, value_col = "conf") {
    ggplot2::ggplot(
      data,
      aes(x = year, y = .data[[value_col]], color = .data[[method]])
    ) +
      geom_point(alpha = 0.7) +
      scale_color_manual(
        values = c(
          "normal" = "grey70",
          "outlier" = "#FF0000",
          "NA" = "#ffffff"
        ),
        labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
        na.value = "#ffffff",
        drop = FALSE
      ) +
      labs(title = title, y = "Confirmed Cases", color = NULL) +
      theme_minimal() +
      theme(
        plot.title = element_text(hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "vertical",
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
      )
  }


generate_outlier_plots <- function(data, column) {
  pct_moyenne <- calculate_pct_labels(data[[paste0("outliers_moyenne")]])
  pct_halper <- calculate_pct_labels(data[[paste0("outliers_halper")]])
  pct_iqr <- calculate_pct_labels(data[[paste0("outliers_IQR")]])

  create_plot <- function(data, method, title, pct_labels) {
    ggplot2::ggplot(
      data,
      aes(x = year, y = .data[[column]], color = .data[[method]])
    ) +
      geom_point(alpha = 0.7) +
      scale_color_manual(
        values = c(
          "normal" = "grey70",
          "outlier" = "#FF0000",
          "NA" = "#ffffff"
        ),
        labels = c(pct_labels$normal, pct_labels$outlier, pct_labels$missing),
        na.value = "#ffffff",
        drop = FALSE
      ) +
      labs(title = title, y = column, color = NULL) +
      theme_minimal() +
      theme(
        plot.title = element_text(hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "vertical",
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8)
      )
  }

  p1 <- create_plot(data, "outliers_moyenne", "Mean/SD Method", pct_moyenne)
  p2 <- create_plot(data, "outliers_halper", "Median/MAD Method", pct_halper)
  p3 <- create_plot(data, "outliers_IQR", "IQR Method", pct_iqr)

  patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
    patchwork::plot_annotation(
      title = paste("Comparison of Outlier Detection Methods:", column),
      theme = ggplot2::theme(plot.title = element_text(hjust = 0.5))
    )
}

conf_outlier_plots <- generate_outlier_plots(conf_results, "conf")


# Filter for median method outliers
conf_median_outliers <- conf_results |>
  dplyr::filter(outliers_halper == "outlier") |>
  dplyr::arrange(desc(conf))

# Export outliers to CSV
outlier_file <- "conf_median_method_outliers.csv"
write.csv(conf_median_outliers, file = outlier_file, row.names = FALSE)

plot_outlier_timeseries <- function(
  full_data, # Original dataset
  outlier_row, # Single row from outlier results
  time_var = "year", # Time variable (year or date)
  value_var = "conf", # Value variable to plot
  highlight_year = NA # Year to focus on
) {
  # Get all conf* variables
  conf_vars <- grep("^conf", names(full_data), value = TRUE)

  # Plot colour palette
  component_palette <- c(
    RColorBrewer::brewer.pal(8, "Set2"), # First 8 colors
    RColorBrewer::brewer.pal(4, "Dark2") # Next 4 (avoid yellow)
  )[1:length(conf_vars)] # Trim to needed length

  # Get all data for this facility for the focus year
  facility_data <- full_data |>
    dplyr::filter(hf_uid == outlier_row$hf_uid, year == highlight_year) |>
    dplyr::mutate(
      date = as.Date(paste(year, month, "01", sep = "-")),
      is_outlier = ifelse(
        outliers_halper == "outlier" & year == highlight_year,
        "Outlier",
        "Normal"
      )
    ) |>
    tidyr::pivot_longer(
      cols = all_of(conf_vars),
      names_to = "conf_variable",
      values_to = "conf_value"
    )

  # Base plot
  p <- ggplot2::ggplot(facility_data, aes(x = date)) +
    # Plot all conf variables as dashed lines (except main variable)
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable != value_var),
      aes(y = conf_value, color = conf_variable, group = conf_variable),
      linetype = "dashed",
      alpha = 0.8,
      linewidth = 0.8
    ) +
    # Plot main variable as solid line
    ggplot2::geom_line(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, group = 1),
      color = "grey70",
      linewidth = 1.2
    ) +
    ggplot2::geom_point(
      data = facility_data |> dplyr::filter(conf_variable == value_var),
      aes(y = conf_value, fill = is_outlier),
      shape = 21,
      size = 3,
      color = "black"
    ) +
    ggplot2::expand_limits(y = 0) +
    ggplot2::scale_color_manual(
      name = "Component Variables",
      values = component_palette
    ) +
    ggplot2::scale_fill_manual(
      values = c("Outlier" = "red", "Normal" = "grey70"),
      name = paste(value_var, "monthly report classification")
    ) +
    ggplot2::labs(
      title = paste(
        "Outlier and Components:",
        "Monthly",
        outlier_row$hf,
        "Reports for",
        value_var
      ),
      subtitle = paste(
        "HF UID:",
        outlier_row$hf_uid,
        "| Year:",
        highlight_year
      ),
      x = "Month",
      y = "Cases",
      caption = ifelse(
        any(
          facility_data$is_outlier == "Outlier" &
            facility_data$conf_variable == value_var
        ),
        paste("Red points indicate outlier values in", value_var),
        paste("No outliers detected in", value_var, "for this year")
      )
    ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "bottom",
      legend.box = "vertical"
    ) +
    ggplot2::scale_x_date(
      date_labels = "%b",
      date_breaks = "1 month"
    )

  return(p)
}

# Select an outlier to visualize
selected_outlier <- conf_median_outliers |>
  dplyr::filter(year == 2017) |>
  dplyr::slice_max(conf, n = 1)

# Generate plot
outlier_timeseries <- plot_outlier_timeseries(
  full_data = conf_results, # Need to use the flagged data here
  outlier_row = selected_outlier,
  value_var = "conf",
  highlight_year = 2017
)

conf_hf_u5_stats <- calculate_outlier_stats(
  clean_malaria_routine_data_final,
  "conf_hf_u5"
)

conf_hf_u5_results <- flag_outliers(
  clean_malaria_routine_data_final,
  conf_stats,
  "conf_hf_u5"
)

conf_hf_u5_outlier_plots <- generate_outlier_plots(
  conf_hf_u5_results,
  "conf_hf_u5"
)

print(conf_hf_u5_outlier_plots)

# Filter for median outlier detection column
conf_hf_u5_results <- conf_hf_u5_results |>
  dplyr::mutate(
    outlier_status_conf_u5 = outliers_halper,
    outlier_flag_conf_u5 = ifelse(outliers_halper == "outlier", TRUE, FALSE)
  ) |>
  dplyr::select(-outliers_moyenne, -outliers_IQR) # Remove unused method columns

# Filter for points detected as outliers
conf_hf_u5_median_outliers <- conf_hf_u5_results |>
  dplyr::filter(outliers_halper == "outlier") |>
  dplyr::arrange(desc(conf_hf_u5))

# save outliers as RDS for R users
saveRDS(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.rds"
  )
)

# save same dataset as parquet for python interop
arrow::write_parquet(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.parquet"
  )
)

# save same dataset as csv for human inspection in excel
write.csv(
  conf_hf_u5_median_outliers,
  here::here(
    "03_outputs/3.1_validation/tables",
    "conf_hf_u5_median_method_outliers.csv"
  ),
  row.names = FALSE
)

# Select an outlier to visualize: pick the highest-conf outlier in 2017
# (data-driven so the example survives any hf_uid format)
selected_outlier_conf_hf_u5 <- conf_hf_u5_median_outliers |>
  dplyr::filter(year == 2017) |>
  dplyr::slice_max(conf_hf_u5, n = 1)

# Generate plot
outlier_timeseries_conf_hf_u5_2017 <- plot_outlier_timeseries(
  full_data = conf_hf_u5_results, # Need to use the flagged data here
  outlier_row = selected_outlier_conf_hf_u5,
  value_var = "conf_hf_u5",
  highlight_year = 2017
)


reflag_outliers <- function(
  df,
  hf_name = NULL,
  hf_uid = NULL,
  year = NULL,
  month = NULL,
  new_status = "normal"
) {
  # Build filter conditions
  conditions <- list()
  if (!is.null(hf_name)) {
    conditions$hf <- hf_name
  }
  if (!is.null(hf_uid)) {
    conditions$hf_uid <- hf_uid
  }
  if (!is.null(year)) {
    conditions$year <- year
  }
  if (!is.null(month)) {
    conditions$month <- month
  }

  # Apply recategorization to ALL outlier flags
  df |>
    dplyr::mutate(
      dplyr::across(
        dplyr::matches("^outlier_status_"),
        ~ ifelse(
          purrr::reduce(
            purrr::map2(names(conditions), conditions, ~ .x == .y),
            `&`
          ),
          new_status,
          .
        )
      ),
      dplyr::across(
        dplyr::matches("^outlier_flag_"),
        ~ ifelse(
          purrr::reduce(
            purrr::map2(names(conditions), conditions, ~ .x == .y),
            `&`
          ),
          new_status == "outlier",
          .
        )
      )
    )
}

cleaned_data <- outlier_flagged_data |>
  reflag_outliers(
    hf_uid = "SL_45678",
    year = 2022,
    month = 10,
    new_status = "investigated_normal" # Custom status
  )


#' Detect outliers for multiple indicators using HF-month context
#'
#' @param df Dataframe containing malaria data
#' @param indicators Character vector of column names to analyze
#' @param method Outlier detection method ("median", "mean", or "iqr")
#'
#' @return List containing:
#'   - flagged_data: Full dataset with outlier flags
#'   - stats: Summary statistics for each indicator
#'   - plots: Outlier methods visualization plots
#'   - outliers: Dataframe of outlier records
# Revised loop function that preserves needed columns for plotting
detect_outliers_loop <- function(df, indicators, method = "median") {
  # Initialize storage
  results <- list(
    flagged_data = df,
    stats = list(),
    plots = list(),
    outliers = list()
  )

  # Loop through each indicator
  for (indicator in indicators) {
    # Skip if indicator doesn't exist
    if (!indicator %in% names(df)) {
      warning(paste(
        "Indicator",
        indicator,
        "not found in dataframe. Skipping."
      ))
      next
    }

    # 1. Calculate statistics by HF-month group
    stats <- calculate_outlier_stats(df, indicator)

    # 2. Flag outliers (keeping all method columns for plotting)
    flagged <- flag_outliers(df, stats, indicator)

    # 3. Add status and flag columns (without removing method columns)
    flagged <- flagged |>
      dplyr::mutate(
        !!paste0("outlier_status_", indicator) := outliers_halper,
        !!paste0("outlier_flag_", indicator) := ifelse(
          outliers_halper == "outlier",
          TRUE,
          FALSE
        )
      )

    # 4. Generate plots (using all method columns)
    plots <- generate_outlier_plots(flagged, indicator)

    # 5. Extract outliers
    outlier_records <- flagged |>
      dplyr::filter(outliers_halper == "outlier") |>
      dplyr::arrange(desc(.data[[indicator]]))

    # Store results
    results$stats[[indicator]] <- stats
    results$plots[[indicator]] <- plots
    results$outliers[[indicator]] <- outlier_records

    # Update main dataframe (only keep final status/flags)
    results$flagged_data <- results$flagged_data |>
      dplyr::left_join(
        flagged |>
          dplyr::select(
            hf_uid,
            year,
            month,
            !!paste0("outlier_status_", indicator),
            !!paste0("outlier_flag_", indicator)
          ),
        by = c("hf_uid", "year", "month")
      )
  }

  return(results)
}

test_components <- c(
  "test_hf_u5",
  "test_hf_5_14",
  "test_hf_ov15",
  "test_com_u5",
  "test_com_5_14",
  "test_com_ov15"
)

test_results <- detect_outliers_loop(
  df = clean_malaria_routine_data_final,
  indicators = test_components,
  method = "median"
)

print(test_results$plots$test_hf_u5)