Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. Missing data detection methods
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Determining active and inactive status
      • Contextual considerations
      • Missing data detection methods
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to Care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
    • 6.1: Trend analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • Understanding Missing Data
    • Understanding Types of Missing Data
  • Step-by-Step
    • Step 1: Install and Load Libraries
    • Step 1.2: Load Data
    • Step 2: Check Missingness Over Time
    • Step 3: Check Missingness Over Time and Administrative Level
    • Step 4: Check Missingness by Health Facility Type and Age Groups
    • Step 5: Visualizing Structural Zeros and Logical Relationships
  • Summary
  • Full code
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. Missing data detection methods

Missing data detection methods

Overview

Missing data is a persistent challenge in public health datasets and is particularly relevant to the SNT process, where multiple data sources must be integrated across time, administrative levels, and reporting platforms. The completeness and consistency of these data underpin risk stratification, resource allocation, and the evaluation of intervention impact.

This challenge is most acute in sub-Saharan Africa, where 95% of global malaria deaths occur and health information systems face the greatest structural pressures. In routine systems such as DHIS2, which serve as the backbone for malaria reporting across the region, missing data may follow recognizable patterns linked to health system functioning, though some degree of randomness can also be present. Not all missingness is due to error, and not all of it should or can be imputed.

Gaps in DHIS2 data are shaped by underlying structural constraints. These include internet outages, delayed uploads, limited supervision, register stockouts, staff shortages, and unclear handling of zeros versus missing values. These issues contribute to uneven reporting patterns across facility types and seasons. Hospitals tend to report less consistently than health posts, as seen in Senegal, where health posts maintained higher reporting completeness than hospitals over multiple years. Reporting is also more complete during the rainy season, with some months showing a 7.5% increase in submissions (Senegal). In Ghana, outpatient totals were frequently submitted while key malaria treatment indicators were left blank, creating selective gaps. Similarly, in Nigeria’s Gombe State, nearly 40% of expected monthly records were incomplete, and intervention-related indicators like IPTp showed disproportionately high rates of missingness. During campaigns or periods of high workload, parallel reporting systems and competing priorities further increase the risk of missing or conflicting data, as shown in Ethiopia and Tanzania, where fragmented systems and unclear use of blanks weakened data reliability.

These systematic gaps create analytical blind spots that compromise SNT outputs. Areas with weak surveillance may appear to have lower burden due to underreporting, while high-burden areas with poor connectivity can be underrepresented in resource allocation. If not addressed, missing data bias estimates, distort risk stratification, misrepresent coverage, and weaken the link between inputs and outcomes. They also reduce the stability of risk maps and the reliability of impact projections.

At the same time, not all missing data should be imputed. Some gaps are due to non-active HF status, such as a facility not yet operational. Others reflect values missing for reasons that cannot be inferred from the available data. In these cases, imputation is not appropriate. Analysts must first understand why data are missing. This involves assessing whether the missingness is completely at random, related to other observed variables, or related to the missing values themselves.

Given the complexity of these decisions, dropping incomplete records is rarely justified. Understanding the cause and patterns of missingness is essential for determining appropriate analytical approaches, whether that involves excluding certain records, applying logical inference, or implementing statistical methods that preserve the spatial and temporal structure of the data.

Objectives
  • Learn systematic approaches to detect and visualize missing data patterns in surveillance systems
  • Examine missingness across temporal, spatial, and demographic dimensions
  • Identify structural zeros, legitimate zeros, and logical inconsistencies in related indicators
  • Distinguish between MCAR, MAR, and MNAR missing data mechanisms
  • Apply diagnostic visualizations to assess data quality and reporting completeness
  • Generate comprehensive missing data assessments to inform analytical decisions

Understanding Missing Data

Before proceeding with any analytical approach, it is important to understand the types of missingness present in the dataset, and to identify the most appropriate methods for addressing each type, whether through exclusion, clinical logic, or statistical techniques.

Understanding Types of Missing Data

When working with incomplete data, it’s important to understand why values are missing. This determines whether imputation is appropriate.

Key Patterns to Examine to Help Determine the Type of Missingness
  • Temporal gaps: Are some months or seasons consistently underreported?
  • Spatial gaps: Are there consistent patterns across districts or regions? When facility data are sparse, higher-level patterns can inform analytical approaches using neighbouring facilities of the same type.
  • Facility-level gaps: Do reporting rates vary by facility type (e.g. hospitals vs health posts), ownership, or service volume?
  • Demographic gaps: Are data more frequently missing for specific age groups (e.g. under-fives, adolescents, pregnant women)?
  • Variable-level gaps: Are certain indicators (e.g. IPTp3, ACT availability) more frequently left blank?

These patterns help identify not just where missingness occurs but also why. Understanding these patterns is the first step in diagnosing whether missingness is Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR).

Once patterns are identified, use them to assess the mechanism behind the missing values. There are three main types of missingness, and understanding which type you’re dealing with helps you make informed analytical decisions about how to handle the missing data.

Not all gaps in the data reflect missingness. Some values are absent because they were never expected to be present in the first place. These reflect non-active HF status, not missing data. For example, a facility may not have been operational during a particular period, or may never have reported on a specific indicator. In such cases, the absence of data should be marked as not applicable and excluded from any imputation procedure.

  1. Missing Completely At Random (MCAR)

What it means: The data are missing for no particular reason. The missingness is unrelated to anything in the dataset.

Example: Suppose some monthly reports are missing because a stack of paper forms was lost in a flood. The flood didn’t target specific facilities or months, it was random. In this case, the missing data is MCAR.

Implication: Imputation is generally acceptable for MCAR data using simple methods. However, you can safely analyze the non-missing data. It’s still representative of the whole.

  1. Missing At Random (MAR)

What it means: The missingness is related to something else in the dataset, but not to the missing value itself.

Example: Imagine that reports of confirmed malaria cases are more likely to be missing from hospitals than health posts. But you know which facilities are hospitals and which are health posts. So the missingness is not random, but it is explainable based on another variable facility type.

Implication: You can use information you do have (like facility type, month, or region) to predict and impute the missing values. Most imputation methods assume MAR.

  1. Missing Not At Random (MNAR)

What it means: The missingness is related to the value that’s missing, and you can’t explain it using other information in your data.

Example: Let’s say some facilities don’t report malaria deaths because they fear it reflects badly on their performance. The missingness is not random and not explained by any variable in the dataset. You don’t know which deaths are missing, and you can’t reliably guess them.

Implication: Imputation is risky. The data are biased in a way you can’t correct without more information.

Type What Causes It Can Be Imputed? Example
MCAR Random external factors unrelated to any variable in the dataset Yes Monthly reports lost in a flood, no systematic pattern by location, time, or indicator
MAR Missingness depends on observed variables Yes Hospitals report less consistently than health posts, but facility type is known and recorded
MNAR Missingness depends on the unobserved value itself Not recommended Facilities underreport malaria deaths to avoid scrutiny; missingness cannot be explained using other variables
Consult with SNT Team

If you’re unsure whether missingness in your dataset is MCAR, MAR, or MNAR, consult the SNT team. Patterns that seem technical may reflect known operational issues (e.g., stockouts, parallel reporting during campaigns). Getting their input now can prevent incorrect imputation choices later.

Step-by-Step

This step-by-step section shows how to systematically detect missingness patterns using our Sierra Leone DHIS2 data (previously assembled in the DHIS2 data preprocessing page). It builds on the principles outlined above, emphasizing the importance of detecting and understanding missing data patterns.

The examples focus on identifying temporal, spatial, and systematic patterns in missingness that inform our classification decisions. Understanding these patterns is essential before proceeding to the Imputation Methods page, where we implement structural and statistical imputation techniques.

Step 1: Install and Load Libraries

First, install and/or load the necessary packages for data manipulation, visualization, and missing data detection.

  • R
  • Python
# Install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Load required packages using pacman
pacman::p_load(
  dplyr,        # data manipulation
  ggplot2,      # plotting
  ggtext,       # markdown text in ggplot2
  here,         # file path management
  lubridate,    # date handling
  naniar,       # missing data visualization and analysis
  tidyr,        # data reshaping
  cli,          # clean logging and CLI-style messages
  scales,       # number formatting
  wesanderson   # color palettes
)

To adapt the code:

  • Do not modify anything in the code above

Step 1.2: Load Data

Load the DHIS2 surveillance data and prepare it for missing data detection.

In this example we focus on the past five years (the data stops to December 2023), so filter out anything before 2019.

  • R
  • Python
# set up data path
data_path <- here::here("1.1.2_epidemiology", "1.1.2a_routine_surveillance")

# read the processed surveillance data
df_routine <- readRDS(
  here::here(data_path, "processed", "sle_routine_cases_processed.rds")
) |>
  # filter to the past five years
  dplyr::filter(year >= 2019) |>
  dplyr::mutate(
    ym = date,
    date = lubridate::ym(date)
   )

# preview the data
head(df_routine)
Output
# A tibble: 6 × 127
  adm0    adm1  adm2  adm3  orgunitlevel5 hf    date       month year  allout_u5
  <chr>   <chr> <chr> <chr> <chr>         <chr> <date>     <chr> <chr>     <dbl>
1 Sierra… Bo D… Bo C… Bo C… Aethel CHP    Aeth… 2019-01-01 01    2019        101
2 Sierra… Bo D… Bo C… Bo C… Aethel CHP    Aeth… 2019-02-01 02    2019         86
3 Sierra… Bo D… Bo C… Bo C… Aethel CHP    Aeth… 2019-03-01 03    2019        149
4 Sierra… Bo D… Bo C… Bo C… Aethel CHP    Aeth… 2019-04-01 04    2019         99
5 Sierra… Bo D… Bo C… Bo C… Aethel CHP    Aeth… 2019-05-01 05    2019         96
6 Sierra… Bo D… Bo C… Bo C… Aethel CHP    Aeth… 2019-06-01 06    2019         80
# ℹ 117 more variables: allout_ov5 <dbl>, maladm_hf_u5 <dbl>,
#   maladm_5_14 <dbl>, maladm_ov15 <dbl>, maldth_u5 <dbl>, maldth_5_14 <dbl>,
#   maldth_ov15 <dbl>, susp_u5_hf <dbl>, susp_5_14_hf <dbl>,
#   susp_ov15_hf <dbl>, susp_u5_com <dbl>, susp_5_14_com <dbl>,
#   susp_ov15_com <dbl>, maldth_fem_ov15 <dbl>, maldth_mal_ov15 <dbl>,
#   maldth_1_59m <dbl>, maldth_10_14 <dbl>, maldth_5_9 <dbl>,
#   tes_neg_rdt_u5_com <dbl>, tes_pos_rdt_u5_com <dbl>, …

To adapt the code:

  • Line 1: Update the data_path to match your folder structure.
  • Line 8: Filter your data to your period of interest.

Step 2: Check Missingness Over Time

In this step we check the missingness rate of all of our variables of interest across time. This analysis helps us understand temporal patterns in data availability and identify periods or variables with systematic missing data issues. The heatmap visualization provides an intuitive overview of data completeness across all malaria indicators for different age groups throughout the study period. The color intensity represents the percentage of missing values, allowing us to quickly identify problematic time periods or variables that may require special handling in subsequent analyses.

Initially, we want to check the following set of variables: malaria tests, suspected, presumed and confirmed cases and malaria treatments aggregated across all age groups, as well as the age-specific breakdowns.

  • R
  • Python
Show the code
# get the variables of interest (aggregated and age-specific)
vars <- c(
  "test",        # aggregated malaria tests
  "test_u5",
  "test_5_14",
  "test_ov15",
  "susp",        # aggregated suspected cases
  "susp_u5",
  "susp_5_14",
  "susp_ov15",
  "pres",        # aggregated presumed cases
  "pres_u5",
  "pres_5_14",
  "pres_ov15",
  "conf",        # aggregated confirmed cases
  "conf_u5",
  "conf_5_14",
  "conf_ov15",
  "maltreat",    # aggregated treatments
  "maltreat_u5",
  "maltreat_5_14",
  "maltreat_ov15"
)

# calculate missing rates by date for each variable
missing_rate_date <- df_routine |>
  dplyr::group_by(date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(vars),
      ~ mean(is.na(.x)) * 100
    ),
    .groups = "drop"
  ) |>
  dplyr::arrange(date) |>
  tidyr::pivot_longer(
    cols = -ym,
    names_to = "variables",
    values_to = "missing_rate"
  )

# Option to control the color scale range for the heatmap
# If TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# If FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only covers that range)
full_range <- FALSE

# Set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # Use full 0-100% range for color scale
} else {
  # Use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_date$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across variables
missing_plot_date <- ggplot2::ggplot(
  missing_rate_date,
  ggplot2::aes(
    y = variables,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "Variable",
    x = "",
    title = "The proportion of missing data for selected variables by year-month"
  )

# display the plot
missing_plot_date

# save plot
ggplot2::ggsave(
  plot = missing_plot_date,
  here::here("03_output/3a_figures/sle_many_vars_missing_plot_date.png"),
  width = 12,
  height = 7,
  dpi = 300
)
Output

To adapt the code:

  • Lines 1-22: Update the vars vector to match your dataset’s variable names. For a different country/dataset, modify variable names to reflect your indicators (e.g., for Ghana you might have cases_u5, tests_5_14, etc.).
  • Line 25: Replace df_routine with your dataset name if different.
  • Line 47: Set full_range <- TRUE if you want the color scale to always span 0-100% for consistent comparison across datasets/time periods. Set full_range <- FALSE to optimize the color scale to your actual data range for better visual contrast within the observed missingness patterns.
Using the sntutils Package

You can also reproduce this exact plot using the sntutils package, which provides additional features and customization options:

sntutils::reporting_rate_plot(
 data = df_routine,          # the dataset to use
 vars_of_interest = vars,    # a vector of vars to check
 x_var = "ym",               # the year-month var for the x axis
 use_reprate = FALSE,        # FALSE for missing rate, TRUE for reporting rate
 full_range = FALSE,         # use actual data range for fill
 target_language = "en"      # the language for plot labels
)

The sntutils::reporting_rate_plot() function includes additional parameters, such as specifying the language for plot labels and saving the output. See the sntutils documentation for more details.

Several indicators show consistent missingness across time. Between mid-2022 and early 2023, missingness sharply increased for several indicators, with test, maltreat, and conf reaching rates above 45%.

Step 3: Check Missingness Over Time and Administrative Level

In this step, we focus on a single indicator, malaria testing for under 5’s at the health facility level (test_hf_u5), to examine how missingness varies over time across districts (adm2 level). This allows us to identify subnational areas with consistently poor reporting and spot time periods where data availability dropped sharply.

Although the data is available at the health facility level—the most granular reporting unit—it is still important to assess missingness at higher administrative levels. This broader view becomes especially useful when facility-level data are sparse. In such cases, we may use fallback analysis approaches that draw on patterns from neighbouring facilities within the same district.

  • R
  • Python
Show the code
# calculate missing rates by year-month and adm2 for each variable
missing_rate_adm2 <- df_routine |>
  dplyr::group_by(ym, adm2) |>
  dplyr::summarise(
    missing_rate = mean(is.na(test_hf_u5)) * 100,
    .groups = "drop"
  )

# Option to control the color scale range for the heatmap
# If TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# If FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only covers that range)
full_range <- FALSE

# Set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # Use full 0-100% range for color scale
} else {
  # Use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_adm2$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across location
missing_plot_adm2 <- ggplot2::ggplot(
  missing_rate_adm2,
  ggplot2::aes(
    y = adm2,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "District",
    x = "",
    title = "The proportion of missing data for test_hf_u5 by year-month and adm2"
  )

# display the plot
missing_plot_adm2

# save plot
ggplot2::ggsave(
  plot = missing_plot_adm2,
  here::here("03_output/3a_figures/sle_test_missing_plot_adm2.png"),
  width = 12,
  height = 7,
  dpi = 300
)
Output

To adapt the code:

  • Line 2: Replace df_routine with your dataset name if different.
  • Line 5: Update the variable name (test_hf_u5) to match your dataset’s variable name of interest.
  • Line 14: Set full_range <- TRUE if you want the color scale to always span 0-100% for consistent comparison across datasets/time periods. Set full_range <- FALSE to optimize the color scale to your actual data range for better visual contrast within the observed missingness patterns.
Using the sntutils Package

You can also reproduce this exact plot using the sntutils package.

sntutils::reporting_rate_plot(
 data = df_routine,              # the dataset to use
 vars_of_interest = "test_hf_u5",      # the var to check
 x_var = "ym",                   # the year-month var for the x axis
 y_var = "adm2",                 # the location var for the y axis
 y_axis_label = "District",      # Specify y axis label (more readable)
 use_reprate = FALSE,            # FALSE for missing rate, TRUE for reporting rate
 full_range = FALSE,             # use actual data range for fill
 target_language = "en"          # the language for plot labels
)

Missing data for test_hf_u5 varies by district and over time. Freetown City Council, Port Loko City Council, Koidu New Sembehun Council, and Makeni City show consistently high missingness, especially from 2019 to mid-2021. Many districts also show increased missingness around mid to late 2022, suggesting a broader reporting issue. In contrast, districts like Falaba and Bonthe Municipal have lower and more stable missing rates. These patterns confirm that test_hf_u5 varies by both time and location.

This suggests that missingness in test_hf_u5 is likely Missing At Random (MAR), as it depends on observable factors such as district and reporting period. In such cases, stratified analysis approaches are appropriate. The district (adm2) level can also serve as a fallback unit for analysis, particularly when individual facility data are sparse, by drawing insights from patterns in neighbouring facilities within the same district.

Step 4: Check Missingness by Health Facility Type and Age Groups

Beyond temporal and geographic patterns, missingness may also vary systematically by additional factors, such as health facility type and age groups. Many routine surveillance systems collect data that is already disaggregated by these factors (e.g., test_hf_u5 representing malaria tests conducted at health facilities for children under 5).

The first step in determining these patterns is to disaggregate such pre-aggregated data into a long format, creating separate columns for indicators, facility types, and age groups. This long format allows us to examine how missingness varies across these important stratification variables and identify systematic reporting gaps that may be related to facility capacity, service delivery models, or demographic-specific data collection challenges.

  • R
  • Python
Show the code
# define core indicators to retain during disaggregation
core_indicators <- c(
  "adm0",
  "adm1",
  "adm2",
  "adm3",
  "hf",
  "hf_uid",
  "year",
  "month",
  "ym",
  "date"
)

# select core indicators and disaggregated variables
df_routine_disagg <- df_routine |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    dplyr::matches("_(com|hf)_(u5|5_14|ov15)$")
  )

# transform to long format with separate facility type and age group columns
df_routine_disagg_long <- df_routine_disagg |>
  dplyr::select(-maladm_hf_u5) |> # drop maladm
  tidyr::pivot_longer(
    cols = dplyr::matches("_(com|hf)_(u5|5_14|ov15)$"),
    names_to = c("indicator", "facility_type", "age_group"),
    names_pattern = "(.+)_(com|hf)_(u5|5_14|ov15)$",
    values_to = "value"
  ) |>
  # relable factors
  dplyr::mutate(
    age_group = factor(
      age_group,
      levels = c("u5", "5_14", "ov15"),
      labels = c("Under 5", "5 to 15", "Over 15")
    ),
    facility_type = factor(
      facility_type,
      levels = c("hf", "com"),
      labels = c("Health Facility", "Community")
    )
  )

# preview the disaggregated data structure
head(df_routine_disagg_long)
Output
# A tibble: 6 × 14
  adm0     adm1  adm2  adm3  hf    hf_uid year  month ym    date       indicator
  <chr>    <chr> <chr> <chr> <chr> <chr>  <chr> <chr> <chr> <date>     <chr>    
1 Sierra … Bo D… Bo C… Bo C… Aeth… hf_00… 2019  01    2019… 2019-01-01 test     
2 Sierra … Bo D… Bo C… Bo C… Aeth… hf_00… 2019  01    2019… 2019-01-01 test     
3 Sierra … Bo D… Bo C… Bo C… Aeth… hf_00… 2019  01    2019… 2019-01-01 test     
4 Sierra … Bo D… Bo C… Bo C… Aeth… hf_00… 2019  01    2019… 2019-01-01 test     
5 Sierra … Bo D… Bo C… Bo C… Aeth… hf_00… 2019  01    2019… 2019-01-01 test     
6 Sierra … Bo D… Bo C… Bo C… Aeth… hf_00… 2019  01    2019… 2019-01-01 test     
# ℹ 3 more variables: facility_type <fct>, age_group <fct>, value <dbl>

To adapt the code:

  • Lines 2-12: Update the core_indicators vector to match your dataset’s administrative and temporal variable names.
  • Line 16: Replace df_routine with your dataset name if different.
  • Lines 19, 26: Modify the regex pattern "_(com|hf)_(u5|5_14|ov15)$" to match your facility type and age group naming conventions (e.g., if your data uses different facility codes or age categories).

Now that we have disaggregated the data, we can examine how missingness varies across all indicators by facility type and age group. This check helps identify systematic patterns in data availability that may be related to service delivery models, facility capacity, or demographic-specific reporting challenges.

  • R
  • Python
Show the code
# prepare data for stacked bar chart with both age group and facility type
missing_present_data_facility <- df_routine_disagg_long |>
  dplyr::group_by(indicator, age_group, facility_type) |>
  dplyr::summarise(
    total_records = dplyr::n(),
    missing_count = sum(is.na(value)),
    present_count = sum(!is.na(value)),
    missing_pct = round(missing_count / total_records * 100, 2),
    present_pct = round(present_count / total_records * 100, 2),
    .groups = "drop"
  ) |>
  tidyr::pivot_longer(
    cols = c(missing_pct, present_pct),
    names_to = "status",
    values_to = "percentage",
    names_pattern = "(.+)_pct"
  ) |>
  dplyr::mutate(
    status = factor(
      status,
      levels = c("missing", "present"),
      labels = c("Missing", "Present")
    )
  )

# create stacked bar plot faceted by age group and facility type
age_facility_miss_plot <- missing_present_data_facility |>
  ggplot2::ggplot(ggplot2::aes(
    x = percentage,
    y = factor(age_group, levels = rev(levels(factor(age_group)))),
    fill = status
  )) +
  ggplot2::geom_col(alpha = 0.8) +
  ggplot2::facet_grid(indicator ~ facility_type) +
  ggplot2::scale_fill_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Present", "Missing")
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    x = "\nPercentage (%)",
    y = "Age Group (years)\n",
    fill = "Data Status"
  ) +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10)
  )

# display the plot
age_facility_miss_plot

# save plot
ggplot2::ggsave(
  plot = age_facility_miss_plot,
  here::here("03_output/3a_figures/sle_test_facility_age_missing_plot.png"),
  width = 12,
  height = 7,
  dpi = 300
)
Output

To adapt the code:

  • Line 2: Replace df_routine_disagg_long with your disaggregated dataset name if different.
  • Lines 18-22: Update the status factor levels and labels if you want different display names for missing/present data categories.
  • Lines 48-52: Modify the plot title and axis labels to match your analysis focus.

The above visualization shows substantial differences in missingness patterns between facility types, with community-level facilities showing considerably higher missing rates, around 70% on average. This pattern is consistent across indicators and age groups. Notably, within community facilities, under-five data is more frequently reported than data for older age groups, a pattern observed across key indicators. These trends point to systematic differences in data collection and reporting capacity between facility types.

Step 5: Visualizing Structural Zeros and Logical Relationships

Before classifying missing values, we visualize the relationships between related indicators. This plot highlights likely structural zeros (e.g. no testing with missing confirmation), possible structural zeros (e.g. confirmed cases equal to zero with missing test values), and logical inconsistencies (e.g. confirmed cases recorded but testing data missing). These patterns reflect the clinical care pathway and help identify where values can be logically inferred, structurally imputed as zero, or require statistical methods.

Specifically:

  • test = 0, conf missing: Likely legitimate zero (no testing = confirmation not possible).
  • conf = 0, test missing: Possibly legitimate zero (no confirmed cases suggests no positives, but testing may have occurred).
  • conf > 0, test missing: Logical inconsistency (confirmed cases imply testing occurred, value should be imputed).
  • test > 0, conf missing: Possibly legitimate zero (testing occurred but no positive cases; confirmation may be legitimately zero or missing).
  • R
  • Python

Note: Below, we use the naniar package to explore missingness. Unlike geom_point(), which silently drops missing values, geom_miss_point() shows where one or both variables are missing by shifting them to plot margins. This makes it easier to detect structural zeros, reporting gaps, and logical inconsistencies.

Show the code
# transform data to wide format to examine relationships between indicators
df_wide <- df_routine_disagg_long |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    facility_type,
    age_group,
    indicator,
    value
  ) |>
  tidyr::pivot_wider(
    names_from = indicator,
    values_from = value
  )

# calculate comprehensive missing counts for subtitle
n_test_missing_conf_zero <- df_wide |>
  dplyr::filter(conf == 0, is.na(test)) |>
  nrow()

n_conf_missing_test_zero <- df_wide |>
  dplyr::filter(test == 0, is.na(conf)) |>
  nrow()

n_test_na_conf_positive <- df_wide |>
  dplyr::filter(is.na(test), conf > 0) |>
  nrow()

n_conf_na_test_positive <- df_wide |>
  dplyr::filter(is.na(conf), test > 0) |>
  nrow()

# create comprehensive subtitle with all missing count scenarios
subtitle_text <- paste0(
  "N = ",
  scales::comma(n_test_missing_conf_zero),
  " where conf = 0 but test is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_conf_missing_test_zero),
  " where test = 0 but conf is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_test_na_conf_positive),
  " where test is missing but conf > 0 (logical inconsistency); \n",
  "N = ",
  scales::comma(n_conf_na_test_positive),
  " where conf is missing but test > 0 (possible legitimate zero)"
)

# create scatter plot showing relationship between test and confirmed counts
structural_zeros_plot <- ggplot2::ggplot(
  data = df_wide,
  mapping = ggplot2::aes(x = test, y = conf)
) +
  naniar::geom_miss_point() +
  ggplot2::geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::geom_hline(
    yintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_y_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_color_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Not Missing", "Missing"),
    labels = c("Present", "Missing")
  ) +
  ggplot2::guides(
    color = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      override.aes = list(
        size = 5
      ),
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Structural Zeros: Testing vs Confirmation",
    subtitle = subtitle_text,
    x = "\nTests Conducted",
    y = "Cases Confirmed\n",
    color = "Data Status"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    plot.subtitle = ggplot2::element_text(size = 9, color = "grey30")
  )


# display the plot
structural_zeros_plot

# save plot
ggplot2::ggsave(
  plot = structural_zeros_plot,
  here::here("03_output/3a_figures/sle_test_conf_check_zeros_plot.png"),
  width = 12,
  height = 7,
  dpi = 300
)
Output

To adapt the code:

  • Lines 16, 20, 24, 28: Update the variable names (test, conf) in the filter conditions to match your dataset’s variable names of interest.
  • Lines 3-6: Update the core_indicators and column selection to match your dataset structure.
  • Lines 9-12: Modify the pivot_wider() transformation to include the specific indicators you want to examine for structural relationships.
  • Line 52: Change the x and y aesthetic mappings to examine different variable pairs (e.g., x = maltreat, y = conf to examine treatment vs confirmation relationships).

The plot shows key patterns in missingness between malaria testing (test) and confirmation (conf) counts. The vertical dashed line marks zero tests; the horizontal dashed line marks zero confirmations. This visualization shows there were 1,769 instances where conf is missing but test > 0. These are likely structural zeros, where testing was done but no confirmed cases were recorded. In such cases, missing confirmation counts can reasonably be set to zero.

Summary

This section introduced approaches for visualising and diagnosing missing data in routine surveillance systems. By examining patterns across time, geography, facility type, and indicator relationships, we showed how missingness can be systematically assessed rather than treated as random gaps. The section demonstrated how simple plots and logical checks help distinguish between true missing values and structural reporting issues, providing a clearer understanding of the data’s reliability. These methods establish a foundation for classifying the nature of missingness and for selecting appropriate strategies to addressing them.

Full code

Find the full code script for missing data detection methods below.

  • R
  • Python
Show full code
################################################################################
################# ~ Missing data detection methods full code ~ #################
################################################################################

### Step 1: Install and Load Libraries -----------------------------------------

# Install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Load required packages using pacman
pacman::p_load(
  dplyr,        # data manipulation
  ggplot2,      # plotting
  ggtext,       # markdown text in ggplot2
  here,         # file path management
  lubridate,    # date handling
  naniar,       # missing data visualization and analysis
  tidyr,        # data reshaping
  cli,          # clean logging and CLI-style messages
  scales,       # number formatting
  wesanderson   # color palettes
)

### Step 1.2: Load Data --------------------------------------------------------

# set up data path
data_path <- here::here("1.1.2_epidemiology", "1.1.2a_routine_surveillance")

# read the processed surveillance data
df_routine <- readRDS(
  here::here(data_path, "processed", "sle_routine_cases_processed.rds")
) |>
  # filter to the past five years
  dplyr::filter(year >= 2019) |>
  dplyr::mutate(
    ym = date,
    date = lubridate::ym(date)
   )

# preview the data
head(df_routine)

### Step 2: Check Missingness Over Time ----------------------------------------

# get the variables of interest (aggregated and age-specific)
vars <- c(
  "test",        # aggregated malaria tests
  "test_u5",
  "test_5_14",
  "test_ov15",
  "susp",        # aggregated suspected cases
  "susp_u5",
  "susp_5_14",
  "susp_ov15",
  "pres",        # aggregated presumed cases
  "pres_u5",
  "pres_5_14",
  "pres_ov15",
  "conf",        # aggregated confirmed cases
  "conf_u5",
  "conf_5_14",
  "conf_ov15",
  "maltreat",    # aggregated treatments
  "maltreat_u5",
  "maltreat_5_14",
  "maltreat_ov15"
)

# calculate missing rates by date for each variable
missing_rate_date <- df_routine |>
  dplyr::group_by(date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(vars),
      ~ mean(is.na(.x)) * 100
    ),
    .groups = "drop"
  ) |>
  dplyr::arrange(date) |>
  tidyr::pivot_longer(
    cols = -ym,
    names_to = "variables",
    values_to = "missing_rate"
  )

# Option to control the color scale range for the heatmap
# If TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# If FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only covers that range)
full_range <- FALSE

# Set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # Use full 0-100% range for color scale
} else {
  # Use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_date$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across variables
missing_plot_date <- ggplot2::ggplot(
  missing_rate_date,
  ggplot2::aes(
    y = variables,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "Variable",
    x = "",
    title = "The proportion of missing data for selected variables by year-month"
  )

# display the plot
missing_plot_date

# save plot
ggplot2::ggsave(
  plot = missing_plot_date,
  here::here("03_output/3a_figures/sle_many_vars_missing_plot_date.png"),
  width = 12,
  height = 7,
  dpi = 300
)

### Step 2: Check Missingness Over Time ----------------------------------------

sntutils::reporting_rate_plot(
 data = df_routine,          # the dataset to use
 vars_of_interest = vars,    # a vector of vars to check
 x_var = "ym",               # the year-month var for the x axis
 use_reprate = FALSE,        # FALSE for missing rate, TRUE for reporting rate
 full_range = FALSE,         # use actual data range for fill
 target_language = "en"      # the language for plot labels
)

### Step 3: Check Missingness Over Time and Administrative Level ---------------

# calculate missing rates by year-month and adm2 for each variable
missing_rate_adm2 <- df_routine |>
  dplyr::group_by(ym, adm2) |>
  dplyr::summarise(
    missing_rate = mean(is.na(test_hf_u5)) * 100,
    .groups = "drop"
  )

# Option to control the color scale range for the heatmap
# If TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# If FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only covers that range)
full_range <- FALSE

# Set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # Use full 0-100% range for color scale
} else {
  # Use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_adm2$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across location
missing_plot_adm2 <- ggplot2::ggplot(
  missing_rate_adm2,
  ggplot2::aes(
    y = adm2,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "District",
    x = "",
    title = "The proportion of missing data for test_hf_u5 by year-month and adm2"
  )

# display the plot
missing_plot_adm2

# save plot
ggplot2::ggsave(
  plot = missing_plot_adm2,
  here::here("03_output/3a_figures/sle_test_missing_plot_adm2.png"),
  width = 12,
  height = 7,
  dpi = 300
)

### Step 3: Check Missingness Over Time and Administrative Level ---------------

sntutils::reporting_rate_plot(
 data = df_routine,              # the dataset to use
 vars_of_interest = "test_hf_u5",      # the var to check
 x_var = "ym",                   # the year-month var for the x axis
 y_var = "adm2",                 # the location var for the y axis
 y_axis_label = "District",      # Specify y axis label (more readable)
 use_reprate = FALSE,            # FALSE for missing rate, TRUE for reporting rate
 full_range = FALSE,             # use actual data range for fill
 target_language = "en"          # the language for plot labels
)

### Step 4: Check Missingness by Health Facility Type and Age Groups -----------

# define core indicators to retain during disaggregation
core_indicators <- c(
  "adm0",
  "adm1",
  "adm2",
  "adm3",
  "hf",
  "hf_uid",
  "year",
  "month",
  "ym",
  "date"
)

# select core indicators and disaggregated variables
df_routine_disagg <- df_routine |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    dplyr::matches("_(com|hf)_(u5|5_14|ov15)$")
  )

# transform to long format with separate facility type and age group columns
df_routine_disagg_long <- df_routine_disagg |>
  dplyr::select(-maladm_hf_u5) |> # drop maladm
  tidyr::pivot_longer(
    cols = dplyr::matches("_(com|hf)_(u5|5_14|ov15)$"),
    names_to = c("indicator", "facility_type", "age_group"),
    names_pattern = "(.+)_(com|hf)_(u5|5_14|ov15)$",
    values_to = "value"
  ) |>
  # relable factors
  dplyr::mutate(
    age_group = factor(
      age_group,
      levels = c("u5", "5_14", "ov15"),
      labels = c("Under 5", "5 to 15", "Over 15")
    ),
    facility_type = factor(
      facility_type,
      levels = c("hf", "com"),
      labels = c("Health Facility", "Community")
    )
  )

# preview the disaggregated data structure
head(df_routine_disagg_long)

### Step 4: Check Missingness by Health Facility Type and Age Groups -----------

# prepare data for stacked bar chart with both age group and facility type
missing_present_data_facility <- df_routine_disagg_long |>
  dplyr::group_by(indicator, age_group, facility_type) |>
  dplyr::summarise(
    total_records = dplyr::n(),
    missing_count = sum(is.na(value)),
    present_count = sum(!is.na(value)),
    missing_pct = round(missing_count / total_records * 100, 2),
    present_pct = round(present_count / total_records * 100, 2),
    .groups = "drop"
  ) |>
  tidyr::pivot_longer(
    cols = c(missing_pct, present_pct),
    names_to = "status",
    values_to = "percentage",
    names_pattern = "(.+)_pct"
  ) |>
  dplyr::mutate(
    status = factor(
      status,
      levels = c("missing", "present"),
      labels = c("Missing", "Present")
    )
  )

# create stacked bar plot faceted by age group and facility type
age_facility_miss_plot <- missing_present_data_facility |>
  ggplot2::ggplot(ggplot2::aes(
    x = percentage,
    y = factor(age_group, levels = rev(levels(factor(age_group)))),
    fill = status
  )) +
  ggplot2::geom_col(alpha = 0.8) +
  ggplot2::facet_grid(indicator ~ facility_type) +
  ggplot2::scale_fill_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Present", "Missing")
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    x = "\nPercentage (%)",
    y = "Age Group (years)\n",
    fill = "Data Status"
  ) +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10)
  )

# display the plot
age_facility_miss_plot

# save plot
ggplot2::ggsave(
  plot = age_facility_miss_plot,
  here::here("03_output/3a_figures/sle_test_facility_age_missing_plot.png"),
  width = 12,
  height = 7,
  dpi = 300
)

### Step 5: Visualizing Structural Zeros and Logical Relationships -------------

# transform data to wide format to examine relationships between indicators
df_wide <- df_routine_disagg_long |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    facility_type,
    age_group,
    indicator,
    value
  ) |>
  tidyr::pivot_wider(
    names_from = indicator,
    values_from = value
  )

# calculate comprehensive missing counts for subtitle
n_test_missing_conf_zero <- df_wide |>
  dplyr::filter(conf == 0, is.na(test)) |>
  nrow()

n_conf_missing_test_zero <- df_wide |>
  dplyr::filter(test == 0, is.na(conf)) |>
  nrow()

n_test_na_conf_positive <- df_wide |>
  dplyr::filter(is.na(test), conf > 0) |>
  nrow()

n_conf_na_test_positive <- df_wide |>
  dplyr::filter(is.na(conf), test > 0) |>
  nrow()

# create comprehensive subtitle with all missing count scenarios
subtitle_text <- paste0(
  "N = ",
  scales::comma(n_test_missing_conf_zero),
  " where conf = 0 but test is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_conf_missing_test_zero),
  " where test = 0 but conf is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_test_na_conf_positive),
  " where test is missing but conf > 0 (logical inconsistency); \n",
  "N = ",
  scales::comma(n_conf_na_test_positive),
  " where conf is missing but test > 0 (possible legitimate zero)"
)

# create scatter plot showing relationship between test and confirmed counts
structural_zeros_plot <- ggplot2::ggplot(
  data = df_wide,
  mapping = ggplot2::aes(x = test, y = conf)
) +
  naniar::geom_miss_point() +
  ggplot2::geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::geom_hline(
    yintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_y_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_color_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Not Missing", "Missing"),
    labels = c("Present", "Missing")
  ) +
  ggplot2::guides(
    color = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      override.aes = list(
        size = 5
      ),
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Structural Zeros: Testing vs Confirmation",
    subtitle = subtitle_text,
    x = "\nTests Conducted",
    y = "Cases Confirmed\n",
    color = "Data Status"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    plot.subtitle = ggplot2::element_text(size = 9, color = "grey30")
  )

# display the plot
structural_zeros_plot

# save plot
ggplot2::ggsave(
  plot = structural_zeros_plot,
  here::here("03_output/3a_figures/sle_test_conf_check_zeros_plot.png"),
  width = 12,
  height = 7,
  dpi = 300
)
 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved