Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. Missing data detection methods
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Acronyms and Resource Library
    • 1.6 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Determining active and inactive status
      • Routine data extraction
      • DHIS2 data preprocessing
      • Missing data detection methods
      • Health facility reporting rate
      • Contextual considerations
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
    • 2.10 Cost Data
  • 3. Situation Analysis
    • 3.1 Review of Past Interventions
      • Case Management
      • Routine Interventions
      • Mass ITN Campaigns
      • Chemoprevention Campaigns
      • Other Vector Control
    • 3.2 Trend Analysis
    • 3.3 Risk Factors
    • 3.4 Impact Evaluation
    • 3.5 Cost Analysis
  • 4. Stratification
    • 4.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 4.2 Access to Care
    • 4.3 Seasonality
      • Defining Seasonal Areas
      • Durations of Seasonality
    • 4.4 Urban Microstratification
  • 5. Intervention Targeting and Prioritization
    • 5.1 Intervention Targeting
    • 5.2 Prioritization
    • 5.3 Optimization under Limited Resources

On this page

  • Overview
  • Understanding Missing Data
    • Understanding Types of Missing Data
  • Step-by-Step
    • Step 1: Install and Load Libraries
    • Step 2: Load Data
    • Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness
      • Step 3.1: Conditional missingness within submitted reports
      • Step 3.2: Quantitative summary across variables
    • Step 4: Check Missingness over Time
    • Step 5: Check Missingness over Time and Administrative Level
    • Step 6: Check Missingness by Health Facility Type and Age Groups
    • Step 7: Visualizing Structural Zeros and Logical Relationships
  • Summary
  • Further Reading
  • Full Code
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. Missing data detection methods

Missing data detection methods

Overview

Missing data is a persistent challenge in public health datasets and is particularly relevant to the SNT process, where multiple data sources must be integrated across time, administrative levels, and reporting platforms. The completeness and consistency of these data underpin risk stratification, resource allocation, and the evaluation of intervention impact.

This challenge is most acute in sub-Saharan Africa, where 95% of global malaria deaths occur and health information systems face the greatest structural pressures. In routine systems such as DHIS2, which serve as the backbone for malaria reporting across the region, missing data may follow recognizable patterns linked to health system functioning, though some degree of randomness can also be present. Not all missingness is due to error, and not all of it should or can be imputed.

Gaps in DHIS2 data are shaped by underlying structural constraints. These include internet outages, delayed uploads, limited supervision, register stockouts, staff shortages, and unclear handling of zeros versus missing values. These issues contribute to uneven reporting patterns across facility types and seasons. Hospitals tend to report less consistently than health posts, as seen in Senegal, where health posts maintained higher reporting completeness than hospitals over multiple years. Reporting is also more complete during the rainy season, with some months showing a 7.5% increase in submissions (Senegal). In Ghana, outpatient totals were frequently submitted while key malaria treatment indicators were left blank, creating selective gaps. Similarly, in Nigeria’s Gombe State, nearly 40% of expected monthly records were incomplete, and intervention-related indicators like IPTp showed disproportionately high rates of missingness. During campaigns or periods of high workload, parallel reporting systems and competing priorities further increase the risk of missing or conflicting data, as shown in Ethiopia and Tanzania, where fragmented systems and unclear use of blanks weakened data reliability.

These systematic gaps create analytical blind spots that compromise SNT outputs. Areas with weak surveillance may appear to have lower burden due to underreporting, while high-burden areas with poor connectivity can be underrepresented in resource allocation. If not addressed, missing data bias estimates, distort risk stratification, misrepresent coverage, and weaken the link between inputs and outcomes. They also reduce the stability of risk maps and the reliability of impact projections.

At the same time, not all missing data should be imputed. Some gaps are due to non-active HF status, such as a facility not yet operational. Others reflect values missing for reasons that cannot be inferred from the available data. In these cases, imputation is not appropriate. We must first understand why data are missing. This involves assessing whether the missingness is completely at random, related to other observed variables, or related to the missing values themselves.

Given the complexity of these decisions, dropping incomplete records is rarely justified. Understanding the cause and patterns of missingness is necessary for determining appropriate analytical approaches, whether that involves excluding certain records, applying logical inference, or implementing statistical methods that preserve the spatial and temporal structure of the data.

NoteObjectives
  • Learn systematic approaches to detect and visualize missing data patterns in surveillance systems
  • Examine missingness across temporal, spatial, and demographic dimensions
  • Identify structural zeros, legitimate zeros, and logical inconsistencies in related indicators
  • Distinguish between MCAR, MAR, and MNAR missing data mechanisms
  • Apply diagnostic visualizations to assess data quality and reporting completeness
  • Generate comprehensive missing data assessments to inform analytical decisions

Understanding Missing Data

Before proceeding with any analytical approach, it is important to understand the types of missingness present in the dataset, and to identify the most appropriate methods for addressing each type, whether through exclusion, clinical logic, or statistical techniques.

Understanding Types of Missing Data

When working with incomplete data, it’s important to understand why values are missing. This determines whether imputation is appropriate.

TipKey patterns to examine to help determine the type of missingness
  • Temporal gaps: Are some months or seasons consistently underreported?
  • Spatial gaps: Are there consistent patterns across districts or regions? When facility data are sparse, higher-level patterns can inform analytical approaches using neighbouring facilities of the same type.
  • Facility-level gaps: Do reporting rates vary by facility type (e.g., hospitals vs health posts), ownership, or service volume?
  • Demographic gaps: Are data more frequently missing for specific age groups (e.g., under-fives, adolescents, pregnant women)?
  • Variable-level gaps: Are certain indicators (e.g., IPTp3, ACT availability) more frequently left blank?

These patterns help identify where missingness occurs and why it occurs. Understanding these patterns is the first step in diagnosing whether missingness is Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR).

Once patterns are identified, use them to assess the mechanism behind the missing values. There are three main types of missingness, and understanding which type we are dealing with helps us make informed analytical decisions about how to handle the missing data.

Not all gaps in the data reflect missingness. Some values are absent because they were never expected to be present in the first place. These reflect non-active HF status, not missing data. For example, a facility may not have been operational during a particular period, or may never have reported on a specific indicator. In such cases, the absence of data should be marked as not applicable and excluded from any imputation procedure.

  1. Missing Completely At Random (MCAR)

What it means: The data are missing for no particular reason. The missingness is unrelated to anything in the dataset.

Example: Suppose some monthly reports are missing because a stack of paper forms was lost in a flood. The flood didn’t target specific facilities or months, it was random. In this case, the missing data is MCAR.

Implication: Imputation is generally acceptable for MCAR data using simple methods. However, we can safely analyze the non-missing data. It is still representative of the whole.

  1. Missing At Random (MAR)

What it means: The missingness is related to something else in the dataset, but not to the missing value itself.

Example: Imagine that reports of confirmed malaria cases are more likely to be missing from hospitals than health posts. But we know which facilities are hospitals and which are health posts. So the missingness is not random, but it is explainable based on another variable facility type.

Implication: We can use the information we have (like facility type, month, or region) to predict and impute the missing values. Most imputation methods assume MAR.

  1. Missing Not At Random (MNAR)

What it means: The missingness is related to the value that is missing, and it cannot be explained using other information in the data.

Example: Suppose some facilities do not report malaria deaths because they fear it reflects badly on their performance. The missingness is not random and not explained by any variable in the dataset. We do not know which deaths are missing, and we cannot reliably estimate them.

Implication: Imputation is risky. The data are biased in a way that cannot be corrected without more information.

Type What Causes It Can Be Imputed? Example
MCAR Random external factors unrelated to any variable in the dataset Yes Monthly reports lost in a flood, no systematic pattern by location, time, or indicator
MAR Missingness depends on observed variables Yes Hospitals report less consistently than health posts, but facility type is known and recorded
MNAR Missingness depends on the unobserved value itself Not recommended Facilities underreport malaria deaths to avoid scrutiny; missingness cannot be explained using other variables
ImportantConsult with SNT team

If the missingness mechanism is unclear (MCAR, MAR, or MNAR), consult the SNT team. Patterns that seem technical may reflect known operational issues (e.g., stockouts, parallel reporting during campaigns). Getting their input early can prevent incorrect imputation choices later.

Step-by-Step

This step-by-step section shows how to systematically detect missingness patterns using our Sierra Leone DHIS2 data (previously assembled in the DHIS2 data preprocessing page). It builds on the principles outlined above, emphasizing the importance of detecting and understanding missing data patterns.

The examples focus on identifying temporal, spatial, and systematic patterns in missingness that inform our classification decisions. Understanding these patterns is necessary before proceeding to the Imputation Methods page, where we implement structural and statistical imputation techniques.

Step 1: Install and Load Libraries

First, install and/or load the necessary packages for data manipulation, visualization, and missing data detection.

  • R
  • Python
# install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load required packages using pacman
pacman::p_load(
  dplyr,        # data manipulation
  ggplot2,      # plotting
  ggtext,       # markdown text in ggplot2
  here,         # file path management
  lubridate,    # date handling
  naniar,       # missing data visualization and analysis
  tidyr,        # data reshaping
  cli,          # clean logging and CLI-style messages
  scales,       # number formatting
  UpSetR,       # upset plots for co-missingness
  wesanderson   # color palettes
)

To adapt the code:

  • Do not modify anything in the code above
Show the code
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
from pyprojroot import here

# ── cli helpers ──────────────────────────────────────────────────────────────
def cli_header(message):
    print(f"\n{message}")

def cli_info(message):
    print(f"INFO: {message}")

def cli_success(message):
    print(f"SUCCESS: {message}")

def cli_warning(message):
    print(f"WARNING: {message}")

def cli_danger(message):
    print(f"ERROR: {message}")

def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )

To adapt the code:

  • Do not modify anything in the code above

Step 2: Load Data

Load the DHIS2 surveillance data and prepare it for missing data detection.

In this example, we focus on the past five years (the data stops in December 2023), so filter out anything before 2019.

  • R
  • Python
Show the code
# read the processed surveillance data produced by the import workflow
df_routine <- readRDS(
  here::here(
    "01_data",
    "1.2_epidemiology",
    "1.2a_routine_surveillance",
    "processed",
    "clean_malaria_routine_data_final.rds"
  )
) |>
  # drop the import-source label carried over from import.qmd
  dplyr::select(-dplyr::any_of("sheet_admin")) |>
  # filter to the past five years
  dplyr::filter(year >= 2019) |>
  dplyr::rename(maladm_hf_u5 = maladm_u5) |>
  dplyr::mutate(
    # ym: human-readable year-month label used as a discrete x-axis
    ym = format(date, "%Y-%m"),
    # date stays as Date (already Date in the .rds produced by import.qmd)
    date = as.Date(date)
  )
NoteOutput
record_id adm0 adm1 adm2 adm3 hf hf_uid location_short location_full facility_type date yearmon year month periodname allout_u5 allout_ov5 maladm_hf_u5 maladm_5_14 maladm_ov15 maldth_u5 maldth_5_14 maldth_ov15 susp_u5_hf susp_5_14_hf susp_ov15_hf susp_u5_com susp_5_14_com susp_ov15_com maldth_fem_ov15 maldth_mal_ov15 maldth_1_59m maldth_10_14 maldth_5_9 tes_neg_rdt_u5_com tes_pos_rdt_u5_com tes_neg_rdt_5_14_com tes_pos_rdt_5_14_com tes_neg_rdt_ov15_com tes_pos_rdt_ov15_com test_neg_mic_u5_hf test_pos_mic_u5_hf test_neg_mic_5_14_hf test_pos_mic_5_14_hf test_neg_mic_ov15_hf test_pos_mic_ov15_hf tes_neg_rdt_u5_hf tes_pos_rdt_u5_hf tes_neg_rdt_5_14_hf tes_pos_rdt_5_14_hf tes_neg_rdt_ov15_hf tes_pos_rdt_ov15_hf maltreat_u24_u5_com maltreat_ov24_u5_com maltreat_u24_5_14_com maltreat_ov24_5_14_com maltreat_u24_ov15_com maltreat_ov24_ov15_com maltreat_u24_u5_hf maltreat_ov24_u5_hf maltreat_u24_5_14_hf maltreat_ov24_5_14_hf maltreat_u24_ov15_hf maltreat_ov24_ov15_hf hf_uid_new hf_clean allout susp test_hf test_com test conf_hf conf_com conf maltreat_com maltreat_hf maltreat pres_com pres_hf pres maladm maldth test_hf_u5 test_hf_5_14 test_hf_ov15 test_com_u5 test_com_5_14 test_com_ov15 test_u5 test_5_14 test_ov15 susp_hf_u5 susp_hf_5_14 susp_hf_ov15 susp_com_u5 susp_com_5_14 susp_com_ov15 susp_u5 susp_5_14 susp_ov15 conf_hf_u5 conf_hf_5_14 conf_hf_ov15 conf_com_u5 conf_com_5_14 conf_com_ov15 conf_u5 conf_5_14 conf_ov15 maltreat_hf_u5 maltreat_hf_5_14 maltreat_hf_ov15 maltreat_com_u5 maltreat_com_5_14 maltreat_com_ov15 maltreat_u5 maltreat_5_14 maltreat_ov15 maltreat_u24_hf maltreat_ov24_hf maltreat_u24_com maltreat_ov24_com maltreat_u24_total maltreat_ov24_total pres_com_u5 pres_com_5_14 pres_com_ov15 pres_hf_u5 pres_hf_5_14 pres_hf_ov15 pres_u5 pres_5_14 pres_ov15 ym
138a1a87 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-01-01 Jan 2019 2019 1 January 2019 101 132 NA NA NA NA NA NA 95 24 44 69 24 44 NA NA NA NA NA 31 38 6 18 20 25 NA NA NA NA NA NA 51 44 6 18 20 24 35 9 NA NA 24 NA 35 9 15 3 16 8 hf_uid_new::1012 Althel CHP 233 300 163 138 301 86 81 167 68 86 154 0 0 0 NA NA 95 24 44 69 24 45 164 48 89 95 24 44 69 24 44 164 48 88 44 18 24 38 18 25 82 36 49 44 18 24 44 NA 24 88 18 48 66 20 59 9 125 29 6 18 0 0 0 0 6 18 0 2019-01
ed6d6325 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-02-01 Feb 2019 2019 2 February 2019 86 51 NA NA NA NA NA NA 82 16 35 62 16 35 NA NA NA NA NA 20 42 7 9 7 28 NA NA NA NA NA NA 25 57 7 9 7 28 65 NA NA NA 6 NA 40 17 9 NA 22 6 hf_uid_new::1012 Althel CHP 137 246 133 113 246 94 79 173 71 94 165 0 0 0 NA NA 82 16 35 62 16 35 144 32 70 82 16 35 62 16 35 144 32 70 57 9 28 42 9 28 99 18 56 57 9 28 65 NA 6 122 9 34 71 23 71 NA 142 23 23 9 0 0 0 0 23 9 0 2019-02
410e3ba3 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-03-01 Mar 2019 2019 3 March 2019 149 132 NA NA NA NA NA NA 106 22 38 69 22 38 NA NA NA NA NA 27 42 10 12 15 23 NA NA NA NA NA NA 40 66 10 12 15 23 66 NA 12 NA 23 NA 66 NA 12 NA 23 NA hf_uid_new::1012 Althel CHP 281 295 166 129 295 101 77 178 101 101 202 24 0 24 NA NA 106 22 38 69 22 38 175 44 76 106 22 38 69 22 38 175 44 76 66 12 23 42 12 23 108 24 46 66 12 23 66 12 23 132 24 46 101 NA 101 NA 202 NA 24 0 0 0 0 0 24 0 0 2019-03
0c0fef59 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-04-01 Apr 2019 2019 4 April 2019 99 80 NA NA NA NA NA NA 74 15 32 49 15 32 NA NA NA NA NA 13 37 7 8 16 16 NA NA NA NA NA NA 16 58 3 12 16 16 58 NA 8 NA 16 NA 58 NA 12 NA 16 NA hf_uid_new::1012 Althel CHP 179 217 121 97 218 86 61 147 82 86 168 21 0 21 NA NA 74 15 32 50 15 32 124 30 64 74 15 32 49 15 32 123 30 64 58 12 16 37 8 16 95 20 32 58 12 16 58 8 16 116 20 32 86 NA 82 NA 168 NA 21 0 0 0 0 0 21 0 0 2019-04
c3e6f1fa SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-05-01 May 2019 2019 5 May 2019 96 73 NA NA NA NA NA NA 84 38 16 49 38 16 NA NA NA NA NA 16 65 13 25 6 10 NA NA NA NA NA NA 19 65 13 25 6 10 46 19 19 6 10 4 46 19 19 6 10 NA hf_uid_new::1012 Althel CHP 169 241 138 135 273 100 100 200 104 100 204 4 0 4 NA NA 84 38 16 81 38 16 165 76 32 84 38 16 49 38 16 133 76 32 65 25 10 65 25 10 130 50 20 65 25 10 65 25 14 130 50 24 75 25 75 29 150 54 0 0 4 0 0 0 0 0 4 2019-05
d620d761 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-06-01 Jun 2019 2019 6 June 2019 80 48 NA NA NA NA NA NA 105 22 19 72 13 19 NA NA NA NA NA 17 55 9 13 7 12 NA NA NA NA NA NA 22 83 9 13 7 12 83 NA 13 NA 12 NA 83 NA 13 NA 12 NA hf_uid_new::1012 Althel CHP 128 250 146 113 259 108 80 188 108 108 216 28 0 28 NA NA 105 22 19 72 22 19 177 44 38 105 22 19 72 13 19 177 35 38 83 13 12 55 13 12 138 26 24 83 13 12 83 13 12 166 26 24 108 NA 108 NA 216 NA 28 0 0 0 0 0 28 0 0 2019-06

To adapt the code:

  • Lines 2-8: Update the path to point at the processed routine surveillance file produced by the import workflow (see the DHIS2 data preprocessing page).
  • Line 14: Filter the data to the period of interest.
  • Line 15: Drop the dplyr::rename() line if the dataset already uses the target column name.
Show the code
from pathlib import Path

import pandas as pd
from pyprojroot import here

# read the processed surveillance data produced by the import workflow.
# the .rds (loaded by R) and .parquet (loaded here) are written from the
# same dhis2_df in import.qmd, so both languages see identical data.
data_path = Path(here(
    "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed/"
    "clean_malaria_routine_data_final.parquet"
))

df_routine = (
    pd.read_parquet(data_path)
    # drop the import-source label carried over from import.qmd
    .drop(columns="sheet_admin", errors="ignore")
    # filter to the past five years
    .loc[lambda d: d["year"] >= 2019]
    .rename(columns={"maladm_u5": "maladm_hf_u5"})
    .assign(date=lambda d: pd.to_datetime(d["date"], errors="coerce"))
    .assign(ym=lambda d: d["date"].dt.strftime("%Y-%m"))
    .reset_index(drop=True)
)
NoteOutput
record_id adm0 adm1 adm2 adm3 hf hf_uid location_short location_full facility_type date yearmon year month periodname allout_u5 allout_ov5 maladm_hf_u5 maladm_5_14 maladm_ov15 maldth_u5 maldth_5_14 maldth_ov15 susp_u5_hf susp_5_14_hf susp_ov15_hf susp_u5_com susp_5_14_com susp_ov15_com maldth_fem_ov15 maldth_mal_ov15 maldth_1_59m maldth_10_14 maldth_5_9 tes_neg_rdt_u5_com tes_pos_rdt_u5_com tes_neg_rdt_5_14_com tes_pos_rdt_5_14_com tes_neg_rdt_ov15_com tes_pos_rdt_ov15_com test_neg_mic_u5_hf test_pos_mic_u5_hf test_neg_mic_5_14_hf test_pos_mic_5_14_hf test_neg_mic_ov15_hf test_pos_mic_ov15_hf tes_neg_rdt_u5_hf tes_pos_rdt_u5_hf tes_neg_rdt_5_14_hf tes_pos_rdt_5_14_hf tes_neg_rdt_ov15_hf tes_pos_rdt_ov15_hf maltreat_u24_u5_com maltreat_ov24_u5_com maltreat_u24_5_14_com maltreat_ov24_5_14_com maltreat_u24_ov15_com maltreat_ov24_ov15_com maltreat_u24_u5_hf maltreat_ov24_u5_hf maltreat_u24_5_14_hf maltreat_ov24_5_14_hf maltreat_u24_ov15_hf maltreat_ov24_ov15_hf hf_uid_new hf_clean allout susp test_hf test_com test conf_hf conf_com conf maltreat_com maltreat_hf maltreat pres_com pres_hf pres maladm maldth test_hf_u5 test_hf_5_14 test_hf_ov15 test_com_u5 test_com_5_14 test_com_ov15 test_u5 test_5_14 test_ov15 susp_hf_u5 susp_hf_5_14 susp_hf_ov15 susp_com_u5 susp_com_5_14 susp_com_ov15 susp_u5 susp_5_14 susp_ov15 conf_hf_u5 conf_hf_5_14 conf_hf_ov15 conf_com_u5 conf_com_5_14 conf_com_ov15 conf_u5 conf_5_14 conf_ov15 maltreat_hf_u5 maltreat_hf_5_14 maltreat_hf_ov15 maltreat_com_u5 maltreat_com_5_14 maltreat_com_ov15 maltreat_u5 maltreat_5_14 maltreat_ov15 maltreat_u24_hf maltreat_ov24_hf maltreat_u24_com maltreat_ov24_com maltreat_u24_total maltreat_ov24_total pres_com_u5 pres_com_5_14 pres_com_ov15 pres_hf_u5 pres_hf_5_14 pres_hf_ov15 pres_u5 pres_5_14 pres_ov15 ym
138a1a87 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-01-01 03:00:00 Jan 2019 2019 1 January 2019 101 132 NaN NaN NaN NaN NaN NaN 95 24 44 69 24 44 NaN NaN NaN NaN NaN 31 38 6 18 20 25 NaN NaN NaN NaN NaN NaN 51 44 6 18 20 24 35 9 NaN NaN 24 NaN 35 9 15 3 16 8 hf_uid_new::1012 Althel CHP 233 300 163 138 301 86 81 167 68 86 154 0 0 0 NaN NaN 95 24 44 69 24 45 164 48 89 95 24 44 69 24 44 164 48 88 44 18 24 38 18 25 82 36 49 44 18 24 44 NaN 24 88 18 48 66 20 59 9 125 29 6 18 0 0 0 0 6 18 0 2019-01
ed6d6325 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-02-01 03:00:00 Feb 2019 2019 2 February 2019 86 51 NaN NaN NaN NaN NaN NaN 82 16 35 62 16 35 NaN NaN NaN NaN NaN 20 42 7 9 7 28 NaN NaN NaN NaN NaN NaN 25 57 7 9 7 28 65 NaN NaN NaN 6 NaN 40 17 9 NaN 22 6 hf_uid_new::1012 Althel CHP 137 246 133 113 246 94 79 173 71 94 165 0 0 0 NaN NaN 82 16 35 62 16 35 144 32 70 82 16 35 62 16 35 144 32 70 57 9 28 42 9 28 99 18 56 57 9 28 65 NaN 6 122 9 34 71 23 71 NaN 142 23 23 9 0 0 0 0 23 9 0 2019-02
410e3ba3 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-03-01 03:00:00 Mar 2019 2019 3 March 2019 149 132 NaN NaN NaN NaN NaN NaN 106 22 38 69 22 38 NaN NaN NaN NaN NaN 27 42 10 12 15 23 NaN NaN NaN NaN NaN NaN 40 66 10 12 15 23 66 NaN 12 NaN 23 NaN 66 NaN 12 NaN 23 NaN hf_uid_new::1012 Althel CHP 281 295 166 129 295 101 77 178 101 101 202 24 0 24 NaN NaN 106 22 38 69 22 38 175 44 76 106 22 38 69 22 38 175 44 76 66 12 23 42 12 23 108 24 46 66 12 23 66 12 23 132 24 46 101 NaN 101 NaN 202 NaN 24 0 0 0 0 0 24 0 0 2019-03
0c0fef59 SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-04-01 03:00:00 Apr 2019 2019 4 April 2019 99 80 NaN NaN NaN NaN NaN NaN 74 15 32 49 15 32 NaN NaN NaN NaN NaN 13 37 7 8 16 16 NaN NaN NaN NaN NaN NaN 16 58 3 12 16 16 58 NaN 8 NaN 16 NaN 58 NaN 12 NaN 16 NaN hf_uid_new::1012 Althel CHP 179 217 121 97 218 86 61 147 82 86 168 21 0 21 NaN NaN 74 15 32 50 15 32 124 30 64 74 15 32 49 15 32 123 30 64 58 12 16 37 8 16 95 20 32 58 12 16 58 8 16 116 20 32 86 NaN 82 NaN 168 NaN 21 0 0 0 0 0 21 0 0 2019-04
c3e6f1fa SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 SOUTHERN ~ BO SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP OPD 2019-05-01 03:00:00 May 2019 2019 5 May 2019 96 73 NaN NaN NaN NaN NaN NaN 84 38 16 49 38 16 NaN NaN NaN NaN NaN 16 65 13 25 6 10 NaN NaN NaN NaN NaN NaN 19 65 13 25 6 10 46 19 19 6 10 4 46 19 19 6 10 NaN hf_uid_new::1012 Althel CHP 169 241 138 135 273 100 100 200 104 100 204 4 0 4 NaN NaN 84 38 16 81 38 16 165 76 32 84 38 16 49 38 16 133 76 32 65 25 10 65 25 10 130 50 20 65 25 10 65 25 14 130 50 24 75 25 75 29 150 54 0 0 4 0 0 0 0 0 4 2019-05

To adapt the code:

  • Lines 6-12: Update the path to point at the processed routine surveillance file produced by the import workflow (see the DHIS2 data preprocessing page).
  • Line 19: Filter the data to the period of interest.
  • Line 20: Drop the .rename() line if the dataset already uses the target column name.

Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness

Before drilling into temporal, spatial, and facility-type patterns, it helps to separate three concepts that often get lumped together as “missing data”:

  • Reporting completeness: was a report submitted at all for a given facility-month?
  • Conditional missingness: given that a report was submitted, which indicators were left blank?
  • Non-applicable periods: months before a facility was operational or after it stopped reporting are not missing. They are simply not expected.

A vertical band of “missing” months in a heatmap usually means no report was submitted (the entire row is NA). A single missing cell within an otherwise complete report has a very different meaning and a different mechanism. Untangling these is the first step before any of the visualisations below can be interpreted correctly.

WarningFilter to active facility-months first

Non-applicable periods (months before a facility was operational, after it stopped reporting, or during gaps in reporting) are defined on the Determining active and inactive status page. Run that workflow first; the analyses below assume df_routine has been filtered to active facility-months. Mixing in not-yet-active or already-inactive months will inflate the missing rates we see in every subsequent step.

Step 3.1: Conditional missingness within submitted reports

A facility-month is treated as a submitted report when at least one core indicator is non-missing. Among submitted reports, the percentage of each indicator left blank tells us whether the missingness is form-completion practice (specific indicators selectively skipped) or reporting failure (whole rows missing). Within an active facility-month, a missing indicator should reflect the former.

  • R
  • Python
Show the code
# define the core indicators that signal a report was submitted
core_inds <- c("test", "conf", "maltreat")

# flag each facility-month as reporting or non-reporting
df_routine <- df_routine |>
  dplyr::mutate(
    report_submitted = dplyr::if_any(
      dplyr::all_of(core_inds),
      ~ !is.na(.x)
    )
  )

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing <- df_routine |>
  dplyr::filter(report_submitted) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(core_inds),
      ~ round(mean(is.na(.x)) * 100, 1)
    )
  ) |>
  tidyr::pivot_longer(
    dplyr::everything(),
    names_to = "indicator",
    values_to = "pct_missing_given_reported"
  )

conditional_missing |>
  dplyr::rename(
    Indicator = indicator,
    `% missing given reported` = pct_missing_given_reported
  ) |>
  knitr::kable(align = "lr", digits = 1)
NoteOutput
Indicator % missing given reported
test 1.2
conf 1.2
maltreat 1.9

To adapt the code:

  • Line 2: Update core_inds to the indicators that signal a report was submitted in the dataset.
Show the code
import pandas as pd

# define the core indicators that signal a report was submitted
core_inds = ["test", "conf", "maltreat"]

# flag each facility-month as reporting or non-reporting
df_routine = df_routine.assign(
    report_submitted=lambda d: d[core_inds].notna().any(axis=1)
)

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing = (
    df_routine.loc[df_routine["report_submitted"]]
    [core_inds]
    .isna()
    .mean()
    .mul(100)
    .round(1)
    .rename_axis("indicator")
    .reset_index(name="pct_missing_given_reported")
)

conditional_missing
NoteOutput
indicator pct_missing_given_reported
test 1.2
conf 1.2
maltreat 1.9

To adapt the code:

  • Line 4: Update core_inds to the indicators that signal a report was submitted in the dataset.

Step 3.2: Quantitative summary across variables

A tabular summary and bar chart give a per-variable view of overall missingness. The bar chart sorts indicators by missing rate, making it easy to spot indicators that warrant closer inspection.

  • R
  • Python
Show the code
# variables to summarise
summary_vars <- c("test", "susp", "pres", "conf", "maltreat")

# compute % missing for the core aggregated indicators
miss_summary <- naniar::miss_var_summary(
  df_routine |>
    dplyr::select(dplyr::all_of(summary_vars))
) |>
  dplyr::mutate(pct_miss = as.numeric(pct_miss))

# bar chart of % missing per variable
miss_summary |>
  dplyr::mutate(
    variable = factor(variable, levels = rev(variable)),
    label_text = paste0(round(pct_miss, 1), "%"),
    inside = pct_miss > max(pct_miss) * 0.15
  ) |>
  ggplot2::ggplot(ggplot2::aes(x = pct_miss, y = variable)) +
  ggplot2::geom_col(fill = "#3B9AB2", width = 0.75) +
  ggplot2::geom_text(
    ggplot2::aes(
      label = label_text,
      hjust = ifelse(inside, 1.15, -0.15),
      colour = ifelse(inside, "white", "grey20")
    ),
    size = 4,
    fontface = "bold",
    family = "sans"
  ) +
  ggplot2::scale_colour_identity() +
  ggplot2::scale_x_continuous(
    expand = ggplot2::expansion(mult = c(0, 0.05)),
    labels = function(x) paste0(x, "%")
  ) +
  ggplot2::labs(
    title = "Proportion of missing data per indicator",
    subtitle = paste(
      "% missing across submitted reports for each",
      "core malaria indicator"
    ),
    x = NULL,
    y = NULL
  ) +
  ggplot2::theme_minimal(base_family = "sans") +
  ggplot2::theme(
    plot.title = ggplot2::element_text(
      size = 16,
      face = "bold",
      margin = ggplot2::margin(b = 4)
    ),
    plot.subtitle = ggplot2::element_text(
      size = 12,
      colour = "grey30",
      margin = ggplot2::margin(b = 14)
    ),
    axis.text.y = ggplot2::element_text(size = 11, colour = "black"),
    axis.text.x = ggplot2::element_text(size = 10, colour = "grey30"),
    panel.grid.major.y = ggplot2::element_blank(),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major.x = ggplot2::element_line(
      colour = "grey90",
      size = 0.3
    ),
    axis.ticks = ggplot2::element_blank(),
    plot.margin = ggplot2::margin(15, 30, 10, 10)
  )
NoteOutput

To adapt the code:

  • Line 2: Update summary_vars to match the dataset’s indicators of interest.
TipCo-missingness and Little’s MCAR test as additional diagnostics

Two extensions to the diagnostics above are useful when the basic summary leaves the mechanism ambiguous:

  • Co-missingness: naniar::gg_miss_upset(df_routine |> dplyr::select(test, susp, pres, conf, maltreat), nsets = 5) shows which indicators are missing together. Requires the UpSetR package. A single large intersection points to report-level reporting failure; many small intersections point to form completion practice.
  • Little’s MCAR test: naniar::mcar_test(df_routine |> dplyr::select(test, conf, maltreat)) returns a p-value for the null hypothesis that data are MCAR. A small p-value supports MAR or MNAR. The test can be slow on large datasets.
Show the code
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

# tabular summary of % missing for the core aggregated indicators
summary_vars = ["test", "susp", "pres", "conf", "maltreat"]

miss_summary = (
    pd.DataFrame({
        "variable": summary_vars,
        "n_miss": [df_routine[v].isna().sum() for v in summary_vars],
        "pct_miss": [
            round(df_routine[v].isna().mean() * 100, 1)
            for v in summary_vars
        ],
    })
    .sort_values("pct_miss", ascending=False)
    .reset_index(drop=True)
)

# bar chart of % missing per variable (horizontal, sorted descending)
vmax = float(miss_summary["pct_miss"].max())
threshold = vmax * 0.15

fig, ax = plt.subplots(figsize=(9, 5.2))
bars = ax.barh(
    miss_summary["variable"],
    miss_summary["pct_miss"],
    color="#3B9AB2",
    height=0.75
)
ax.invert_yaxis()

# value labels: inside the bar (white) when there is room,
# outside (grey) for short bars
for bar, val in zip(bars, miss_summary["pct_miss"]):
    label = f"{round(val, 1)}%"
    if val > threshold:
        ax.text(
            val - vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="right",
            color="white",
            fontsize=11,
            fontweight="bold"
        )
    else:
        ax.text(
            val + vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="left",
            color="#333333",
            fontsize=11,
            fontweight="bold"
        )

# title (large, bold) + subtitle (smaller, grey)
fig.suptitle(
    "Proportion of missing data per indicator",
    fontsize=16,
    fontweight="bold",
    x=0.05,
    ha="left",
    y=0.98
)
ax.set_title(
    "% missing across submitted reports for each core malaria indicator",
    fontsize=12,
    color="#555555",
    loc="left",
    pad=8
)

ax.set_xlabel("")
ax.set_ylabel("")
ax.xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f"{int(x)}%")
)
ax.tick_params(left=False, labelsize=11, colors="#333333")
ax.tick_params(axis="x", colors="#555555", labelsize=10)

# clean look: no border, only faint x gridlines
for spine in ax.spines.values():
    spine.set_visible(False)
ax.grid(axis="x", color="#E5E5E5", linewidth=0.6)
ax.grid(axis="y", visible=False)
ax.set_axisbelow(True)
ax.set_xlim(0, max(vmax * 1.05, 10))

plt.tight_layout(rect=[0, 0, 1, 0.94])
NoteOutput
(0.0, 25.305000000000003)

To adapt the code:

  • Line 6: Update the indicator list in summary_vars to match the dataset.
ImportantConsult with SNT team

Indicators that show very high or 100% missingness are often a signal that the value lives elsewhere, not that it is genuinely absent. SNT data management is iterative: an indicator that appears empty at the facility level may sit in a different mother dataset (an aggregated reporting form, a separate platform module, a parallel reporting system, or a renamed field). Before treating these indicators as analytical gaps, consult the SNT team to confirm whether the indicator is collected elsewhere or whether the dataset under analysis is the right starting point. What looks like a data problem is often a sourcing problem the team has already resolved.

The conditional missingness table and the overall summary together tell us whether the gaps we see downstream are predominantly reporting failures (whole rows missing, all indicators co-missing) or form completion practice (specific indicators missing despite a report being filed). The two require different responses on the Imputation Methods page.

Step 4: Check Missingness over Time

In this step, we check the missingness rate of our variables of interest across time. This analysis helps us understand temporal patterns in data availability and identify periods or variables with systematic missing data issues. The heatmap visualization provides an overview of data completeness across all malaria indicators for different age groups throughout the study period. The color intensity represents the percentage of missing values, allowing us to quickly identify problematic time periods or variables that may require special handling in subsequent analyses.

Initially, we want to check the following set of variables: malaria tests, suspected, presumed and confirmed cases and malaria treatments aggregated across all age groups, as well as the age-specific breakdowns.

  • R
  • Python
Show the code
# get the variables of interest (aggregated and age-specific)
vars <- c(
  "test",        # aggregated malaria tests
  "test_u5",
  "test_5_14",
  "test_ov15",
  "susp",        # aggregated suspected cases
  "susp_u5",
  "susp_5_14",
  "susp_ov15",
  "pres",        # aggregated presumed cases
  "pres_u5",
  "pres_5_14",
  "pres_ov15",
  "conf",        # aggregated confirmed cases
  "conf_u5",
  "conf_5_14",
  "conf_ov15",
  "maltreat",    # aggregated treatments
  "maltreat_u5",
  "maltreat_5_14",
  "maltreat_ov15"
)

# calculate missing rates by date for each variable
missing_rate_date <- df_routine |>
  dplyr::group_by(date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(vars),
      ~ mean(is.na(.x)) * 100
    ),
    .groups = "drop"
  ) |>
  dplyr::arrange(date) |>
  tidyr::pivot_longer(
    cols = -ym,
    names_to = "variables",
    values_to = "missing_rate"
  ) |>
  # enforce the explicit `vars` order on the y-axis so R and Python match
  dplyr::mutate(variables = factor(variables, levels = vars))

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_date$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across variables
missing_plot_date <- ggplot2::ggplot(
  missing_rate_date,
  ggplot2::aes(
    y = variables,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "Variable",
    x = "",
    title = "The proportion of missing data for selected variables by year-month"
  )

# display the plot
missing_plot_date
NoteOutput

To adapt the code:

  • Lines 1-23: Update the vars vector to match the dataset’s variable names. For a different country or dataset, modify variable names to reflect the relevant indicators (e.g., for Ghana this might be cases_u5, tests_5_14, etc.).
  • Line 26: Replace df_routine with the dataset name if different.
  • Line 50: Set full_range <- TRUE to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range <- FALSE to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.
Show the code
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# variables of interest (aggregated and age-specific)
vars_of_interest = [
    "test",           # aggregated malaria tests
    "test_u5",
    "test_5_14",
    "test_ov15",
    "susp",           # aggregated suspected cases
    "susp_u5",
    "susp_5_14",
    "susp_ov15",
    "pres",           # aggregated presumed cases
    "pres_u5",
    "pres_5_14",
    "pres_ov15",
    "conf",           # aggregated confirmed cases
    "conf_u5",
    "conf_5_14",
    "conf_ov15",
    "maltreat",       # aggregated treatments
    "maltreat_u5",
    "maltreat_5_14",
    "maltreat_ov15",
]

# calculate missing rates by year-month for each variable
missing_rate_date = (
    df_routine.groupby("ym", as_index=False)[vars_of_interest]
    .apply(lambda g: g[vars_of_interest].isna().mean() * 100)
    .reset_index()
    .rename(columns={"level_0": "ym"})
)
# rebuild from groupby to preserve ym correctly
_grouped = df_routine.groupby("ym")
missing_rate_date = pd.DataFrame(
    {v: _grouped[v].apply(lambda s: s.isna().mean() * 100)
     for v in vars_of_interest}
).reset_index()

# pivot to long format
missing_rate_long = missing_rate_date.melt(
    id_vars="ym",
    value_vars=vars_of_interest,
    var_name="variables",
    value_name="missing_rate"
).sort_values("ym")

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_long["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap (mirrors wesanderson::wes_palette("Zissou1", 100))
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

# pivot back to matrix for pcolormesh
ym_order = sorted(missing_rate_long["ym"].unique())
var_order = vars_of_interest  # top-to-bottom order on y-axis

pivot = (
    missing_rate_long
    .pivot(index="variables", columns="ym", values="missing_rate")
    .reindex(index=var_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(var_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

# x-axis ticks and labels
ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

# y-axis ticks and labels
ax.set_yticks(np.arange(len(var_order)) + 0.5)
ax.set_yticklabels(var_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("Variable")
ax.set_title(
    "The proportion of missing data for selected variables by year-month",
    fontsize=12, pad=10
)

# horizontal colorbar at the bottom
cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

# remove grid lines
ax.grid(False)
plt.tight_layout()
NoteOutput

To adapt the code:

  • Lines 7-27: Update vars_of_interest to match the dataset’s variable names. For a different country or dataset, modify variable names to reflect the relevant indicators.
  • Line 31: Replace df_routine with the dataset name if different.
  • Line 53: Set full_range = True to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range = False to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.

Several indicators show consistent missingness across time. Between mid-2022 and early 2023, missingness sharply increased for several indicators, with test, maltreat, and conf reaching rates above 45%.

What to look for in heatmaps like this on any dataset:

  • Vertical bands (all indicators missing in the same months) indicate system-wide outages or platform migrations and usually point to MCAR at the time-period level.
  • Horizontal bands (one indicator consistently missing across all months) indicate indicator-specific reporting practice or form changes and usually point to MAR conditional on indicator.
  • An expanding right edge indicates delayed reporting in the most recent months; consider whether those months should be excluded.
  • Co-located bands across age-disaggregated rows (e.g., test_u5, test_5_14, test_ov15 all missing together) usually mean the aggregate was reported but the disaggregation was skipped.

Patterns that look like MAR conditional on time are addressable through time-stratified imputation on the Imputation Methods page.

Step 5: Check Missingness over Time and Administrative Level

In this step, we focus on a single indicator, malaria testing for under 5’s at the health facility level (test_hf_u5), to examine how missingness varies over time across districts (adm2 level). This allows us to identify subnational areas with consistently poor reporting and spot time periods where data availability dropped sharply.

Although the data is available at the health facility level, the most granular reporting unit, it is still important to assess missingness at higher administrative levels. This broader view becomes especially useful when facility-level data are sparse. In such cases, we may use fallback analysis approaches that draw on patterns from neighbouring facilities within the same district.

  • R
  • Python
Show the code
# calculate missing rates by year-month and adm2 for each variable
missing_rate_adm2 <- df_routine |>
  dplyr::group_by(ym, adm2) |>
  dplyr::summarise(
    missing_rate = mean(is.na(test_hf_u5)) * 100,
    .groups = "drop"
  )

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_adm2$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across location
missing_plot_adm2 <- ggplot2::ggplot(
  missing_rate_adm2,
  ggplot2::aes(
    y = adm2,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "District",
    x = "",
    title = "The proportion of missing data for test_hf_u5 by year-month and adm2"
  )

# display the plot
missing_plot_adm2
NoteOutput

To adapt the code:

  • Line 2: Replace df_routine with the dataset name if different.
  • Line 5: Update the variable name (test_hf_u5) to match the dataset’s variable name of interest.
  • Line 15: Set full_range <- TRUE to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range <- FALSE to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.
Show the code
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# calculate missing rates by year-month and adm2 for test_hf_u5
missing_rate_adm2 = (
    df_routine.groupby(["ym", "adm2"], as_index=False)
    .agg(missing_rate=("test_hf_u5", lambda s: s.isna().mean() * 100))
)

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_adm2["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

ym_order = sorted(missing_rate_adm2["ym"].unique())
adm2_order = sorted(missing_rate_adm2["adm2"].unique())

pivot = (
    missing_rate_adm2
    .pivot(index="adm2", columns="ym", values="missing_rate")
    .reindex(index=adm2_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(adm2_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

ax.set_yticks(np.arange(len(adm2_order)) + 0.5)
ax.set_yticklabels(adm2_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("District")
ax.set_title(
    "The proportion of missing data for test_hf_u5 by year-month and adm2",
    fontsize=12, pad=10
)

cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

ax.grid(False)
plt.tight_layout()
NoteOutput

To adapt the code:

  • Line 8: Replace df_routine with the dataset name if different.
  • Line 9: Update the variable name (test_hf_u5) to match the dataset’s variable name of interest.
  • Line 13: Set full_range = True to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range = False to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.

Missing data for test_hf_u5 varies by district and over time. WESTERN URBAN shows the highest missingness rate (about 67% on average), with KOINADUGU, WESTERN RURAL, and KONO also above 25%. Many districts also show increased missingness around mid to late 2022, suggesting a broader reporting issue. In contrast, districts like KARENE, KAMBIA, and MOYAMBA have lower and more stable missing rates. These patterns confirm that test_hf_u5 varies by both time and location.

This suggests that missingness in test_hf_u5 is likely Missing At Random (MAR), as it depends on observable factors such as district and reporting period. In such cases, stratified analysis approaches are appropriate. The district (adm2) level can also serve as a fallback unit for analysis, particularly when individual facility data are sparse, by drawing insights from patterns in neighbouring facilities within the same district.

ImportantConsult with SNT team

District-level patterns of missingness often reflect operational decisions the SNT team already knows about: which districts switched DHIS2 instances or migrated platforms, when health information system reforms rolled out, where parallel reporting systems run alongside DHIS2, and which districts ran campaigns that disrupted routine reporting. Before classifying a district’s pattern as MAR or MNAR, consult the SNT team to rule out a known structural cause. The team may also point to an alternative source that is more complete for the affected period.

What to look for when running this on a different country:

  • Persistent horizontal stripes in specific districts often reflect urban or peri-urban hospitals reporting less consistently than rural health posts. Check the facility-type composition of those districts.
  • Vertical bands shared across districts indicate a national-level reporting disruption such as a DHIS2 migration, a campaign month, or a holiday season.
  • A dark lower-left region means missingness is concentrated early in the time series; the indicator may have been added later, in which case pre-activation months should be masked as not-applicable per the active status page.
  • A smooth gradient with no strong pattern may indicate MCAR; confirm with naniar::mcar_test() on that indicator.

When MAR patterns like those above are present, plan to use a stratified or hierarchical imputation approach on the Imputation Methods page.

Step 6: Check Missingness by Health Facility Type and Age Groups

Beyond temporal and geographic patterns, missingness may also vary systematically by additional factors, such as health facility type and age groups. Many routine surveillance systems collect data that is already disaggregated by these factors (e.g., test_hf_u5 representing malaria tests conducted at health facilities for children under 5).

The first step in determining these patterns is to disaggregate such pre-aggregated data into a long format, creating separate columns for indicators, facility types, and age groups. This long format allows us to examine how missingness varies across these important stratification variables and identify systematic reporting gaps that may be related to facility capacity, service delivery models, or demographic-specific data collection challenges.

  • R
  • Python
Show the code
# define core indicators to retain during disaggregation
core_indicators <- c(
  "adm0",
  "adm1",
  "adm2",
  "adm3",
  "hf",
  "hf_uid",
  "year",
  "month",
  "ym",
  "date"
)

# select core indicators and disaggregated variables
df_routine_disagg <- df_routine |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    dplyr::matches("_(com|hf)_(u5|5_14|ov15)$")
  )

# transform to long format with separate facility type and age group columns
df_routine_disagg_long <- df_routine_disagg |>
  # drop maladm_hf_u5: lacks community-level and other age-group
  # counterparts, so it cannot be pivoted into the long format below
  dplyr::select(-maladm_hf_u5) |>
  tidyr::pivot_longer(
    cols = dplyr::matches("_(com|hf)_(u5|5_14|ov15)$"),
    names_to = c("indicator", "facility_type", "age_group"),
    names_pattern = "(.+)_(com|hf)_(u5|5_14|ov15)$",
    values_to = "value"
  ) |>
  # relabel factors
  dplyr::mutate(
    age_group = factor(
      age_group,
      levels = c("u5", "5_14", "ov15"),
      labels = c("Under 5", "5 to 15", "Over 15")
    ),
    facility_type = factor(
      facility_type,
      levels = c("hf", "com"),
      labels = c("Health Facility", "Community")
    )
  )

# preview the disaggregated data structure
df_routine_disagg_long |>
  utils::head() |>
  knitr::kable()
NoteOutput
adm0 adm1 adm2 adm3 hf hf_uid year month ym date indicator facility_type age_group value
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 1 2019-01 2019-01-01 test Health Facility Under 5 95
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 1 2019-01 2019-01-01 test Health Facility 5 to 15 24
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 1 2019-01 2019-01-01 test Health Facility Over 15 44
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 1 2019-01 2019-01-01 test Community Under 5 69
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 1 2019-01 2019-01-01 test Community 5 to 15 24
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 1 2019-01 2019-01-01 test Community Over 15 45

To adapt the code:

  • Lines 2-13: Update the core_indicators vector to match the dataset’s administrative and temporal variable names.
  • Line 16: Replace df_routine with the dataset name if different.
  • Lines 19, 28: Modify the regex pattern "_(com|hf)_(u5|5_14|ov15)$" to match the facility type and age group naming conventions (e.g., if the data uses different facility codes or age categories).
Show the code
import re

import pandas as pd

# define core columns to retain during disaggregation
core_indicators = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid",
    "year", "month", "ym", "date"
]

# select core indicators and disaggregated columns
disagg_pattern = re.compile(r".+_(com|hf)_(u5|5_14|ov15)$")
disagg_cols = [c for c in df_routine.columns if disagg_pattern.match(c)]
keep_core = [c for c in core_indicators if c in df_routine.columns]

df_routine_disagg = df_routine[keep_core + disagg_cols].copy()

# drop maladm_hf_u5: lacks community-level and other age-group counterparts
if "maladm_hf_u5" in df_routine_disagg.columns:
    df_routine_disagg = df_routine_disagg.drop(columns=["maladm_hf_u5"])

# re-identify disaggregated columns after the drop
disagg_cols_final = [
    c for c in df_routine_disagg.columns
    if c not in keep_core
]

# transform to long format with separate facility_type and age_group columns
def _parse_col(col):
    """Extract (indicator, facility_type, age_group) from column name."""
    m = re.match(r"^(.+)_(com|hf)_(u5|5_14|ov15)$", col)
    if m:
        return m.group(1), m.group(2), m.group(3)
    return None, None, None

long_rows = []
for col in disagg_cols_final:
    ind, ftype, age = _parse_col(col)
    if ind is None:
        continue
    tmp = df_routine_disagg[keep_core + [col]].copy()
    tmp = tmp.rename(columns={col: "value"})
    tmp["indicator"] = ind
    tmp["facility_type"] = ftype
    tmp["age_group"] = age
    long_rows.append(tmp)

df_routine_disagg_long = pd.concat(long_rows, ignore_index=True)

# relabel factors to match R output
age_order = ["u5", "5_14", "ov15"]
age_labels = ["Under 5", "5 to 15", "Over 15"]
ftype_order = ["hf", "com"]
ftype_labels = ["Health Facility", "Community"]

df_routine_disagg_long["age_group"] = pd.Categorical(
    df_routine_disagg_long["age_group"].map(
        dict(zip(age_order, age_labels))
    ),
    categories=age_labels,
    ordered=True
)
df_routine_disagg_long["facility_type"] = pd.Categorical(
    df_routine_disagg_long["facility_type"].map(
        dict(zip(ftype_order, ftype_labels))
    ),
    categories=ftype_labels,
    ordered=True
)

# preview the disaggregated data structure as an HTML table
df_routine_disagg_long.head().style.hide(axis="index")
NoteOutput
adm0 adm1 adm2 adm3 hf hf_uid year month ym date value indicator facility_type age_group
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 1 2019-01 2019-01-01 03:00:00 95 test Health Facility Under 5
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 2 2019-02 2019-02-01 03:00:00 82 test Health Facility Under 5
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 3 2019-03 2019-03-01 03:00:00 106 test Health Facility Under 5
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 4 2019-04 2019-04-01 03:00:00 74 test Health Facility Under 5
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2a696984 2019 5 2019-05 2019-05-01 03:00:00 84 test Health Facility Under 5

To adapt the code:

  • Lines 6-13: Update core_indicators to match the dataset’s administrative and temporal variable names.
  • Line 15: Replace df_routine with the dataset name if different.
  • Lines 13, 32: Modify the regex pattern r".+_(com|hf)_(u5|5_14|ov15)$" to match the facility type and age group naming conventions in the dataset.

Now that we have disaggregated the data, we can examine how missingness varies across all indicators by facility type and age group. This check helps identify systematic patterns in data availability that may be related to service delivery models, facility capacity, or demographic-specific reporting challenges.

  • R
  • Python
Show the code
# prepare data for stacked bar chart with both age group and facility type
missing_present_data_facility <- df_routine_disagg_long |>
  dplyr::group_by(indicator, age_group, facility_type) |>
  dplyr::summarise(
    total_records = dplyr::n(),
    missing_count = sum(is.na(value)),
    present_count = sum(!is.na(value)),
    missing_pct = round(missing_count / total_records * 100, 2),
    present_pct = round(present_count / total_records * 100, 2),
    .groups = "drop"
  ) |>
  tidyr::pivot_longer(
    cols = c(missing_pct, present_pct),
    names_to = "status",
    values_to = "percentage",
    names_pattern = "(.+)_pct"
  ) |>
  dplyr::mutate(
    status = factor(
      status,
      levels = c("missing", "present"),
      labels = c("Missing", "Present")
    )
  )

# create stacked bar plot faceted by age group and facility type
age_facility_miss_plot <- missing_present_data_facility |>
  ggplot2::ggplot(ggplot2::aes(
    x = percentage,
    y = factor(age_group, levels = rev(levels(factor(age_group)))),
    fill = status
  )) +
  ggplot2::geom_col(alpha = 0.8) +
  ggplot2::facet_grid(indicator ~ facility_type) +
  ggplot2::scale_fill_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Present", "Missing")
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    x = "\nPercentage (%)",
    y = "Age Group (years)\n",
    fill = "Data Status"
  ) +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10)
  )

# display the plot
age_facility_miss_plot
NoteOutput

To adapt the code:

  • Line 2: Replace df_routine_disagg_long with the disaggregated dataset name if different.
  • Lines 18-22: Update the status factor levels and labels to use different display names for missing/present data categories.
  • Lines 48-52: Modify the plot title and axis labels to match the analysis focus.
Show the code
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# prepare data for stacked bar chart with age group and facility type
missing_present_data_facility = (
    df_routine_disagg_long
    .groupby(["indicator", "age_group", "facility_type"], observed=True,
             as_index=False)
    .agg(
        total_records=("value", "count"),
        missing_count=("value", lambda s: s.isna().sum()),
        present_count=("value", lambda s: s.notna().sum()),
    )
    .assign(
        missing_pct=lambda d: (d["missing_count"] / d["total_records"] * 100)
                              .round(2),
        present_pct=lambda d: (d["present_count"] / d["total_records"] * 100)
                              .round(2),
    )
)

# pivot to long format for status column
mp_long = missing_present_data_facility.melt(
    id_vars=["indicator", "age_group", "facility_type"],
    value_vars=["missing_pct", "present_pct"],
    var_name="status",
    value_name="percentage"
).assign(
    status=lambda d: pd.Categorical(
        d["status"].map({"missing_pct": "Missing", "present_pct": "Present"}),
        categories=["Missing", "Present"],
        ordered=True
    )
)

# viridis D palette: begin=0.2, end=0.5 for two categories
import matplotlib.cm as cm
viridis = cm.get_cmap("viridis")
color_missing = viridis(0.2)
color_present = viridis(0.5)
status_colors = {"Missing": color_missing, "Present": color_present}

indicators = list(mp_long["indicator"].unique())
facility_types = list(mp_long["facility_type"].cat.categories)
age_labels_ordered = list(mp_long["age_group"].cat.categories)
age_rev = list(reversed(age_labels_ordered))

n_ind = len(indicators)
n_ftype = len(facility_types)

fig, axes = plt.subplots(
    n_ind, n_ftype,
    figsize=(12, 8),
    sharex=True,
    sharey="row"
)

for i, ind in enumerate(indicators):
    for j, ftype in enumerate(facility_types):
        ax = axes[i][j] if n_ind > 1 else axes[j]
        subset = mp_long.loc[
            (mp_long["indicator"] == ind) &
            (mp_long["facility_type"] == ftype)
        ]
        for status in ["Present", "Missing"]:
            s_sub = subset.loc[subset["status"] == status]
            s_sub = s_sub.set_index("age_group").reindex(age_rev)
            ax.barh(
                s_sub.index,
                s_sub["percentage"],
                color=status_colors[status],
                alpha=0.8,
                label=status
            )
        ax.set_xlim(0, 100)
        ax.set_xlabel("")
        ax.set_ylabel("")
        # panel border
        for spine in ax.spines.values():
            spine.set_linewidth(0.7)
        # strip labels
        if i == 0:
            ax.set_title(str(ftype), fontsize=9, fontweight="bold")
        if j == n_ftype - 1:
            ax.yaxis.set_label_position("right")
            ax.set_ylabel(str(ind), rotation=270, labelpad=12,
                          fontsize=9, fontweight="bold")
        # no grid
        ax.grid(False)

# shared axis labels
fig.supxlabel("Percentage (%)", y=0.02, fontsize=10)
fig.supylabel("Age Group (years)", x=0.02, fontsize=10)
fig.suptitle(
    "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    fontsize=12, y=1.01
)

# shared legend
handles = [
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Present"], alpha=0.8),
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Missing"], alpha=0.8),
]
fig.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    bbox_to_anchor=(0.0, -0.04),
    ncol=2,
    frameon=False,
    fontsize=9
)
plt.tight_layout()
NoteOutput

To adapt the code:

  • Line 8: Replace df_routine_disagg_long with the disaggregated dataset name if different.
  • Lines 30-34: Update the status mapping if different display names for missing/present data categories are needed.
  • Lines 94-99: Modify the plot title and axis labels to match the analysis focus.

The above visualization shows substantial differences in missingness patterns between facility types, with community-level facilities showing considerably higher missing rates, around 70% on average. This pattern is consistent across indicators and age groups. Notably, within community facilities, under-five data is more frequently reported than data for older age groups, a pattern observed across key indicators. These trends point to systematic differences in data collection and reporting capacity between facility types.

What to look for when running this on a different dataset:

  • A large gap between facility types (community vs facility) reflects differences in data collection capacity, supervision frequency, or which indicators are mandated at each level. Stratify imputation by facility type.
  • A larger gap in older age groups often reflects that the programme focuses on under-fives; older-age data may be a structural non-applicable rather than missing.
  • An indicator-specific gap (one row of facets noticeably worse than the others) often means the form does not include that indicator at that level. Verify with the country team before imputing.

Mark indicators that are structurally not collected at a given facility type or age band as not-applicable rather than imputing them. Imputation strategies for genuine facility-type MAR patterns are covered on the Imputation Methods page.

Step 7: Visualizing Structural Zeros and Logical Relationships

Before classifying missing values, we visualize the relationships between related indicators. This plot highlights likely structural zeros (e.g., no testing with missing confirmation), possible structural zeros (e.g., confirmed cases equal to zero with missing test values), and logical inconsistencies (e.g., confirmed cases recorded but testing data missing). These patterns reflect the clinical care pathway and help identify where values can be logically inferred, structurally imputed as zero, or require statistical methods.

Specifically:

  • test = 0, conf missing: Likely legitimate zero (no testing = confirmation not possible).
  • conf = 0, test missing: Possibly legitimate zero (no confirmed cases suggests no positives, but testing may have occurred).
  • conf > 0, test missing: Logical inconsistency (confirmed cases imply testing occurred, value should be imputed).
  • test > 0, conf missing: Possibly legitimate zero (testing occurred but no positive cases; confirmation may be legitimately zero or missing).
  • R
  • Python

Note: Below, we use the naniar package to explore missingness. Unlike geom_point(), which silently drops missing values, geom_miss_point() shows where one or both variables are missing by shifting them to plot margins. This makes it easier to detect structural zeros, reporting gaps, and logical inconsistencies.

Show the code
# transform data to wide format to examine relationships between indicators
df_wide <- df_routine_disagg_long |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    facility_type,
    age_group,
    indicator,
    value
  ) |>
  tidyr::pivot_wider(
    names_from = indicator,
    values_from = value
  )

# calculate comprehensive missing counts for subtitle
n_test_missing_conf_zero <- df_wide |>
  dplyr::filter(conf == 0, is.na(test)) |>
  nrow()

n_conf_missing_test_zero <- df_wide |>
  dplyr::filter(test == 0, is.na(conf)) |>
  nrow()

n_test_na_conf_positive <- df_wide |>
  dplyr::filter(is.na(test), conf > 0) |>
  nrow()

n_conf_na_test_positive <- df_wide |>
  dplyr::filter(is.na(conf), test > 0) |>
  nrow()

# create comprehensive subtitle with all missing count scenarios
subtitle_text <- paste0(
  "N = ",
  scales::comma(n_test_missing_conf_zero),
  " where conf = 0 but test is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_conf_missing_test_zero),
  " where test = 0 but conf is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_test_na_conf_positive),
  " where test is missing but conf > 0 (logical inconsistency); \n",
  "N = ",
  scales::comma(n_conf_na_test_positive),
  " where conf is missing but test > 0 (possible legitimate zero)"
)

# create scatter plot showing relationship between test and confirmed counts
structural_zeros_plot <- ggplot2::ggplot(
  data = df_wide,
  mapping = ggplot2::aes(x = test, y = conf)
) +
  naniar::geom_miss_point() +
  ggplot2::geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::geom_hline(
    yintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_y_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_color_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Not Missing", "Missing"),
    labels = c("Present", "Missing")
  ) +
  ggplot2::guides(
    color = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      override.aes = list(
        size = 5
      ),
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Structural Zeros: Testing vs Confirmation",
    subtitle = subtitle_text,
    x = "\nTests Conducted",
    y = "Cases Confirmed\n",
    color = "Data Status"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    plot.subtitle = ggplot2::element_text(size = 9, color = "grey30")
  )


# display the plot
structural_zeros_plot
NoteOutput

To adapt the code:

  • Lines 17, 21, 25, 29: Update the variable names (test, conf) in the filter conditions to match the dataset’s variable names of interest.
  • Lines 3-9: Update the core_indicators and column selection to match the dataset structure.
  • Lines 10-13: Modify the pivot_wider() transformation to include the specific indicators to examine for structural relationships.
  • Line 51: Change the x and y aesthetic mappings to examine different variable pairs (e.g., x = maltreat, y = conf to examine treatment vs confirmation relationships).

Note: Below, we replicate the naniar::geom_miss_point() approach in Python. Missing values on one axis are shifted to the plot margin (10% below the minimum observed value) so the structural-zero patterns remain visible.

Show the code
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd

# transform to wide format to examine relationships between indicators
keep_core = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid", "year", "month", "ym", "date"
]
keep_cols = [c for c in keep_core if c in df_routine_disagg_long.columns]

df_wide = (
    df_routine_disagg_long[keep_cols + ["facility_type", "age_group",
                                         "indicator", "value"]]
    .pivot_table(
        index=keep_cols + ["facility_type", "age_group"],
        columns="indicator",
        values="value",
        aggfunc="first",
        observed=True
    )
    .reset_index()
)
df_wide.columns.name = None

# calculate missing counts for subtitle
n_test_missing_conf_zero = df_wide.loc[
    (df_wide["conf"] == 0) & df_wide["test"].isna()
].shape[0]

n_conf_missing_test_zero = df_wide.loc[
    (df_wide["test"] == 0) & df_wide["conf"].isna()
].shape[0]

n_test_na_conf_positive = df_wide.loc[
    df_wide["test"].isna() & (df_wide["conf"] > 0)
].shape[0]

n_conf_na_test_positive = df_wide.loc[
    df_wide["conf"].isna() & (df_wide["test"] > 0)
].shape[0]

subtitle_text = (
    f"N = {n_test_missing_conf_zero:,} where conf = 0 but test is missing "
    f"(legitimate zero); \n"
    f"N = {n_conf_missing_test_zero:,} where test = 0 but conf is missing "
    f"(legitimate zero); \n"
    f"N = {n_test_na_conf_positive:,} where test is missing but conf > 0 "
    f"(logical inconsistency); \n"
    f"N = {n_conf_na_test_positive:,} where conf is missing but test > 0 "
    f"(possible legitimate zero)"
)

# create demo data matching the R render chunk:
# rows 400-600 where conf is NA get test = 0 (legitimate zero pattern)
# rows 490-900 where test is NA get conf = 0 (legitimate zero pattern)
df_wide_demo = df_wide.copy().reset_index(drop=True)
mask1 = (df_wide_demo.index.isin(range(400, 601))) & df_wide_demo["conf"].isna()
df_wide_demo.loc[mask1, "test"] = 0.0
mask2 = (df_wide_demo.index.isin(range(490, 901))) & df_wide_demo["test"].isna()
df_wide_demo.loc[mask2, "conf"] = 0.0

# compute margin offsets for miss_point behavior
test_obs = df_wide_demo["test"].dropna()
conf_obs = df_wide_demo["conf"].dropna()
test_range = test_obs.max() - test_obs.min() if len(test_obs) > 1 else 1.0
conf_range = conf_obs.max() - conf_obs.min() if len(conf_obs) > 1 else 1.0
test_margin = test_obs.min() - 0.1 * test_range
conf_margin = conf_obs.min() - 0.1 * conf_range

viridis = cm.get_cmap("viridis")
color_present = viridis(0.5)   # "Not Missing"
color_missing = viridis(0.2)   # "Missing"

fig, ax = plt.subplots(figsize=(10, 8))

# --- present-present points ---
mask_pp = df_wide_demo["test"].notna() & df_wide_demo["conf"].notna()
ax.scatter(
    df_wide_demo.loc[mask_pp, "test"],
    df_wide_demo.loc[mask_pp, "conf"],
    color=color_present, alpha=0.4, s=10, label="Present", rasterized=True
)

# --- test missing, conf present: shift test to left margin ---
mask_tm = df_wide_demo["test"].isna() & df_wide_demo["conf"].notna()
ax.scatter(
    [test_margin] * mask_tm.sum(),
    df_wide_demo.loc[mask_tm, "conf"],
    color=color_missing, alpha=0.4, s=10, label="Missing", rasterized=True
)

# --- conf missing, test present: shift conf to bottom margin ---
mask_cm = df_wide_demo["conf"].isna() & df_wide_demo["test"].notna()
ax.scatter(
    df_wide_demo.loc[mask_cm, "test"],
    [conf_margin] * mask_cm.sum(),
    color=color_missing, alpha=0.4, s=10, rasterized=True
)

# reference lines at zero
ax.axvline(0, linestyle="dashed", color="grey", linewidth=0.2)
ax.axhline(0, linestyle="dashed", color="grey", linewidth=0.2)

# comma-formatted tick labels
ax.xaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"{x:,.0f}"
))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda y, _: f"{y:,.0f}"
))

ax.set_xlabel("\nTests Conducted")
ax.set_ylabel("Cases Confirmed\n")
ax.set_title("Structural Zeros: Testing vs Confirmation", fontsize=12)
ax.text(
    0.0, -0.12, subtitle_text,
    transform=ax.transAxes, fontsize=9, color="grey",
    verticalalignment="top"
)

handles = [
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_present, markersize=8),
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_missing, markersize=8),
]
ax.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    frameon=True,
    fontsize=9
)

for spine in ax.spines.values():
    spine.set_linewidth(0.7)
plt.tight_layout()
NoteOutput

To adapt the code:

  • Lines 30, 34, 38, 42: Update the variable names (test, conf) in the filter conditions to match the dataset’s variable names of interest.
  • Lines 9-12: Update keep_core and column selection to match the dataset structure.
  • Lines 13-16: Modify the pivot_table() transformation to include the specific indicators to examine for structural relationships.
  • Lines 66, 67: Change the scatter x and y column references to examine different variable pairs (e.g., maltreat and conf).

The plot shows key patterns in missingness between malaria testing (test) and confirmation (conf) counts. The vertical dashed line marks zero tests; the horizontal dashed line marks zero confirmations. This visualization shows there were 1,769 instances where conf is missing but test > 0. These are likely structural zeros, where testing was done but no confirmed cases were recorded. In such cases, missing confirmation counts can reasonably be set to zero.

What to look for in this plot on any dataset:

  • A cluster of “Missing” markers along the y-axis margin with test > 0: tests were recorded but confirmations were left blank. Likely a structural zero; set conf to 0.
  • A cluster of “Missing” markers along the x-axis margin with conf > 0: confirmations were recorded but tests were left blank. Logical inconsistency; flag for review, or impute test from conf and a historical positivity rate.
  • “Missing” markers at the origin: both indicators are missing, which points to reporting failure rather than a structural zero.
  • Asymmetry along one axis: indicator-specific data entry practice. Check whether the form distinguishes 0 from blank.

Summary

This section introduced approaches for visualising and diagnosing missing data in routine surveillance systems. By examining patterns across time, geography, facility type, and indicator relationships, we showed how missingness can be systematically assessed rather than treated as random gaps. The section demonstrated how simple plots and logical checks help distinguish between true missing values and structural reporting issues, providing a clearer understanding of the data’s reliability. These methods establish a foundation for classifying the nature of missingness and for selecting appropriate strategies for addressing them. Structural and statistical imputation techniques are implemented in detail on the Imputation Methods page.

Further Reading

  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. Foundational paper introducing the MCAR / MAR / MNAR framework.
  • van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). Chapman & Hall/CRC. Canonical reference for multiple imputation; freely available at https://stefvanbuuren.name/fimd/.
  • WHO. (2017). Data Quality Review (DQR) toolkit. Recommended framework for assessing completeness, timeliness, internal consistency, and external consistency in routine health information system data.
  • Maïga, A. et al. (2019). Generating statistics from health facility data: the state of routine health information systems in Eastern and Southern Africa. BMJ Global Health, 4(5), e001849.
  • naniar package documentation: https://naniar.njtierney.com. Practical guide to missing data visualisation in R, including upset plots and Little’s MCAR test.

Full Code

Find the full code script for missing data detection methods below.

  • R
  • Python
Show full code
################################################################################
################# ~ Missing data detection methods full code ~ #################
################################################################################

### Step 1: Install and Load Libraries -----------------------------------------

# install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load required packages using pacman
pacman::p_load(
  dplyr,        # data manipulation
  ggplot2,      # plotting
  ggtext,       # markdown text in ggplot2
  here,         # file path management
  lubridate,    # date handling
  naniar,       # missing data visualization and analysis
  tidyr,        # data reshaping
  cli,          # clean logging and CLI-style messages
  scales,       # number formatting
  UpSetR,       # upset plots for co-missingness
  wesanderson   # color palettes
)

### Step 2: Load Data ----------------------------------------------------------

# read the processed surveillance data produced by the import workflow
df_routine <- readRDS(
  here::here(
    "01_data",
    "1.2_epidemiology",
    "1.2a_routine_surveillance",
    "processed",
    "clean_malaria_routine_data_final.rds"
  )
) |>
  # drop the import-source label carried over from import.qmd
  dplyr::select(-dplyr::any_of("sheet_admin")) |>
  # filter to the past five years
  dplyr::filter(year >= 2019) |>
  dplyr::rename(maladm_hf_u5 = maladm_u5) |>
  dplyr::mutate(
    # ym: human-readable year-month label used as a discrete x-axis
    ym = format(date, "%Y-%m"),
    # date stays as Date (already Date in the .rds produced by import.qmd)
    date = as.Date(date)
  )

### Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness --------

#### Step 3.1: Conditional missingness within submitted reports ----------------

# define the core indicators that signal a report was submitted
core_inds <- c("test", "conf", "maltreat")

# flag each facility-month as reporting or non-reporting
df_routine <- df_routine |>
  dplyr::mutate(
    report_submitted = dplyr::if_any(
      dplyr::all_of(core_inds),
      ~ !is.na(.x)
    )
  )

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing <- df_routine |>
  dplyr::filter(report_submitted) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(core_inds),
      ~ round(mean(is.na(.x)) * 100, 1)
    )
  ) |>
  tidyr::pivot_longer(
    dplyr::everything(),
    names_to = "indicator",
    values_to = "pct_missing_given_reported"
  )

conditional_missing |>
  dplyr::rename(
    Indicator = indicator,
    `% missing given reported` = pct_missing_given_reported
  ) |>
  knitr::kable(align = "lr", digits = 1)

#### Step 3.2: Quantitative summary across variables ---------------------------

# variables to summarise
summary_vars <- c("test", "susp", "pres", "conf", "maltreat")

# compute % missing for the core aggregated indicators
miss_summary <- naniar::miss_var_summary(
  df_routine |>
    dplyr::select(dplyr::all_of(summary_vars))
) |>
  dplyr::mutate(pct_miss = as.numeric(pct_miss))

# bar chart of % missing per variable
miss_summary |>
  dplyr::mutate(
    variable = factor(variable, levels = rev(variable)),
    label_text = paste0(round(pct_miss, 1), "%"),
    inside = pct_miss > max(pct_miss) * 0.15
  ) |>
  ggplot2::ggplot(ggplot2::aes(x = pct_miss, y = variable)) +
  ggplot2::geom_col(fill = "#3B9AB2", width = 0.75) +
  ggplot2::geom_text(
    ggplot2::aes(
      label = label_text,
      hjust = ifelse(inside, 1.15, -0.15),
      colour = ifelse(inside, "white", "grey20")
    ),
    size = 4,
    fontface = "bold",
    family = "sans"
  ) +
  ggplot2::scale_colour_identity() +
  ggplot2::scale_x_continuous(
    expand = ggplot2::expansion(mult = c(0, 0.05)),
    labels = function(x) paste0(x, "%")
  ) +
  ggplot2::labs(
    title = "Proportion of missing data per indicator",
    subtitle = paste(
      "% missing across submitted reports for each",
      "core malaria indicator"
    ),
    x = NULL,
    y = NULL
  ) +
  ggplot2::theme_minimal(base_family = "sans") +
  ggplot2::theme(
    plot.title = ggplot2::element_text(
      size = 16,
      face = "bold",
      margin = ggplot2::margin(b = 4)
    ),
    plot.subtitle = ggplot2::element_text(
      size = 12,
      colour = "grey30",
      margin = ggplot2::margin(b = 14)
    ),
    axis.text.y = ggplot2::element_text(size = 11, colour = "black"),
    axis.text.x = ggplot2::element_text(size = 10, colour = "grey30"),
    panel.grid.major.y = ggplot2::element_blank(),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major.x = ggplot2::element_line(
      colour = "grey90",
      size = 0.3
    ),
    axis.ticks = ggplot2::element_blank(),
    plot.margin = ggplot2::margin(15, 30, 10, 10)
  )

### Step 4: Check Missingness over Time ----------------------------------------

# get the variables of interest (aggregated and age-specific)
vars <- c(
  "test",        # aggregated malaria tests
  "test_u5",
  "test_5_14",
  "test_ov15",
  "susp",        # aggregated suspected cases
  "susp_u5",
  "susp_5_14",
  "susp_ov15",
  "pres",        # aggregated presumed cases
  "pres_u5",
  "pres_5_14",
  "pres_ov15",
  "conf",        # aggregated confirmed cases
  "conf_u5",
  "conf_5_14",
  "conf_ov15",
  "maltreat",    # aggregated treatments
  "maltreat_u5",
  "maltreat_5_14",
  "maltreat_ov15"
)

# calculate missing rates by date for each variable
missing_rate_date <- df_routine |>
  dplyr::group_by(date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(vars),
      ~ mean(is.na(.x)) * 100
    ),
    .groups = "drop"
  ) |>
  dplyr::arrange(date) |>
  tidyr::pivot_longer(
    cols = -ym,
    names_to = "variables",
    values_to = "missing_rate"
  ) |>
  # enforce the explicit `vars` order on the y-axis so R and Python match
  dplyr::mutate(variables = factor(variables, levels = vars))

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_date$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across variables
missing_plot_date <- ggplot2::ggplot(
  missing_rate_date,
  ggplot2::aes(
    y = variables,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "Variable",
    x = "",
    title = "The proportion of missing data for selected variables by year-month"
  )

# display the plot
missing_plot_date

### Step 5: Check Missingness over Time and Administrative Level ---------------

# calculate missing rates by year-month and adm2 for each variable
missing_rate_adm2 <- df_routine |>
  dplyr::group_by(ym, adm2) |>
  dplyr::summarise(
    missing_rate = mean(is.na(test_hf_u5)) * 100,
    .groups = "drop"
  )

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_adm2$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across location
missing_plot_adm2 <- ggplot2::ggplot(
  missing_rate_adm2,
  ggplot2::aes(
    y = adm2,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "District",
    x = "",
    title = "The proportion of missing data for test_hf_u5 by year-month and adm2"
  )

# display the plot
missing_plot_adm2

### Step 6: Check Missingness by Health Facility Type and Age Groups -----------

# define core indicators to retain during disaggregation
core_indicators <- c(
  "adm0",
  "adm1",
  "adm2",
  "adm3",
  "hf",
  "hf_uid",
  "year",
  "month",
  "ym",
  "date"
)

# select core indicators and disaggregated variables
df_routine_disagg <- df_routine |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    dplyr::matches("_(com|hf)_(u5|5_14|ov15)$")
  )

# transform to long format with separate facility type and age group columns
df_routine_disagg_long <- df_routine_disagg |>
  # drop maladm_hf_u5: lacks community-level and other age-group
  # counterparts, so it cannot be pivoted into the long format below
  dplyr::select(-maladm_hf_u5) |>
  tidyr::pivot_longer(
    cols = dplyr::matches("_(com|hf)_(u5|5_14|ov15)$"),
    names_to = c("indicator", "facility_type", "age_group"),
    names_pattern = "(.+)_(com|hf)_(u5|5_14|ov15)$",
    values_to = "value"
  ) |>
  # relabel factors
  dplyr::mutate(
    age_group = factor(
      age_group,
      levels = c("u5", "5_14", "ov15"),
      labels = c("Under 5", "5 to 15", "Over 15")
    ),
    facility_type = factor(
      facility_type,
      levels = c("hf", "com"),
      labels = c("Health Facility", "Community")
    )
  )

# preview the disaggregated data structure
df_routine_disagg_long |>
  utils::head() |>
  knitr::kable()

# prepare data for stacked bar chart with both age group and facility type
missing_present_data_facility <- df_routine_disagg_long |>
  dplyr::group_by(indicator, age_group, facility_type) |>
  dplyr::summarise(
    total_records = dplyr::n(),
    missing_count = sum(is.na(value)),
    present_count = sum(!is.na(value)),
    missing_pct = round(missing_count / total_records * 100, 2),
    present_pct = round(present_count / total_records * 100, 2),
    .groups = "drop"
  ) |>
  tidyr::pivot_longer(
    cols = c(missing_pct, present_pct),
    names_to = "status",
    values_to = "percentage",
    names_pattern = "(.+)_pct"
  ) |>
  dplyr::mutate(
    status = factor(
      status,
      levels = c("missing", "present"),
      labels = c("Missing", "Present")
    )
  )

# create stacked bar plot faceted by age group and facility type
age_facility_miss_plot <- missing_present_data_facility |>
  ggplot2::ggplot(ggplot2::aes(
    x = percentage,
    y = factor(age_group, levels = rev(levels(factor(age_group)))),
    fill = status
  )) +
  ggplot2::geom_col(alpha = 0.8) +
  ggplot2::facet_grid(indicator ~ facility_type) +
  ggplot2::scale_fill_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Present", "Missing")
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    x = "\nPercentage (%)",
    y = "Age Group (years)\n",
    fill = "Data Status"
  ) +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10)
  )

# display the plot
age_facility_miss_plot

### Step 7: Visualizing Structural Zeros and Logical Relationships -------------

# transform data to wide format to examine relationships between indicators
df_wide <- df_routine_disagg_long |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    facility_type,
    age_group,
    indicator,
    value
  ) |>
  tidyr::pivot_wider(
    names_from = indicator,
    values_from = value
  )

# calculate comprehensive missing counts for subtitle
n_test_missing_conf_zero <- df_wide |>
  dplyr::filter(conf == 0, is.na(test)) |>
  nrow()

n_conf_missing_test_zero <- df_wide |>
  dplyr::filter(test == 0, is.na(conf)) |>
  nrow()

n_test_na_conf_positive <- df_wide |>
  dplyr::filter(is.na(test), conf > 0) |>
  nrow()

n_conf_na_test_positive <- df_wide |>
  dplyr::filter(is.na(conf), test > 0) |>
  nrow()

# create comprehensive subtitle with all missing count scenarios
subtitle_text <- paste0(
  "N = ",
  scales::comma(n_test_missing_conf_zero),
  " where conf = 0 but test is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_conf_missing_test_zero),
  " where test = 0 but conf is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_test_na_conf_positive),
  " where test is missing but conf > 0 (logical inconsistency); \n",
  "N = ",
  scales::comma(n_conf_na_test_positive),
  " where conf is missing but test > 0 (possible legitimate zero)"
)

# create scatter plot showing relationship between test and confirmed counts
structural_zeros_plot <- ggplot2::ggplot(
  data = df_wide,
  mapping = ggplot2::aes(x = test, y = conf)
) +
  naniar::geom_miss_point() +
  ggplot2::geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::geom_hline(
    yintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_y_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_color_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Not Missing", "Missing"),
    labels = c("Present", "Missing")
  ) +
  ggplot2::guides(
    color = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      override.aes = list(
        size = 5
      ),
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Structural Zeros: Testing vs Confirmation",
    subtitle = subtitle_text,
    x = "\nTests Conducted",
    y = "Cases Confirmed\n",
    color = "Data Status"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    plot.subtitle = ggplot2::element_text(size = 9, color = "grey30")
  )

# display the plot
structural_zeros_plot
Show full code
################################################################################
################# ~ Missing data detection methods full code ~ #################
################################################################################

### Step 1: Install and Load Libraries -----------------------------------------

from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
from pyprojroot import here

# ── cli helpers ──────────────────────────────────────────────────────────────
def cli_header(message):
    print(f"\n{message}")

def cli_info(message):
    print(f"INFO: {message}")

def cli_success(message):
    print(f"SUCCESS: {message}")

def cli_warning(message):
    print(f"WARNING: {message}")

def cli_danger(message):
    print(f"ERROR: {message}")

def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )

### Step 2: Load Data ----------------------------------------------------------

from pathlib import Path

import pandas as pd
from pyprojroot import here

# read the processed surveillance data produced by the import workflow.
# the .rds (loaded by R) and .parquet (loaded here) are written from the
# same dhis2_df in import.qmd, so both languages see identical data.
data_path = Path(here(
    "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed/"
    "clean_malaria_routine_data_final.parquet"
))

df_routine = (
    pd.read_parquet(data_path)
    # drop the import-source label carried over from import.qmd
    .drop(columns="sheet_admin", errors="ignore")
    # filter to the past five years
    .loc[lambda d: d["year"] >= 2019]
    .rename(columns={"maladm_u5": "maladm_hf_u5"})
    .assign(date=lambda d: pd.to_datetime(d["date"], errors="coerce"))
    .assign(ym=lambda d: d["date"].dt.strftime("%Y-%m"))
    .reset_index(drop=True)
)

### Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness --------

#### Step 3.1: Conditional missingness within submitted reports ----------------

import pandas as pd

# define the core indicators that signal a report was submitted
core_inds = ["test", "conf", "maltreat"]

# flag each facility-month as reporting or non-reporting
df_routine = df_routine.assign(
    report_submitted=lambda d: d[core_inds].notna().any(axis=1)
)

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing = (
    df_routine.loc[df_routine["report_submitted"]]
    [core_inds]
    .isna()
    .mean()
    .mul(100)
    .round(1)
    .rename_axis("indicator")
    .reset_index(name="pct_missing_given_reported")
)

conditional_missing

#### Step 3.2: Quantitative summary across variables ---------------------------

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

# tabular summary of % missing for the core aggregated indicators
summary_vars = ["test", "susp", "pres", "conf", "maltreat"]

miss_summary = (
    pd.DataFrame({
        "variable": summary_vars,
        "n_miss": [df_routine[v].isna().sum() for v in summary_vars],
        "pct_miss": [
            round(df_routine[v].isna().mean() * 100, 1)
            for v in summary_vars
        ],
    })
    .sort_values("pct_miss", ascending=False)
    .reset_index(drop=True)
)

# bar chart of % missing per variable (horizontal, sorted descending)
vmax = float(miss_summary["pct_miss"].max())
threshold = vmax * 0.15

fig, ax = plt.subplots(figsize=(9, 5.2))
bars = ax.barh(
    miss_summary["variable"],
    miss_summary["pct_miss"],
    color="#3B9AB2",
    height=0.75
)
ax.invert_yaxis()

# value labels: inside the bar (white) when there is room,
# outside (grey) for short bars
for bar, val in zip(bars, miss_summary["pct_miss"]):
    label = f"{round(val, 1)}%"
    if val > threshold:
        ax.text(
            val - vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="right",
            color="white",
            fontsize=11,
            fontweight="bold"
        )
    else:
        ax.text(
            val + vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="left",
            color="#333333",
            fontsize=11,
            fontweight="bold"
        )

# title (large, bold) + subtitle (smaller, grey)
fig.suptitle(
    "Proportion of missing data per indicator",
    fontsize=16,
    fontweight="bold",
    x=0.05,
    ha="left",
    y=0.98
)
ax.set_title(
    "% missing across submitted reports for each core malaria indicator",
    fontsize=12,
    color="#555555",
    loc="left",
    pad=8
)

ax.set_xlabel("")
ax.set_ylabel("")
ax.xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f"{int(x)}%")
)
ax.tick_params(left=False, labelsize=11, colors="#333333")
ax.tick_params(axis="x", colors="#555555", labelsize=10)

# clean look: no border, only faint x gridlines
for spine in ax.spines.values():
    spine.set_visible(False)
ax.grid(axis="x", color="#E5E5E5", linewidth=0.6)
ax.grid(axis="y", visible=False)
ax.set_axisbelow(True)
ax.set_xlim(0, max(vmax * 1.05, 10))

plt.tight_layout(rect=[0, 0, 1, 0.94])

### Step 4: Check Missingness over Time ----------------------------------------

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# variables of interest (aggregated and age-specific)
vars_of_interest = [
    "test",           # aggregated malaria tests
    "test_u5",
    "test_5_14",
    "test_ov15",
    "susp",           # aggregated suspected cases
    "susp_u5",
    "susp_5_14",
    "susp_ov15",
    "pres",           # aggregated presumed cases
    "pres_u5",
    "pres_5_14",
    "pres_ov15",
    "conf",           # aggregated confirmed cases
    "conf_u5",
    "conf_5_14",
    "conf_ov15",
    "maltreat",       # aggregated treatments
    "maltreat_u5",
    "maltreat_5_14",
    "maltreat_ov15",
]

# calculate missing rates by year-month for each variable
missing_rate_date = (
    df_routine.groupby("ym", as_index=False)[vars_of_interest]
    .apply(lambda g: g[vars_of_interest].isna().mean() * 100)
    .reset_index()
    .rename(columns={"level_0": "ym"})
)
# rebuild from groupby to preserve ym correctly
_grouped = df_routine.groupby("ym")
missing_rate_date = pd.DataFrame(
    {v: _grouped[v].apply(lambda s: s.isna().mean() * 100)
     for v in vars_of_interest}
).reset_index()

# pivot to long format
missing_rate_long = missing_rate_date.melt(
    id_vars="ym",
    value_vars=vars_of_interest,
    var_name="variables",
    value_name="missing_rate"
).sort_values("ym")

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_long["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap (mirrors wesanderson::wes_palette("Zissou1", 100))
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

# pivot back to matrix for pcolormesh
ym_order = sorted(missing_rate_long["ym"].unique())
var_order = vars_of_interest  # top-to-bottom order on y-axis

pivot = (
    missing_rate_long
    .pivot(index="variables", columns="ym", values="missing_rate")
    .reindex(index=var_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(var_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

# x-axis ticks and labels
ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

# y-axis ticks and labels
ax.set_yticks(np.arange(len(var_order)) + 0.5)
ax.set_yticklabels(var_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("Variable")
ax.set_title(
    "The proportion of missing data for selected variables by year-month",
    fontsize=12, pad=10
)

# horizontal colorbar at the bottom
cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

# remove grid lines
ax.grid(False)
plt.tight_layout()

### Step 5: Check Missingness over Time and Administrative Level ---------------

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# calculate missing rates by year-month and adm2 for test_hf_u5
missing_rate_adm2 = (
    df_routine.groupby(["ym", "adm2"], as_index=False)
    .agg(missing_rate=("test_hf_u5", lambda s: s.isna().mean() * 100))
)

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_adm2["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

ym_order = sorted(missing_rate_adm2["ym"].unique())
adm2_order = sorted(missing_rate_adm2["adm2"].unique())

pivot = (
    missing_rate_adm2
    .pivot(index="adm2", columns="ym", values="missing_rate")
    .reindex(index=adm2_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(adm2_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

ax.set_yticks(np.arange(len(adm2_order)) + 0.5)
ax.set_yticklabels(adm2_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("District")
ax.set_title(
    "The proportion of missing data for test_hf_u5 by year-month and adm2",
    fontsize=12, pad=10
)

cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

ax.grid(False)
plt.tight_layout()

### Step 6: Check Missingness by Health Facility Type and Age Groups -----------

import re

import pandas as pd

# define core columns to retain during disaggregation
core_indicators = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid",
    "year", "month", "ym", "date"
]

# select core indicators and disaggregated columns
disagg_pattern = re.compile(r".+_(com|hf)_(u5|5_14|ov15)$")
disagg_cols = [c for c in df_routine.columns if disagg_pattern.match(c)]
keep_core = [c for c in core_indicators if c in df_routine.columns]

df_routine_disagg = df_routine[keep_core + disagg_cols].copy()

# drop maladm_hf_u5: lacks community-level and other age-group counterparts
if "maladm_hf_u5" in df_routine_disagg.columns:
    df_routine_disagg = df_routine_disagg.drop(columns=["maladm_hf_u5"])

# re-identify disaggregated columns after the drop
disagg_cols_final = [
    c for c in df_routine_disagg.columns
    if c not in keep_core
]

# transform to long format with separate facility_type and age_group columns
def _parse_col(col):
    """Extract (indicator, facility_type, age_group) from column name."""
    m = re.match(r"^(.+)_(com|hf)_(u5|5_14|ov15)$", col)
    if m:
        return m.group(1), m.group(2), m.group(3)
    return None, None, None

long_rows = []
for col in disagg_cols_final:
    ind, ftype, age = _parse_col(col)
    if ind is None:
        continue
    tmp = df_routine_disagg[keep_core + [col]].copy()
    tmp = tmp.rename(columns={col: "value"})
    tmp["indicator"] = ind
    tmp["facility_type"] = ftype
    tmp["age_group"] = age
    long_rows.append(tmp)

df_routine_disagg_long = pd.concat(long_rows, ignore_index=True)

# relabel factors to match R output
age_order = ["u5", "5_14", "ov15"]
age_labels = ["Under 5", "5 to 15", "Over 15"]
ftype_order = ["hf", "com"]
ftype_labels = ["Health Facility", "Community"]

df_routine_disagg_long["age_group"] = pd.Categorical(
    df_routine_disagg_long["age_group"].map(
        dict(zip(age_order, age_labels))
    ),
    categories=age_labels,
    ordered=True
)
df_routine_disagg_long["facility_type"] = pd.Categorical(
    df_routine_disagg_long["facility_type"].map(
        dict(zip(ftype_order, ftype_labels))
    ),
    categories=ftype_labels,
    ordered=True
)

# preview the disaggregated data structure as an HTML table
df_routine_disagg_long.head().style.hide(axis="index")

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# prepare data for stacked bar chart with age group and facility type
missing_present_data_facility = (
    df_routine_disagg_long
    .groupby(["indicator", "age_group", "facility_type"], observed=True,
             as_index=False)
    .agg(
        total_records=("value", "count"),
        missing_count=("value", lambda s: s.isna().sum()),
        present_count=("value", lambda s: s.notna().sum()),
    )
    .assign(
        missing_pct=lambda d: (d["missing_count"] / d["total_records"] * 100)
                              .round(2),
        present_pct=lambda d: (d["present_count"] / d["total_records"] * 100)
                              .round(2),
    )
)

# pivot to long format for status column
mp_long = missing_present_data_facility.melt(
    id_vars=["indicator", "age_group", "facility_type"],
    value_vars=["missing_pct", "present_pct"],
    var_name="status",
    value_name="percentage"
).assign(
    status=lambda d: pd.Categorical(
        d["status"].map({"missing_pct": "Missing", "present_pct": "Present"}),
        categories=["Missing", "Present"],
        ordered=True
    )
)

# viridis D palette: begin=0.2, end=0.5 for two categories
import matplotlib.cm as cm
viridis = cm.get_cmap("viridis")
color_missing = viridis(0.2)
color_present = viridis(0.5)
status_colors = {"Missing": color_missing, "Present": color_present}

indicators = list(mp_long["indicator"].unique())
facility_types = list(mp_long["facility_type"].cat.categories)
age_labels_ordered = list(mp_long["age_group"].cat.categories)
age_rev = list(reversed(age_labels_ordered))

n_ind = len(indicators)
n_ftype = len(facility_types)

fig, axes = plt.subplots(
    n_ind, n_ftype,
    figsize=(12, 8),
    sharex=True,
    sharey="row"
)

for i, ind in enumerate(indicators):
    for j, ftype in enumerate(facility_types):
        ax = axes[i][j] if n_ind > 1 else axes[j]
        subset = mp_long.loc[
            (mp_long["indicator"] == ind) &
            (mp_long["facility_type"] == ftype)
        ]
        for status in ["Present", "Missing"]:
            s_sub = subset.loc[subset["status"] == status]
            s_sub = s_sub.set_index("age_group").reindex(age_rev)
            ax.barh(
                s_sub.index,
                s_sub["percentage"],
                color=status_colors[status],
                alpha=0.8,
                label=status
            )
        ax.set_xlim(0, 100)
        ax.set_xlabel("")
        ax.set_ylabel("")
        # panel border
        for spine in ax.spines.values():
            spine.set_linewidth(0.7)
        # strip labels
        if i == 0:
            ax.set_title(str(ftype), fontsize=9, fontweight="bold")
        if j == n_ftype - 1:
            ax.yaxis.set_label_position("right")
            ax.set_ylabel(str(ind), rotation=270, labelpad=12,
                          fontsize=9, fontweight="bold")
        # no grid
        ax.grid(False)

# shared axis labels
fig.supxlabel("Percentage (%)", y=0.02, fontsize=10)
fig.supylabel("Age Group (years)", x=0.02, fontsize=10)
fig.suptitle(
    "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    fontsize=12, y=1.01
)

# shared legend
handles = [
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Present"], alpha=0.8),
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Missing"], alpha=0.8),
]
fig.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    bbox_to_anchor=(0.0, -0.04),
    ncol=2,
    frameon=False,
    fontsize=9
)
plt.tight_layout()

### Step 7: Visualizing Structural Zeros and Logical Relationships -------------

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd

# transform to wide format to examine relationships between indicators
keep_core = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid", "year", "month", "ym", "date"
]
keep_cols = [c for c in keep_core if c in df_routine_disagg_long.columns]

df_wide = (
    df_routine_disagg_long[keep_cols + ["facility_type", "age_group",
                                         "indicator", "value"]]
    .pivot_table(
        index=keep_cols + ["facility_type", "age_group"],
        columns="indicator",
        values="value",
        aggfunc="first",
        observed=True
    )
    .reset_index()
)
df_wide.columns.name = None

# calculate missing counts for subtitle
n_test_missing_conf_zero = df_wide.loc[
    (df_wide["conf"] == 0) & df_wide["test"].isna()
].shape[0]

n_conf_missing_test_zero = df_wide.loc[
    (df_wide["test"] == 0) & df_wide["conf"].isna()
].shape[0]

n_test_na_conf_positive = df_wide.loc[
    df_wide["test"].isna() & (df_wide["conf"] > 0)
].shape[0]

n_conf_na_test_positive = df_wide.loc[
    df_wide["conf"].isna() & (df_wide["test"] > 0)
].shape[0]

subtitle_text = (
    f"N = {n_test_missing_conf_zero:,} where conf = 0 but test is missing "
    f"(legitimate zero); \n"
    f"N = {n_conf_missing_test_zero:,} where test = 0 but conf is missing "
    f"(legitimate zero); \n"
    f"N = {n_test_na_conf_positive:,} where test is missing but conf > 0 "
    f"(logical inconsistency); \n"
    f"N = {n_conf_na_test_positive:,} where conf is missing but test > 0 "
    f"(possible legitimate zero)"
)

# create demo data matching the R render chunk:
# rows 400-600 where conf is NA get test = 0 (legitimate zero pattern)
# rows 490-900 where test is NA get conf = 0 (legitimate zero pattern)
df_wide_demo = df_wide.copy().reset_index(drop=True)
mask1 = (df_wide_demo.index.isin(range(400, 601))) & df_wide_demo["conf"].isna()
df_wide_demo.loc[mask1, "test"] = 0.0
mask2 = (df_wide_demo.index.isin(range(490, 901))) & df_wide_demo["test"].isna()
df_wide_demo.loc[mask2, "conf"] = 0.0

# compute margin offsets for miss_point behavior
test_obs = df_wide_demo["test"].dropna()
conf_obs = df_wide_demo["conf"].dropna()
test_range = test_obs.max() - test_obs.min() if len(test_obs) > 1 else 1.0
conf_range = conf_obs.max() - conf_obs.min() if len(conf_obs) > 1 else 1.0
test_margin = test_obs.min() - 0.1 * test_range
conf_margin = conf_obs.min() - 0.1 * conf_range

viridis = cm.get_cmap("viridis")
color_present = viridis(0.5)   # "Not Missing"
color_missing = viridis(0.2)   # "Missing"

fig, ax = plt.subplots(figsize=(10, 8))

# --- present-present points ---
mask_pp = df_wide_demo["test"].notna() & df_wide_demo["conf"].notna()
ax.scatter(
    df_wide_demo.loc[mask_pp, "test"],
    df_wide_demo.loc[mask_pp, "conf"],
    color=color_present, alpha=0.4, s=10, label="Present", rasterized=True
)

# --- test missing, conf present: shift test to left margin ---
mask_tm = df_wide_demo["test"].isna() & df_wide_demo["conf"].notna()
ax.scatter(
    [test_margin] * mask_tm.sum(),
    df_wide_demo.loc[mask_tm, "conf"],
    color=color_missing, alpha=0.4, s=10, label="Missing", rasterized=True
)

# --- conf missing, test present: shift conf to bottom margin ---
mask_cm = df_wide_demo["conf"].isna() & df_wide_demo["test"].notna()
ax.scatter(
    df_wide_demo.loc[mask_cm, "test"],
    [conf_margin] * mask_cm.sum(),
    color=color_missing, alpha=0.4, s=10, rasterized=True
)

# reference lines at zero
ax.axvline(0, linestyle="dashed", color="grey", linewidth=0.2)
ax.axhline(0, linestyle="dashed", color="grey", linewidth=0.2)

# comma-formatted tick labels
ax.xaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"{x:,.0f}"
))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda y, _: f"{y:,.0f}"
))

ax.set_xlabel("\nTests Conducted")
ax.set_ylabel("Cases Confirmed\n")
ax.set_title("Structural Zeros: Testing vs Confirmation", fontsize=12)
ax.text(
    0.0, -0.12, subtitle_text,
    transform=ax.transAxes, fontsize=9, color="grey",
    verticalalignment="top"
)

handles = [
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_present, markersize=8),
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_missing, markersize=8),
]
ax.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    frameon=True,
    fontsize=9
)

for spine in ax.spines.values():
    spine.set_linewidth(0.7)
plt.tight_layout()
 

©2026 Applied Health Analytics for Delivery and Innovation. All rights reserved