Missing data detection methods

Overview

Missing data is a persistent challenge in public health datasets and is particularly relevant to the SNT process, where multiple data sources must be integrated across time, administrative levels, and reporting platforms. The completeness and consistency of these data underpin risk stratification, resource allocation, and the evaluation of intervention impact.

This challenge is most acute in sub-Saharan Africa, where 95% of global malaria deaths occur and health information systems face the greatest structural pressures. In routine systems such as DHIS2, which serve as the backbone for malaria reporting across the region, missing data may follow recognizable patterns linked to health system functioning, though some degree of randomness can also be present. Not all missingness is due to error, and not all of it should or can be imputed.

Gaps in DHIS2 data are shaped by underlying structural constraints. These include internet outages, delayed uploads, limited supervision, register stockouts, staff shortages, and unclear handling of zeros versus missing values. These issues contribute to uneven reporting patterns across facility types and seasons. Hospitals tend to report less consistently than health posts, as seen in Senegal, where health posts maintained higher reporting completeness than hospitals over multiple years. Reporting is also more complete during the rainy season, with some months showing a 7.5% increase in submissions (Senegal). In Ghana, outpatient totals were frequently submitted while key malaria treatment indicators were left blank, creating selective gaps. Similarly, in Nigeria’s Gombe State, nearly 40% of expected monthly records were incomplete, and intervention-related indicators like IPTp showed disproportionately high rates of missingness. During campaigns or periods of high workload, parallel reporting systems and competing priorities further increase the risk of missing or conflicting data, as shown in Ethiopia and Tanzania, where fragmented systems and unclear use of blanks weakened data reliability.

These systematic gaps create analytical blind spots that compromise SNT outputs. Areas with weak surveillance may appear to have lower burden due to underreporting, while high-burden areas with poor connectivity can be underrepresented in resource allocation. If not addressed, missing data bias estimates, distort risk stratification, misrepresent coverage, and weaken the link between inputs and outcomes. They also reduce the stability of risk maps and the reliability of impact projections.

At the same time, not all missing data should be imputed. Some gaps are due to non-active HF status, such as a facility not yet operational. Others reflect values missing for reasons that cannot be inferred from the available data. In these cases, imputation is not appropriate. We must first understand why data are missing. This involves assessing whether the missingness is completely at random, related to other observed variables, or related to the missing values themselves.

Given the complexity of these decisions, dropping incomplete records is rarely justified. Understanding the cause and patterns of missingness is necessary for determining appropriate analytical approaches, whether that involves excluding certain records, applying logical inference, or implementing statistical methods that preserve the spatial and temporal structure of the data.

Objectives

Learn systematic approaches to detect and visualize missing data patterns in surveillance systems
Examine missingness across temporal, spatial, and demographic dimensions
Identify structural zeros, legitimate zeros, and logical inconsistencies in related indicators
Distinguish between MCAR, MAR, and MNAR missing data mechanisms
Apply diagnostic visualizations to assess data quality and reporting completeness
Generate comprehensive missing data assessments to inform analytical decisions

Understanding Missing Data

Before proceeding with any analytical approach, it is important to understand the types of missingness present in the dataset, and to identify the most appropriate methods for addressing each type, whether through exclusion, clinical logic, or statistical techniques.

Understanding Types of Missing Data

When working with incomplete data, it’s important to understand why values are missing. This determines whether imputation is appropriate.

Key patterns to examine to help determine the type of missingness

Temporal gaps: Are some months or seasons consistently underreported?
Spatial gaps: Are there consistent patterns across districts or regions? When facility data are sparse, higher-level patterns can inform analytical approaches using neighbouring facilities of the same type.
Facility-level gaps: Do reporting rates vary by facility type (e.g., hospitals vs health posts), ownership, or service volume?
Demographic gaps: Are data more frequently missing for specific age groups (e.g., under-fives, adolescents, pregnant women)?
Variable-level gaps: Are certain indicators (e.g., IPTp3, ACT availability) more frequently left blank?

These patterns help identify where missingness occurs and why it occurs. Understanding these patterns is the first step in diagnosing whether missingness is Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR).

Once patterns are identified, use them to assess the mechanism behind the missing values. There are three main types of missingness, and understanding which type we are dealing with helps us make informed analytical decisions about how to handle the missing data.

Not all gaps in the data reflect missingness. Some values are absent because they were never expected to be present in the first place. These reflect non-active HF status, not missing data. For example, a facility may not have been operational during a particular period, or may never have reported on a specific indicator. In such cases, the absence of data should be marked as not applicable and excluded from any imputation procedure.

Missing Completely At Random (MCAR)

What it means: The data are missing for no particular reason. The missingness is unrelated to anything in the dataset.

Example: Suppose some monthly reports are missing because a stack of paper forms was lost in a flood. The flood didn’t target specific facilities or months, it was random. In this case, the missing data is MCAR.

Implication: Imputation is generally acceptable for MCAR data using simple methods. However, we can safely analyze the non-missing data. It is still representative of the whole.

Missing At Random (MAR)

What it means: The missingness is related to something else in the dataset, but not to the missing value itself.

Example: Imagine that reports of confirmed malaria cases are more likely to be missing from hospitals than health posts. But we know which facilities are hospitals and which are health posts. So the missingness is not random, but it is explainable based on another variable facility type.

Implication: We can use the information we have (like facility type, month, or region) to predict and impute the missing values. Most imputation methods assume MAR.

Missing Not At Random (MNAR)

What it means: The missingness is related to the value that is missing, and it cannot be explained using other information in the data.

Example: Suppose some facilities do not report malaria deaths because they fear it reflects badly on their performance. The missingness is not random and not explained by any variable in the dataset. We do not know which deaths are missing, and we cannot reliably estimate them.

Implication: Imputation is risky. The data are biased in a way that cannot be corrected without more information.

Type	What Causes It	Can Be Imputed?	Example
MCAR	Random external factors unrelated to any variable in the dataset	Yes	Monthly reports lost in a flood, no systematic pattern by location, time, or indicator
MAR	Missingness depends on observed variables	Yes	Hospitals report less consistently than health posts, but facility type is known and recorded
MNAR	Missingness depends on the unobserved value itself	Not recommended	Facilities underreport malaria deaths to avoid scrutiny; missingness cannot be explained using other variables

Consult with SNT team

If the missingness mechanism is unclear (MCAR, MAR, or MNAR), consult the SNT team. Patterns that seem technical may reflect known operational issues (e.g., stockouts, parallel reporting during campaigns). Getting their input early can prevent incorrect imputation choices later.

Step-by-Step

This step-by-step section shows how to systematically detect missingness patterns using our Sierra Leone DHIS2 data (previously assembled in the DHIS2 data preprocessing page). It builds on the principles outlined above, emphasizing the importance of detecting and understanding missing data patterns.

The examples focus on identifying temporal, spatial, and systematic patterns in missingness that inform our classification decisions. Understanding these patterns is necessary before proceeding to the Imputation Methods page, where we implement structural and statistical imputation techniques.

Step 1: Install and Load Libraries

First, install and/or load the necessary packages for data manipulation, visualization, and missing data detection.

# install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load required packages using pacman
pacman::p_load(
  dplyr,        # data manipulation
  ggplot2,      # plotting
  ggtext,       # markdown text in ggplot2
  here,         # file path management
  lubridate,    # date handling
  naniar,       # missing data visualization and analysis
  tidyr,        # data reshaping
  cli,          # clean logging and CLI-style messages
  scales,       # number formatting
  UpSetR,       # upset plots for co-missingness
  wesanderson   # color palettes
)

To adapt the code:

Do not modify anything in the code above

Show the code

from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
from pyprojroot import here

# ── cli helpers ──────────────────────────────────────────────────────────────
def cli_header(message):
    print(f"\n{message}")

def cli_info(message):
    print(f"INFO: {message}")

def cli_success(message):
    print(f"SUCCESS: {message}")

def cli_warning(message):
    print(f"WARNING: {message}")

def cli_danger(message):
    print(f"ERROR: {message}")

def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )

To adapt the code:

Do not modify anything in the code above

Step 2: Load Data

Load the DHIS2 surveillance data and prepare it for missing data detection.

In this example, we focus on the past five years (the data stops in December 2023), so filter out anything before 2019.

Show the code

# read the processed surveillance data produced by the import workflow
df_routine <- readRDS(
  here::here(
    "01_data",
    "1.2_epidemiology",
    "1.2a_routine_surveillance",
    "processed",
    "clean_malaria_routine_data_final.rds"
  )
) |>
  # drop the import-source label carried over from import.qmd
  dplyr::select(-dplyr::any_of("sheet_admin")) |>
  # filter to the past five years
  dplyr::filter(year >= 2019) |>
  dplyr::rename(maladm_hf_u5 = maladm_u5) |>
  dplyr::mutate(
    # ym: human-readable year-month label used as a discrete x-axis
    ym = format(date, "%Y-%m"),
    # date stays as Date (already Date in the .rds produced by import.qmd)
    date = as.Date(date)
  )

Output

record_id	adm0	adm1	adm2	adm3	hf	hf_uid	location_short	location_full	facility_type	date	yearmon	year	month	periodname	allout_u5	allout_ov5	maladm_hf_u5	maladm_5_14	maladm_ov15	maldth_u5	maldth_5_14	maldth_ov15	susp_u5_hf	susp_5_14_hf	susp_ov15_hf	susp_u5_com	susp_5_14_com	susp_ov15_com	maldth_fem_ov15	maldth_mal_ov15	maldth_1_59m	maldth_10_14	maldth_5_9	tes_neg_rdt_u5_com	tes_pos_rdt_u5_com	tes_neg_rdt_5_14_com	tes_pos_rdt_5_14_com	tes_neg_rdt_ov15_com	tes_pos_rdt_ov15_com	test_neg_mic_u5_hf	test_pos_mic_u5_hf	test_neg_mic_5_14_hf	test_pos_mic_5_14_hf	test_neg_mic_ov15_hf	test_pos_mic_ov15_hf	tes_neg_rdt_u5_hf	tes_pos_rdt_u5_hf	tes_neg_rdt_5_14_hf	tes_pos_rdt_5_14_hf	tes_neg_rdt_ov15_hf	tes_pos_rdt_ov15_hf	maltreat_u24_u5_com	maltreat_ov24_u5_com	maltreat_u24_5_14_com	maltreat_ov24_5_14_com	maltreat_u24_ov15_com	maltreat_ov24_ov15_com	maltreat_u24_u5_hf	maltreat_ov24_u5_hf	maltreat_u24_5_14_hf	maltreat_ov24_5_14_hf	maltreat_u24_ov15_hf	maltreat_ov24_ov15_hf	hf_uid_new	hf_clean	allout	susp	test_hf	test_com	test	conf_hf	conf_com	conf	maltreat_com	maltreat_hf	maltreat	pres_com	pres	maladm	maldth	test_hf_u5	test_hf_5_14	test_hf_ov15	test_com_u5	test_com_5_14	test_com_ov15	test_u5	test_5_14	test_ov15	susp_hf_u5	susp_hf_5_14	susp_hf_ov15	susp_com_u5	susp_com_5_14	susp_com_ov15	susp_u5	susp_5_14	susp_ov15	conf_hf_u5	conf_hf_5_14	conf_hf_ov15	conf_com_u5	conf_com_5_14	conf_com_ov15	conf_u5	conf_5_14	conf_ov15	maltreat_hf_u5	maltreat_hf_5_14	maltreat_hf_ov15	maltreat_com_u5	maltreat_com_5_14	maltreat_com_ov15	maltreat_u5	maltreat_5_14	maltreat_ov15	maltreat_u24_hf	maltreat_ov24_hf	maltreat_u24_com	maltreat_ov24_com	maltreat_u24_total	maltreat_ov24_total	pres_com_u5	pres_com_5_14	pres_com_ov15	pres_u5	pres_5_14	pres_ov15	ym
138a1a87	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-01-01	Jan 2019	2019	1	January 2019	101	132	NA	NA	NA	NA	NA	NA	95	24	44	69	24	44	NA	NA	NA	NA	NA	31	38	6	18	20	25	NA	NA	NA	NA	NA	NA	51	44	6	18	20	24	35	9	NA	NA	24	NA	35	9	15	3	16	8	hf_uid_new::1012	Althel CHP	233	300	163	138	301	86	81	167	68	86	154	0	0	NA	NA	95	24	44	69	24	45	164	48	89	95	24	44	69	24	44	164	48	88	44	18	24	38	18	25	82	36	49	44	18	24	44	NA	24	88	18	48	66	20	59	9	125	29	6	18	0	6	18	0	2019-01
ed6d6325	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-02-01	Feb 2019	2019	2	February 2019	86	51	NA	NA	NA	NA	NA	NA	82	16	35	62	16	35	NA	NA	NA	NA	NA	20	42	7	9	7	28	NA	NA	NA	NA	NA	NA	25	57	7	9	7	28	65	NA	NA	NA	6	NA	40	17	9	NA	22	6	hf_uid_new::1012	Althel CHP	137	246	133	113	246	94	79	173	71	94	165	0	0	NA	NA	82	16	35	62	16	35	144	32	70	82	16	35	62	16	35	144	32	70	57	9	28	42	9	28	99	18	56	57	9	28	65	NA	6	122	9	34	71	23	71	NA	142	23	23	9	0	23	9	0	2019-02
410e3ba3	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-03-01	Mar 2019	2019	3	March 2019	149	132	NA	NA	NA	NA	NA	NA	106	22	38	69	22	38	NA	NA	NA	NA	NA	27	42	10	12	15	23	NA	NA	NA	NA	NA	NA	40	66	10	12	15	23	66	NA	12	NA	23	NA	66	NA	12	NA	23	NA	hf_uid_new::1012	Althel CHP	281	295	166	129	295	101	77	178	101	101	202	24	24	NA	NA	106	22	38	69	22	38	175	44	76	106	22	38	69	22	38	175	44	76	66	12	23	42	12	23	108	24	46	66	12	23	66	12	23	132	24	46	101	NA	101	NA	202	NA	24	0	0	24	0	0	2019-03
0c0fef59	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-04-01	Apr 2019	2019	4	April 2019	99	80	NA	NA	NA	NA	NA	NA	74	15	32	49	15	32	NA	NA	NA	NA	NA	13	37	7	8	16	16	NA	NA	NA	NA	NA	NA	16	58	3	12	16	16	58	NA	8	NA	16	NA	58	NA	12	NA	16	NA	hf_uid_new::1012	Althel CHP	179	217	121	97	218	86	61	147	82	86	168	21	21	NA	NA	74	15	32	50	15	32	124	30	64	74	15	32	49	15	32	123	30	64	58	12	16	37	8	16	95	20	32	58	12	16	58	8	16	116	20	32	86	NA	82	NA	168	NA	21	0	0	21	0	0	2019-04
c3e6f1fa	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-05-01	May 2019	2019	5	May 2019	96	73	NA	NA	NA	NA	NA	NA	84	38	16	49	38	16	NA	NA	NA	NA	NA	16	65	13	25	6	10	NA	NA	NA	NA	NA	NA	19	65	13	25	6	10	46	19	19	6	10	4	46	19	19	6	10	NA	hf_uid_new::1012	Althel CHP	169	241	138	135	273	100	100	200	104	100	204	4	4	NA	NA	84	38	16	81	38	16	165	76	32	84	38	16	49	38	16	133	76	32	65	25	10	65	25	10	130	50	20	65	25	10	65	25	14	130	50	24	75	25	75	29	150	54	0	0	4	0	0	4	2019-05
d620d761	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-06-01	Jun 2019	2019	6	June 2019	80	48	NA	NA	NA	NA	NA	NA	105	22	19	72	13	19	NA	NA	NA	NA	NA	17	55	9	13	7	12	NA	NA	NA	NA	NA	NA	22	83	9	13	7	12	83	NA	13	NA	12	NA	83	NA	13	NA	12	NA	hf_uid_new::1012	Althel CHP	128	250	146	113	259	108	80	188	108	108	216	28	28	NA	NA	105	22	19	72	22	19	177	44	38	105	22	19	72	13	19	177	35	38	83	13	12	55	13	12	138	26	24	83	13	12	83	13	12	166	26	24	108	NA	108	NA	216	NA	28	0	0	28	0	0	2019-06

To adapt the code:

Lines 2-8: Update the path to point at the processed routine surveillance file produced by the import workflow (see the DHIS2 data preprocessing page).
Line 14: Filter the data to the period of interest.
Line 15: Drop the dplyr::rename() line if the dataset already uses the target column name.

Show the code

from pathlib import Path

import pandas as pd
from pyprojroot import here

# read the processed surveillance data produced by the import workflow.
# the .rds (loaded by R) and .parquet (loaded here) are written from the
# same dhis2_df in import.qmd, so both languages see identical data.
data_path = Path(here(
    "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed/"
    "clean_malaria_routine_data_final.parquet"
))

df_routine = (
    pd.read_parquet(data_path)
    # drop the import-source label carried over from import.qmd
    .drop(columns="sheet_admin", errors="ignore")
    # filter to the past five years
    .loc[lambda d: d["year"] >= 2019]
    .rename(columns={"maladm_u5": "maladm_hf_u5"})
    .assign(date=lambda d: pd.to_datetime(d["date"], errors="coerce"))
    .assign(ym=lambda d: d["date"].dt.strftime("%Y-%m"))
    .reset_index(drop=True)
)

Output

record_id	adm0	adm1	adm2	adm3	hf	hf_uid	location_short	location_full	facility_type	date	yearmon	year	month	periodname	allout_u5	allout_ov5	maladm_hf_u5	maladm_5_14	maladm_ov15	maldth_u5	maldth_5_14	maldth_ov15	susp_u5_hf	susp_5_14_hf	susp_ov15_hf	susp_u5_com	susp_5_14_com	susp_ov15_com	maldth_fem_ov15	maldth_mal_ov15	maldth_1_59m	maldth_10_14	maldth_5_9	tes_neg_rdt_u5_com	tes_pos_rdt_u5_com	tes_neg_rdt_5_14_com	tes_pos_rdt_5_14_com	tes_neg_rdt_ov15_com	tes_pos_rdt_ov15_com	test_neg_mic_u5_hf	test_pos_mic_u5_hf	test_neg_mic_5_14_hf	test_pos_mic_5_14_hf	test_neg_mic_ov15_hf	test_pos_mic_ov15_hf	tes_neg_rdt_u5_hf	tes_pos_rdt_u5_hf	tes_neg_rdt_5_14_hf	tes_pos_rdt_5_14_hf	tes_neg_rdt_ov15_hf	tes_pos_rdt_ov15_hf	maltreat_u24_u5_com	maltreat_ov24_u5_com	maltreat_u24_5_14_com	maltreat_ov24_5_14_com	maltreat_u24_ov15_com	maltreat_ov24_ov15_com	maltreat_u24_u5_hf	maltreat_ov24_u5_hf	maltreat_u24_5_14_hf	maltreat_ov24_5_14_hf	maltreat_u24_ov15_hf	maltreat_ov24_ov15_hf	hf_uid_new	hf_clean	allout	susp	test_hf	test_com	test	conf_hf	conf_com	conf	maltreat_com	maltreat_hf	maltreat	pres_com	pres	maladm	maldth	test_hf_u5	test_hf_5_14	test_hf_ov15	test_com_u5	test_com_5_14	test_com_ov15	test_u5	test_5_14	test_ov15	susp_hf_u5	susp_hf_5_14	susp_hf_ov15	susp_com_u5	susp_com_5_14	susp_com_ov15	susp_u5	susp_5_14	susp_ov15	conf_hf_u5	conf_hf_5_14	conf_hf_ov15	conf_com_u5	conf_com_5_14	conf_com_ov15	conf_u5	conf_5_14	conf_ov15	maltreat_hf_u5	maltreat_hf_5_14	maltreat_hf_ov15	maltreat_com_u5	maltreat_com_5_14	maltreat_com_ov15	maltreat_u5	maltreat_5_14	maltreat_ov15	maltreat_u24_hf	maltreat_ov24_hf	maltreat_u24_com	maltreat_ov24_com	maltreat_u24_total	maltreat_ov24_total	pres_com_u5	pres_com_5_14	pres_com_ov15	pres_u5	pres_5_14	pres_ov15	ym
138a1a87	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-01-01 03:00:00	Jan 2019	2019	1	January 2019	101	132	NaN	NaN	NaN	NaN	NaN	NaN	95	24	44	69	24	44	NaN	NaN	NaN	NaN	NaN	31	38	6	18	20	25	NaN	NaN	NaN	NaN	NaN	NaN	51	44	6	18	20	24	35	9	NaN	NaN	24	NaN	35	9	15	3	16	8	hf_uid_new::1012	Althel CHP	233	300	163	138	301	86	81	167	68	86	154	0	0	NaN	NaN	95	24	44	69	24	45	164	48	89	95	24	44	69	24	44	164	48	88	44	18	24	38	18	25	82	36	49	44	18	24	44	NaN	24	88	18	48	66	20	59	9	125	29	6	18	0	6	18	0	2019-01
ed6d6325	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-02-01 03:00:00	Feb 2019	2019	2	February 2019	86	51	NaN	NaN	NaN	NaN	NaN	NaN	82	16	35	62	16	35	NaN	NaN	NaN	NaN	NaN	20	42	7	9	7	28	NaN	NaN	NaN	NaN	NaN	NaN	25	57	7	9	7	28	65	NaN	NaN	NaN	6	NaN	40	17	9	NaN	22	6	hf_uid_new::1012	Althel CHP	137	246	133	113	246	94	79	173	71	94	165	0	0	NaN	NaN	82	16	35	62	16	35	144	32	70	82	16	35	62	16	35	144	32	70	57	9	28	42	9	28	99	18	56	57	9	28	65	NaN	6	122	9	34	71	23	71	NaN	142	23	23	9	0	23	9	0	2019-02
410e3ba3	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-03-01 03:00:00	Mar 2019	2019	3	March 2019	149	132	NaN	NaN	NaN	NaN	NaN	NaN	106	22	38	69	22	38	NaN	NaN	NaN	NaN	NaN	27	42	10	12	15	23	NaN	NaN	NaN	NaN	NaN	NaN	40	66	10	12	15	23	66	NaN	12	NaN	23	NaN	66	NaN	12	NaN	23	NaN	hf_uid_new::1012	Althel CHP	281	295	166	129	295	101	77	178	101	101	202	24	24	NaN	NaN	106	22	38	69	22	38	175	44	76	106	22	38	69	22	38	175	44	76	66	12	23	42	12	23	108	24	46	66	12	23	66	12	23	132	24	46	101	NaN	101	NaN	202	NaN	24	0	0	24	0	0	2019-03
0c0fef59	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-04-01 03:00:00	Apr 2019	2019	4	April 2019	99	80	NaN	NaN	NaN	NaN	NaN	NaN	74	15	32	49	15	32	NaN	NaN	NaN	NaN	NaN	13	37	7	8	16	16	NaN	NaN	NaN	NaN	NaN	NaN	16	58	3	12	16	16	58	NaN	8	NaN	16	NaN	58	NaN	12	NaN	16	NaN	hf_uid_new::1012	Althel CHP	179	217	121	97	218	86	61	147	82	86	168	21	21	NaN	NaN	74	15	32	50	15	32	124	30	64	74	15	32	49	15	32	123	30	64	58	12	16	37	8	16	95	20	32	58	12	16	58	8	16	116	20	32	86	NaN	82	NaN	168	NaN	21	0	0	21	0	0	2019-04
c3e6f1fa	SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	SOUTHERN ~ BO	SOUTHERN ~ BO ~ BO TOWN ~ AETHEL CHP	OPD	2019-05-01 03:00:00	May 2019	2019	5	May 2019	96	73	NaN	NaN	NaN	NaN	NaN	NaN	84	38	16	49	38	16	NaN	NaN	NaN	NaN	NaN	16	65	13	25	6	10	NaN	NaN	NaN	NaN	NaN	NaN	19	65	13	25	6	10	46	19	19	6	10	4	46	19	19	6	10	NaN	hf_uid_new::1012	Althel CHP	169	241	138	135	273	100	100	200	104	100	204	4	4	NaN	NaN	84	38	16	81	38	16	165	76	32	84	38	16	49	38	16	133	76	32	65	25	10	65	25	10	130	50	20	65	25	10	65	25	14	130	50	24	75	25	75	29	150	54	0	0	4	0	0	4	2019-05

To adapt the code:

Lines 6-12: Update the path to point at the processed routine surveillance file produced by the import workflow (see the DHIS2 data preprocessing page).
Line 19: Filter the data to the period of interest.
Line 20: Drop the .rename() line if the dataset already uses the target column name.

Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness

Before drilling into temporal, spatial, and facility-type patterns, it helps to separate three concepts that often get lumped together as “missing data”:

Reporting completeness: was a report submitted at all for a given facility-month?
Conditional missingness: given that a report was submitted, which indicators were left blank?
Non-applicable periods: months before a facility was operational or after it stopped reporting are not missing. They are simply not expected.

A vertical band of “missing” months in a heatmap usually means no report was submitted (the entire row is NA). A single missing cell within an otherwise complete report has a very different meaning and a different mechanism. Untangling these is the first step before any of the visualisations below can be interpreted correctly.

Filter to active facility-months first

Non-applicable periods (months before a facility was operational, after it stopped reporting, or during gaps in reporting) are defined on the Determining active and inactive status page. Run that workflow first; the analyses below assume df_routine has been filtered to active facility-months. Mixing in not-yet-active or already-inactive months will inflate the missing rates we see in every subsequent step.

Step 3.1: Conditional missingness within submitted reports

A facility-month is treated as a submitted report when at least one core indicator is non-missing. Among submitted reports, the percentage of each indicator left blank tells us whether the missingness is form-completion practice (specific indicators selectively skipped) or reporting failure (whole rows missing). Within an active facility-month, a missing indicator should reflect the former.

Show the code

# define the core indicators that signal a report was submitted
core_inds <- c("test", "conf", "maltreat")

# flag each facility-month as reporting or non-reporting
df_routine <- df_routine |>
  dplyr::mutate(
    report_submitted = dplyr::if_any(
      dplyr::all_of(core_inds),
      ~ !is.na(.x)
    )
  )

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing <- df_routine |>
  dplyr::filter(report_submitted) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(core_inds),
      ~ round(mean(is.na(.x)) * 100, 1)
    )
  ) |>
  tidyr::pivot_longer(
    dplyr::everything(),
    names_to = "indicator",
    values_to = "pct_missing_given_reported"
  )

conditional_missing |>
  dplyr::rename(
    Indicator = indicator,
    `% missing given reported` = pct_missing_given_reported
  ) |>
  knitr::kable(align = "lr", digits = 1)

Output

Indicator	% missing given reported
test	1.2
conf	1.2
maltreat	1.9

To adapt the code:

Line 2: Update core_inds to the indicators that signal a report was submitted in the dataset.

Show the code

import pandas as pd

# define the core indicators that signal a report was submitted
core_inds = ["test", "conf", "maltreat"]

# flag each facility-month as reporting or non-reporting
df_routine = df_routine.assign(
    report_submitted=lambda d: d[core_inds].notna().any(axis=1)
)

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing = (
    df_routine.loc[df_routine["report_submitted"]]
    [core_inds]
    .isna()
    .mean()
    .mul(100)
    .round(1)
    .rename_axis("indicator")
    .reset_index(name="pct_missing_given_reported")
)

conditional_missing

Output

indicator	pct_missing_given_reported
test	1.2
conf	1.2
maltreat	1.9

To adapt the code:

Line 4: Update core_inds to the indicators that signal a report was submitted in the dataset.

Step 3.2: Quantitative summary across variables

A tabular summary and bar chart give a per-variable view of overall missingness. The bar chart sorts indicators by missing rate, making it easy to spot indicators that warrant closer inspection.

Show the code

# variables to summarise
summary_vars <- c("test", "susp", "pres", "conf", "maltreat")

# compute % missing for the core aggregated indicators
miss_summary <- naniar::miss_var_summary(
  df_routine |>
    dplyr::select(dplyr::all_of(summary_vars))
) |>
  dplyr::mutate(pct_miss = as.numeric(pct_miss))

# bar chart of % missing per variable
miss_summary |>
  dplyr::mutate(
    variable = factor(variable, levels = rev(variable)),
    label_text = paste0(round(pct_miss, 1), "%"),
    inside = pct_miss > max(pct_miss) * 0.15
  ) |>
  ggplot2::ggplot(ggplot2::aes(x = pct_miss, y = variable)) +
  ggplot2::geom_col(fill = "#3B9AB2", width = 0.75) +
  ggplot2::geom_text(
    ggplot2::aes(
      label = label_text,
      hjust = ifelse(inside, 1.15, -0.15),
      colour = ifelse(inside, "white", "grey20")
    ),
    size = 4,
    fontface = "bold",
    family = "sans"
  ) +
  ggplot2::scale_colour_identity() +
  ggplot2::scale_x_continuous(
    expand = ggplot2::expansion(mult = c(0, 0.05)),
    labels = function(x) paste0(x, "%")
  ) +
  ggplot2::labs(
    title = "Proportion of missing data per indicator",
    subtitle = paste(
      "% missing across submitted reports for each",
      "core malaria indicator"
    ),
    x = NULL,
    y = NULL
  ) +
  ggplot2::theme_minimal(base_family = "sans") +
  ggplot2::theme(
    plot.title = ggplot2::element_text(
      size = 16,
      face = "bold",
      margin = ggplot2::margin(b = 4)
    ),
    plot.subtitle = ggplot2::element_text(
      size = 12,
      colour = "grey30",
      margin = ggplot2::margin(b = 14)
    ),
    axis.text.y = ggplot2::element_text(size = 11, colour = "black"),
    axis.text.x = ggplot2::element_text(size = 10, colour = "grey30"),
    panel.grid.major.y = ggplot2::element_blank(),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major.x = ggplot2::element_line(
      colour = "grey90",
      size = 0.3
    ),
    axis.ticks = ggplot2::element_blank(),
    plot.margin = ggplot2::margin(15, 30, 10, 10)
  )

Output

To adapt the code:

Line 2: Update summary_vars to match the dataset’s indicators of interest.

Co-missingness and Little’s MCAR test as additional diagnostics

Two extensions to the diagnostics above are useful when the basic summary leaves the mechanism ambiguous:

Co-missingness: naniar::gg_miss_upset(df_routine |> dplyr::select(test, susp, pres, conf, maltreat), nsets = 5) shows which indicators are missing together. Requires the UpSetR package. A single large intersection points to report-level reporting failure; many small intersections point to form completion practice.
Little’s MCAR test: naniar::mcar_test(df_routine |> dplyr::select(test, conf, maltreat)) returns a p-value for the null hypothesis that data are MCAR. A small p-value supports MAR or MNAR. The test can be slow on large datasets.

Show the code

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

# tabular summary of % missing for the core aggregated indicators
summary_vars = ["test", "susp", "pres", "conf", "maltreat"]

miss_summary = (
    pd.DataFrame({
        "variable": summary_vars,
        "n_miss": [df_routine[v].isna().sum() for v in summary_vars],
        "pct_miss": [
            round(df_routine[v].isna().mean() * 100, 1)
            for v in summary_vars
        ],
    })
    .sort_values("pct_miss", ascending=False)
    .reset_index(drop=True)
)

# bar chart of % missing per variable (horizontal, sorted descending)
vmax = float(miss_summary["pct_miss"].max())
threshold = vmax * 0.15

fig, ax = plt.subplots(figsize=(9, 5.2))
bars = ax.barh(
    miss_summary["variable"],
    miss_summary["pct_miss"],
    color="#3B9AB2",
    height=0.75
)
ax.invert_yaxis()

# value labels: inside the bar (white) when there is room,
# outside (grey) for short bars
for bar, val in zip(bars, miss_summary["pct_miss"]):
    label = f"{round(val, 1)}%"
    if val > threshold:
        ax.text(
            val - vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="right",
            color="white",
            fontsize=11,
            fontweight="bold"
        )
    else:
        ax.text(
            val + vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="left",
            color="#333333",
            fontsize=11,
            fontweight="bold"
        )

# title (large, bold) + subtitle (smaller, grey)
fig.suptitle(
    "Proportion of missing data per indicator",
    fontsize=16,
    fontweight="bold",
    x=0.05,
    ha="left",
    y=0.98
)
ax.set_title(
    "% missing across submitted reports for each core malaria indicator",
    fontsize=12,
    color="#555555",
    loc="left",
    pad=8
)

ax.set_xlabel("")
ax.set_ylabel("")
ax.xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f"{int(x)}%")
)
ax.tick_params(left=False, labelsize=11, colors="#333333")
ax.tick_params(axis="x", colors="#555555", labelsize=10)

# clean look: no border, only faint x gridlines
for spine in ax.spines.values():
    spine.set_visible(False)
ax.grid(axis="x", color="#E5E5E5", linewidth=0.6)
ax.grid(axis="y", visible=False)
ax.set_axisbelow(True)
ax.set_xlim(0, max(vmax * 1.05, 10))

plt.tight_layout(rect=[0, 0, 1, 0.94])

Output

(0.0, 25.305000000000003)

To adapt the code:

Line 6: Update the indicator list in summary_vars to match the dataset.

Consult with SNT team

Indicators that show very high or 100% missingness are often a signal that the value lives elsewhere, not that it is genuinely absent. SNT data management is iterative: an indicator that appears empty at the facility level may sit in a different mother dataset (an aggregated reporting form, a separate platform module, a parallel reporting system, or a renamed field). Before treating these indicators as analytical gaps, consult the SNT team to confirm whether the indicator is collected elsewhere or whether the dataset under analysis is the right starting point. What looks like a data problem is often a sourcing problem the team has already resolved.

The conditional missingness table and the overall summary together tell us whether the gaps we see downstream are predominantly reporting failures (whole rows missing, all indicators co-missing) or form completion practice (specific indicators missing despite a report being filed). The two require different responses on the Imputation Methods page.

Step 4: Check Missingness over Time

In this step, we check the missingness rate of our variables of interest across time. This analysis helps us understand temporal patterns in data availability and identify periods or variables with systematic missing data issues. The heatmap visualization provides an overview of data completeness across all malaria indicators for different age groups throughout the study period. The color intensity represents the percentage of missing values, allowing us to quickly identify problematic time periods or variables that may require special handling in subsequent analyses.

Initially, we want to check the following set of variables: malaria tests, suspected, presumed and confirmed cases and malaria treatments aggregated across all age groups, as well as the age-specific breakdowns.

Show the code

# get the variables of interest (aggregated and age-specific)
vars <- c(
  "test",        # aggregated malaria tests
  "test_u5",
  "test_5_14",
  "test_ov15",
  "susp",        # aggregated suspected cases
  "susp_u5",
  "susp_5_14",
  "susp_ov15",
  "pres",        # aggregated presumed cases
  "pres_u5",
  "pres_5_14",
  "pres_ov15",
  "conf",        # aggregated confirmed cases
  "conf_u5",
  "conf_5_14",
  "conf_ov15",
  "maltreat",    # aggregated treatments
  "maltreat_u5",
  "maltreat_5_14",
  "maltreat_ov15"
)

# calculate missing rates by date for each variable
missing_rate_date <- df_routine |>
  dplyr::group_by(date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(vars),
      ~ mean(is.na(.x)) * 100
    ),
    .groups = "drop"
  ) |>
  dplyr::arrange(date) |>
  tidyr::pivot_longer(
    cols = -ym,
    names_to = "variables",
    values_to = "missing_rate"
  ) |>
  # enforce the explicit `vars` order on the y-axis so R and Python match
  dplyr::mutate(variables = factor(variables, levels = vars))

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_date$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across variables
missing_plot_date <- ggplot2::ggplot(
  missing_rate_date,
  ggplot2::aes(
    y = variables,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "Variable",
    x = "",
    title = "The proportion of missing data for selected variables by year-month"
  )

# display the plot
missing_plot_date

Output

To adapt the code:

Lines 1-23: Update the vars vector to match the dataset’s variable names. For a different country or dataset, modify variable names to reflect the relevant indicators (e.g., for Ghana this might be cases_u5, tests_5_14, etc.).
Line 26: Replace df_routine with the dataset name if different.
Line 50: Set full_range <- TRUE to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range <- FALSE to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.

Show the code

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# variables of interest (aggregated and age-specific)
vars_of_interest = [
    "test",           # aggregated malaria tests
    "test_u5",
    "test_5_14",
    "test_ov15",
    "susp",           # aggregated suspected cases
    "susp_u5",
    "susp_5_14",
    "susp_ov15",
    "pres",           # aggregated presumed cases
    "pres_u5",
    "pres_5_14",
    "pres_ov15",
    "conf",           # aggregated confirmed cases
    "conf_u5",
    "conf_5_14",
    "conf_ov15",
    "maltreat",       # aggregated treatments
    "maltreat_u5",
    "maltreat_5_14",
    "maltreat_ov15",
]

# calculate missing rates by year-month for each variable
missing_rate_date = (
    df_routine.groupby("ym", as_index=False)[vars_of_interest]
    .apply(lambda g: g[vars_of_interest].isna().mean() * 100)
    .reset_index()
    .rename(columns={"level_0": "ym"})
)
# rebuild from groupby to preserve ym correctly
_grouped = df_routine.groupby("ym")
missing_rate_date = pd.DataFrame(
    {v: _grouped[v].apply(lambda s: s.isna().mean() * 100)
     for v in vars_of_interest}
).reset_index()

# pivot to long format
missing_rate_long = missing_rate_date.melt(
    id_vars="ym",
    value_vars=vars_of_interest,
    var_name="variables",
    value_name="missing_rate"
).sort_values("ym")

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_long["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap (mirrors wesanderson::wes_palette("Zissou1", 100))
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

# pivot back to matrix for pcolormesh
ym_order = sorted(missing_rate_long["ym"].unique())
var_order = vars_of_interest  # top-to-bottom order on y-axis

pivot = (
    missing_rate_long
    .pivot(index="variables", columns="ym", values="missing_rate")
    .reindex(index=var_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(var_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

# x-axis ticks and labels
ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

# y-axis ticks and labels
ax.set_yticks(np.arange(len(var_order)) + 0.5)
ax.set_yticklabels(var_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("Variable")
ax.set_title(
    "The proportion of missing data for selected variables by year-month",
    fontsize=12, pad=10
)

# horizontal colorbar at the bottom
cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

# remove grid lines
ax.grid(False)
plt.tight_layout()

Output

To adapt the code:

Lines 7-27: Update vars_of_interest to match the dataset’s variable names. For a different country or dataset, modify variable names to reflect the relevant indicators.
Line 31: Replace df_routine with the dataset name if different.
Line 53: Set full_range = True to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range = False to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.

Several indicators show consistent missingness across time. Between mid-2022 and early 2023, missingness sharply increased for several indicators, with test, maltreat, and conf reaching rates above 45%.

What to look for in heatmaps like this on any dataset:

Vertical bands (all indicators missing in the same months) indicate system-wide outages or platform migrations and usually point to MCAR at the time-period level.
Horizontal bands (one indicator consistently missing across all months) indicate indicator-specific reporting practice or form changes and usually point to MAR conditional on indicator.
An expanding right edge indicates delayed reporting in the most recent months; consider whether those months should be excluded.
Co-located bands across age-disaggregated rows (e.g., test_u5, test_5_14, test_ov15 all missing together) usually mean the aggregate was reported but the disaggregation was skipped.

Patterns that look like MAR conditional on time are addressable through time-stratified imputation on the Imputation Methods page.

Step 5: Check Missingness over Time and Administrative Level

In this step, we focus on a single indicator, malaria testing for under 5’s at the health facility level (test_hf_u5), to examine how missingness varies over time across districts (adm2 level). This allows us to identify subnational areas with consistently poor reporting and spot time periods where data availability dropped sharply.

Although the data is available at the health facility level, the most granular reporting unit, it is still important to assess missingness at higher administrative levels. This broader view becomes especially useful when facility-level data are sparse. In such cases, we may use fallback analysis approaches that draw on patterns from neighbouring facilities within the same district.

Show the code

# calculate missing rates by year-month and adm2 for each variable
missing_rate_adm2 <- df_routine |>
  dplyr::group_by(ym, adm2) |>
  dplyr::summarise(
    missing_rate = mean(is.na(test_hf_u5)) * 100,
    .groups = "drop"
  )

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_adm2$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across location
missing_plot_adm2 <- ggplot2::ggplot(
  missing_rate_adm2,
  ggplot2::aes(
    y = adm2,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "District",
    x = "",
    title = "The proportion of missing data for test_hf_u5 by year-month and adm2"
  )

# display the plot
missing_plot_adm2

Output

To adapt the code:

Line 2: Replace df_routine with the dataset name if different.
Line 5: Update the variable name (test_hf_u5) to match the dataset’s variable name of interest.
Line 15: Set full_range <- TRUE to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range <- FALSE to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.

Show the code

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# calculate missing rates by year-month and adm2 for test_hf_u5
missing_rate_adm2 = (
    df_routine.groupby(["ym", "adm2"], as_index=False)
    .agg(missing_rate=("test_hf_u5", lambda s: s.isna().mean() * 100))
)

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_adm2["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

ym_order = sorted(missing_rate_adm2["ym"].unique())
adm2_order = sorted(missing_rate_adm2["adm2"].unique())

pivot = (
    missing_rate_adm2
    .pivot(index="adm2", columns="ym", values="missing_rate")
    .reindex(index=adm2_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(adm2_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

ax.set_yticks(np.arange(len(adm2_order)) + 0.5)
ax.set_yticklabels(adm2_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("District")
ax.set_title(
    "The proportion of missing data for test_hf_u5 by year-month and adm2",
    fontsize=12, pad=10
)

cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

ax.grid(False)
plt.tight_layout()

Output

To adapt the code:

Line 8: Replace df_routine with the dataset name if different.
Line 9: Update the variable name (test_hf_u5) to match the dataset’s variable name of interest.
Line 13: Set full_range = True to span the color scale across 0-100% for consistent comparison across datasets or time periods. Set full_range = False to limit the color scale to the actual data range for better visual contrast within the observed missingness patterns.

Missing data for test_hf_u5 varies by district and over time. WESTERN URBAN shows the highest missingness rate (about 67% on average), with KOINADUGU, WESTERN RURAL, and KONO also above 25%. Many districts also show increased missingness around mid to late 2022, suggesting a broader reporting issue. In contrast, districts like KARENE, KAMBIA, and MOYAMBA have lower and more stable missing rates. These patterns confirm that test_hf_u5 varies by both time and location.

This suggests that missingness in test_hf_u5 is likely Missing At Random (MAR), as it depends on observable factors such as district and reporting period. In such cases, stratified analysis approaches are appropriate. The district (adm2) level can also serve as a fallback unit for analysis, particularly when individual facility data are sparse, by drawing insights from patterns in neighbouring facilities within the same district.

Consult with SNT team

District-level patterns of missingness often reflect operational decisions the SNT team already knows about: which districts switched DHIS2 instances or migrated platforms, when health information system reforms rolled out, where parallel reporting systems run alongside DHIS2, and which districts ran campaigns that disrupted routine reporting. Before classifying a district’s pattern as MAR or MNAR, consult the SNT team to rule out a known structural cause. The team may also point to an alternative source that is more complete for the affected period.

What to look for when running this on a different country:

Persistent horizontal stripes in specific districts often reflect urban or peri-urban hospitals reporting less consistently than rural health posts. Check the facility-type composition of those districts.
Vertical bands shared across districts indicate a national-level reporting disruption such as a DHIS2 migration, a campaign month, or a holiday season.
A dark lower-left region means missingness is concentrated early in the time series; the indicator may have been added later, in which case pre-activation months should be masked as not-applicable per the active status page.
A smooth gradient with no strong pattern may indicate MCAR; confirm with naniar::mcar_test() on that indicator.

When MAR patterns like those above are present, plan to use a stratified or hierarchical imputation approach on the Imputation Methods page.

Step 6: Check Missingness by Health Facility Type and Age Groups

Beyond temporal and geographic patterns, missingness may also vary systematically by additional factors, such as health facility type and age groups. Many routine surveillance systems collect data that is already disaggregated by these factors (e.g., test_hf_u5 representing malaria tests conducted at health facilities for children under 5).

The first step in determining these patterns is to disaggregate such pre-aggregated data into a long format, creating separate columns for indicators, facility types, and age groups. This long format allows us to examine how missingness varies across these important stratification variables and identify systematic reporting gaps that may be related to facility capacity, service delivery models, or demographic-specific data collection challenges.

Show the code

# define core indicators to retain during disaggregation
core_indicators <- c(
  "adm0",
  "adm1",
  "adm2",
  "adm3",
  "hf",
  "hf_uid",
  "year",
  "month",
  "ym",
  "date"
)

# select core indicators and disaggregated variables
df_routine_disagg <- df_routine |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    dplyr::matches("_(com|hf)_(u5|5_14|ov15)$")
  )

# transform to long format with separate facility type and age group columns
df_routine_disagg_long <- df_routine_disagg |>
  # drop maladm_hf_u5: lacks community-level and other age-group
  # counterparts, so it cannot be pivoted into the long format below
  dplyr::select(-maladm_hf_u5) |>
  tidyr::pivot_longer(
    cols = dplyr::matches("_(com|hf)_(u5|5_14|ov15)$"),
    names_to = c("indicator", "facility_type", "age_group"),
    names_pattern = "(.+)_(com|hf)_(u5|5_14|ov15)$",
    values_to = "value"
  ) |>
  # relabel factors
  dplyr::mutate(
    age_group = factor(
      age_group,
      levels = c("u5", "5_14", "ov15"),
      labels = c("Under 5", "5 to 15", "Over 15")
    ),
    facility_type = factor(
      facility_type,
      levels = c("hf", "com"),
      labels = c("Health Facility", "Community")
    )
  )

# preview the disaggregated data structure
df_routine_disagg_long |>
  utils::head() |>
  knitr::kable()

Output

adm0	adm1	adm2	adm3	hf	hf_uid	year	month	ym	date	indicator	facility_type	age_group	value
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	1	2019-01	2019-01-01	test	Health Facility	Under 5	95
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	1	2019-01	2019-01-01	test	Health Facility	5 to 15	24
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	1	2019-01	2019-01-01	test	Health Facility	Over 15	44
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	1	2019-01	2019-01-01	test	Community	Under 5	69
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	1	2019-01	2019-01-01	test	Community	5 to 15	24
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	1	2019-01	2019-01-01	test	Community	Over 15	45

To adapt the code:

Lines 2-13: Update the core_indicators vector to match the dataset’s administrative and temporal variable names.
Line 16: Replace df_routine with the dataset name if different.
Lines 19, 28: Modify the regex pattern "_(com|hf)_(u5|5_14|ov15)$" to match the facility type and age group naming conventions (e.g., if the data uses different facility codes or age categories).

Show the code

import re

import pandas as pd

# define core columns to retain during disaggregation
core_indicators = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid",
    "year", "month", "ym", "date"
]

# select core indicators and disaggregated columns
disagg_pattern = re.compile(r".+_(com|hf)_(u5|5_14|ov15)$")
disagg_cols = [c for c in df_routine.columns if disagg_pattern.match(c)]
keep_core = [c for c in core_indicators if c in df_routine.columns]

df_routine_disagg = df_routine[keep_core + disagg_cols].copy()

# drop maladm_hf_u5: lacks community-level and other age-group counterparts
if "maladm_hf_u5" in df_routine_disagg.columns:
    df_routine_disagg = df_routine_disagg.drop(columns=["maladm_hf_u5"])

# re-identify disaggregated columns after the drop
disagg_cols_final = [
    c for c in df_routine_disagg.columns
    if c not in keep_core
]

# transform to long format with separate facility_type and age_group columns
def _parse_col(col):
    """Extract (indicator, facility_type, age_group) from column name."""
    m = re.match(r"^(.+)_(com|hf)_(u5|5_14|ov15)$", col)
    if m:
        return m.group(1), m.group(2), m.group(3)
    return None, None, None

long_rows = []
for col in disagg_cols_final:
    ind, ftype, age = _parse_col(col)
    if ind is None:
        continue
    tmp = df_routine_disagg[keep_core + [col]].copy()
    tmp = tmp.rename(columns={col: "value"})
    tmp["indicator"] = ind
    tmp["facility_type"] = ftype
    tmp["age_group"] = age
    long_rows.append(tmp)

df_routine_disagg_long = pd.concat(long_rows, ignore_index=True)

# relabel factors to match R output
age_order = ["u5", "5_14", "ov15"]
age_labels = ["Under 5", "5 to 15", "Over 15"]
ftype_order = ["hf", "com"]
ftype_labels = ["Health Facility", "Community"]

df_routine_disagg_long["age_group"] = pd.Categorical(
    df_routine_disagg_long["age_group"].map(
        dict(zip(age_order, age_labels))
    ),
    categories=age_labels,
    ordered=True
)
df_routine_disagg_long["facility_type"] = pd.Categorical(
    df_routine_disagg_long["facility_type"].map(
        dict(zip(ftype_order, ftype_labels))
    ),
    categories=ftype_labels,
    ordered=True
)

# preview the disaggregated data structure as an HTML table
df_routine_disagg_long.head().style.hide(axis="index")

Output

adm0	adm1	adm2	adm3	hf	hf_uid	year	month	ym	date	value	indicator	facility_type	age_group
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	1	2019-01	2019-01-01 03:00:00	95	test	Health Facility	Under 5
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	2	2019-02	2019-02-01 03:00:00	82	test	Health Facility	Under 5
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	3	2019-03	2019-03-01 03:00:00	106	test	Health Facility	Under 5
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	4	2019-04	2019-04-01 03:00:00	74	test	Health Facility	Under 5
SIERRA LEONE	SOUTHERN	BO	BO TOWN	AETHEL CHP	2a696984	2019	5	2019-05	2019-05-01 03:00:00	84	test	Health Facility	Under 5

To adapt the code:

Lines 6-13: Update core_indicators to match the dataset’s administrative and temporal variable names.
Line 15: Replace df_routine with the dataset name if different.
Lines 13, 32: Modify the regex pattern r".+_(com|hf)_(u5|5_14|ov15)$" to match the facility type and age group naming conventions in the dataset.

Now that we have disaggregated the data, we can examine how missingness varies across all indicators by facility type and age group. This check helps identify systematic patterns in data availability that may be related to service delivery models, facility capacity, or demographic-specific reporting challenges.

Show the code

# prepare data for stacked bar chart with both age group and facility type
missing_present_data_facility <- df_routine_disagg_long |>
  dplyr::group_by(indicator, age_group, facility_type) |>
  dplyr::summarise(
    total_records = dplyr::n(),
    missing_count = sum(is.na(value)),
    present_count = sum(!is.na(value)),
    missing_pct = round(missing_count / total_records * 100, 2),
    present_pct = round(present_count / total_records * 100, 2),
    .groups = "drop"
  ) |>
  tidyr::pivot_longer(
    cols = c(missing_pct, present_pct),
    names_to = "status",
    values_to = "percentage",
    names_pattern = "(.+)_pct"
  ) |>
  dplyr::mutate(
    status = factor(
      status,
      levels = c("missing", "present"),
      labels = c("Missing", "Present")
    )
  )

# create stacked bar plot faceted by age group and facility type
age_facility_miss_plot <- missing_present_data_facility |>
  ggplot2::ggplot(ggplot2::aes(
    x = percentage,
    y = factor(age_group, levels = rev(levels(factor(age_group)))),
    fill = status
  )) +
  ggplot2::geom_col(alpha = 0.8) +
  ggplot2::facet_grid(indicator ~ facility_type) +
  ggplot2::scale_fill_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Present", "Missing")
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    x = "\nPercentage (%)",
    y = "Age Group (years)\n",
    fill = "Data Status"
  ) +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10)
  )

# display the plot
age_facility_miss_plot

Output

To adapt the code:

Line 2: Replace df_routine_disagg_long with the disaggregated dataset name if different.
Lines 18-22: Update the status factor levels and labels to use different display names for missing/present data categories.
Lines 48-52: Modify the plot title and axis labels to match the analysis focus.

Show the code

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# prepare data for stacked bar chart with age group and facility type
missing_present_data_facility = (
    df_routine_disagg_long
    .groupby(["indicator", "age_group", "facility_type"], observed=True,
             as_index=False)
    .agg(
        total_records=("value", "count"),
        missing_count=("value", lambda s: s.isna().sum()),
        present_count=("value", lambda s: s.notna().sum()),
    )
    .assign(
        missing_pct=lambda d: (d["missing_count"] / d["total_records"] * 100)
                              .round(2),
        present_pct=lambda d: (d["present_count"] / d["total_records"] * 100)
                              .round(2),
    )
)

# pivot to long format for status column
mp_long = missing_present_data_facility.melt(
    id_vars=["indicator", "age_group", "facility_type"],
    value_vars=["missing_pct", "present_pct"],
    var_name="status",
    value_name="percentage"
).assign(
    status=lambda d: pd.Categorical(
        d["status"].map({"missing_pct": "Missing", "present_pct": "Present"}),
        categories=["Missing", "Present"],
        ordered=True
    )
)

# viridis D palette: begin=0.2, end=0.5 for two categories
import matplotlib.cm as cm
viridis = cm.get_cmap("viridis")
color_missing = viridis(0.2)
color_present = viridis(0.5)
status_colors = {"Missing": color_missing, "Present": color_present}

indicators = list(mp_long["indicator"].unique())
facility_types = list(mp_long["facility_type"].cat.categories)
age_labels_ordered = list(mp_long["age_group"].cat.categories)
age_rev = list(reversed(age_labels_ordered))

n_ind = len(indicators)
n_ftype = len(facility_types)

fig, axes = plt.subplots(
    n_ind, n_ftype,
    figsize=(12, 8),
    sharex=True,
    sharey="row"
)

for i, ind in enumerate(indicators):
    for j, ftype in enumerate(facility_types):
        ax = axes[i][j] if n_ind > 1 else axes[j]
        subset = mp_long.loc[
            (mp_long["indicator"] == ind) &
            (mp_long["facility_type"] == ftype)
        ]
        for status in ["Present", "Missing"]:
            s_sub = subset.loc[subset["status"] == status]
            s_sub = s_sub.set_index("age_group").reindex(age_rev)
            ax.barh(
                s_sub.index,
                s_sub["percentage"],
                color=status_colors[status],
                alpha=0.8,
                label=status
            )
        ax.set_xlim(0, 100)
        ax.set_xlabel("")
        ax.set_ylabel("")
        # panel border
        for spine in ax.spines.values():
            spine.set_linewidth(0.7)
        # strip labels
        if i == 0:
            ax.set_title(str(ftype), fontsize=9, fontweight="bold")
        if j == n_ftype - 1:
            ax.yaxis.set_label_position("right")
            ax.set_ylabel(str(ind), rotation=270, labelpad=12,
                          fontsize=9, fontweight="bold")
        # no grid
        ax.grid(False)

# shared axis labels
fig.supxlabel("Percentage (%)", y=0.02, fontsize=10)
fig.supylabel("Age Group (years)", x=0.02, fontsize=10)
fig.suptitle(
    "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    fontsize=12, y=1.01
)

# shared legend
handles = [
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Present"], alpha=0.8),
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Missing"], alpha=0.8),
]
fig.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    bbox_to_anchor=(0.0, -0.04),
    ncol=2,
    frameon=False,
    fontsize=9
)
plt.tight_layout()

Output

To adapt the code:

Line 8: Replace df_routine_disagg_long with the disaggregated dataset name if different.
Lines 30-34: Update the status mapping if different display names for missing/present data categories are needed.
Lines 94-99: Modify the plot title and axis labels to match the analysis focus.

The above visualization shows substantial differences in missingness patterns between facility types, with community-level facilities showing considerably higher missing rates, around 70% on average. This pattern is consistent across indicators and age groups. Notably, within community facilities, under-five data is more frequently reported than data for older age groups, a pattern observed across key indicators. These trends point to systematic differences in data collection and reporting capacity between facility types.

What to look for when running this on a different dataset:

A large gap between facility types (community vs facility) reflects differences in data collection capacity, supervision frequency, or which indicators are mandated at each level. Stratify imputation by facility type.
A larger gap in older age groups often reflects that the programme focuses on under-fives; older-age data may be a structural non-applicable rather than missing.
An indicator-specific gap (one row of facets noticeably worse than the others) often means the form does not include that indicator at that level. Verify with the country team before imputing.

Mark indicators that are structurally not collected at a given facility type or age band as not-applicable rather than imputing them. Imputation strategies for genuine facility-type MAR patterns are covered on the Imputation Methods page.

Step 7: Visualizing Structural Zeros and Logical Relationships

Before classifying missing values, we visualize the relationships between related indicators. This plot highlights likely structural zeros (e.g., no testing with missing confirmation), possible structural zeros (e.g., confirmed cases equal to zero with missing test values), and logical inconsistencies (e.g., confirmed cases recorded but testing data missing). These patterns reflect the clinical care pathway and help identify where values can be logically inferred, structurally imputed as zero, or require statistical methods.

Specifically:

test = 0, conf missing: Likely legitimate zero (no testing = confirmation not possible).
conf = 0, test missing: Possibly legitimate zero (no confirmed cases suggests no positives, but testing may have occurred).
conf > 0, test missing: Logical inconsistency (confirmed cases imply testing occurred, value should be imputed).
test > 0, conf missing: Possibly legitimate zero (testing occurred but no positive cases; confirmation may be legitimately zero or missing).

Note: Below, we use the naniar package to explore missingness. Unlike geom_point(), which silently drops missing values, geom_miss_point() shows where one or both variables are missing by shifting them to plot margins. This makes it easier to detect structural zeros, reporting gaps, and logical inconsistencies.

Show the code

# transform data to wide format to examine relationships between indicators
df_wide <- df_routine_disagg_long |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    facility_type,
    age_group,
    indicator,
    value
  ) |>
  tidyr::pivot_wider(
    names_from = indicator,
    values_from = value
  )

# calculate comprehensive missing counts for subtitle
n_test_missing_conf_zero <- df_wide |>
  dplyr::filter(conf == 0, is.na(test)) |>
  nrow()

n_conf_missing_test_zero <- df_wide |>
  dplyr::filter(test == 0, is.na(conf)) |>
  nrow()

n_test_na_conf_positive <- df_wide |>
  dplyr::filter(is.na(test), conf > 0) |>
  nrow()

n_conf_na_test_positive <- df_wide |>
  dplyr::filter(is.na(conf), test > 0) |>
  nrow()

# create comprehensive subtitle with all missing count scenarios
subtitle_text <- paste0(
  "N = ",
  scales::comma(n_test_missing_conf_zero),
  " where conf = 0 but test is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_conf_missing_test_zero),
  " where test = 0 but conf is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_test_na_conf_positive),
  " where test is missing but conf > 0 (logical inconsistency); \n",
  "N = ",
  scales::comma(n_conf_na_test_positive),
  " where conf is missing but test > 0 (possible legitimate zero)"
)

# create scatter plot showing relationship between test and confirmed counts
structural_zeros_plot <- ggplot2::ggplot(
  data = df_wide,
  mapping = ggplot2::aes(x = test, y = conf)
) +
  naniar::geom_miss_point() +
  ggplot2::geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::geom_hline(
    yintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_y_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_color_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Not Missing", "Missing"),
    labels = c("Present", "Missing")
  ) +
  ggplot2::guides(
    color = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      override.aes = list(
        size = 5
      ),
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Structural Zeros: Testing vs Confirmation",
    subtitle = subtitle_text,
    x = "\nTests Conducted",
    y = "Cases Confirmed\n",
    color = "Data Status"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    plot.subtitle = ggplot2::element_text(size = 9, color = "grey30")
  )


# display the plot
structural_zeros_plot

Output

To adapt the code:

Lines 17, 21, 25, 29: Update the variable names (test, conf) in the filter conditions to match the dataset’s variable names of interest.
Lines 3-9: Update the core_indicators and column selection to match the dataset structure.
Lines 10-13: Modify the pivot_wider() transformation to include the specific indicators to examine for structural relationships.
Line 51: Change the x and y aesthetic mappings to examine different variable pairs (e.g., x = maltreat, y = conf to examine treatment vs confirmation relationships).

Note: Below, we replicate the naniar::geom_miss_point() approach in Python. Missing values on one axis are shifted to the plot margin (10% below the minimum observed value) so the structural-zero patterns remain visible.

Show the code

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd

# transform to wide format to examine relationships between indicators
keep_core = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid", "year", "month", "ym", "date"
]
keep_cols = [c for c in keep_core if c in df_routine_disagg_long.columns]

df_wide = (
    df_routine_disagg_long[keep_cols + ["facility_type", "age_group",
                                         "indicator", "value"]]
    .pivot_table(
        index=keep_cols + ["facility_type", "age_group"],
        columns="indicator",
        values="value",
        aggfunc="first",
        observed=True
    )
    .reset_index()
)
df_wide.columns.name = None

# calculate missing counts for subtitle
n_test_missing_conf_zero = df_wide.loc[
    (df_wide["conf"] == 0) & df_wide["test"].isna()
].shape[0]

n_conf_missing_test_zero = df_wide.loc[
    (df_wide["test"] == 0) & df_wide["conf"].isna()
].shape[0]

n_test_na_conf_positive = df_wide.loc[
    df_wide["test"].isna() & (df_wide["conf"] > 0)
].shape[0]

n_conf_na_test_positive = df_wide.loc[
    df_wide["conf"].isna() & (df_wide["test"] > 0)
].shape[0]

subtitle_text = (
    f"N = {n_test_missing_conf_zero:,} where conf = 0 but test is missing "
    f"(legitimate zero); \n"
    f"N = {n_conf_missing_test_zero:,} where test = 0 but conf is missing "
    f"(legitimate zero); \n"
    f"N = {n_test_na_conf_positive:,} where test is missing but conf > 0 "
    f"(logical inconsistency); \n"
    f"N = {n_conf_na_test_positive:,} where conf is missing but test > 0 "
    f"(possible legitimate zero)"
)

# create demo data matching the R render chunk:
# rows 400-600 where conf is NA get test = 0 (legitimate zero pattern)
# rows 490-900 where test is NA get conf = 0 (legitimate zero pattern)
df_wide_demo = df_wide.copy().reset_index(drop=True)
mask1 = (df_wide_demo.index.isin(range(400, 601))) & df_wide_demo["conf"].isna()
df_wide_demo.loc[mask1, "test"] = 0.0
mask2 = (df_wide_demo.index.isin(range(490, 901))) & df_wide_demo["test"].isna()
df_wide_demo.loc[mask2, "conf"] = 0.0

# compute margin offsets for miss_point behavior
test_obs = df_wide_demo["test"].dropna()
conf_obs = df_wide_demo["conf"].dropna()
test_range = test_obs.max() - test_obs.min() if len(test_obs) > 1 else 1.0
conf_range = conf_obs.max() - conf_obs.min() if len(conf_obs) > 1 else 1.0
test_margin = test_obs.min() - 0.1 * test_range
conf_margin = conf_obs.min() - 0.1 * conf_range

viridis = cm.get_cmap("viridis")
color_present = viridis(0.5)   # "Not Missing"
color_missing = viridis(0.2)   # "Missing"

fig, ax = plt.subplots(figsize=(10, 8))

# --- present-present points ---
mask_pp = df_wide_demo["test"].notna() & df_wide_demo["conf"].notna()
ax.scatter(
    df_wide_demo.loc[mask_pp, "test"],
    df_wide_demo.loc[mask_pp, "conf"],
    color=color_present, alpha=0.4, s=10, label="Present", rasterized=True
)

# --- test missing, conf present: shift test to left margin ---
mask_tm = df_wide_demo["test"].isna() & df_wide_demo["conf"].notna()
ax.scatter(
    [test_margin] * mask_tm.sum(),
    df_wide_demo.loc[mask_tm, "conf"],
    color=color_missing, alpha=0.4, s=10, label="Missing", rasterized=True
)

# --- conf missing, test present: shift conf to bottom margin ---
mask_cm = df_wide_demo["conf"].isna() & df_wide_demo["test"].notna()
ax.scatter(
    df_wide_demo.loc[mask_cm, "test"],
    [conf_margin] * mask_cm.sum(),
    color=color_missing, alpha=0.4, s=10, rasterized=True
)

# reference lines at zero
ax.axvline(0, linestyle="dashed", color="grey", linewidth=0.2)
ax.axhline(0, linestyle="dashed", color="grey", linewidth=0.2)

# comma-formatted tick labels
ax.xaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"{x:,.0f}"
))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda y, _: f"{y:,.0f}"
))

ax.set_xlabel("\nTests Conducted")
ax.set_ylabel("Cases Confirmed\n")
ax.set_title("Structural Zeros: Testing vs Confirmation", fontsize=12)
ax.text(
    0.0, -0.12, subtitle_text,
    transform=ax.transAxes, fontsize=9, color="grey",
    verticalalignment="top"
)

handles = [
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_present, markersize=8),
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_missing, markersize=8),
]
ax.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    frameon=True,
    fontsize=9
)

for spine in ax.spines.values():
    spine.set_linewidth(0.7)
plt.tight_layout()

Output

To adapt the code:

Lines 30, 34, 38, 42: Update the variable names (test, conf) in the filter conditions to match the dataset’s variable names of interest.
Lines 9-12: Update keep_core and column selection to match the dataset structure.
Lines 13-16: Modify the pivot_table() transformation to include the specific indicators to examine for structural relationships.
Lines 66, 67: Change the scatter x and y column references to examine different variable pairs (e.g., maltreat and conf).

The plot shows key patterns in missingness between malaria testing (test) and confirmation (conf) counts. The vertical dashed line marks zero tests; the horizontal dashed line marks zero confirmations. This visualization shows there were 1,769 instances where conf is missing but test > 0. These are likely structural zeros, where testing was done but no confirmed cases were recorded. In such cases, missing confirmation counts can reasonably be set to zero.

What to look for in this plot on any dataset:

A cluster of “Missing” markers along the y-axis margin with test > 0: tests were recorded but confirmations were left blank. Likely a structural zero; set conf to 0.
A cluster of “Missing” markers along the x-axis margin with conf > 0: confirmations were recorded but tests were left blank. Logical inconsistency; flag for review, or impute test from conf and a historical positivity rate.
“Missing” markers at the origin: both indicators are missing, which points to reporting failure rather than a structural zero.
Asymmetry along one axis: indicator-specific data entry practice. Check whether the form distinguishes 0 from blank.

Summary

This section introduced approaches for visualising and diagnosing missing data in routine surveillance systems. By examining patterns across time, geography, facility type, and indicator relationships, we showed how missingness can be systematically assessed rather than treated as random gaps. The section demonstrated how simple plots and logical checks help distinguish between true missing values and structural reporting issues, providing a clearer understanding of the data’s reliability. These methods establish a foundation for classifying the nature of missingness and for selecting appropriate strategies for addressing them. Structural and statistical imputation techniques are implemented in detail on the Imputation Methods page.

Full Code

Find the full code script for missing data detection methods below.

Show full code

################################################################################
################# ~ Missing data detection methods full code ~ #################
################################################################################

### Step 1: Install and Load Libraries -----------------------------------------

# install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load required packages using pacman
pacman::p_load(
  dplyr,        # data manipulation
  ggplot2,      # plotting
  ggtext,       # markdown text in ggplot2
  here,         # file path management
  lubridate,    # date handling
  naniar,       # missing data visualization and analysis
  tidyr,        # data reshaping
  cli,          # clean logging and CLI-style messages
  scales,       # number formatting
  UpSetR,       # upset plots for co-missingness
  wesanderson   # color palettes
)

### Step 2: Load Data ----------------------------------------------------------

# read the processed surveillance data produced by the import workflow
df_routine <- readRDS(
  here::here(
    "01_data",
    "1.2_epidemiology",
    "1.2a_routine_surveillance",
    "processed",
    "clean_malaria_routine_data_final.rds"
  )
) |>
  # drop the import-source label carried over from import.qmd
  dplyr::select(-dplyr::any_of("sheet_admin")) |>
  # filter to the past five years
  dplyr::filter(year >= 2019) |>
  dplyr::rename(maladm_hf_u5 = maladm_u5) |>
  dplyr::mutate(
    # ym: human-readable year-month label used as a discrete x-axis
    ym = format(date, "%Y-%m"),
    # date stays as Date (already Date in the .rds produced by import.qmd)
    date = as.Date(date)
  )

### Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness --------

#### Step 3.1: Conditional missingness within submitted reports ----------------

# define the core indicators that signal a report was submitted
core_inds <- c("test", "conf", "maltreat")

# flag each facility-month as reporting or non-reporting
df_routine <- df_routine |>
  dplyr::mutate(
    report_submitted = dplyr::if_any(
      dplyr::all_of(core_inds),
      ~ !is.na(.x)
    )
  )

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing <- df_routine |>
  dplyr::filter(report_submitted) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(core_inds),
      ~ round(mean(is.na(.x)) * 100, 1)
    )
  ) |>
  tidyr::pivot_longer(
    dplyr::everything(),
    names_to = "indicator",
    values_to = "pct_missing_given_reported"
  )

conditional_missing |>
  dplyr::rename(
    Indicator = indicator,
    `% missing given reported` = pct_missing_given_reported
  ) |>
  knitr::kable(align = "lr", digits = 1)

#### Step 3.2: Quantitative summary across variables ---------------------------

# variables to summarise
summary_vars <- c("test", "susp", "pres", "conf", "maltreat")

# compute % missing for the core aggregated indicators
miss_summary <- naniar::miss_var_summary(
  df_routine |>
    dplyr::select(dplyr::all_of(summary_vars))
) |>
  dplyr::mutate(pct_miss = as.numeric(pct_miss))

# bar chart of % missing per variable
miss_summary |>
  dplyr::mutate(
    variable = factor(variable, levels = rev(variable)),
    label_text = paste0(round(pct_miss, 1), "%"),
    inside = pct_miss > max(pct_miss) * 0.15
  ) |>
  ggplot2::ggplot(ggplot2::aes(x = pct_miss, y = variable)) +
  ggplot2::geom_col(fill = "#3B9AB2", width = 0.75) +
  ggplot2::geom_text(
    ggplot2::aes(
      label = label_text,
      hjust = ifelse(inside, 1.15, -0.15),
      colour = ifelse(inside, "white", "grey20")
    ),
    size = 4,
    fontface = "bold",
    family = "sans"
  ) +
  ggplot2::scale_colour_identity() +
  ggplot2::scale_x_continuous(
    expand = ggplot2::expansion(mult = c(0, 0.05)),
    labels = function(x) paste0(x, "%")
  ) +
  ggplot2::labs(
    title = "Proportion of missing data per indicator",
    subtitle = paste(
      "% missing across submitted reports for each",
      "core malaria indicator"
    ),
    x = NULL,
    y = NULL
  ) +
  ggplot2::theme_minimal(base_family = "sans") +
  ggplot2::theme(
    plot.title = ggplot2::element_text(
      size = 16,
      face = "bold",
      margin = ggplot2::margin(b = 4)
    ),
    plot.subtitle = ggplot2::element_text(
      size = 12,
      colour = "grey30",
      margin = ggplot2::margin(b = 14)
    ),
    axis.text.y = ggplot2::element_text(size = 11, colour = "black"),
    axis.text.x = ggplot2::element_text(size = 10, colour = "grey30"),
    panel.grid.major.y = ggplot2::element_blank(),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major.x = ggplot2::element_line(
      colour = "grey90",
      size = 0.3
    ),
    axis.ticks = ggplot2::element_blank(),
    plot.margin = ggplot2::margin(15, 30, 10, 10)
  )

### Step 4: Check Missingness over Time ----------------------------------------

# get the variables of interest (aggregated and age-specific)
vars <- c(
  "test",        # aggregated malaria tests
  "test_u5",
  "test_5_14",
  "test_ov15",
  "susp",        # aggregated suspected cases
  "susp_u5",
  "susp_5_14",
  "susp_ov15",
  "pres",        # aggregated presumed cases
  "pres_u5",
  "pres_5_14",
  "pres_ov15",
  "conf",        # aggregated confirmed cases
  "conf_u5",
  "conf_5_14",
  "conf_ov15",
  "maltreat",    # aggregated treatments
  "maltreat_u5",
  "maltreat_5_14",
  "maltreat_ov15"
)

# calculate missing rates by date for each variable
missing_rate_date <- df_routine |>
  dplyr::group_by(date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(vars),
      ~ mean(is.na(.x)) * 100
    ),
    .groups = "drop"
  ) |>
  dplyr::arrange(date) |>
  tidyr::pivot_longer(
    cols = -ym,
    names_to = "variables",
    values_to = "missing_rate"
  ) |>
  # enforce the explicit `vars` order on the y-axis so R and Python match
  dplyr::mutate(variables = factor(variables, levels = vars))

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_date$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across variables
missing_plot_date <- ggplot2::ggplot(
  missing_rate_date,
  ggplot2::aes(
    y = variables,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "Variable",
    x = "",
    title = "The proportion of missing data for selected variables by year-month"
  )

# display the plot
missing_plot_date

### Step 5: Check Missingness over Time and Administrative Level ---------------

# calculate missing rates by year-month and adm2 for each variable
missing_rate_adm2 <- df_routine |>
  dplyr::group_by(ym, adm2) |>
  dplyr::summarise(
    missing_rate = mean(is.na(test_hf_u5)) * 100,
    .groups = "drop"
  )

# option to control the color scale range for the heatmap
# if TRUE: color scale goes from 0-100% (full range of possible missing rates,
# even if actual data doesnt cover this range)
# if FALSE: color scale is limited to actual range of missingness in the data
# (e.g., if missing rates range from 20-70%, color scale only
# covers that range)
full_range <- FALSE

# set fill scale limits based on full_range option
fill_limits <- if (full_range) {
  c(0, 100) # use full 0-100% range for color scale
} else {
  # use actual range of missing rates in the data for color scale
  fill_var_values <- missing_rate_adm2$missing_rate
  c(
    floor(min(fill_var_values, na.rm = TRUE)),
    ceiling(max(fill_var_values, na.rm = TRUE))
  )
}

# plot the missing range across location
missing_plot_adm2 <- ggplot2::ggplot(
  missing_rate_adm2,
  ggplot2::aes(
    y = adm2,
    x = ym,
    fill = missing_rate
  )
) +
  ggplot2::geom_tile(colour = "white", linewidth = .2) +
  ggplot2::scale_fill_gradientn(
    colours = wesanderson::wes_palette(
      "Zissou1",
      100,
      type = "continuous"
    ),
    limits = fill_limits
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_colorbar(
      title.position = "top",
      nrow = 1,
      label.position = "bottom",
      direction = "horizontal",
      barheight = ggplot2::unit(0.3, "cm"),
      barwidth = ggplot2::unit(4, "cm"),
      ticks = TRUE,
      draw.ulim = TRUE,
      draw.llim = TRUE
    )
  ) +

  ggplot2::scale_x_discrete(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    legend.title = ggplot2::element_text(
      size = 12,
      face = "bold",
      family = "sans"
    ),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal",
    legend.box.just = "center",
    legend.margin = ggplot2::margin(t = 0, unit = "cm"),
    legend.text = ggplot2::element_text(
      size = 8,
      family = "sans"
    ),
    axis.title.x = ggplot2::element_text(
      margin = ggplot2::margin(t = 5, unit = "pt")
    ),
    axis.title.y = ggplot2::element_text(
      margin = ggplot2::margin(r = 10, unit = "pt")
    ),
    axis.text.x = ggplot2::element_text(
      angle = 75,
      hjust = 1,
      family = "sans"
    ),
    axis.text = ggplot2::element_text(family = "sans"),
    axis.title = ggplot2::element_text(family = "sans"),
    plot.title = ggtext::element_markdown(
      size = 12,
      family = "sans",
      margin = ggplot2::margin(b = 10)
    ),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "grey90")
  ) +
  ggplot2::labs(
    fill = "Missing rate (%)",
    y = "District",
    x = "",
    title = "The proportion of missing data for test_hf_u5 by year-month and adm2"
  )

# display the plot
missing_plot_adm2

### Step 6: Check Missingness by Health Facility Type and Age Groups -----------

# define core indicators to retain during disaggregation
core_indicators <- c(
  "adm0",
  "adm1",
  "adm2",
  "adm3",
  "hf",
  "hf_uid",
  "year",
  "month",
  "ym",
  "date"
)

# select core indicators and disaggregated variables
df_routine_disagg <- df_routine |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    dplyr::matches("_(com|hf)_(u5|5_14|ov15)$")
  )

# transform to long format with separate facility type and age group columns
df_routine_disagg_long <- df_routine_disagg |>
  # drop maladm_hf_u5: lacks community-level and other age-group
  # counterparts, so it cannot be pivoted into the long format below
  dplyr::select(-maladm_hf_u5) |>
  tidyr::pivot_longer(
    cols = dplyr::matches("_(com|hf)_(u5|5_14|ov15)$"),
    names_to = c("indicator", "facility_type", "age_group"),
    names_pattern = "(.+)_(com|hf)_(u5|5_14|ov15)$",
    values_to = "value"
  ) |>
  # relabel factors
  dplyr::mutate(
    age_group = factor(
      age_group,
      levels = c("u5", "5_14", "ov15"),
      labels = c("Under 5", "5 to 15", "Over 15")
    ),
    facility_type = factor(
      facility_type,
      levels = c("hf", "com"),
      labels = c("Health Facility", "Community")
    )
  )

# preview the disaggregated data structure
df_routine_disagg_long |>
  utils::head() |>
  knitr::kable()

# prepare data for stacked bar chart with both age group and facility type
missing_present_data_facility <- df_routine_disagg_long |>
  dplyr::group_by(indicator, age_group, facility_type) |>
  dplyr::summarise(
    total_records = dplyr::n(),
    missing_count = sum(is.na(value)),
    present_count = sum(!is.na(value)),
    missing_pct = round(missing_count / total_records * 100, 2),
    present_pct = round(present_count / total_records * 100, 2),
    .groups = "drop"
  ) |>
  tidyr::pivot_longer(
    cols = c(missing_pct, present_pct),
    names_to = "status",
    values_to = "percentage",
    names_pattern = "(.+)_pct"
  ) |>
  dplyr::mutate(
    status = factor(
      status,
      levels = c("missing", "present"),
      labels = c("Missing", "Present")
    )
  )

# create stacked bar plot faceted by age group and facility type
age_facility_miss_plot <- missing_present_data_facility |>
  ggplot2::ggplot(ggplot2::aes(
    x = percentage,
    y = factor(age_group, levels = rev(levels(factor(age_group)))),
    fill = status
  )) +
  ggplot2::geom_col(alpha = 0.8) +
  ggplot2::facet_grid(indicator ~ facility_type) +
  ggplot2::scale_fill_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Present", "Missing")
  ) +
  ggplot2::guides(
    fill = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    x = "\nPercentage (%)",
    y = "Age Group (years)\n",
    fill = "Data Status"
  ) +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::scale_y_discrete(expand = c(0, 0)) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major = ggplot2::element_blank(),
    panel.background = ggplot2::element_blank(),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10)
  )

# display the plot
age_facility_miss_plot

### Step 7: Visualizing Structural Zeros and Logical Relationships -------------

# transform data to wide format to examine relationships between indicators
df_wide <- df_routine_disagg_long |>
  dplyr::select(
    dplyr::any_of(core_indicators),
    facility_type,
    age_group,
    indicator,
    value
  ) |>
  tidyr::pivot_wider(
    names_from = indicator,
    values_from = value
  )

# calculate comprehensive missing counts for subtitle
n_test_missing_conf_zero <- df_wide |>
  dplyr::filter(conf == 0, is.na(test)) |>
  nrow()

n_conf_missing_test_zero <- df_wide |>
  dplyr::filter(test == 0, is.na(conf)) |>
  nrow()

n_test_na_conf_positive <- df_wide |>
  dplyr::filter(is.na(test), conf > 0) |>
  nrow()

n_conf_na_test_positive <- df_wide |>
  dplyr::filter(is.na(conf), test > 0) |>
  nrow()

# create comprehensive subtitle with all missing count scenarios
subtitle_text <- paste0(
  "N = ",
  scales::comma(n_test_missing_conf_zero),
  " where conf = 0 but test is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_conf_missing_test_zero),
  " where test = 0 but conf is missing (legitimate zero); \n",
  "N = ",
  scales::comma(n_test_na_conf_positive),
  " where test is missing but conf > 0 (logical inconsistency); \n",
  "N = ",
  scales::comma(n_conf_na_test_positive),
  " where conf is missing but test > 0 (possible legitimate zero)"
)

# create scatter plot showing relationship between test and confirmed counts
structural_zeros_plot <- ggplot2::ggplot(
  data = df_wide,
  mapping = ggplot2::aes(x = test, y = conf)
) +
  naniar::geom_miss_point() +
  ggplot2::geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::geom_hline(
    yintercept = 0,
    linetype = "dashed",
    color = "grey40",
    linewidth = 0.2
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_y_continuous(
    labels = scales::comma_format()
  ) +
  ggplot2::scale_color_viridis_d(
    option = "D",
    begin = 0.2,
    end = 0.5,
    breaks = c("Not Missing", "Missing"),
    labels = c("Present", "Missing")
  ) +
  ggplot2::guides(
    color = ggplot2::guide_legend(
      title.position = "top",
      title.hjust = 0,
      override.aes = list(
        size = 5
      ),
      label.position = "bottom"
    )
  ) +
  ggplot2::labs(
    title = "Structural Zeros: Testing vs Confirmation",
    subtitle = subtitle_text,
    x = "\nTests Conducted",
    y = "Cases Confirmed\n",
    color = "Data Status"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.box.just = "left",
    axis.text.y = ggplot2::element_text(size = 8),
    strip.text = ggplot2::element_text(
      family = "sans",
      face = "bold"
    ),
    panel.border = ggplot2::element_rect(
      colour = "black",
      fill = NA,
      size = .7
    ),
    panel.spacing.x = ggplot2::unit(0.5, "cm"),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    plot.subtitle = ggplot2::element_text(size = 9, color = "grey30")
  )

# display the plot
structural_zeros_plot

Show full code

################################################################################
################# ~ Missing data detection methods full code ~ #################
################################################################################

### Step 1: Install and Load Libraries -----------------------------------------

from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
from pyprojroot import here

# ── cli helpers ──────────────────────────────────────────────────────────────
def cli_header(message):
    print(f"\n{message}")

def cli_info(message):
    print(f"INFO: {message}")

def cli_success(message):
    print(f"SUCCESS: {message}")

def cli_warning(message):
    print(f"WARNING: {message}")

def cli_danger(message):
    print(f"ERROR: {message}")

def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )

### Step 2: Load Data ----------------------------------------------------------

from pathlib import Path

import pandas as pd
from pyprojroot import here

# read the processed surveillance data produced by the import workflow.
# the .rds (loaded by R) and .parquet (loaded here) are written from the
# same dhis2_df in import.qmd, so both languages see identical data.
data_path = Path(here(
    "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed/"
    "clean_malaria_routine_data_final.parquet"
))

df_routine = (
    pd.read_parquet(data_path)
    # drop the import-source label carried over from import.qmd
    .drop(columns="sheet_admin", errors="ignore")
    # filter to the past five years
    .loc[lambda d: d["year"] >= 2019]
    .rename(columns={"maladm_u5": "maladm_hf_u5"})
    .assign(date=lambda d: pd.to_datetime(d["date"], errors="coerce"))
    .assign(ym=lambda d: d["date"].dt.strftime("%Y-%m"))
    .reset_index(drop=True)
)

### Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness --------

#### Step 3.1: Conditional missingness within submitted reports ----------------

import pandas as pd

# define the core indicators that signal a report was submitted
core_inds = ["test", "conf", "maltreat"]

# flag each facility-month as reporting or non-reporting
df_routine = df_routine.assign(
    report_submitted=lambda d: d[core_inds].notna().any(axis=1)
)

# conditional missingness: within submitted reports only,
# what % of each indicator was left blank?
conditional_missing = (
    df_routine.loc[df_routine["report_submitted"]]
    [core_inds]
    .isna()
    .mean()
    .mul(100)
    .round(1)
    .rename_axis("indicator")
    .reset_index(name="pct_missing_given_reported")
)

conditional_missing

#### Step 3.2: Quantitative summary across variables ---------------------------

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

# tabular summary of % missing for the core aggregated indicators
summary_vars = ["test", "susp", "pres", "conf", "maltreat"]

miss_summary = (
    pd.DataFrame({
        "variable": summary_vars,
        "n_miss": [df_routine[v].isna().sum() for v in summary_vars],
        "pct_miss": [
            round(df_routine[v].isna().mean() * 100, 1)
            for v in summary_vars
        ],
    })
    .sort_values("pct_miss", ascending=False)
    .reset_index(drop=True)
)

# bar chart of % missing per variable (horizontal, sorted descending)
vmax = float(miss_summary["pct_miss"].max())
threshold = vmax * 0.15

fig, ax = plt.subplots(figsize=(9, 5.2))
bars = ax.barh(
    miss_summary["variable"],
    miss_summary["pct_miss"],
    color="#3B9AB2",
    height=0.75
)
ax.invert_yaxis()

# value labels: inside the bar (white) when there is room,
# outside (grey) for short bars
for bar, val in zip(bars, miss_summary["pct_miss"]):
    label = f"{round(val, 1)}%"
    if val > threshold:
        ax.text(
            val - vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="right",
            color="white",
            fontsize=11,
            fontweight="bold"
        )
    else:
        ax.text(
            val + vmax * 0.012,
            bar.get_y() + bar.get_height() / 2,
            label,
            va="center",
            ha="left",
            color="#333333",
            fontsize=11,
            fontweight="bold"
        )

# title (large, bold) + subtitle (smaller, grey)
fig.suptitle(
    "Proportion of missing data per indicator",
    fontsize=16,
    fontweight="bold",
    x=0.05,
    ha="left",
    y=0.98
)
ax.set_title(
    "% missing across submitted reports for each core malaria indicator",
    fontsize=12,
    color="#555555",
    loc="left",
    pad=8
)

ax.set_xlabel("")
ax.set_ylabel("")
ax.xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f"{int(x)}%")
)
ax.tick_params(left=False, labelsize=11, colors="#333333")
ax.tick_params(axis="x", colors="#555555", labelsize=10)

# clean look: no border, only faint x gridlines
for spine in ax.spines.values():
    spine.set_visible(False)
ax.grid(axis="x", color="#E5E5E5", linewidth=0.6)
ax.grid(axis="y", visible=False)
ax.set_axisbelow(True)
ax.set_xlim(0, max(vmax * 1.05, 10))

plt.tight_layout(rect=[0, 0, 1, 0.94])

### Step 4: Check Missingness over Time ----------------------------------------

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# variables of interest (aggregated and age-specific)
vars_of_interest = [
    "test",           # aggregated malaria tests
    "test_u5",
    "test_5_14",
    "test_ov15",
    "susp",           # aggregated suspected cases
    "susp_u5",
    "susp_5_14",
    "susp_ov15",
    "pres",           # aggregated presumed cases
    "pres_u5",
    "pres_5_14",
    "pres_ov15",
    "conf",           # aggregated confirmed cases
    "conf_u5",
    "conf_5_14",
    "conf_ov15",
    "maltreat",       # aggregated treatments
    "maltreat_u5",
    "maltreat_5_14",
    "maltreat_ov15",
]

# calculate missing rates by year-month for each variable
missing_rate_date = (
    df_routine.groupby("ym", as_index=False)[vars_of_interest]
    .apply(lambda g: g[vars_of_interest].isna().mean() * 100)
    .reset_index()
    .rename(columns={"level_0": "ym"})
)
# rebuild from groupby to preserve ym correctly
_grouped = df_routine.groupby("ym")
missing_rate_date = pd.DataFrame(
    {v: _grouped[v].apply(lambda s: s.isna().mean() * 100)
     for v in vars_of_interest}
).reset_index()

# pivot to long format
missing_rate_long = missing_rate_date.melt(
    id_vars="ym",
    value_vars=vars_of_interest,
    var_name="variables",
    value_name="missing_rate"
).sort_values("ym")

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_long["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap (mirrors wesanderson::wes_palette("Zissou1", 100))
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

# pivot back to matrix for pcolormesh
ym_order = sorted(missing_rate_long["ym"].unique())
var_order = vars_of_interest  # top-to-bottom order on y-axis

pivot = (
    missing_rate_long
    .pivot(index="variables", columns="ym", values="missing_rate")
    .reindex(index=var_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(var_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

# x-axis ticks and labels
ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

# y-axis ticks and labels
ax.set_yticks(np.arange(len(var_order)) + 0.5)
ax.set_yticklabels(var_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("Variable")
ax.set_title(
    "The proportion of missing data for selected variables by year-month",
    fontsize=12, pad=10
)

# horizontal colorbar at the bottom
cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

# remove grid lines
ax.grid(False)
plt.tight_layout()

### Step 5: Check Missingness over Time and Administrative Level ---------------

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# calculate missing rates by year-month and adm2 for test_hf_u5
missing_rate_adm2 = (
    df_routine.groupby(["ym", "adm2"], as_index=False)
    .agg(missing_rate=("test_hf_u5", lambda s: s.isna().mean() * 100))
)

# option to control the color scale range for the heatmap
full_range = False

if full_range:
    vmin, vmax = 0.0, 100.0
else:
    vals = missing_rate_adm2["missing_rate"].dropna()
    vmin = float(np.floor(vals.min()))
    vmax = float(np.ceil(vals.max()))

# Zissou1 colormap
zissou1_colors = ["#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00"]
zissou1_cmap = mcolors.LinearSegmentedColormap.from_list(
    "Zissou1", zissou1_colors, N=100
)

ym_order = sorted(missing_rate_adm2["ym"].unique())
adm2_order = sorted(missing_rate_adm2["adm2"].unique())

pivot = (
    missing_rate_adm2
    .pivot(index="adm2", columns="ym", values="missing_rate")
    .reindex(index=adm2_order, columns=ym_order)
)

fig, ax = plt.subplots(figsize=(12, 8))
mesh = ax.pcolormesh(
    np.arange(len(ym_order) + 1),
    np.arange(len(adm2_order) + 1),
    pivot.values,
    cmap=zissou1_cmap,
    vmin=vmin,
    vmax=vmax,
    linewidth=0.2,
    edgecolors="white"
)
ax.set_aspect("auto")

ax.set_xticks(np.arange(len(ym_order)) + 0.5)
ax.set_xticklabels(ym_order, rotation=75, ha="right", fontsize=8)

ax.set_yticks(np.arange(len(adm2_order)) + 0.5)
ax.set_yticklabels(adm2_order, fontsize=8)

ax.set_xlabel("")
ax.set_ylabel("District")
ax.set_title(
    "The proportion of missing data for test_hf_u5 by year-month and adm2",
    fontsize=12, pad=10
)

cbar = fig.colorbar(
    mesh, ax=ax, orientation="horizontal",
    fraction=0.03, pad=0.18, aspect=40
)
cbar.set_label("Missing rate (%)", fontsize=12, fontweight="bold")
cbar.ax.tick_params(labelsize=8)

ax.grid(False)
plt.tight_layout()

### Step 6: Check Missingness by Health Facility Type and Age Groups -----------

import re

import pandas as pd

# define core columns to retain during disaggregation
core_indicators = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid",
    "year", "month", "ym", "date"
]

# select core indicators and disaggregated columns
disagg_pattern = re.compile(r".+_(com|hf)_(u5|5_14|ov15)$")
disagg_cols = [c for c in df_routine.columns if disagg_pattern.match(c)]
keep_core = [c for c in core_indicators if c in df_routine.columns]

df_routine_disagg = df_routine[keep_core + disagg_cols].copy()

# drop maladm_hf_u5: lacks community-level and other age-group counterparts
if "maladm_hf_u5" in df_routine_disagg.columns:
    df_routine_disagg = df_routine_disagg.drop(columns=["maladm_hf_u5"])

# re-identify disaggregated columns after the drop
disagg_cols_final = [
    c for c in df_routine_disagg.columns
    if c not in keep_core
]

# transform to long format with separate facility_type and age_group columns
def _parse_col(col):
    """Extract (indicator, facility_type, age_group) from column name."""
    m = re.match(r"^(.+)_(com|hf)_(u5|5_14|ov15)$", col)
    if m:
        return m.group(1), m.group(2), m.group(3)
    return None, None, None

long_rows = []
for col in disagg_cols_final:
    ind, ftype, age = _parse_col(col)
    if ind is None:
        continue
    tmp = df_routine_disagg[keep_core + [col]].copy()
    tmp = tmp.rename(columns={col: "value"})
    tmp["indicator"] = ind
    tmp["facility_type"] = ftype
    tmp["age_group"] = age
    long_rows.append(tmp)

df_routine_disagg_long = pd.concat(long_rows, ignore_index=True)

# relabel factors to match R output
age_order = ["u5", "5_14", "ov15"]
age_labels = ["Under 5", "5 to 15", "Over 15"]
ftype_order = ["hf", "com"]
ftype_labels = ["Health Facility", "Community"]

df_routine_disagg_long["age_group"] = pd.Categorical(
    df_routine_disagg_long["age_group"].map(
        dict(zip(age_order, age_labels))
    ),
    categories=age_labels,
    ordered=True
)
df_routine_disagg_long["facility_type"] = pd.Categorical(
    df_routine_disagg_long["facility_type"].map(
        dict(zip(ftype_order, ftype_labels))
    ),
    categories=ftype_labels,
    ordered=True
)

# preview the disaggregated data structure as an HTML table
df_routine_disagg_long.head().style.hide(axis="index")

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd

# prepare data for stacked bar chart with age group and facility type
missing_present_data_facility = (
    df_routine_disagg_long
    .groupby(["indicator", "age_group", "facility_type"], observed=True,
             as_index=False)
    .agg(
        total_records=("value", "count"),
        missing_count=("value", lambda s: s.isna().sum()),
        present_count=("value", lambda s: s.notna().sum()),
    )
    .assign(
        missing_pct=lambda d: (d["missing_count"] / d["total_records"] * 100)
                              .round(2),
        present_pct=lambda d: (d["present_count"] / d["total_records"] * 100)
                              .round(2),
    )
)

# pivot to long format for status column
mp_long = missing_present_data_facility.melt(
    id_vars=["indicator", "age_group", "facility_type"],
    value_vars=["missing_pct", "present_pct"],
    var_name="status",
    value_name="percentage"
).assign(
    status=lambda d: pd.Categorical(
        d["status"].map({"missing_pct": "Missing", "present_pct": "Present"}),
        categories=["Missing", "Present"],
        ordered=True
    )
)

# viridis D palette: begin=0.2, end=0.5 for two categories
import matplotlib.cm as cm
viridis = cm.get_cmap("viridis")
color_missing = viridis(0.2)
color_present = viridis(0.5)
status_colors = {"Missing": color_missing, "Present": color_present}

indicators = list(mp_long["indicator"].unique())
facility_types = list(mp_long["facility_type"].cat.categories)
age_labels_ordered = list(mp_long["age_group"].cat.categories)
age_rev = list(reversed(age_labels_ordered))

n_ind = len(indicators)
n_ftype = len(facility_types)

fig, axes = plt.subplots(
    n_ind, n_ftype,
    figsize=(12, 8),
    sharex=True,
    sharey="row"
)

for i, ind in enumerate(indicators):
    for j, ftype in enumerate(facility_types):
        ax = axes[i][j] if n_ind > 1 else axes[j]
        subset = mp_long.loc[
            (mp_long["indicator"] == ind) &
            (mp_long["facility_type"] == ftype)
        ]
        for status in ["Present", "Missing"]:
            s_sub = subset.loc[subset["status"] == status]
            s_sub = s_sub.set_index("age_group").reindex(age_rev)
            ax.barh(
                s_sub.index,
                s_sub["percentage"],
                color=status_colors[status],
                alpha=0.8,
                label=status
            )
        ax.set_xlim(0, 100)
        ax.set_xlabel("")
        ax.set_ylabel("")
        # panel border
        for spine in ax.spines.values():
            spine.set_linewidth(0.7)
        # strip labels
        if i == 0:
            ax.set_title(str(ftype), fontsize=9, fontweight="bold")
        if j == n_ftype - 1:
            ax.yaxis.set_label_position("right")
            ax.set_ylabel(str(ind), rotation=270, labelpad=12,
                          fontsize=9, fontweight="bold")
        # no grid
        ax.grid(False)

# shared axis labels
fig.supxlabel("Percentage (%)", y=0.02, fontsize=10)
fig.supylabel("Age Group (years)", x=0.02, fontsize=10)
fig.suptitle(
    "Missing vs Present Data by Indicator, Age Group, and Facility Type",
    fontsize=12, y=1.01
)

# shared legend
handles = [
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Present"], alpha=0.8),
    plt.Rectangle((0, 0), 1, 1, color=status_colors["Missing"], alpha=0.8),
]
fig.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    bbox_to_anchor=(0.0, -0.04),
    ncol=2,
    frameon=False,
    fontsize=9
)
plt.tight_layout()

### Step 7: Visualizing Structural Zeros and Logical Relationships -------------

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd

# transform to wide format to examine relationships between indicators
keep_core = [
    "adm0", "adm1", "adm2", "adm3",
    "hf", "hf_uid", "year", "month", "ym", "date"
]
keep_cols = [c for c in keep_core if c in df_routine_disagg_long.columns]

df_wide = (
    df_routine_disagg_long[keep_cols + ["facility_type", "age_group",
                                         "indicator", "value"]]
    .pivot_table(
        index=keep_cols + ["facility_type", "age_group"],
        columns="indicator",
        values="value",
        aggfunc="first",
        observed=True
    )
    .reset_index()
)
df_wide.columns.name = None

# calculate missing counts for subtitle
n_test_missing_conf_zero = df_wide.loc[
    (df_wide["conf"] == 0) & df_wide["test"].isna()
].shape[0]

n_conf_missing_test_zero = df_wide.loc[
    (df_wide["test"] == 0) & df_wide["conf"].isna()
].shape[0]

n_test_na_conf_positive = df_wide.loc[
    df_wide["test"].isna() & (df_wide["conf"] > 0)
].shape[0]

n_conf_na_test_positive = df_wide.loc[
    df_wide["conf"].isna() & (df_wide["test"] > 0)
].shape[0]

subtitle_text = (
    f"N = {n_test_missing_conf_zero:,} where conf = 0 but test is missing "
    f"(legitimate zero); \n"
    f"N = {n_conf_missing_test_zero:,} where test = 0 but conf is missing "
    f"(legitimate zero); \n"
    f"N = {n_test_na_conf_positive:,} where test is missing but conf > 0 "
    f"(logical inconsistency); \n"
    f"N = {n_conf_na_test_positive:,} where conf is missing but test > 0 "
    f"(possible legitimate zero)"
)

# create demo data matching the R render chunk:
# rows 400-600 where conf is NA get test = 0 (legitimate zero pattern)
# rows 490-900 where test is NA get conf = 0 (legitimate zero pattern)
df_wide_demo = df_wide.copy().reset_index(drop=True)
mask1 = (df_wide_demo.index.isin(range(400, 601))) & df_wide_demo["conf"].isna()
df_wide_demo.loc[mask1, "test"] = 0.0
mask2 = (df_wide_demo.index.isin(range(490, 901))) & df_wide_demo["test"].isna()
df_wide_demo.loc[mask2, "conf"] = 0.0

# compute margin offsets for miss_point behavior
test_obs = df_wide_demo["test"].dropna()
conf_obs = df_wide_demo["conf"].dropna()
test_range = test_obs.max() - test_obs.min() if len(test_obs) > 1 else 1.0
conf_range = conf_obs.max() - conf_obs.min() if len(conf_obs) > 1 else 1.0
test_margin = test_obs.min() - 0.1 * test_range
conf_margin = conf_obs.min() - 0.1 * conf_range

viridis = cm.get_cmap("viridis")
color_present = viridis(0.5)   # "Not Missing"
color_missing = viridis(0.2)   # "Missing"

fig, ax = plt.subplots(figsize=(10, 8))

# --- present-present points ---
mask_pp = df_wide_demo["test"].notna() & df_wide_demo["conf"].notna()
ax.scatter(
    df_wide_demo.loc[mask_pp, "test"],
    df_wide_demo.loc[mask_pp, "conf"],
    color=color_present, alpha=0.4, s=10, label="Present", rasterized=True
)

# --- test missing, conf present: shift test to left margin ---
mask_tm = df_wide_demo["test"].isna() & df_wide_demo["conf"].notna()
ax.scatter(
    [test_margin] * mask_tm.sum(),
    df_wide_demo.loc[mask_tm, "conf"],
    color=color_missing, alpha=0.4, s=10, label="Missing", rasterized=True
)

# --- conf missing, test present: shift conf to bottom margin ---
mask_cm = df_wide_demo["conf"].isna() & df_wide_demo["test"].notna()
ax.scatter(
    df_wide_demo.loc[mask_cm, "test"],
    [conf_margin] * mask_cm.sum(),
    color=color_missing, alpha=0.4, s=10, rasterized=True
)

# reference lines at zero
ax.axvline(0, linestyle="dashed", color="grey", linewidth=0.2)
ax.axhline(0, linestyle="dashed", color="grey", linewidth=0.2)

# comma-formatted tick labels
ax.xaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"{x:,.0f}"
))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda y, _: f"{y:,.0f}"
))

ax.set_xlabel("\nTests Conducted")
ax.set_ylabel("Cases Confirmed\n")
ax.set_title("Structural Zeros: Testing vs Confirmation", fontsize=12)
ax.text(
    0.0, -0.12, subtitle_text,
    transform=ax.transAxes, fontsize=9, color="grey",
    verticalalignment="top"
)

handles = [
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_present, markersize=8),
    plt.Line2D([0], [0], marker="o", color="w",
               markerfacecolor=color_missing, markersize=8),
]
ax.legend(
    handles, ["Present", "Missing"],
    title="Data Status",
    loc="lower left",
    frameon=True,
    fontsize=9
)

for spine in ax.spines.values():
    spine.set_linewidth(0.7)
plt.tight_layout()

Overview

Understanding Missing Data

Understanding Types of Missing Data

Step-by-Step

Step 1: Install and Load Libraries

Step 2: Load Data

Step 3: Distinguish Reporting Gaps from Indicator-Level Missingness

Step 3.1: Conditional missingness within submitted reports

Step 3.2: Quantitative summary across variables

Step 4: Check Missingness over Time

Step 5: Check Missingness over Time and Administrative Level

Step 6: Check Missingness by Health Facility Type and Age Groups

Step 7: Visualizing Structural Zeros and Logical Relationships

Summary

Further Reading

Full Code