Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. DHIS2 data preprocessing
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Acronyms and Resource Library
    • 1.6 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Determining active and inactive status
      • Routine data extraction
      • DHIS2 data preprocessing
      • Missing data detection methods
      • Health facility reporting rate
      • Contextual considerations
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
    • 2.10 Cost Data
  • 3. Situation Analysis
    • 3.1 Review of Past Interventions
      • Case Management
      • Routine Interventions
      • Mass ITN Campaigns
      • Chemoprevention Campaigns
      • Other Vector Control
    • 3.2 Trend Analysis
    • 3.3 Risk Factors
    • 3.4 Impact Evaluation
    • 3.5 Cost Analysis
  • 4. Stratification
    • 4.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 4.2 Access to Care
    • 4.3 Seasonality
      • Defining Seasonal Areas
      • Durations of Seasonality
    • 4.4 Urban Microstratification
  • 5. Intervention Targeting and Prioritization
    • 5.1 Intervention Targeting
    • 5.2 Prioritization
    • 5.3 Optimization under Limited Resources

On this page

  • Overview
  • Key Considerations
    • Troubleshooting
    • Best Practices When Working with Routine Data
    • What a Clean and Well-Structured Dataset Should Look Like
  • Step-by-Step
    • Step 1: Import Packages and Data
      • Step 1.1: Import packages
      • Step 1.2: Import data
    • Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts
      • Step 2.1: Verify all expected groups are present after import
      • Step 2.2: Check for completely missing variables
    • Step 3: Rename Columns
    • Step 4: Standardise the Date Columns
    • Step 5: Managing Location Columns
      • Step 5.1: Harmonise administrative names
      • Step 5.2: Create unique identifiers and location labels
    • Step 5.5: Harmonise Health Facility Names with the Master Facility List
    • Step 6: Compute Variables
      • Step 6.1: Compute indicator totals and new variables
      • Step 6.2: Quality control of indicator totals
      • Step 6.3: Export rows with incoherent totals
      • Step 6.4: Add IPD/OPD specification
    • Step 7: Finalize Data
      • Step 7.1: Troubleshoot duplicate HF-month records with different data
      • Step 7.2: Generate final data dictionary
      • Step 7.3: Arrange final column order
    • Step 8: Aggregate and Save Data
      • Step 8.1: Save data at the health facility level
      • Step 8.2: Aggregate and save data at each admin unit level
  • Summary
  • Full Code
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. DHIS2 data preprocessing

DHIS2 data preprocessing

Beginner

Overview

DHIS2 is a core data source for routine malaria surveillance, capturing indicators such as outpatient case counts, diagnostic testing, and treatment provision. However, raw DHIS2 exports may not be ready for SNT analysis if they have structural or quality issues, are messy, include inconsistent or unreadable column names, missing values, mismatched facility names, duplicate records, and formats that vary across months or locations. To make this data usable for SNT, it needs to be cleaned, reshaped, and aligned with a spatial and temporal structure that supports later steps of analysis: identifying active and inactive health facilities, calculating reporting rates, managing outliers, calculating malaria incidence, etc.

In this section, we walk through how to preprocess monthly DHIS2 malaria records to make them SNT-ready, using example data from Sierra Leone. This includes cleaning and standardizing facility-level data, parsing and standardizing date formats, harmonizing administrative names with reference shapefiles for spatial joins, computing derived indicators, and aggregating to administrative levels where needed to produce structured datasets.

The DHIS2 data we are working with may require some or all of the specific steps detailed below, or additional steps may be required to prepare the data for analysis.

NoteObjectives
  • Import and combine DHIS2 data files from multiple sources
  • Verify data completeness after import (check all groups present, identify missing variables)
  • Rename columns using a data dictionary for standardized naming
  • Standardize date columns (parse dates, create year/month/yearmon fields)
  • Harmonize administrative names with reference shapefiles for spatial joins
  • Create unique identifiers (UIDs) for facilities and records
  • Compute indicator totals and derived variables (e.g., presumed cases)
  • Perform quality control checks on computed indicators
  • Identify and resolve duplicate HF-month records
  • Aggregate data at the selected administrative level
  • Export cleaned facility-level and aggregated datasets

Key Considerations

Troubleshooting

Routine surveillance data can be trickier to work with than other data sources that are more standardized. Expect to encounter and resolve data issues throughout the DHIS2 data preprocessing process. We will likely need to:

  • Investigate unexpected patterns in the data
  • Make judgment calls about how to handle inconsistencies
  • Collaborate with the SNT team on country-specific issues
  • Develop custom solutions for problems not covered here

The troubleshooting approaches shown in this guide provide a starting framework, but real-world data cleaning often requires iterative problem-solving and local expertise. While this guide addresses challenges others have encountered when preparing DHIS2 data for analysis, every country is different. The specific context may present unique problems that are not yet covered in the code library. We encourage persistence and creativity if needed, and recommend consulting others for help if a problem cannot be resolved independently.

If we encounter a new challenge when preprocessing DHIS2 data and would like new content to be added to the code library, please fill out our feedback form.

Best Practices When Working with Routine Data

1. Understand the System’s Context

  • Reporting workflows: Know how data moves from clinics to DHIS2 (paper vs. direct entry, aggregation rules).
  • Indicator definitions: Confirm case definitions and inclusions

2. Triangulate Strategically

Cross-check with:

  • External benchmarks (e.g., HMIS reports, survey data)
  • Parallel systems (e.g., stock management data for testing kits)
  • Temporal patterns (compare dry vs. rainy season trends)

3. Document Your Process

Maintain a troubleshooting log tracking:

  • Issues encountered
  • Data sources consulted
  • Decisions made (with rationale)

Timing lags: Facilities may report late, especially after holidays or system outages.

What a Clean and Well-Structured Dataset Should Look Like

Our data might come in a different format to the Sierra Leone example below. Whether we are able to use the chunk of code provided, or parts of it, to import the data, please make sure the data, after import, contains the following information:

  • A column with the date (year and month) of the report
  • A column with the health facility name
  • Various columns with parent admin units - at minimum we will need the name of the admin unit for SNT analysis (for example, chiefdom for Sierra Leone)
  • The data indicator columns

Below is an example of what the data should look like after import:

adm0 adm1 adm2 adm3 hf date month year allout_u5 allout_ov5 conf_u5 conf_5_14
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-01-01 1 2021 44 48 36 22
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-02-01 2 2021 37 27 20 30
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-03-01 3 2021 44 42 18 22
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-04-01 4 2021 28 23 30 30
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-05-01 5 2021 138 141 72 40
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-06-01 6 2021 127 82 59 16
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-07-01 7 2021 130 175 86 24
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-08-01 8 2021 144 203 93 17
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-09-01 9 2021 159 159 78 18
SIERRA LEONE SOUTHERN BO BO TOWN AETHEL CHP 2021-10-01 10 2021 88 128 6 80

Step-by-Step

In this section, we walk through the key preprocessing steps needed to get DHIS2 malaria data ready for SNT analysis, using example DHIS2 data from Sierra Leone. The focus here is on cleaning, reshaping, and aggregating facility-level data to produce structured outputs at both the facility and administrative levels.

Each step is designed to guide us through the process cleanly. Follow the notes in the code, especially where edits are required (e.g., column names or admin levels). The goal is to make the data analysis-ready without breaking linkages to spatial units or reporting structures.

Remember to adapt this process to the specific country context in which we are working.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Import Packages and Data

We will use Sierra Leone DHIS2 malaria data as a working example. Variable names or database structures may differ across countries. Review and update the variable names to match the country’s DHIS2 configuration.

We begin by installing and loading the required packages for our preprocessing, then import the data.

If using Python and help installing packages is needed, please refer to the Getting Started section.

Step 1.1: Import packages

  • R
  • Python
# install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  tidyverse, # data manipulation, reshaping and plotting
  rio,       # import/export multiple file types
  DT,        # interactive table preview
  here,      # project-relative file paths
  readxl,    # read excel files
  writexl,   # write excel files
  knitr      # render code in quarto/rmarkdown
)
import pandas as pd  # load core data tools
from pathlib import Path  # handle file system paths
from pyprojroot import here  # build project-relative paths
import os  # optional path utilities
import locale  # for setting language locale
import geopandas as gpd  # for importing shapefiles
import xxhash  # for hashing
import pyarrow.parquet as pq  # for exportin parquet files
import pyarrow as pa  # for exportin parquet files

Step 1.2: Import data

The code chunk below defines a helper function to import different types of data, for example Excel or csv data. If the DHIS2 extraction resulted in multiple files, this function will merge them together into a single dataset. Then we use it to read in our data file(s).

When calling the DHIS2 data import function, we read files from a folder using the here package. The here package builds paths from the project root, keeping code portable and reliable.

ImportantInspect Your Data Before Proceeding

Before running any code, take a moment to visually inspect the data. This is especially important when working with DHIS2 data or other routine health sources. As explained in the Data Structures of the Getting Started: For Analysts page, the data should follow tidy data principles: each column represents a variable, each row an observation, and each cell contains a single value.

To check this, open the Excel or CSV file directly and review:

  • Headers: Are the column names in the first row? Are they clear and consistent?
  • Columns and rows:
    • Each column should represent a variable (e.g., health_facility, month, confirmed_cases).
    • Each row should represent a single observation (e.g., one facility-month).
  • Cells: Each cell should contain a single value (not merged or grouped). Avoid cells that contain multiple values (e.g., 10/5 for tested/positive).
  • Extra rows: Remove any top rows used for titles, notes, or merged headers.

If the dataset is already tidy, we are ready to move on. If not, clean it up before working with it. This can be done:

  • In Excel, by editing a copy of the file and removing any extra formatting, then saving the new copy to import and work with in later steps.

  • Or in R, using packages like:

    • rio::import(file, skip = X) to skip X number of extra header rows
    • tidyr::pivot_longer() to reshape wide data into long format
    • dplyr::mutate() and separate() to split combined values into separate columns
  • Or in Python, using:

    • pandas.read_excel(file, skiprows=X) to skip X number of unwanted header rows
    • pandas.melt() to reshape wide data into long format
    • pandas.Series.str.split() and pandas.assign() to split and create new columns

For more complex formatting issues, like multi-row or merged headers, see this excellent step-by-step guide: Tidying Multi-Header Excel Data with R by Paul Campbell (2019).

Getting this right early will save time and reduce confusion later in the workflow, as all the subsequent steps assume that the data is already in tidy format and ready for analysis.

  • R
  • Python

Set the paths to the DHIS2 data initially.

# set the path to the data
core_routine_path <- here::here(
  "01_data",
  "1.2_epidemiology",
  "1.2a_routine_surveillance",
  "raw"
)

# set the full DHIS2 path
path_to_dhis2 <- here::here(core_routine_path, "sle_dhis2_2015_2022.xlsx")

If the dataset is a single Excel spreadsheet, it can be imported directly using rio::import, which reads multiple file formats. This approach assumes all indicators, locations, and time periods are stored in one file

# read the data
dhis2_df <- rio::import(file = path_to_dhis2)

In cases where the data is spread across multiple tabs, for example split into yearly tabs, use the following approach.

dhis2_df <- rio::import_list(
  file = path_to_dhis2,
  rbind = TRUE,        # stack all tabs into one table
  rbind_label = "year" # tab name stored in column 'year'
)

DHIS2 exports can be difficult when data spans many facilities or years. File size and memory use increase when stacking many tabs or files. For this reason, data may be exported across multiple files or sheets. If the data is in a zipped archive, or split into multiple files in the same format, import them together with rio::import_list() to import them together. This assumes all files share the same structure, column names, and data types. It also assumes sheets contain rectangular data only, and the file extension is consistent so rio can detect the format. Examples of this include district or admin named files exported by year ranges, such as Bombali_District_2015-2023.xls or Kambia_District_2015-2023.xls.

dhis2_df <- rio::import_list(
  file = list.files(
    path = core_routine_path,
    pattern = "\\.(xls)$",    # the file extensions
    full.names = TRUE
  ),
  rbind = TRUE,                # stack all tabs/files into one table
  rbind_label = "sheet_admin", # source name stored in column 'sheet_admin'
  rbind_fill = TRUE            # fill missing columns with NA when stacking
)

# check data
dhis2_df |>
  dplyr::select(1:20) |>
  dplyr::glimpse()
NoteOutput
Rows: 184,056
Columns: 10
$ orgunitlevel1                                  <chr> "Sierra Leone", "Sierra…
$ orgunitlevel2                                  <chr> "Bo District", "Bo Dist…
$ orgunitlevel3                                  <chr> "Bo City Council", "Bo …
$ orgunitlevel4                                  <chr> "Bo City", "Bo City", "…
$ orgunitlevel5                                  <chr> "Aethel CHP", "Aethel C…
$ organisationunitname                           <chr> "Aethel CHP", "Aethel C…
$ periodname                                     <chr> "January 2015", "Februa…
$ `OPD (New and follow-up curative) 0-59m_X`     <dbl> NA, NA, NA, NA, NA, NA,…
$ `OPD (New and follow-up curative) 5+y_X`       <dbl> NA, NA, NA, NA, NA, NA,…
$ `Admission - Child with malaria 0-59 months_X` <dbl> NA, NA, NA, NA, NA, NA,…

To adapt the code:

  • Line 3: Set core_routine_path to your target folder.
  • Line 4: Modify the pattern to match your extension (xls, xlsx, csv).
  • Line 8: Set rbind_label = 'sheet_admin' to store the source tab or file, such as admin units, use year for yearly tabs/files.

Set the paths to the DHIS2 data initially.

# set the full dhis2 path
core_routine_path = Path(
    here("01_data/1.2_epidemiology/1.2a_routine_surveillance/raw")
)

# set the full dhis2 path
path_to_dhis2 = core_routine_path / "sle_dhis2_2015_2022.xlsx"

If the dataset is a single Excel file, import it directly using pd.read_excel. This assumes all indicators, locations, and time periods are in one spreadsheet.

# read the data
dhis2_df = pd.read_excel(path_to_dhis2)

In cases where data is stored in multiple tabs, such as yearly splits, load all tabs and stack them into one table. Add a column to keep the tab source, like year.

# import all tabs and stack into a single table
dhis2_df = pd.concat(
    [
        pd.read_excel(path_to_dhis2, sheet_name=s).assign(year=s)
        for s in pd.ExcelFile(path_to_dhis2).sheet_names
    ],
    ignore_index=True,
)
dhis2_df.head(10).style

Cumulative exports across many facilities or years are often split into multiple files. File size and memory use increase when stacking. All files and sheets are assumed to share the same columns and types. Use pandas to detect the sheets, read each file, stack them, and store the file or tab source in a label column, such as sheet_admin. Examples of this include district or admin named files exported by year ranges, such as Bombali_District_2015-2023.xls or Kambia_District_2015-2023.xls.

# set file extension
extension = "xls"

# find all files with specified extension in directory
files = list(core_routine_path.glob(f"*.{extension}"))

# initialise dataframe
dhis2_df = pd.DataFrame()

# iterate over files, concatenate stack as one dataframe
for file in files:
    temp = pd.read_excel(file, sheet_name="Sheet1")
    dhis2_df = pd.concat([dhis2_df, temp])

# Inspect data
dhis2_df.head(10)
orgunitlevel1 orgunitlevel2 orgunitlevel3 orgunitlevel4 orgunitlevel5 organisationunitname periodname OPD (New and follow-up curative) 0-59m_X OPD (New and follow-up curative) 5+y_X Admission - Child with malaria 0-59 months_X Admission - Child with malaria 5-14 years_X Admission - Malaria 15+ years_X Separation - Child with malaria 0-59 months_X Death Separation - Child with malaria 5-14 years_X Death Separation - Malaria 15+ years_X Death Fever case - suspected Malaria 0-59m_X Fever case - suspected Malaria 5-14y_X Fever case - suspected Malaria 15+y_X Fever case in community (Suspected Malaria) 0-59m_X Fever case in community (Suspected Malaria) 5-14y_X Fever case in community (Suspected Malaria) 15+y_X Death malaria 15+ years Female Death malaria 15+ years Male Child death - Malaria 1-59m_X Child death - Malaria 10-14y_X Child death - Malaria 5-9y_X Fever case in community tested for Malaria (RDT) - Negative 0-59m_X Fever case in community tested for Malaria (RDT) - Positive 0-59m_X Fever case in community tested for Malaria (RDT) - Negative 5-14y_X Fever case in community tested for Malaria (RDT) - Positive 5-14y_X Fever case in community tested for Malaria (RDT) - Negative 15+y_X Fever case in community tested for Malaria (RDT) - Positive 15+y_X Fever case tested for Malaria (Microscopy) - Negative 0-59m_X Fever case tested for Malaria (Microscopy) - Positive 0-59m_X Fever case tested for Malaria (Microscopy) - Negative 5-14y_X Fever case tested for Malaria (Microscopy) - Positive 5-14y_X Fever case tested for Malaria (Microscopy) - Negative 15+y_X Fever case tested for Malaria (Microscopy) - Positive 15+y_X Fever case tested for Malaria (RDT) - Negative 0-59m_X Fever case tested for Malaria (RDT) - Positive 0-59m_X Fever case tested for Malaria (RDT) - Negative 5-14y_X Fever case tested for Malaria (RDT) - Positive 5-14y_X Fever case tested for Malaria (RDT) - Negative 15+y_X Fever case tested for Malaria (RDT) - Positive 15+y_X Malaria treated in community with ACT <24 hours 0-59m_X Malaria treated in community with ACT >24 hours 0-59m_X Malaria treated in community with ACT <24 hours 5-14y_X Malaria treated in community with ACT >24 hours 5-14y_X Malaria treated in community with ACT <24 hours 15+y_X Malaria treated in community with ACT >24 hours 15+y_X Malaria treated with ACT <24 hours 0-59m_X Malaria treated with ACT >24 hours 0-59m_X Malaria treated with ACT <24 hours 5-14y_X Malaria treated with ACT >24 hours 5-14y_X Malaria treated with ACT <24 hours 15+y_X Malaria treated with ACT >24 hours 15+y_X
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP January 2015 44 43 NaN NaN NaN NaN NaN NaN 21 4 4 10 7 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 21 NaN 4 NaN 4 15 6 5 2 3 1 18 3 4 NaN 2 NaN
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP February 2015 49 34 NaN NaN NaN NaN NaN NaN 37 3 4 10 7 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 27 NaN 3 NaN 4 15 6 5 2 3 1 17 4 2 NaN 2 1
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP March 2015 102 83 NaN NaN NaN NaN NaN NaN 46 5 14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 36 NaN 3 NaN 8 NaN NaN NaN NaN NaN NaN 36 NaN 3 NaN 5 NaN
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP April 2015 68 68 NaN NaN NaN NaN NaN NaN 44 10 10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9 35 4 6 5 5 NaN NaN NaN NaN NaN NaN 27 8 4 2 3 2
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP May 2015 118 76 NaN NaN NaN NaN NaN NaN 105 15 22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20 85 4 11 5 17 NaN NaN NaN NaN NaN NaN 67 18 7 4 10 7
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP June 2015 145 89 NaN NaN NaN NaN NaN NaN 110 10 21 10 11 9 NaN NaN NaN NaN NaN 2 8 1 10 1 8 NaN NaN NaN NaN NaN NaN 24 86 6 4 10 11 13 4 9 1 6 1 70 16 3 1 6 5
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP July 2015 95 94 NaN NaN NaN NaN NaN NaN 67 31 26 10 13 9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 17 50 9 22 8 18 20 NaN 10 NaN 8 NaN 43 6 13 5 10 5
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP August 2015 86 63 NaN NaN NaN NaN NaN NaN 72 6 15 15 17 14 NaN NaN NaN NaN NaN 2 13 3 14 2 12 NaN NaN NaN NaN NaN NaN 8 64 NaN 4 NaN 15 20 4 11 3 6 6 52 12 3 1 11 4
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP September 2015 117 94 NaN NaN NaN NaN NaN NaN 66 15 24 16 19 16 NaN NaN NaN NaN NaN 2 14 3 16 2 14 NaN NaN NaN NaN NaN NaN 16 50 9 6 2 22 22 4 13 3 6 3 42 8 3 3 5 7
Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP October 2015 97 68 NaN NaN NaN NaN NaN NaN 84 5 11 16 5 10 NaN NaN NaN NaN NaN 5 11 4 1 4 6 NaN NaN NaN NaN NaN NaN 11 73 2 3 3 8 23 3 1 NaN 4 NaN 66 7 2 1 5 3

To adapt the code:

  • Line 1: Update extension to xls, xlsx, or csv.
  • Line 4: Set core_routine_path to your data folder.
  • Line 11: Remove sheet_name="Sheet1" if your files do not use a fixed tab name.

Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts

Before any cleaning or indicator construction, confirm that the merged dataset contains all groups and components from the original DHIS2 exports.

Step 2.1: Verify all expected groups are present after import

DHIS2 exports are often split across several files or sheets. The structure varies by country. Files may be separated by admin units, by years, or by other DHIS2-defined groupings such as monthly batches, subnational regions, wards, partner-assigned districts, or custom DHIS2 groups such as outpatient tallies. After importing and appending these extracts into a single dataset, we should run a set of checks to confirm that all components were included correctly and that nothing was missed or duplicated.

The example below uses adm2 (currently orgunitlevel3 in our data) as the grouping variable, but replace adm2 with year or any other grouping used in the export.

  • R
  • Python
# check the adm's' in our data
dhis2_df |>
  dplyr::distinct(orgunitlevel3)
NoteOutput
orgunitlevel3
Bo City Council
Bo District Council
Bombali District Council
Makeni City Council
Bonthe District Council
Bonthe Municipal Council
Falaba District Council
Kailahun District Council
Kambia District Council
Karene District Council
Kenema City Council
Kenema District Council
Koinadugu District Council
Kono District Council
Koidu New Sembehun City Council
Moyamba District Council
Port Loko District Council
Port Loko City Council
Pujehun District Council
Tonkolili District Council
Western Area Rural District Council
Freetown City Council

To adapt the code:

  • Line 3: Replace orgunitlevel3 with the grouping used in your export (for example, adm2, adm1, year, ward, district, or any custom DHIS2 grouping).
# check the adm's' in our data
dhis2_df[["orgunitlevel3"]].drop_duplicates().reset_index(drop=True)
NoteOutput
orgunitlevel3
Moyamba District Council
Bombali District Council
Makeni City Council
Kambia District Council
Tonkolili District Council
Port Loko District Council
Port Loko City Council
Karene District Council
Koinadugu District Council
Kono District Council
Koidu New Sembehun City Council
Western Area Rural District Council
Kailahun District Council
Pujehun District Council
Kenema City Council
Kenema District Council
Bonthe District Council
Bonthe Municipal Council
Bo City Council
Bo District Council
Falaba District Council
Freetown City Council

To adapt the code:

  • Line 2: Replace orgunitlevel3 with the grouping used in your export (for example, adm2, adm1, year, ward, district, or any custom DHIS2 grouping).

This diagnostic checks that all expected groups from the extracts appear in the merged dataset, whether files were split by admin units, years, wards, or any other DHIS2 grouping.

If we imported 15 files or sheets, we should see 15 distinct groups. Fewer groups usually indicate a missing file, inconsistent tab names, or spelling differences. Row counts will vary across groups, so focus on group presence, not size.

If groups are missing, confirm that all source files are present, sheet names follow a consistent pattern, and group names match across extracts. After fixing any issues and re-importing, continue with the preprocessing steps.

Step 2.2: Check for completely missing variables

Some variables, such as health facility name, reporting date, and administrative unit, must not be missing. Rows with missing values in these fields cannot be used directly and should be investigated with the DHIS2 focal person.

Completely missing columns should be reviewed before continuing. They may reflect changes in DHIS2 indicators, missing files, or indicators that were never collected. Check with the SNT team to confirm whether the variable should be kept, renamed, or removed. Resolving this early prevents issues in all downstream SNT steps.

The code below prints the number of missing values for each variable and highlights any missingness in the critical identifiers needed for SNT analysis.

  • R
  • Python
# identify complete missing values per column
dhis2_df |>
  dplyr::summarise(
    dplyr::across(
      dplyr::everything(),
      ~ mean(is.na(.x)) * 100
    )
  ) |>
  tidyr::pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "pct_missing"
  ) |>
  dplyr::filter(pct_missing == 100)
# A tibble: 0 × 2
# ℹ 2 variables: variable <chr>, pct_missing <dbl>
# compute percent missing for each column
pct_missing = dhis2_df.isna().mean() * 100

# convert to tidy/long format
missing_table = (
    pct_missing.reset_index()
    .rename(columns={"index": "variable", 0: "pct_missing"})
)

# filter for completely missing columns
missing_table[missing_table["pct_missing"] == 100]
Empty DataFrame
Columns: [variable, pct_missing]
Index: []

Do not change or adapt the code above.

Step 3: Rename Columns

A practical way to standardise and rename DHIS2 column names is to use a data dictionary. The dictionary contains at least two columns: 1. the original DHIS2 column names as they appear in the raw export 2. the corresponding SNT column names we prefer to use

The SNT names are typically shorter and more consistent, which helps maintain a clear structure across the analytical pipeline.

TipData Dictionary
indicator_label snt_var
orgunitlevel1 adm0
orgunitlevel2 adm1
orgunitlevel3 adm2
orgunitlevel4 adm3
organisationunitname hf
OPD (New and follow-up curative) 0-59m_X allout_u5
OPD (New and follow-up curative) 5+y_X allout_ov5
Admission - Child with malaria 0-59 months_X maladm_u5
Admission - Child with malaria 5-14 years_X maladm_5_14
Admission - Malaria 15+ years_X maladm_ov15
Child death - Malaria 1-59m_X maldth_1_59m
Child death - Malaria 10-14y_X maldth_10_14
Child death - Malaria 5-9y_X maldth_5_9
Death malaria 15+ years Female maldth_fem_ov15
Death malaria 15+ years Male maldth_mal_ov15
Separation - Child with malaria 0-59 months_X Death maldth_u5
Separation - Child with malaria 5-14 years_X Death maldth_5_14
Separation - Malaria 15+ years_X Death maldth_ov15
Fever case - suspected Malaria 0-59m_X susp_u5_hf
Fever case - suspected Malaria 5-14y_X susp_5_14_hf
Fever case - suspected Malaria 15+y_X susp_ov15_hf
Fever case in community (Suspected Malaria) 0-59m_X susp_u5_com
Fever case in community (Suspected Malaria) 5-14y_X susp_5_14_com
Fever case in community (Suspected Malaria) 15+y_X susp_ov15_com
Fever case in community tested for Malaria (RDT) - Negative 0-59m_X tes_neg_rdt_u5_com
Fever case in community tested for Malaria (RDT) - Positive 0-59m_X tes_pos_rdt_u5_com
Fever case in community tested for Malaria (RDT) - Negative 5-14y_X tes_neg_rdt_5_14_com
Fever case in community tested for Malaria (RDT) - Positive 5-14y_X tes_pos_rdt_5_14_com
Fever case in community tested for Malaria (RDT) - Negative 15+y_X tes_neg_rdt_ov15_com
Fever case in community tested for Malaria (RDT) - Positive 15+y_X tes_pos_rdt_ov15_com
Fever case tested for Malaria (Microscopy) - Negative 0-59m_X test_neg_mic_u5_hf
Fever case tested for Malaria (Microscopy) - Positive 0-59m_X test_pos_mic_u5_hf
Fever case tested for Malaria (Microscopy) - Negative 5-14y_X test_neg_mic_5_14_hf
Fever case tested for Malaria (Microscopy) - Positive 5-14y_X test_pos_mic_5_14_hf
Fever case tested for Malaria (Microscopy) - Negative 15+y_X test_neg_mic_ov15_hf
Fever case tested for Malaria (Microscopy) - Positive 15+y_X test_pos_mic_ov15_hf
Fever case tested for Malaria (RDT) - Negative 0-59m_X tes_neg_rdt_u5_hf
Fever case tested for Malaria (RDT) - Positive 0-59m_X tes_pos_rdt_u5_hf
Fever case tested for Malaria (RDT) - Negative 5-14y_X tes_neg_rdt_5_14_hf
Fever case tested for Malaria (RDT) - Positive 5-14y_X tes_pos_rdt_5_14_hf
Fever case tested for Malaria (RDT) - Negative 15+y_X tes_neg_rdt_ov15_hf
Fever case tested for Malaria (RDT) - Positive 15+y_X tes_pos_rdt_ov15_hf
Malaria treated in community with ACT <24 hours 0-59m_X maltreat_u24_u5_com
Malaria treated in community with ACT >24 hours 0-59m_X maltreat_ov24_u5_com
Malaria treated in community with ACT <24 hours 5-14y_X maltreat_u24_5_14_com
Malaria treated in community with ACT >24 hours 5-14y_X maltreat_ov24_5_14_com
Malaria treated in community with ACT <24 hours 15+y_X maltreat_u24_ov15_com
Malaria treated in community with ACT >24 hours 15+y_X maltreat_ov24_ov15_com
Malaria treated with ACT <24 hours 0-59m_X maltreat_u24_u5_hf
Malaria treated with ACT >24 hours 5-14y_X maltreat_ov24_5_14_hf
Malaria treated with ACT <24 hours 5-14y_X maltreat_u24_5_14_hf
Malaria treated with ACT >24 hours 15+y_X maltreat_ov24_ov15_hf
Malaria treated with ACT >24 hours 0-59m_X maltreat_ov24_u5_hf
Malaria treated with ACT <24 hours 15+y_X maltreat_u24_ov15_hf

Keeping this dictionary in an external file (for example, Excel or CSV) allows others to review and update it, and national teams can confirm indicator naming directly. This also reduces the risk of errors from hard-coded renaming rules, which may break when column formats change or when scripts are modified.

Using a dictionary creates a shared reference for DHIS2 renaming and helps maintain a consistent naming approach throughout the SNT workflow.

ImportantConsult with SNT team

Review variable definitions for the country to better understand and appropriately analyze the data. Consider investigating questions like:

  • Are admissions included in outpatients?
  • Are data on pregnant women included in the data for adults?
  • Is there double counting between RDT and microscopy results? If yes, is it appropriate to only use RDT results?
  • Are data from the private sector included here? If so, what percentage of the private sector reports into DHIS2?
  • Are data from community health workers included in their assigned health facility data or are data separate?
  • Have any variables been included or adapted throughout the years? If so, how should they be treated throughout the time series?

Always confirm variable definitions with the SNT team.

  • R
  • Python
# import data dictionary
data_dict <- rio::import(
  here::here(core_routine_path, "sle_dhis2_data_dict.xlsx")
)
# create rename vector: object = old names, names = new names
rename_vector <- stats::setNames(
  object = data_dict$indicator_label,
  nm = data_dict$snt_var
)

# rename DHIS2 data
dhis2_df <- dhis2_df |>
  dplyr::rename(
    !!!rename_vector
  ) |>
  # we drop orgunitlevel5 because it s same as hf
  dplyr::select(-orgunitlevel5)

# check the names
colnames(dhis2_df)
NoteOutput
 [1] "adm0"                   "adm1"                   "adm2"                  
 [4] "adm3"                   "hf"                     "periodname"            
 [7] "allout_u5"              "allout_ov5"             "maladm_u5"             
[10] "maladm_5_14"            "maladm_ov15"            "maldth_u5"             
[13] "maldth_5_14"            "maldth_ov15"            "susp_u5_hf"            
[16] "susp_5_14_hf"           "susp_ov15_hf"           "susp_u5_com"           
[19] "susp_5_14_com"          "susp_ov15_com"          "maldth_fem_ov15"       
[22] "maldth_mal_ov15"        "maldth_1_59m"           "maldth_10_14"          
[25] "maldth_5_9"             "tes_neg_rdt_u5_com"     "tes_pos_rdt_u5_com"    
[28] "tes_neg_rdt_5_14_com"   "tes_pos_rdt_5_14_com"   "tes_neg_rdt_ov15_com"  
[31] "tes_pos_rdt_ov15_com"   "test_neg_mic_u5_hf"     "test_pos_mic_u5_hf"    
[34] "test_neg_mic_5_14_hf"   "test_pos_mic_5_14_hf"   "test_neg_mic_ov15_hf"  
[37] "test_pos_mic_ov15_hf"   "tes_neg_rdt_u5_hf"      "tes_pos_rdt_u5_hf"     
[40] "tes_neg_rdt_5_14_hf"    "tes_pos_rdt_5_14_hf"    "tes_neg_rdt_ov15_hf"   
[43] "tes_pos_rdt_ov15_hf"    "maltreat_u24_u5_com"    "maltreat_ov24_u5_com"  
[46] "maltreat_u24_5_14_com"  "maltreat_ov24_5_14_com" "maltreat_u24_ov15_com" 
[49] "maltreat_ov24_ov15_com" "maltreat_u24_u5_hf"     "maltreat_ov24_u5_hf"   
[52] "maltreat_u24_5_14_hf"   "maltreat_ov24_5_14_hf"  "maltreat_u24_ov15_hf"  
[55] "maltreat_ov24_ov15_hf"  "sheet_admin"           

To adapt the code:

  • Line 3: Point to your own dictionary file and folder.
  • Lines 7–8: Update the dictionary column names for original and SNT variables.
  • Line 12: Replace dhis2_df with your DHIS2 dataset.
  • Line 17: Remove or change orgunitlevel5 if this column is not present.
# import data dictionary
dict_path = os.path.join(core_routine_path, "sle_dhis2_data_dict.xlsx")
data_dict = pd.read_excel(dict_path)

# build rename dictionary: {old_name: new_name}
rename_dict = dict(zip(data_dict["indicator_label"], data_dict["snt_var"]))

# rename dhis2 data
dhis2_df = dhis2_df.rename(columns=rename_dict).drop(
    columns=["orgunitlevel5"], errors="ignore"
)

# view updated column names
dhis2_df.columns.tolist()
NoteOutput
['adm0', 'adm1', 'adm2', 'adm3', 'hf', 'periodname', 'allout_u5', 'allout_ov5', 'maladm_u5', 'maladm_5_14', 'maladm_ov15', 'maldth_u5', 'maldth_5_14', 'maldth_ov15', 'susp_u5_hf', 'susp_5_14_hf', 'susp_ov15_hf', 'susp_u5_com', 'susp_5_14_com', 'susp_ov15_com', 'maldth_fem_ov15', 'maldth_mal_ov15', 'maldth_1_59m', 'maldth_10_14', 'maldth_5_9', 'tes_neg_rdt_u5_com', 'tes_pos_rdt_u5_com', 'tes_neg_rdt_5_14_com', 'tes_pos_rdt_5_14_com', 'tes_neg_rdt_ov15_com', 'tes_pos_rdt_ov15_com', 'test_neg_mic_u5_hf', 'test_pos_mic_u5_hf', 'test_neg_mic_5_14_hf', 'test_pos_mic_5_14_hf', 'test_neg_mic_ov15_hf', 'test_pos_mic_ov15_hf', 'tes_neg_rdt_u5_hf', 'tes_pos_rdt_u5_hf', 'tes_neg_rdt_5_14_hf', 'tes_pos_rdt_5_14_hf', 'tes_neg_rdt_ov15_hf', 'tes_pos_rdt_ov15_hf', 'maltreat_u24_u5_com', 'maltreat_ov24_u5_com', 'maltreat_u24_5_14_com', 'maltreat_ov24_5_14_com', 'maltreat_u24_ov15_com', 'maltreat_ov24_ov15_com', 'maltreat_u24_u5_hf', 'maltreat_ov24_u5_hf', 'maltreat_u24_5_14_hf', 'maltreat_ov24_5_14_hf', 'maltreat_u24_ov15_hf', 'maltreat_ov24_ov15_hf']

To adapt the code:

  • Line 2: Update the file path if your dictionary is stored elsewhere.
  • Line 6: Change indicator_label and snt_var to match your dictionary column names.
  • Line 9: Replace dhis2_df with your DHIS2 dataset.
  • Line 10: Adjust or remove "orgunitlevel5" if this column is not present.

The Excel data dictionary used here is based on the Sierra Leone DHIS2 export. If we are working with another country, update the rows in the Excel file so that the “original DHIS2 name” and “SNT name” columns match the indicators.

For example, if the dataset uses “Admission of child with malaria - 5–14yrs”, enter that exact name in the dictionary and set the preferred SNT name beside it. Once the Excel file is updated, the renaming code applies the mapping automatically.

Warning

The dictionary must reflect the country’s DHIS2 export. Add, edit, or remove rows in the Excel file so that it aligns with the indicator naming and reporting structure.

Step 4: Standardise the Date Columns

Most SNT workflows require a consistent set of date variables: a YYYY-MM-DD date column, separate year and month fields, and a yearmon field in YYYY-MM for ordered time-series work. The goal of this step is to take whatever date format appears in the DHIS2 export and convert it into this standard structure.

The steps we apply depend on the starting format. Some formats parse automatically (for example, YYYY-MM-DD). Others, such as "Jan 2020" or French "Janvier 2020", need small additional steps.

The code block below shows how to parse the main date formats seen in DHIS2 exports and convert them into a consistent structure. Once the date field is standardised, creating the year, month, and yearmon fields becomes straightforward.

TipParsing Dates
  • R
  • Python

Format already in YYYY-MM-DD

# create dummy data
dates_ex1 <- tibble::tibble(
  raw_date = c("2020-01-15", "2021-03-01", "2022-12-20")
)

# parse dates
dates_ex1 |>
  dplyr::mutate(
    date = lubridate::ymd(raw_date)
  )
NoteOutput
raw_date date
2020-01-15 2020-01-15
2021-03-01 2021-03-01
2022-12-20 2022-12-20

Format “Jan 2020” or “January 2020”

# create dummy data
dates_ex2 <- tibble::tibble(
  raw_date = c("Jan 2020", "February 2021", "Mar 2022")
)

# parse dates
dates_ex2 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(
      raw_date,
      orders = c("b Y", "B Y")
    ) |>
      as.Date()
  )
NoteOutput
raw_date date
Jan 2020 2020-01-01
February 2021 2021-02-01
Mar 2022 2022-03-01

French month names (for example, “Janvier 2020”)

# create dummy data
dates_ex3 <- tibble::tibble(
  raw_date = c(
    "Janvier 2020",
    "Février 2021",
    "décembre 2022",
    "Jan 2021",
    "Fe 2022",
    "Dec 2023"
  )
)

# parse dates
dates_ex3 |>
  dplyr::mutate(
    date = paste0("01 ", raw_date) |>
      lubridate::dmy(locale = "fr_FR") |>
      as.Date()
  )
NoteOutput
raw_date date
Janvier 2020 2020-01-01
Février 2021 2021-02-01
décembre 2022 2022-12-01
Jan 2021 2021-01-01
Fe 2022 2022-02-01
Dec 2023 2023-12-01

Portuguese month names (for example, “Janeiro 2020”)

# create dummy data
dates_ex4 <- tibble::tibble(
  raw_date = c("Janeiro 2020", "fevereiro 2021", "Dezembro 2022",
               "Jan 2021", "Fev 2022", "Dez 2023")
)

# parse dates
dates_ex4 |>
  dplyr::mutate(
    date = paste0("01 ", raw_date) |>
      lubridate::dmy(locale = "pt_PT") |>
      as.Date()
  )
NoteOutput
raw_date date
Janeiro 2020 2020-01-01
fevereiro 2021 2021-02-01
Dezembro 2022 2022-12-01
Jan 2021 2021-01-01
Fev 2022 2022-02-01
Dez 2023 2023-12-01

Format “2023-01” or “2023/01”

# create dummy data
dates_ex5 <- tibble::tibble(
  raw_date = c("2020-01", "2021/05", "2023-12")
)

# parse dates
dates_ex5 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(raw_date, orders = c("Y-m", "Y/m")),
    date = lubridate::floor_date(date, unit = "month") |> as.Date()
  )
NoteOutput
raw_date date
2020-01 2020-01-01
2021/05 2021-05-01
2023-12 2023-12-01

Format “01/2020” or “01-2020” (Month–Year)

# create dummy data
dates_ex6 <- tibble::tibble(
  raw_date = c("01/2020", "06-2021", "11/2022")
)

# parse dates
dates_ex6 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(raw_date, orders = c("m/Y", "m-Y")),
    date = lubridate::floor_date(date, unit = "month") |> as.Date()
  )
NoteOutput
raw_date date
01/2020 2020-01-01
06-2021 2021-06-01
11/2022 2022-11-01

Compact format “202301” (YYYYMM)

# create dummy data
dates_ex7 <- tibble::tibble(
  raw_date = c("202301", "202102", "202212")
)

dates_ex7 |>
  dplyr::mutate(
    year = substr(raw_date, 1, 4),
    month = substr(raw_date, 5, 6),
    date = lubridate::ymd(sprintf("%s-%s-01", year, month)) |> as.Date()
  )
NoteOutput
raw_date year month date
202301 2023 01 2023-01-01
202102 2021 02 2021-02-01
202212 2022 12 2022-12-01

Mixed or unclear formats

# create dummy data
dates_ex8 <- tibble::tibble(
  raw_date = c(
    "2020-01-15", # ISO full date
    "Jan 2021",   # English abbreviated month
    "202203",     # compact YYYYMM
    "03/2022",    # month/year with slash
    "2021-07",    # year-month
    "July 2020",  # English full month
    "15-02-2021", # DD-MM-YYYY
    "2020/11/05", # YYYY/MM/DD
    "2020.12.25", # dotted date
    "2022.07",    # YYYY.MM
    "10-2020",    # MM-YYYY
    "20210105",   # YYYYMMDD
    "2020 Jan",   # Year then month (English)
    "2020.01.01"  # dotted YYYY.MM.DD
  )
)

# parse dates
dates_ex8 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(
      raw_date,
      orders = c(
        "Y-m-d", # 2020-01-15
        "Y/m/d", # 2020/11/05
        "Y.m.d", # 2020.12.25
        "b Y",   # Jan 2021
        "B Y",   # January 2021
        "Y b",   # 2020 Jan
        "Y B",   # 2020 January
        "Ym",    # 202203
        "m/Y",   # 03/2022
        "m-Y",   # 10-2020
        "Y-m",   # 2021-07
        "Y.m",   # 2022.07
        "d-m-Y", # 15-02-2021
        "d/m/Y", # 15/02/2021 (if present)
        "Ymd"    #  20210105
      )
    ) |>
      as.Date()
  )
NoteOutput
raw_date date
2020-01-15 2020-01-15
Jan 2021 2021-01-01
202203 2022-03-01
03/2022 2022-03-01
2021-07 2021-07-01
July 2020 2020-07-01
15-02-2021 2021-02-15
2020/11/05 2020-11-05
2020.12.25 2020-12-25
2022.07 2022-07-01

Format already in YYYY-MM-DD

# create dummy data
dates_ex1 = pd.DataFrame({
    "raw_date": ["2020-01-15", "2021-03-01", "2022-12-20"]
})

# parse dates
dates_ex1["date"] = pd.to_datetime(dates_ex1["raw_date"], format="%Y-%m-%d")

dates_ex1
NoteOutput
raw_date date
2020-01-15 2020-01-15 03:00:00
2021-03-01 2021-03-01 03:00:00
2022-12-20 2022-12-20 03:00:00

Format “Jan 2020” or “January 2020”

# create dummy data
dates_ex2 = pd.DataFrame({
    "raw_date": ["Jan 2020", "February 2021", "Mar 2022"]
})

# parse dates
dates_ex2["date"] = pd.to_datetime(
    dates_ex2["raw_date"],
    format=None,  # allow flexible parsing of abbreviated and full month names
)

dates_ex2
NoteOutput
raw_date date
Jan 2020 2020-01-01 03:00:00
February 2021 2021-02-01 03:00:00
Mar 2022 2022-03-01 03:00:00

French month names (for example, “Janvier 2020”)

# set French locale (adjust if needed depending on system)
# common options: 'fr_FR.UTF-8', 'fr_FR'
locale.setlocale(locale.LC_TIME, 'fr_FR.UTF-8')

# create dummy data
dates_ex3 = pd.DataFrame({
    "raw_date": [
        "Janvier 2020",
        "Février 2021",
        "décembre 2022"
    ]
})

# ensure day is present for parsing
dates_ex3["date"] = pd.to_datetime(
    "01 " + dates_ex3["raw_date"],
    format="%d %B %Y",
    errors="coerce"
)

# fallback for abbreviated months
mask_missing = dates_ex3["date"].isna()
dates_ex3.loc[mask_missing, "date"] = pd.to_datetime(
    "01 " + dates_ex3.loc[mask_missing, "raw_date"],
    format="%d %b %Y",
    errors="coerce"
)

dates_ex3
NoteOutput
raw_date date
Janvier 2020 2020-01-01 03:00:00
Février 2021 2021-02-01 03:00:00
décembre 2022 2022-12-01 03:00:00

Portuguese month names (for example, “Janeiro 2020”)

# set Portuguese locale
# common options: 'pt_PT.UTF-8', 'pt_PT'
locale.setlocale(locale.LC_TIME, 'pt_PT.UTF-8')

# create dummy data
dates_ex4 = pd.DataFrame({
    "raw_date": [
        "Janeiro 2020",
        "fevereiro 2021",
        "Dezembro 2022",
        "Jan 2021",
        "Fev 2022",
        "Dez 2023"
    ]
})

# first attempt: full Portuguese month names
dates_ex4["date"] = pd.to_datetime(
    "01 " + dates_ex4["raw_date"],
    format="%d %B %Y",
    errors="coerce"
)

# fallback: abbreviated Portuguese month names (e.g. Jan, Fev, Dez)
mask_missing = dates_ex4["date"].isna()
dates_ex4.loc[mask_missing, "date"] = pd.to_datetime(
    "01 " + dates_ex4.loc[mask_missing, "raw_date"],
    format="%d %b %Y",
    errors="coerce"
)

dates_ex4
NoteOutput
raw_date date
Janeiro 2020 2020-01-01 03:00:00
fevereiro 2021 2021-02-01 03:00:00
Dezembro 2022 2022-12-01 03:00:00
Jan 2021 2021-01-01 03:00:00
Fev 2022 2022-02-01 03:00:00
Dez 2023 2023-12-01 03:00:00

Mixed or unclear formats

# create dummy data
dates_ex5 = pd.DataFrame(
    {
        "raw_date": [
            "2020-01-15",  # ISO full date
            "Jan 2021",    # English abbreviated month
            "202203",      # compact YYYYMM
            "03/2022",     # month/year
            "2021-07",     # year-month
            "July 2020",   # English full month
            "15-02-2021",  # DD-MM-YYYY
            "2020/11/05",  # YYYY/MM/DD
            "2020.12.25",  # YYYY.MM.DD
            "2022.07",     # YYYY.MM
            "10-2020",     # MM-YYYY
            "20210105",    # YYYYMMDD
            "2020 Jan",    # year then month
            "2020.01.01",  # dotted full date
        ]
    }
)

# list of possible formats
formats = [
    "%Y-%m-%d",
    "%Y/%m/%d",
    "%Y.%m.%d",
    "%d-%m-%Y",
    "%d/%m/%Y",
    "%b %Y",
    "%B %Y",
    "%Y %b",
    "%Y %B",
    "%Y-%m",
    "%Y/%m",
    "%Y.%m",
    "%m-%Y",
    "%m/%Y",
    "%Y%m",
    "%Y%m%d",
]


def try_parse(value):
    # try strict formats first
    for fmt in formats:
        try:
            return pd.to_datetime(value, format=fmt)
        except:
            pass
    # fallback: let pandas guess freely
    return pd.to_datetime(value, errors="coerce")


dates_ex5["date"] = dates_ex5["raw_date"].apply(try_parse)

dates_ex5["date"] = dates_ex5["date"].dt.to_period("M").dt.to_timestamp()
dates_ex5
NoteOutput
raw_date date
2020-01-15 2020-01-01 03:00:00
Jan 2021 2021-01-01 03:00:00
202203 2022-03-01 03:00:00
03/2022 2022-03-01 03:00:00
2021-07 2021-07-01 03:00:00
July 2020 2020-07-01 03:00:00
15-02-2021 2021-02-01 03:00:00
2020/11/05 2020-11-01 03:00:00
2020.12.25 2020-12-01 03:00:00
2022.07 2022-07-01 03:00:00
10-2020 2020-10-01 03:00:00
20210105 2021-01-01 03:00:00
2020 Jan 2020-01-01 03:00:00
2020.01.01 2020-01-01 03:00:00

For the Sierra Leone dataset, the date field is stored in periodname and appears in formats such as “January 2019” or “February 2019”. We first standardise this into a proper YYYY-MM-DD date. Once the date column is cleaned, we can generate the year, month, and yearmon fields used throughout the SNT workflow.

  • R
  • Python
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
Sys.setlocale("LC_TIME", "en_US.UTF-8")

dhis2_df <- dhis2_df |>
  # parse the raw date into a proper Date object
  dplyr::mutate(
    date = lubridate::parse_date_time(
      periodname,
      orders = c("B Y", "b Y")
    ) |>
      as.Date()
  ) |>
  # create year, month, and ordered yearmon label
  dplyr::mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    # human-readable label, e.g. "Jan 2020"
    yearmon = format(date, "%b %Y"),
    # order factor by actual chronological date
    yearmon = factor(yearmon, levels = unique(yearmon[order(date)]))
  )

# check the head
dhis2_df |>
  dplyr::select(date, year, month, yearmon) |>
  head()
NoteOutput
[1] “en_US.UTF-8”
date year month yearmon
2015-01-01 2015 1 Jan 2015
2015-02-01 2015 2 Feb 2015
2015-03-01 2015 3 Mar 2015
2015-04-01 2015 4 Apr 2015
2015-05-01 2015 5 May 2015
2015-06-01 2015 6 Jun 2015

To adapt the code:

  • Line 3: Set the locale to match the language of month names in your export, for example "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8".
  • Line 9: Update the column name if your date field is not called periodname.
  • Line 10: Adjust the orders = c("B Y", "b Y") list only if your dates use a different structure.
  • Line 19 Change the yearmon display (for example, "%b %Y" to "%B %Y") based on how you want month names shown.
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
locale.setlocale(locale.LC_TIME, "en_US.UTF-8")

# parse the raw date into a proper datetime object
dhis2_df["date"] = pd.to_datetime(
    dhis2_df["periodname"],
    format="%B %Y",
    errors="coerce"
)

# create year, month, and ordered yearmon columns
dhis2_df["year"] = dhis2_df["date"].dt.year
dhis2_df["month"] = dhis2_df["date"].dt.month
dhis2_df["yearmon"] = dhis2_df["date"].dt.strftime("%b %Y")

# check the output
dhis2_df[["date", "year", "month", "yearmon"]].head()
NoteOutput
date year month yearmon
2015-01-01 03:00:00 2015 1 Jan 2015
2015-02-01 03:00:00 2015 2 Feb 2015
2015-03-01 03:00:00 2015 3 Mar 2015
2015-04-01 03:00:00 2015 4 Apr 2015
2015-05-01 03:00:00 2015 5 May 2015

To adapt the code:

  • Line 3: Set the locale to match the language of month names in your export, for example "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8".
  • Line 6: Update the column name if your date field is not called periodname.
  • Line 7: Adjust the format="%B %Y" if your dates use a different structure (for example, "%b %Y" for abbreviated months).
  • Line 14: Change the strftime format (for example, "%b %Y" to "%B %Y") based on how you want month names shown.

Step 5: Managing Location Columns

Step 5.1: Harmonise administrative names

Differences in spelling, capitalisation, spacing, or naming conventions between DHIS2 exports and shapefiles are common. These differences will cause joins to fail and lead to missing or duplicated geographic units. Before any spatial join, we should harmonise administrative names used in the DHIS2 dataset with those defined in the shapefile.

A practical way to do this is to use the prep_geonames() function from the sntutils package. The function detects mismatches, suggests likely matches using string distance methods, and allows interactive confirmation or editing. It also saves decisions to a cache file so that harmonisation is reproducible and consistent across analysts and across reruns.

We only show the basic structure here. The full workflow, including interactive matching and scripted reuse, is explained in the Merging shapefiles with tabular data section of the SNT Code Library. For an in-depth blog post based on this method, see Cleaning and standardising geographic names in R with prep_geonames() from the AMMnet Hackathon.

Before any interactive matching, we ensure that the columns are correctly assigned. From observation of the admin names, it appeared that adm1 was actually adm2, so we create a simple lookup to assign districts (adm2) to their respective provinces (adm1).

  • R
  • Python
# install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")

# set up location to save cache
cache_path <- "01_data/1.1_foundational/1.1f_cache_files/processed"

# get adm3 shapefile to use as reference lookup
shp_adm3 <- sntutils::read(
  here::here("01_data/1.1_foundational/1.1a_admin_boundaries/processed/sle_spatial_adm3_2021.rds")
) |>
  sf::st_drop_geometry() # drop geometry, we just need the admin names

# standardise admin names
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    adm0 = toupper(adm0),
    adm2 = toupper(adm1),
    adm3 = toupper(adm3),
    hf = toupper(hf),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2 = stringr::str_remove_all(adm2, " DISTRICT"),
    adm3 = stringr::str_remove_all(adm3, " CHIEFDOM")
  ) |>
  dplyr::mutate(
    # assign provinces based on district groupings
    # (no adm1 as current adm1 is adm2)
    adm1 = dplyr::case_when(
      adm2 %in% c("KAILAHUN", "KENEMA", "KONO") ~ "EASTERN",
      adm2 %in% c("BOMBALI", "FALABA", "KOINADUGU", "TONKOLILI") ~ "NORTH EAST",
      adm2 %in% c("KAMBIA", "KARENE", "PORT LOKO") ~ "NORTH WEST",
      adm2 %in% c("BO", "BONTHE", "MOYAMBA", "PUJEHUN") ~ "SOUTHERN",
      adm2 %in% c("WESTERN AREA RURAL", "WESTERN AREA URBAN") ~ "WESTERN"
    )
  )

# harmonise admin names between DHIS2 data and shapefile
dhis2_df <-
  sntutils::prep_geonames(
    target_df = dhis2_df,
    lookup_df = lookup_keys,
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    interactive = TRUE,
    cache_path = here::here(cache_path, "geoname_decisions_cache.xlsx")
  )
NoteOutput
── ℹ Match Summary ─────────────────────────────────────────────────────────────
! Both sides have unmatched names; see per-level lines below.

Target data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 16 matched                                       
• adm3 (level3): 199 out of 207 matched                                     
Lookup data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 16 matched                                       
• adm3 (level3): 199 out of 208 matched                                     
ℹ Partial match completed. There are still matches to be made.
Would you like to do interactive matching? (yes/no):
ℹ Exiting without interactive matching...

To adapt the code:

  • Line 6: Update cache_path to your preferred location for saving cached files.
  • Lines 9–12: Replace the shapefile path with the path to your reference shapefile.
  • Line 15: Replace dhis2_df with your DHIS2 or target dataset.
  • Lines 15–36: This standardisation block is specific to Sierra Leone. Adapt the toupper(), str_remove_all(), and case_when() logic to match your country’s administrative structure and naming conventions.
  • Line 39: Replace dhis2_df with your target dataframe name.
  • Line 42: Replace lookup_keys with your lookup or reference dataset.
  • Lines 43–46: Adjust level0, level1, level2, and level3 to match your actual admin column names if they differ (e.g., “country”, “region”, “district”, “ward”). Ensure the columns exist in both datasets.
  • Line 48: Update the cache filename if desired.

Once updated, run the code to harmonise the data’s admin names with the shapefile.

# If not already, install the sntutils python package from GitHub
# has a number of useful helper functions
pip install git+https://github.com/ahadi-analytics/sntutils-py.git
from sntutils.geo import prep_geonames # for name matching

# set up location to save cache
cache_path = Path("01_data/1.1_foundational/1.1f_cache_files/processed")

# get adm3 shapefile to use as reference lookup
shp_adm3 = gpd.read_file(
    Path(here("01_data/1.1_foundational/1.1a_admin_boundaries/processed/sle_spatial_adm3_2021.geojson"))
).drop(columns="geometry")

# standardise admin names
dhis2_df = dhis2_df.assign(
    adm0=lambda x: x["adm0"].str.upper(),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2=lambda x: x["adm1"]
    .str.upper()
    .str.replace(" DISTRICT", "", regex=False)
    .str.strip(),
    adm3=lambda x: x["adm3"]
    .str.upper()
    .str.replace(" CHIEFDOM", "", regex=False)
    .str.strip(),
    hf=lambda x: x["hf"]
    .str.upper()
)

# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
district_to_province = {
    "KAILAHUN": "EASTERN", "KENEMA": "EASTERN", "KONO": "EASTERN",
    "BOMBALI": "NORTH EAST", "FALABA": "NORTH EAST",
    "KOINADUGU": "NORTH EAST", "TONKOLILI": "NORTH EAST",
    "KAMBIA": "NORTH WEST", "KARENE": "NORTH WEST", "PORT LOKO": "NORTH WEST",
    "BO": "SOUTHERN", "BONTHE": "SOUTHERN",
    "MOYAMBA": "SOUTHERN", "PUJEHUN": "SOUTHERN",
    "WESTERN AREA RURAL": "WESTERN", "WESTERN AREA URBAN": "WESTERN"
}

dhis2_df["adm1"] = dhis2_df["adm2"].map(district_to_province)

# harmonise admin names between dhis2 data and shapefile
# note: sntutils::prep_geonames() has no direct python equivalent
# manual fuzzy matching or exact merge required
dhis2_df = dhis2_df.merge(
    lookup_keys,
    on=["adm0", "adm1", "adm2", "adm3"],
    how="left"
)

# Harmonise admin names between population data and shapefile
dhis2_df = prep_geonames(
    target_df=dhis2_df,
    lookup_df=shp_adm3,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    cache_path=here(cache_path, "geoname_decisions_cache.xlsx")
)
NoteOutput
Info: Loaded cache from /Users/mohamedyusuf/ahadi-analytics/code/GitHub/snt-code-library/01_data/1.1_foundational/1.1f_cache_files/processed/geoname_decisions_cache.xlsx

ℹ Match Summary

╭───────────────┬─────────┬───────────────┬───────────────╮
│ Level         │ Matched │ Target Data N │ Lookup Data N │
├───────────────┼─────────┼───────────────┼───────────────┤
│ adm0 (level0) │    1    │       1       │       1       │
│ adm1 (level1) │    5    │       5       │       5       │
│ adm2 (level2) │   16    │      16       │      16       │
│ adm3 (level3) │   207   │      207      │      208      │
╰───────────────┴─────────┴───────────────┴───────────────╯
ℹ Lookup has extra names not in data                       

Success: All records matched; process completed. Exiting...

To adapt the code:

  • Line 3: Run once in terminal to install the sntutils-py package from GitHub.
  • Line 4: Import statement for the prep_geonames function.
  • Line 7: Update cache_path to your preferred location for saving cached files.
  • Lines 10–12: Replace the shapefile path with the path to your reference shapefile.
  • Line 15: Replace dhis2_df with your DHIS2 or target dataset.
  • Lines 15–29: This standardisation block is specific to Sierra Leone. Adapt the .str.upper(), .str.replace(), and district_to_province dictionary to match your country’s administrative structure and naming conventions.
  • Lines 48–52: Remove this merge block if using prep_geonames() below, as it duplicates the matching step.
  • Line 55: Replace dhis2_df with your target dataframe name.
  • Line 57: Replace shp_adm3 with your lookup or reference dataset.
  • Lines 58–61: Adjust level0, level1, level2, and level3 to match your actual admin column names if they differ (e.g., “country”, “region”, “district”, “ward”). Ensure the columns exist in both datasets.
  • Line 62: Update the cache filename if desired.

Once updated, run the code to harmonise the data’s admin names with the shapefile.

Once the interactive matching is complete and the results are saved, the output message confirms that all administrative levels have been successfully aligned between the dataset and the reference shapefile. If mismatches remain, the prep_geoname function will prompt us to resolve them interactively.

To double-check the decisions, share the cache file geoname_decisions_cache.xlsx with the country SNT team to ensure all matching decisions are correct.

Step 5.2: Create unique identifiers and location labels

Using our corrected admin names, we can now create key columns for downstream analysis and mapping:

  • hf_uid: A unique health facility identifier generated by hashing the full administrative hierarchy and facility name. This ensures each facility has a consistent, reproducible ID.
  • location_short: A concatenated label of province and district (adm1 ~ adm2), useful for summary tables and aggregated views.
  • location_full: The complete administrative hierarchy including chiefdom and facility name, useful for detailed map tooltips and drill-down labelling.
  • record_id: A unique record identifier combining the facility ID and time period, ensuring each facility-month combination is uniquely identifiable.
  • R
  • Python
# create identifiers
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # create unique health facility identifier from admin hierarchy
    hf_uid = sntutils::vdigest(
      paste0(adm0, adm1, adm2, adm3, hf),
      algo = "xxhash32"
    ),
    # create location labels for mapping
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, hf, sep = " ~ "),
    # generate consistent record id (facility + time period)
    record_id = sntutils::vdigest(
      paste(hf_uid, yearmon),
      algo = "xxhash32"
    )
  )

# check
dhis2_df |>
  dplyr::arrange(location_full) |>
  dplyr::distinct(location_full, hf_uid, record_id) |>
  head()
NoteOutput
location_full hf_uid record_id
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b 20724f8d
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b cc1562d5
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b 2d42fbf9
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b 6958bd08
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b d2cf4c99
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b 0bcfc2a4

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Line 6: Adjust the paste0() arguments to match your administrative column names (e.g., adm0, adm1, adm2, adm3) and facility column (e.g., hf).
  • Line 10: Modify location_short to include the admin levels you want for summary views.
  • Line 11: Modify location_full to include all admin levels and facility name for detailed labelling.
  • Line 14: Ensure yearmon matches your time period column name. If using a different temporal unit (e.g., year, epiweek), update accordingly.

Once updated, run the code to generate unique identifiers for your health facilities and records.

# helper function to create xxhash32 digest (equivalent to sntutils::vdigest)
def vdigest(x, algo="xxhash32"):
    """Create vectorised hash digest."""
    return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())


# create identifiers
dhis2_df = dhis2_df.assign(
    # create unique health facility identifier from admin hierarchy
    # use paste0 semantics (no separator) to match R's paste0(adm0, adm1, adm2, adm3, hf)
    hf_uid=lambda x: vdigest(
        x["adm0"] + x["adm1"] + x["adm2"] + x["adm3"] + x["hf"]
    ),
    # create location labels for mapping
    location_short=lambda x: x["adm1"] + " ~ " + x["adm2"],
    location_full=lambda x: (
        x["adm1"] + " ~ " + x["adm2"] + " ~ " + x["adm3"] + " ~ " + x["hf"]
    ),
)

# generate consistent record id (facility + time period)
# use space to match R's paste(hf_uid, yearmon) default sep=" "
dhis2_df["record_id"] = vdigest(
    dhis2_df["hf_uid"] + " " + dhis2_df["yearmon"].astype(str)
)

# check
dhis2_df[["location_full", "hf_uid", "record_id"]].drop_duplicates().sort_values(
    "location_full"
).head()
NoteOutput
location_full hf_uid record_id
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP 95a9f6ec 8eceb7b6
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP 95a9f6ec af7f4517
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP 95a9f6ec ab2697b8
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP 95a9f6ec a6473869
EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP 95a9f6ec 249ddfd4

To adapt the code:

  • Line 8: Replace dhis2_df with your target dataset.
  • Lines 11–13: Adjust the concatenated columns to match your administrative column names (e.g., adm0, adm1, adm2, adm3) and facility column (e.g., hf).
  • Line 15: Modify location_short to include the admin levels you want for summary views.
  • Lines 16–18: Modify location_full to include all admin levels and facility name for detailed labelling.
  • Lines 23–25: Ensure yearmon matches your time period column name. If using a different temporal unit (e.g., year, date), update accordingly.

Once updated, run the code to generate unique identifiers for your health facilities and records.

Step 5.5: Harmonise Health Facility Names with the Master Facility List

DHIS2 facility names often drift from the canonical names recorded in the country’s master facility list (MFL): typos, abbreviations, casing differences, and diacritics accumulate as data is entered by hand. Before computing indicators in Step 6, attach the canonical MFL name and a stable facility identifier to every row of dhis2_df so that downstream joins, aggregations, and reporting-rate calculations don’t fragment a single facility into multiple records.

The matching itself (string normalisation, fuzzy scoring, manual review of ambiguous candidates, and the structured output schema) is documented on the Fuzzy matching of names across datasets page. Work through that page first to produce final_dhis2_mfl_integrated.{rds,parquet} for your country, then return here to attach the canonical names back into the routine surveillance pipeline using the code below.

  • R
  • Python
Show the code
# load the fuzzy-matched DHIS2 + MFL output produced on the master facility
# lists page
mfl_matched <- readRDS(here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed",
  "final_dhis2_mfl_integrated.rds"
))

# build a distinct composite-keyed lookup; same facility name (hf) can
# appear in several adm2/adm3 areas, so key on (adm1, adm2, adm3, hf) to
# keep the lookup unique per facility-location
mfl_lookup <- mfl_matched |>
  dplyr::distinct(
    adm1, adm2, adm3, hf, hf_uid_new, hf_clean = hf_mfl_raw
  )

# attach canonical MFL name and stable id; relationship = "many-to-one"
# raises an error if mfl_lookup is not unique per join key, catching any
# accidental row multiplication
dhis2_df <- dhis2_df |>
  dplyr::left_join(
    mfl_lookup,
    by = c("adm1", "adm2", "adm3", "hf"),
    relationship = "many-to-one"
  )

# inspect a few rows of raw vs canonical name
dhis2_df |>
  dplyr::distinct(adm1, adm2, adm3, hf, hf_clean, hf_uid_new) |>
  dplyr::arrange(adm1, adm2, adm3, hf) |>
  utils::head(10)
NoteOutput
adm1 adm2 adm3 hf hf_clean hf_uid_new
EASTERN KAILAHUN DEA BAIWALA CHP Baiwala CHC hf_uid_new::1
EASTERN KAILAHUN DEA NAGBENA CHP Nagbena CHP hf_uid_new::2
EASTERN KAILAHUN DEA SIENGA CHP Sienga CHP hf_uid_new::3
EASTERN KAILAHUN JAHN GBEIKA CHP Gbeika CHC hf_uid_new::4
EASTERN KAILAHUN JAWIE BOMBOHUN MCHP Bombohun MCHP hf_uid_new::5
EASTERN KAILAHUN JAWIE DARU CHC Daru CHC hf_uid_new::6
EASTERN KAILAHUN JAWIE KAMBAMA CHP Kambama CHP hf_uid_new::7
EASTERN KAILAHUN JAWIE KORTUMA MCHP Kortuma MCHP hf_uid_new::8
EASTERN KAILAHUN JAWIE MAMABU CHP NA hf_uid_new::9
EASTERN KAILAHUN JAWIE NJALA-GRIMA CHP Njala-Grima CHP hf_uid_new::10

To adapt the code:

  • Lines 3-9: Point to your country’s matched MFL output produced on the fuzzy-matching page.
  • Lines 14-16: If your matched output uses different column names than adm1 / adm2 / adm3 / hf / hf_uid_new / hf_mfl_raw, update the distinct arguments to match.
  • Lines 20-24: If your country uses different admin-level columns, update the by vector. The relationship = "many-to-one" argument raises a clear error if a facility name is duplicated within the same (adm1, adm2, adm3) cell, surfacing data quality issues before they silently multiply rows downstream.
Show the code
# load the fuzzy-matched DHIS2 + MFL output produced on the master facility
# lists page
mfl_matched = pd.read_parquet(
    Path(here(
        "01_data/1.1_foundational/1.1c_health_facilities/processed/"
        "final_dhis2_mfl_integrated.parquet"
    ))
)

# build a distinct composite-keyed lookup; same facility name (hf) can
# appear in several adm2/adm3 areas, so key on (adm1, adm2, adm3, hf) to
# keep the lookup unique per facility-location
mfl_lookup = (
    mfl_matched[
        ["adm1", "adm2", "adm3", "hf", "hf_uid_new", "hf_mfl_raw"]
    ]
    .drop_duplicates(subset=["adm1", "adm2", "adm3", "hf"])
    .rename(columns={"hf_mfl_raw": "hf_clean"})
)

# attach canonical MFL name and stable id; validate="many_to_one" raises
# an error if mfl_lookup is not unique per join key, catching any
# accidental row multiplication
dhis2_df = dhis2_df.merge(
    mfl_lookup,
    on=["adm1", "adm2", "adm3", "hf"],
    how="left",
    validate="many_to_one",
)

# inspect a few rows of raw vs canonical name
(
    dhis2_df[["adm1", "adm2", "adm3", "hf", "hf_clean", "hf_uid_new"]]
    .drop_duplicates()
    .sort_values(["adm1", "adm2", "adm3", "hf"])
    .head(10)
)
NoteOutput
adm1 adm2 adm3 hf hf_clean hf_uid_new
EASTERN KAILAHUN DEA BAIWALA CHP NA hf_uid_new::0
EASTERN KAILAHUN DEA NAGBENA CHP NA hf_uid_new::1
EASTERN KAILAHUN DEA SIENGA CHP NA hf_uid_new::2
EASTERN KAILAHUN JAHN GBEIKA CHP NA hf_uid_new::3
EASTERN KAILAHUN JAWIE BOMBOHUN MCHP NA hf_uid_new::4
EASTERN KAILAHUN JAWIE DARU CHC NA hf_uid_new::5
EASTERN KAILAHUN JAWIE KAMBAMA CHP NA hf_uid_new::6
EASTERN KAILAHUN JAWIE KORTUMA MCHP NA hf_uid_new::7
EASTERN KAILAHUN JAWIE MAMABU CHP NA hf_uid_new::8
EASTERN KAILAHUN JAWIE NJALA-GRIMA CHP NA hf_uid_new::9

To adapt the code:

  • Lines 3-8: Point to your country’s matched MFL output produced on the fuzzy-matching page.
  • Lines 13-17: If your matched output uses different column names than adm1 / adm2 / adm3 / hf / hf_uid_new / hf_mfl_raw, update the column list and drop_duplicates subset.
  • Lines 22-27: If your country uses different admin-level columns, update the on list. The validate="many_to_one" argument raises a clear error if a facility name is duplicated within the same (adm1, adm2, adm3) cell, surfacing data quality issues before they silently multiply rows downstream.

Step 6: Compute Variables

Step 6.1: Compute indicator totals and new variables

Now that we have imported our DHIS2 data and cleaned up the columns, we will compute derived variables to create totals for specific indicators. DHIS2 can provide totals, but these columns must specifically be extracted. In the raw DHIS2 dataset, indicators are typically disaggregated by age group, community, and health facility level. Analysts should strive to obtain the most disaggregated databases possible.

For example, if we want to calculate the total number of outpatient visits, DHIS2 may not provide a direct total, but includes components that can be summed, such as allout_u5 and allout_ov5. Remember to review variable definitions to ensure aggregations do not double count cases.

The same logic applies to other indicators, including total suspected cases (susp), tested cases (test), confirmed cases (conf), treated cases (maltreat), and presumed cases (pres). The code snippet below shows how to compute these totals by combining the relevant components. The code also includes indicator aggregation by age groups, such as test_hf_u5 which captures all children under five tested by either RDT or microscopy at a health facility.

It is best to verify with the SNT team which specific data elements from DHIS2 should be summed to obtain the correct total. While some calculations may seem obvious, others are not. The box in Step 3 lists example questions that we should get verification for before summing across disaggregated data elements. Depending on the country’s DHIS2 and data reporting practice, other questions may also be relevant.

ImportantConsult with SNT team

While some countries have presumed cases in their DHIS2 data, not all do! Here are three options for presumed cases that countries have used:

  • Option 1: The data already has a pres column, no further calculation is needed.
  • Option 2: Calculate presumed cases using the difference between treated cases and confirmed cases: maltreat - conf
  • Option 3: Calculate presumed cases using the difference between suspected cases and tested cases: susp - test

The code below uses Option 2 (lines 52 to 55). Testing is subject to resource availability meaning facilities handle presumed cases differently.

Consult the SNT team to determine how best to approach presumed cases calculations, whether it’s one of the options shown above or a different approach.

  • R
  • Python

Below we create two new functions: fallback_row_sum() and fallback_diff(). The fallback_row_sum() function replaces rowSums(..., na.rm = TRUE), which returns 0 when all values in a row are NA. This default behavior can obscure missingness.

In routine health data, it’s critical to distinguish between:

  • True zeros: e.g., a facility reported 0 cases
  • Missing values: e.g., the facility did not report at all

fallback_row_sum() returns NA when too few values are present, preserving this distinction and preventing overestimation of completeness.

Similarly, fallback_diff() returns the absolute difference if both values are present, the non-missing value if only one is present, and NA if both are missing. It applies pmax() to ensure results meet a minimum threshold.

Both functions guard against misleading outputs when data are incomplete.

# smart row-wise sum with missing data handling
fallback_row_sum <- function(..., min_present = 1, .keep_zero_as_zero = TRUE) {
  vars_matrix <- cbind(...)
  valid_count <- rowSums(!is.na(vars_matrix))
  raw_sum <- rowSums(vars_matrix, na.rm = TRUE)

  ifelse(valid_count >= min_present, raw_sum, NA_real_)
}

# fallback absolute difference between two vectors
fallback_diff <- function(col1, col2, minimum = 0) {
  dplyr::case_when(
    is.na(col1) & is.na(col2) ~ NA_real_,
    is.na(col1) ~ pmax(col2, minimum),
    is.na(col2) ~ pmax(col1, minimum),
    TRUE ~ pmax(col1 - col2, minimum)
  )
}

Now we use these functions to aggregate our columns.

Show the code
# compute indicator totals in DHIS2 data
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # outpatient visits
    allout = fallback_row_sum(allout_u5, allout_ov5),

    # suspected cases
    susp = fallback_row_sum(
      susp_u5_hf,
      susp_5_14_hf,
      susp_ov15_hf,
      susp_u5_com,
      susp_5_14_com,
      susp_ov15_com
    ),

    # tested cases
    test_hf = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    test_com = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com,
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com,
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    test = fallback_row_sum(test_hf, test_com),

    # confirmed cases (hf and com)
    conf_hf = fallback_row_sum(
      test_pos_mic_u5_hf,
      test_pos_mic_5_14_hf,
      test_pos_mic_ov15_hf,
      tes_pos_rdt_u5_hf,
      tes_pos_rdt_5_14_hf,
      tes_pos_rdt_ov15_hf
    ),

    conf_com = fallback_row_sum(
      tes_pos_rdt_u5_com,
      tes_pos_rdt_5_14_com,
      tes_pos_rdt_ov15_com
    ),

    conf = fallback_row_sum(conf_hf, conf_com),

    # treated cases
    maltreat_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com,
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    maltreat_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf,
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    maltreat = fallback_row_sum(maltreat_hf, maltreat_com),

    # presumed cases
    pres_com = fallback_diff(maltreat_com, conf_com),
    pres_hf = fallback_diff(maltreat_hf, conf_hf),
    pres = fallback_row_sum(pres_com, pres_hf),

    # malaria admissions
    maladm = fallback_row_sum(
      maladm_u5,
      maladm_5_14,
      maladm_ov15
    ),

    # malaria deaths
    maldth = fallback_row_sum(
      maldth_u5,
      maldth_1_59m,
      maldth_10_14,
      maldth_5_9,
      maldth_5_14,
      maldth_ov15,
      maldth_fem_ov15,
      maldth_mal_ov15
    ),

    # age-group specific aggregations
    # tested cases by age group (HF only)
    test_hf_u5 = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    test_hf_5_14 = fallback_row_sum(
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    test_hf_ov15 = fallback_row_sum(
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # tested cases by age group (Community only)
    test_com_u5 = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com
    ),

    test_com_5_14 = fallback_row_sum(
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com
    ),

    test_com_ov15 = fallback_row_sum(
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    # total tested by age group (HF + Community)
    test_u5 = fallback_row_sum(test_hf_u5, test_com_u5),
    test_5_14 = fallback_row_sum(test_hf_5_14, test_com_5_14),
    test_ov15 = fallback_row_sum(test_hf_ov15, test_com_ov15),

    # suspected cases by age group (HF only)
    susp_hf_u5 = susp_u5_hf,

    susp_hf_5_14 = susp_5_14_hf,

    susp_hf_ov15 = susp_ov15_hf,

    # suspected cases by age group (Community only)
    susp_com_u5 = susp_u5_com,

    susp_com_5_14 = susp_5_14_com,

    susp_com_ov15 = susp_ov15_com,

    # total suspected by age group (HF + Community)
    susp_u5 = fallback_row_sum(susp_hf_u5, susp_com_u5),
    susp_5_14 = fallback_row_sum(susp_hf_5_14, susp_com_5_14),
    susp_ov15 = fallback_row_sum(susp_hf_ov15, susp_com_ov15),

    # confirmed cases by age group (HF only)
    conf_hf_u5 = fallback_row_sum(
      test_pos_mic_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    conf_hf_5_14 = fallback_row_sum(
      test_pos_mic_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    conf_hf_ov15 = fallback_row_sum(
      test_pos_mic_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # confirmed cases by age group (Community only)
    conf_com_u5 = tes_pos_rdt_u5_com,
    conf_com_5_14 = tes_pos_rdt_5_14_com,
    conf_com_ov15 = tes_pos_rdt_ov15_com,

    # total confirmed cases by age group (HF + Community)
    conf_u5 = fallback_row_sum(conf_hf_u5, conf_com_u5),
    conf_5_14 = fallback_row_sum(conf_hf_5_14, conf_com_5_14),
    conf_ov15 = fallback_row_sum(conf_hf_ov15, conf_com_ov15),

    # treated cases by age group (HF only)
    maltreat_hf_u5 = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf
    ),

    maltreat_hf_5_14 = fallback_row_sum(
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf
    ),

    maltreat_hf_ov15 = fallback_row_sum(
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    # treated cases by age group (Community only)
    maltreat_com_u5 = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com
    ),

    maltreat_com_5_14 = fallback_row_sum(
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com
    ),

    maltreat_com_ov15 = fallback_row_sum(
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    # total treated cases by age group (HF + Community)
    maltreat_u5 = fallback_row_sum(maltreat_hf_u5, maltreat_com_u5),
    maltreat_5_14 = fallback_row_sum(maltreat_hf_5_14, maltreat_com_5_14),
    maltreat_ov15 = fallback_row_sum(maltreat_hf_ov15, maltreat_com_ov15),

    # total treated cases within 24 hours (HF only)
    maltreat_u24_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_u24_ov15_hf
    ),

    # total treated cases after 24 hours (HF only)
    maltreat_ov24_hf = fallback_row_sum(
      maltreat_ov24_u5_hf,
      maltreat_ov24_5_14_hf,
      maltreat_ov24_ov15_hf
    ),

    # total treated cases within 24 hours (Community only)
    maltreat_u24_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_u24_ov15_com
    ),

    # total treated cases after 24 hours (Community only)
    maltreat_ov24_com = fallback_row_sum(
      maltreat_ov24_u5_com,
      maltreat_ov24_5_14_com,
      maltreat_ov24_ov15_com
    ),

    # overall totals (HF + Community)
    maltreat_u24_total = fallback_row_sum(maltreat_u24_hf, maltreat_u24_com),
    maltreat_ov24_total = fallback_row_sum(maltreat_ov24_hf, maltreat_ov24_com),

    # presumed cases by age group
    pres_com_u5 = fallback_diff(maltreat_com_u5, conf_com_u5),
    pres_com_5_14 = fallback_diff(maltreat_com_5_14, conf_com_5_14),
    pres_com_ov15 = fallback_diff(maltreat_com_ov15, conf_com_ov15),

    pres_hf_u5 = fallback_diff(maltreat_hf_u5, conf_hf_u5),
    pres_hf_5_14 = fallback_diff(maltreat_hf_5_14, conf_hf_5_14),
    pres_hf_ov15 = fallback_diff(maltreat_hf_ov15, conf_hf_ov15),

    pres_u5 = fallback_row_sum(pres_com_u5, pres_hf_u5),
    pres_5_14 = fallback_row_sum(pres_com_5_14, pres_hf_5_14),
    pres_ov15 = fallback_row_sum(pres_com_ov15, pres_hf_ov15)
  )

# check to see the aggregation worked
dhis2_df |>
  dplyr::filter(
    record_id %in%
      c("6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842")
  ) |>
  dplyr::arrange(allout) |>
  dplyr::select(
    allout,
    allout_u5,
    allout_ov5,
    pres_hf_u5,
    maltreat_hf_u5,
    conf_hf_u5
  ) |>
  head()
NoteOutput
allout allout_u5 allout_ov5 pres_hf_u5 maltreat_hf_u5 conf_hf_u5
2 2 NA 0 19 19
30 NA 30 0 41 58
123 NA 123 0 118 118
139 86 53 46 NA 46
165 156 9 59 NA 59
378 256 122 50 50 NA

To adapt the code: - Lines 4–64: Adjust the variable names to reflect those relevant to your dataset when calculating new variables. - Once updated, run the code.

Show the code
# helper function: row-wise sum that returns value if only one non-NA
def fallback_row_sum(df, cols):
    return df[cols].sum(axis=1, skipna=True, min_count=1)


# helper function: difference with floor at 0, handling NA the same way R does
# when both values are NA return NA; when only one is NA return the other
# clamped at 0; when both present return max(col1 - col2, 0)
import numpy as np

def fallback_diff(df, col1, col2):
    a = df[col1]
    b = df[col2]
    both_na = a.isna() & b.isna()
    only_b_na = a.notna() & b.isna()
    only_a_na = a.isna() & b.notna()
    result = (a - b).clip(lower=0)
    result = result.where(~only_b_na, a.clip(lower=0))
    result = result.where(~only_a_na, b.clip(lower=0))
    result = result.where(~both_na, np.nan)
    return result


dhis2_df = (
    dhis2_df.assign(
        # outpatient visits
        allout=lambda x: fallback_row_sum(x, ["allout_u5", "allout_ov5"]),
        # suspected cases
        susp=lambda x: fallback_row_sum(
            x,
            [
                "susp_u5_hf",
                "susp_5_14_hf",
                "susp_ov15_hf",
                "susp_u5_com",
                "susp_5_14_com",
                "susp_ov15_com",
            ],
        ),
        # tested cases
        test_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        test_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_neg_rdt_u5_com",
                "tes_pos_rdt_u5_com",
                "tes_neg_rdt_5_14_com",
                "tes_pos_rdt_5_14_com",
                "tes_neg_rdt_ov15_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # confirmed cases (HF and COM)
        conf_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_pos_mic_u5_hf",
                "test_pos_mic_5_14_hf",
                "test_pos_mic_ov15_hf",
                "tes_pos_rdt_u5_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        conf_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_pos_rdt_u5_com",
                "tes_pos_rdt_5_14_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # treated cases
        maltreat_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_ov24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_ov24_5_14_com",
                "maltreat_u24_ov15_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
        maltreat_hf=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_hf",
                "maltreat_ov24_u5_hf",
                "maltreat_u24_5_14_hf",
                "maltreat_ov24_5_14_hf",
                "maltreat_u24_ov15_hf",
                "maltreat_ov24_ov15_hf",
            ],
        ),
        # malaria admissions
        maladm=lambda x: fallback_row_sum(
            x, ["maladm_u5", "maladm_5_14", "maladm_ov15"]
        ),
        # malaria deaths
        maldth=lambda x: fallback_row_sum(
            x,
            [
                "maldth_u5",
                "maldth_1_59m",
                "maldth_10_14",
                "maldth_5_9",
                "maldth_5_14",
                "maldth_ov15",
                "maldth_fem_ov15",
                "maldth_mal_ov15",
            ],
        ),
        # AGE-GROUP SPECIFIC AGGREGATIONS
        # Tested cases by age group (HF only)
        test_hf_u5=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
            ],
        ),
        test_hf_5_14=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
            ],
        ),
        test_hf_ov15=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        # Tested cases by age group (Community only)
        test_com_u5=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_u5_com", "tes_pos_rdt_u5_com"]
        ),
        test_com_5_14=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_5_14_com", "tes_pos_rdt_5_14_com"]
        ),
        test_com_ov15=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_ov15_com", "tes_pos_rdt_ov15_com"]
        ),
        # Suspected cases by age group (rename for consistency)
        susp_hf_u5=lambda x: x["susp_u5_hf"],
        susp_hf_5_14=lambda x: x["susp_5_14_hf"],
        susp_hf_ov15=lambda x: x["susp_ov15_hf"],
        susp_com_u5=lambda x: x["susp_u5_com"],
        susp_com_5_14=lambda x: x["susp_5_14_com"],
        susp_com_ov15=lambda x: x["susp_ov15_com"],
        # Confirmed cases by age group (HF only)
        conf_hf_u5=lambda x: fallback_row_sum(
            x, ["test_pos_mic_u5_hf", "tes_pos_rdt_u5_hf"]
        ),
        conf_hf_5_14=lambda x: fallback_row_sum(
            x, ["test_pos_mic_5_14_hf", "tes_pos_rdt_5_14_hf"]
        ),
        conf_hf_ov15=lambda x: fallback_row_sum(
            x, ["test_pos_mic_ov15_hf", "tes_pos_rdt_ov15_hf"]
        ),
        # Confirmed cases by age group (Community only)
        conf_com_u5=lambda x: x["tes_pos_rdt_u5_com"],
        conf_com_5_14=lambda x: x["tes_pos_rdt_5_14_com"],
        conf_com_ov15=lambda x: x["tes_pos_rdt_ov15_com"],
        # Treated cases by age group (HF only)
        maltreat_hf_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_hf", "maltreat_ov24_u5_hf"]
        ),
        maltreat_hf_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_hf", "maltreat_ov24_5_14_hf"]
        ),
        maltreat_hf_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_hf", "maltreat_ov24_ov15_hf"]
        ),
        # Treated cases by age group (Community only)
        maltreat_com_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_com", "maltreat_ov24_u5_com"]
        ),
        maltreat_com_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_com", "maltreat_ov24_5_14_com"]
        ),
        maltreat_com_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_com", "maltreat_ov24_ov15_com"]
        ),
        # Total treated cases within/after 24 hours
        maltreat_u24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_u24_u5_hf", "maltreat_u24_5_14_hf", "maltreat_u24_ov15_hf"],
        ),
        maltreat_ov24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_ov24_u5_hf", "maltreat_ov24_5_14_hf", "maltreat_ov24_ov15_hf"],
        ),
        maltreat_u24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_u24_ov15_com",
            ],
        ),
        maltreat_ov24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_ov24_u5_com",
                "maltreat_ov24_5_14_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
    )
    .assign(
        # second pass: computed from first pass columns
        test=lambda x: fallback_row_sum(x, ["test_hf", "test_com"]),
        conf=lambda x: fallback_row_sum(x, ["conf_hf", "conf_com"]),
        maltreat=lambda x: fallback_row_sum(x, ["maltreat_hf", "maltreat_com"]),
        # totals by age group
        test_u5=lambda x: fallback_row_sum(x, ["test_hf_u5", "test_com_u5"]),
        test_5_14=lambda x: fallback_row_sum(x, ["test_hf_5_14", "test_com_5_14"]),
        test_ov15=lambda x: fallback_row_sum(x, ["test_hf_ov15", "test_com_ov15"]),
        susp_u5=lambda x: fallback_row_sum(x, ["susp_hf_u5", "susp_com_u5"]),
        susp_5_14=lambda x: fallback_row_sum(x, ["susp_hf_5_14", "susp_com_5_14"]),
        susp_ov15=lambda x: fallback_row_sum(x, ["susp_hf_ov15", "susp_com_ov15"]),
        conf_u5=lambda x: fallback_row_sum(x, ["conf_hf_u5", "conf_com_u5"]),
        conf_5_14=lambda x: fallback_row_sum(x, ["conf_hf_5_14", "conf_com_5_14"]),
        conf_ov15=lambda x: fallback_row_sum(x, ["conf_hf_ov15", "conf_com_ov15"]),
        maltreat_u5=lambda x: fallback_row_sum(
            x, ["maltreat_hf_u5", "maltreat_com_u5"]
        ),
        maltreat_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_hf_5_14", "maltreat_com_5_14"]
        ),
        maltreat_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_hf_ov15", "maltreat_com_ov15"]
        ),
        maltreat_u24_total=lambda x: fallback_row_sum(
            x, ["maltreat_u24_hf", "maltreat_u24_com"]
        ),
        maltreat_ov24_total=lambda x: fallback_row_sum(
            x, ["maltreat_ov24_hf", "maltreat_ov24_com"]
        ),
        # presumed cases
        pres_com=lambda x: fallback_diff(x, "maltreat_com", "conf_com"),
        pres_hf=lambda x: fallback_diff(x, "maltreat_hf", "conf_hf"),
        pres_com_u5=lambda x: fallback_diff(x, "maltreat_com_u5", "conf_com_u5"),
        pres_com_5_14=lambda x: fallback_diff(x, "maltreat_com_5_14", "conf_com_5_14"),
        pres_com_ov15=lambda x: fallback_diff(x, "maltreat_com_ov15", "conf_com_ov15"),
        pres_hf_u5=lambda x: fallback_diff(x, "maltreat_hf_u5", "conf_hf_u5"),
        pres_hf_5_14=lambda x: fallback_diff(x, "maltreat_hf_5_14", "conf_hf_5_14"),
        pres_hf_ov15=lambda x: fallback_diff(x, "maltreat_hf_ov15", "conf_hf_ov15"),
    )
    .assign(
        # third pass: totals from presumed
        pres=lambda x: fallback_row_sum(x, ["pres_com", "pres_hf"]),
        pres_u5=lambda x: fallback_row_sum(x, ["pres_com_u5", "pres_hf_u5"]),
        pres_5_14=lambda x: fallback_row_sum(x, ["pres_com_5_14", "pres_hf_5_14"]),
        pres_ov15=lambda x: fallback_row_sum(x, ["pres_com_ov15", "pres_hf_ov15"]),
    )
)

# inspect results
(
    dhis2_df[
        dhis2_df["record_id"].isin(
            ["6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842"]
        )
    ][
        [
            "allout",
            "allout_u5",
            "allout_ov5",
            "pres_hf_u5",
            "maltreat_hf_u5",
            "conf_hf_u5",
        ]
    ]
    .sort_values("allout")
    .head(6)
)
NoteOutput
allout allout_u5 allout_ov5 pres_hf_u5 maltreat_hf_u5 conf_hf_u5

To adapt the code:

  • Lines 11–270: Adjust the variable names to reflect those relevant to your dataset when calculating new variables.

Once updated, run the code.

We can see that the aggregation worked as expected using our fallback functions. The allout column is the sum of allout_u5 and allout_ov5. When either value is NA, the non-NA value is preserved rather than returning NA. Similarly, pres_hf_u5 is derived from the difference between maltreat_hf_u5 and conf_hf_u5. When either value is NA, the result reflects the available data rather than defaulting to NA.

Step 6.2: Quality control of indicator totals

Now let’s check that indicator totals are equal to the sum of their disaggregated components. The code chunk below counts the number of rows where the indicator total does not equal the sum of its components. For example, rows where allout is not equal to the sum of allout_u5 and allout_ov5.

WarningWhy check indicator totals?

If all totals were manually summed using the previous step, we may choose to skip this step. However, even if totals were manually calculated in the previous step, checking totals here can help identify errors.

In the case that total columns are extracted from DHIS2, it is necessary to perform this quality control check for coherency. This check confirms which components go into the aggregated totals extracted from DHIS2.

  • R
  • Python
Show the code
# create mismatch flags for each indicator group
mismatch_summary <- dhis2_df |>
  dplyr::summarise(
    # outpatient checks
    allout_mismatch = sum(
      allout != (allout_u5 + allout_ov5),
      na.rm = TRUE
    ),

    # malaria admissions check
    maladm_mismatch = sum(
      maladm != (maladm_u5 + maladm_5_14 + maladm_ov15),
      na.rm = TRUE
    ),

    # total tests check
    test_mismatch = sum(
      test != (test_hf + test_com),
      na.rm = TRUE
    ),

    # total confirmed check
    conf_mismatch = sum(
      conf != (conf_hf + conf_com),
      na.rm = TRUE
    ),

    # total treated check
    maltreat_mismatch = sum(
      maltreat != (maltreat_hf + maltreat_com),
      na.rm = TRUE
    ),

    # total presumed check
    pres_mismatch = sum(
      pres != (pres_hf + pres_com),
      na.rm = TRUE
    ),

    # malaria deaths check
    maldth_mismatch = sum(
      maldth !=
        (maldth_1_59m +
          maldth_u5 +
          maldth_5_9 +
          maldth_10_14 +
          maldth_5_14 +
          maldth_fem_ov15 +
          maldth_mal_ov15 +
          maldth_ov15),
      na.rm = TRUE
    ),

    # tested cases by age group
    test_u5_mismatch = sum(
      test_u5 != (test_hf_u5 + test_com_u5),
      na.rm = TRUE
    ),
    test_5_14_mismatch = sum(
      test_5_14 != (test_hf_5_14 + test_com_5_14),
      na.rm = TRUE
    ),
    test_ov15_mismatch = sum(
      test_ov15 != (test_hf_ov15 + test_com_ov15),
      na.rm = TRUE
    ),

    # confirmed cases by age group
    conf_u5_mismatch = sum(
      conf_u5 != (conf_hf_u5 + conf_com_u5),
      na.rm = TRUE
    ),
    conf_5_14_mismatch = sum(
      conf_5_14 != (conf_hf_5_14 + conf_com_5_14),
      na.rm = TRUE
    ),
    conf_ov15_mismatch = sum(
      conf_ov15 != (conf_hf_ov15 + conf_com_ov15),
      na.rm = TRUE
    ),

    # presumed cases by age group
    pres_u5_mismatch = sum(
      pres_u5 != (pres_hf_u5 + pres_com_u5),
      na.rm = TRUE
    ),
    pres_5_14_mismatch = sum(
      pres_5_14 != (pres_hf_5_14 + pres_com_5_14),
      na.rm = TRUE
    ),
    pres_ov15_mismatch = sum(
      pres_ov15 != (pres_hf_ov15 + pres_com_ov15),
      na.rm = TRUE
    ),

    # suspected cases by age group
    susp_u5_mismatch = sum(
      susp_u5 != (susp_u5_hf + susp_u5_com),
      na.rm = TRUE
    ),
    susp_5_14_mismatch = sum(
      susp_5_14 != (susp_5_14_hf + susp_5_14_com),
      na.rm = TRUE
    ),
    susp_ov15_mismatch = sum(
      susp_ov15 != (susp_ov15_hf + susp_ov15_com),
      na.rm = TRUE
    ),

    # treated cases by age group
    maltreat_u5_mismatch = sum(
      maltreat_u5 != (maltreat_hf_u5 + maltreat_com_u5),
      na.rm = TRUE
    ),
    maltreat_5_14_mismatch = sum(
      maltreat_5_14 != (maltreat_hf_5_14 + maltreat_com_5_14),
      na.rm = TRUE
    ),
    maltreat_ov15_mismatch = sum(
      maltreat_ov15 != (maltreat_hf_ov15 + maltreat_com_ov15),
      na.rm = TRUE
    ),

    # treatment timing checks
    maltreat_u24_mismatch = sum(
      maltreat_u24_total != (maltreat_u24_hf + maltreat_u24_com),
      na.rm = TRUE
    ),
    maltreat_ov24_mismatch = sum(
      maltreat_ov24_total != (maltreat_ov24_hf + maltreat_ov24_com),
      na.rm = TRUE
    )
  ) |>
  # pivot to long format for easier viewing
  tidyr::pivot_longer(
    cols = dplyr::everything(),
    names_to = "indicator",
    values_to = "n_mismatches"
  ) |>
  # filter to show only indicators with mismatches
  dplyr::filter(n_mismatches > 0)

mismatch_summary
NoteOutput

Note: because this code displays rows where indicator totals are not equal to the sum of their components, we should expect to see output with <0 rows> if all calculations are correct. If rows do appear in the output, use the hf and periodname to further investigate the discrepancy.

# A tibble: 0 × 2
# ℹ 2 variables: indicator <chr>, n_mismatches <int>

To adapt the code:

  • Line 3: Replace dhis2_df with your target dataset.
  • Lines 6–52: Adjust the indicator checks to match your data. Each check compares a total column against the sum of its components. Add or remove checks based on the indicators available in your dataset.
  • Lines 54–126: These check age-disaggregated indicators. Modify column names to match your age group naming conventions (e.g., _u5, _5_14, _ov15).
  • Line 135: The output filters to show only indicators with mismatches. Remove this filter to see all indicators regardless of mismatch status.
Show the code
# create mismatch flags for each indicator group
# helper function to count mismatches, ignoring NAs
def count_mismatch(total, components):
    """Count mismatches between total and sum of components, ignoring NAs."""
    calculated = components.sum(axis=1)
    # only compare where both total and calculated are not NA
    mask = total.notna() & calculated.notna()
    return (total[mask] != calculated[mask]).sum()

mismatch_summary = pd.DataFrame({
    # outpatient checks
    "allout_mismatch": [
        count_mismatch(
            dhis2_df["allout"],
            dhis2_df[["allout_u5", "allout_ov5"]]
        )
    ],

    # malaria admissions check
    "maladm_mismatch": [
        count_mismatch(
            dhis2_df["maladm"],
            dhis2_df[["maladm_u5", "maladm_5_14", "maladm_ov15"]]
        )
    ],

    # total tests check
    "test_mismatch": [
        count_mismatch(
            dhis2_df["test"],
            dhis2_df[["test_hf", "test_com"]]
        )
    ],

    # total confirmed check
    "conf_mismatch": [
        count_mismatch(
            dhis2_df["conf"],
            dhis2_df[["conf_hf", "conf_com"]]
        )
    ],

    # total treated check
    "maltreat_mismatch": [
        count_mismatch(
            dhis2_df["maltreat"],
            dhis2_df[["maltreat_hf", "maltreat_com"]]
        )
    ],

    # total presumed check
    "pres_mismatch": [
        count_mismatch(
            dhis2_df["pres"],
            dhis2_df[["pres_hf", "pres_com"]]
        )
    ],

    # malaria deaths check
    "maldth_mismatch": [
        count_mismatch(
            dhis2_df["maldth"],
            dhis2_df[[
                "maldth_1_59m", "maldth_u5", "maldth_5_9", "maldth_10_14",
                "maldth_5_14", "maldth_fem_ov15", "maldth_mal_ov15", "maldth_ov15"
            ]]
        )
    ],

    # tested cases by age group
    "test_u5_mismatch": [
        count_mismatch(
            dhis2_df["test_u5"],
            dhis2_df[["test_hf_u5", "test_com_u5"]]
        )
    ],
    "test_5_14_mismatch": [
        count_mismatch(
            dhis2_df["test_5_14"],
            dhis2_df[["test_hf_5_14", "test_com_5_14"]]
        )
    ],
    "test_ov15_mismatch": [
        count_mismatch(
            dhis2_df["test_ov15"],
            dhis2_df[["test_hf_ov15", "test_com_ov15"]]
        )
    ],

    # confirmed cases by age group
    "conf_u5_mismatch": [
        count_mismatch(
            dhis2_df["conf_u5"],
            dhis2_df[["conf_hf_u5", "conf_com_u5"]]
        )
    ],
    "conf_5_14_mismatch": [
        count_mismatch(
            dhis2_df["conf_5_14"],
            dhis2_df[["conf_hf_5_14", "conf_com_5_14"]]
        )
    ],
    "conf_ov15_mismatch": [
        count_mismatch(
            dhis2_df["conf_ov15"],
            dhis2_df[["conf_hf_ov15", "conf_com_ov15"]]
        )
    ],

    # presumed cases by age group
    "pres_u5_mismatch": [
        count_mismatch(
            dhis2_df["pres_u5"],
            dhis2_df[["pres_hf_u5", "pres_com_u5"]]
        )
    ],
    "pres_5_14_mismatch": [
        count_mismatch(
            dhis2_df["pres_5_14"],
            dhis2_df[["pres_hf_5_14", "pres_com_5_14"]]
        )
    ],
    "pres_ov15_mismatch": [
        count_mismatch(
            dhis2_df["pres_ov15"],
            dhis2_df[["pres_hf_ov15", "pres_com_ov15"]]
        )
    ],

    # suspected cases by age group
    "susp_u5_mismatch": [
        count_mismatch(
            dhis2_df["susp_u5"],
            dhis2_df[["susp_hf_u5", "susp_com_u5"]]
        )
    ],
    "susp_5_14_mismatch": [
        count_mismatch(
            dhis2_df["susp_5_14"],
            dhis2_df[["susp_hf_5_14", "susp_com_5_14"]]
        )
    ],
    "susp_ov15_mismatch": [
        count_mismatch(
            dhis2_df["susp_ov15"],
            dhis2_df[["susp_hf_ov15", "susp_com_ov15"]]
        )
    ],

    # treated cases by age group
    "maltreat_u5_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u5"],
            dhis2_df[["maltreat_hf_u5", "maltreat_com_u5"]]
        )
    ],
    "maltreat_5_14_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_5_14"],
            dhis2_df[["maltreat_hf_5_14", "maltreat_com_5_14"]]
        )
    ],
    "maltreat_ov15_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov15"],
            dhis2_df[["maltreat_hf_ov15", "maltreat_com_ov15"]]
        )
    ],

    # treatment timing checks
    "maltreat_u24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u24_total"],
            dhis2_df[["maltreat_u24_hf", "maltreat_u24_com"]]
        )
    ],
    "maltreat_ov24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov24_total"],
            dhis2_df[["maltreat_ov24_hf", "maltreat_ov24_com"]]
        )
    ]
})

# pivot to long format for easier viewing
mismatch_summary = (
    mismatch_summary
    .melt(var_name="indicator", value_name="n_mismatches")
    # filter to show only indicators with mismatches
    .query("n_mismatches > 0")
)

mismatch_summary
NoteOutput

Note: because this code displays rows where indicator totals are not equal to the sum of their components, we should expect to see output with <0 rows> if all calculations are correct. If rows do appear in the output, use the hf and periodname to further investigate the discrepancy.

Empty DataFrame
Columns: [indicator, n_mismatches]
Index: []

To adapt the code:

  • Lines 10–156: Replace dhis2_df with your target dataset.
  • Lines 10–58: Adjust the indicator checks to match your data. Each check compares a total column against the sum of its components. Add or remove checks based on the indicators available in your dataset.
  • Lines 60–156: These check age-disaggregated indicators. Modify column names to match your age group naming conventions (e.g., _u5, _5_14, _ov15).
  • Line 164: The output filters to show only indicators with mismatches. Remove .query("n_mismatches > 0") to see all indicators regardless of mismatch status.

Step 6.3: Export rows with incoherent totals

If there are incoherent totals detected, they can be exported for further assessment using the below code. Exported outputs must be shared with the SNT team for review and guidance.

  • R
  • Python
Show the code
# identify rows with incoherent totals
incoherent_rows <- dhis2_df |>
  dplyr::filter(
    # outpatient checks
    allout != (allout_u5 + allout_ov5) |
    # malaria admissions check
    maladm != (maladm_u5 + maladm_5_14 + maladm_ov15) |
    # total tests check
    test != (test_hf + test_com) |
    # total confirmed check
    conf != (conf_hf + conf_com) |
    # total treated check
    maltreat != (maltreat_hf + maltreat_com) |
    # total presumed check
    pres != (pres_hf + pres_com)
  ) |>
  dplyr::select(
    hf, periodname,
    dplyr::matches("allout|maladm|test|conf|maltreat|pres")
  )

# set path for saving
output_file <- here::here(
  "03_outputs",
  "3.1_validation",
  "tables",
  "sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
rio::export(incoherent_rows, file = output_file)

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Lines 3–16: Adjust the filter conditions to match your indicator totals and components.
  • Lines 23–29: Update the output file path to your preferred location.

Once updated, run the code to export rows with incoherent totals for review with the SNT team.

Show the code
# identify rows with incoherent totals
incoherent_rows = dhis2_df[
    # outpatient checks
    (dhis2_df["allout"] != (dhis2_df["allout_u5"] + dhis2_df["allout_ov5"])) |
    # malaria admissions check
    (dhis2_df["maladm"] != (dhis2_df["maladm_u5"] + dhis2_df["maladm_5_14"] + dhis2_df["maladm_ov15"])) |
    # total tests check
    (dhis2_df["test"] != (dhis2_df["test_hf"] + dhis2_df["test_com"])) |
    # total confirmed check
    (dhis2_df["conf"] != (dhis2_df["conf_hf"] + dhis2_df["conf_com"])) |
    # total treated check
    (dhis2_df["maltreat"] != (dhis2_df["maltreat_hf"] + dhis2_df["maltreat_com"])) |
    # total presumed check
    (dhis2_df["pres"] != (dhis2_df["pres_hf"] + dhis2_df["pres_com"]))
]

# select relevant columns
incoherent_rows = incoherent_rows[
    ["hf", "periodname"] +
    [col for col in incoherent_rows.columns
     if any(x in col for x in ["allout", "maladm", "test", "conf", "maltreat", "pres"])]
]

# set path for saving
output_file = Path(
    here("03_outputs/3.1_validation/tables/sle_incoherent_totals_dhis2.xlsx")
)

# export to xlsx
incoherent_rows.to_excel(output_file, index=False)

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Lines 2–15: Adjust the filter conditions to match your indicator totals and components.
  • Lines 25–30: Update the output file path to your preferred location.

Once updated, run the code to export rows with incoherent totals for review with the SNT team.

Step 6.4: Add IPD/OPD specification

It may be useful in later analyses to subset routine data to only inpatient or only outpatient facilities or departments. For example, we may want to analyse malaria admissions only from facilities with inpatient capacity, or focus on outpatient case management indicators from primary care facilities.

Here, we create a column that specifies whether a facility provides inpatient (IPD) or outpatient (OPD) services. This classification can be based on:

  • Facility type naming conventions (e.g., “Hospital” = IPD, “CHP” = OPD)
  • Presence of inpatient indicators (e.g., facilities reporting admissions or deaths)
  • A reference lookup file from the SNT team or health ministry
WarningConsult with SNT team

Facility classification varies by country. Confirm with the SNT team which facility types should be classified as IPD vs OPD. Some facilities may offer both services and require special handling.

  • R
  • Python
Show the code
# option 1: classify based on facility name patterns
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    facility_type = dplyr::case_when(
      # inpatient facilities (hospitals)
      stringr::str_detect(
        hf,
        regex("hospital|hosp", ignore_case = TRUE)
      ) ~ "IPD",
      stringr::str_detect(
        hf,
        regex("district hospital|regional hospital", ignore_case = TRUE)
      ) ~ "IPD",
      # outpatient facilities (clinics, health posts, community health)
      stringr::str_detect(
        hf,
        regex(
          "CHP|CHC|MCHP|clinic|health post|health centre",
          ignore_case = TRUE
        )
      ) ~ "OPD",
      # default to OPD if no pattern matched
      TRUE ~ "OPD"
    )
  )

# option 2: classify based on presence of inpatient indicators
dhis2_df <- dhis2_df |>
  dplyr::group_by(hf_uid) |>
  dplyr::mutate(
    # facility has ipd if it ever reports admissions or deaths
    has_ipd = any(!is.na(maladm) & maladm > 0, na.rm = TRUE) |
      any(!is.na(maldth) & maldth > 0, na.rm = TRUE),
    facility_type = dplyr::if_else(has_ipd, "IPD", "OPD")
  ) |>
  dplyr::ungroup() |>
  dplyr::select(-has_ipd)

# option 3: use a reference lookup file
hf_path <-
  rio::import(
    here::here(
      "01_data",
      "1.1_foundational",
      "1.1c_health_facilities",
      "processed",
      "health_facility_master_list.xlsx"
    )
  )

dhis2_df <- dhis2_df |>
  dplyr::left_join(
    facility_lookup |> dplyr::select(hf_uid, facility_type),
    by = "hf_uid"
  )

# check classification distribution
dhis2_df |>
  dplyr::distinct(hf_uid, facility_type) |>
  dplyr::count(facility_type)
NoteOutput
facility_type n
IPD 89
OPD 1682

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Lines 6–23 (Option 1): Adjust the str_detect() patterns to match your country’s facility naming conventions (e.g., “dispensary”, “health center”, “referral hospital”).
  • Lines 32–33 (Option 2): Modify the indicator columns (maladm, maldth) if your dataset uses different names for inpatient indicators.
  • Lines 42–55 (Option 3): Update the path to your facility classification lookup file if using a reference list from the SNT team.

Choose the option that best fits your data and context. Option 3 (reference lookup) is recommended when an official facility classification exists.

Show the code
# option 1: classify based on facility name patterns
def classify_facility(hf_name):
    """Classify facility as IPD or OPD based on name patterns."""
    hf_lower = hf_name.lower() if pd.notna(hf_name) else ""

    # inpatient facilities (hospitals)
    if any(x in hf_lower for x in ["hospital", "hosp"]):
        return "IPD"
    # outpatient facilities (clinics, health posts, community health)
    elif any(x in hf_lower for x in ["chp", "chc", "mchp", "clinic", "health post", "health centre"]):
        return "OPD"
    # default to OPD
    else:
        return "OPD"

dhis2_df["facility_type"] = dhis2_df["hf"].apply(classify_facility)

# option 2: classify based on presence of inpatient indicators
has_ipd = (
    dhis2_df
    .groupby("hf_uid")
    .apply(
        lambda x: ((x["maladm"].notna() & (x["maladm"] > 0)).any() |
                   (x["maldth"].notna() & (x["maldth"] > 0)).any())
    )
    .reset_index(name="has_ipd")
)

dhis2_df = dhis2_df.merge(has_ipd, on="hf_uid", how="left")
dhis2_df["facility_type"] = dhis2_df["has_ipd"].map({True: "IPD", False: "OPD"})
dhis2_df = dhis2_df.drop(columns="has_ipd")

# option 3: use a reference lookup file
facility_lookup = pd.read_excel(
    Path("path/to/facility_classification.xlsx")
)

dhis2_df = dhis2_df.merge(
    facility_lookup[["hf_uid", "facility_type"]],
    on="hf_uid",
    how="left"
)

# check classification distribution
dhis2_df[["hf_uid", "facility_type"]].drop_duplicates()["facility_type"].value_counts()
NoteOutput
facility_type n
OPD 1682
IPD 89

To adapt the code:

  • Lines 6–13 (Option 1): Adjust the str_detect() / string patterns to match your country’s facility naming conventions (e.g., “dispensary”, “health center”, “referral hospital”).
  • Lines 17–25 (Option 2): Modify the indicator columns (maladm, maldth) if your dataset uses different names for inpatient indicators.
  • Lines 28–35 (Option 3): Update the path to your facility classification lookup file if using a reference list from the SNT team.

Choose the option that best fits your data and context. Option 3 (reference lookup) is recommended when an official facility classification exists.

Step 7: Finalize Data

Step 7.1: Troubleshoot duplicate HF-month records with different data

Here we search for rows in the dataset that correspond to the same HF-month report but happen to have different data. If any are found, these need to be cleaned at this stage to make sure the dataset contains only one report for each combination of HF and month. Expect to work closely with the SNT team or DHIS2 focal person at this stage to ensure correct data handling decisions are made.

Duplicate HF-month records can arise from multiple data entry submissions for the same reporting period, data imports from different sources being combined, system errors during DHIS2 extraction, or facilities reporting through both paper and electronic systems.

Unresolved duplicates will inflate case counts and distort indicators. Identifying and resolving them is necessary before any analysis.

  • R
  • Python
Show the code
# identify duplicate hf-month combinations
duplicates <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::ungroup()

# count number of duplicate pairs
n_duplicates <- duplicates |>
  dplyr::distinct(record_id) |>
  nrow()

# if duplicates exist, inspect them
if (n_duplicates > 0) {
  # view duplicate records with key indicators
  duplicate_details <- duplicates |>
    dplyr::select(
      record_id,
      adm0, adm1, adm2, adm3, hf, yearmon,
      allout,
      test,
      conf,
      maltreat
    ) |>
    dplyr::arrange(record_id)

  print(duplicate_details)

  # export for review with SNT team
  rio::export(
    duplicate_details,
    here::here(
      "03_outputs",
      "3.1_validation",
      "tables",
      "sle_duplicate_records_dhis2.xlsx"
    )
  )
}

# option 1: keep first record (if duplicates are exact copies)
dhis2_df <- dhis2_df |>
  dplyr::distinct(record_id, .keep_all = TRUE)

# option 2: keep record with most complete data
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::slice_max(
    # count non-NA values across indicator columns
    order_by = rowSums(!is.na(dplyr::across(dplyr::where(is.numeric)))),
    n = 1,
    with_ties = FALSE
  ) |>
  dplyr::ungroup()

# option 3: sum duplicates (if records represent partial reports)
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id, hf, adm0, adm1, adm2, adm3) |>
  dplyr::summarise(
    dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)),
    .groups = "drop"
  )

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Lines 18–21: Adjust the columns selected for duplicate inspection based on your key indicators.
  • Lines 26–33: Update the output file path for exporting duplicates.
  • Lines 36–56: Choose the appropriate resolution option based on your context.

Once updated, run the code to identify and resolve duplicate HF-month records.

Show the code
# identify duplicate hf-month combinations
duplicates = dhis2_df.groupby(["hf_uid", "yearmon"]).filter(lambda x: len(x) > 1)

# count number of duplicate pairs
n_duplicates = duplicates[["hf_uid", "yearmon"]].drop_duplicates().shape[0]

# if duplicates exist, inspect them
if n_duplicates > 0:
    # view duplicate records with key indicators
    duplicate_details = duplicates[
        ["record_id", "allout", "test", "conf", "maltreat"]
    ].sort_values(["hf_uid", "yearmon"])

    print(duplicate_details)

    # export for review with SNT team
    duplicate_details.to_excel(
        Path(
            here(
                "03_outputs/3.1_validation/tables/"
                "sle_duplicate_records_dhis2.xlsx"
            )
        ),
        index=False,
    )

# option 1: keep first record (if duplicates are exact copies)
dhis2_df = dhis2_df.drop_duplicates(subset=["hf_uid", "yearmon"], keep="first")

# option 2: keep record with most complete data
dhis2_df = (
    dhis2_df.assign(
        n_complete=lambda x: x.select_dtypes(include="number").notna().sum(axis=1)
    )
    .sort_values("n_complete", ascending=False)
    .drop_duplicates(subset=["record_id"], keep="first")
    .drop(columns="n_complete")
)

# option 3: sum duplicates (if records represent partial reports)
group_cols = ["hf_uid", "yearmon", "hf", "adm0", "adm1", "adm2", "adm3"]
numeric_cols = dhis2_df.select_dtypes(include="number").columns.tolist()

dhis2_df = dhis2_df.groupby(group_cols, as_index=False)[numeric_cols].sum()

# verify duplicates resolved
n_remaining = dhis2_df.groupby(["record_id"]).filter(lambda x: len(x) > 1).shape[0]

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Lines 10–12: Adjust the columns selected for duplicate inspection based on your key indicators.
  • Lines 17–25: Update the output file path for exporting duplicates.
  • Lines 36–56: Choose the appropriate resolution option based on your context.

Once updated, run the code to identify and resolve duplicate HF-month records.

TipChoosing a resolution strategy

Consult with the SNT team before choosing how to resolve duplicates. The correct approach depends on why duplicates exist. Exact duplicates from system errors can be handled by keeping the first record (Option 1). When records have different completeness from re-submissions, keep the most complete record (Option 2). Partial reports from split submissions may need to be summed together (Option 3). Conflicting data requiring investigation should be exported and resolved manually with the SNT team.

Step 7.2: Generate final data dictionary

A complete data dictionary documents all columns in the preprocessed dataset, including both the original DHIS2 variables and the computed/derived columns created during preprocessing. This dictionary serves as a reference for downstream analyses and facilitates collaboration with the SNT team.

  • R
  • Python
Show the code
# define descriptions for computed/derived columns
computed_vars <- tibble::tribble(
  ~snt_var,              ~indicator_label,
  # time columns
  "date",                "Report date (YYYY-MM-DD)",
  "year",                "Report year",
  "month",               "Report month (1-12)",
  "yearmon",             "Year-month label (e.g., Jan 2020)",
  # identifier columns
  "hf_uid",              "Unique health facility identifier (hash)",
  "record_id",           "Unique record identifier (facility + month hash)",
  "location_short",      "Location label: adm1 ~ adm2",
  "location_full",       "Location label: adm1 ~ adm2 ~ adm3 ~ hf",
  "facility_type",       "Facility type (IPD/OPD)",
  # aggregated totals
  "allout",              "Total outpatient visits (allout_u5 + allout_ov5)",
  "susp",                "Total suspected cases (all ages, HF + COM)",
  "test",                "Total tested (test_hf + test_com)",
  "test_hf",             "Total tested at health facility",
  "test_com",            "Total tested in community",
  "conf",                "Total confirmed cases (conf_hf + conf_com)",
  "conf_hf",             "Total confirmed cases at health facility",
  "conf_com",            "Total confirmed cases in community",
  "maltreat",            "Total treated cases (maltreat_hf + maltreat_com)",
  "maltreat_hf",         "Total treated cases at health facility",
  "maltreat_com",        "Total treated cases in community",
  "pres",                "Total presumed cases (pres_hf + pres_com)",
  "pres_hf",             "Total presumed cases at health facility",
  "pres_com",            "Total presumed cases in community",
  "maladm",              "Total malaria admissions (all ages)",
  "maldth",              "Total malaria deaths (all ages)",
  # age-group totals
  "test_u5",             "Total tested under 5 (HF + COM)",
  "test_5_14",           "Total tested 5-14 years (HF + COM)",
  "test_ov15",           "Total tested over 15 years (HF + COM)",
  "test_hf_u5",          "Tested under 5 at health facility",
  "test_hf_5_14",        "Tested 5-14 years at health facility",
  "test_hf_ov15",        "Tested over 15 years at health facility",
  "test_com_u5",         "Tested under 5 in community",
  "test_com_5_14",       "Tested 5-14 years in community",
  "test_com_ov15",       "Tested over 15 years in community",
  "susp_u5",             "Suspected cases under 5 (HF + COM)",
  "susp_5_14",           "Suspected cases 5-14 years (HF + COM)",
  "susp_ov15",           "Suspected cases over 15 years (HF + COM)",
  "susp_hf_u5",          "Suspected cases under 5 at health facility",
  "susp_hf_5_14",        "Suspected cases 5-14 years at health facility",
  "susp_hf_ov15",        "Suspected cases over 15 years at health facility",
  "susp_com_u5",         "Suspected cases under 5 in community",
  "susp_com_5_14",       "Suspected cases 5-14 years in community",
  "susp_com_ov15",       "Suspected cases over 15 years in community",
  "conf_u5",             "Confirmed cases under 5 (HF + COM)",
  "conf_5_14",           "Confirmed cases 5-14 years (HF + COM)",
  "conf_ov15",           "Confirmed cases over 15 years (HF + COM)",
  "conf_hf_u5",          "Confirmed cases under 5 at health facility",
  "conf_hf_5_14",        "Confirmed cases 5-14 years at health facility",
  "conf_hf_ov15",        "Confirmed cases over 15 years at health facility",
  "conf_com_u5",         "Confirmed cases under 5 in community",
  "conf_com_5_14",       "Confirmed cases 5-14 years in community",
  "conf_com_ov15",       "Confirmed cases over 15 years in community",
  "maltreat_u5",         "Treated cases under 5 (HF + COM)",
  "maltreat_5_14",       "Treated cases 5-14 years (HF + COM)",
  "maltreat_ov15",       "Treated cases over 15 years (HF + COM)",
  "maltreat_hf_u5",      "Treated cases under 5 at health facility",
  "maltreat_hf_5_14",    "Treated cases 5-14 years at health facility",
  "maltreat_hf_ov15",    "Treated cases over 15 years at health facility",
  "maltreat_com_u5",     "Treated cases under 5 in community",
  "maltreat_com_5_14",   "Treated cases 5-14 years in community",
  "maltreat_com_ov15",   "Treated cases over 15 years in community",
  "maltreat_u24_hf",     "Treated cases within 24hrs at health facility",
  "maltreat_ov24_hf",    "Treated cases after 24hrs at health facility",
  "maltreat_u24_com",    "Treated cases within 24hrs in community",
  "maltreat_ov24_com",   "Treated cases after 24hrs in community",
  "maltreat_u24_total",  "Total treated cases within 24hrs (HF + COM)",
  "maltreat_ov24_total", "Total treated cases after 24hrs (HF + COM)",
  "pres_u5",             "Presumed cases under 5 (HF + COM)",
  "pres_5_14",           "Presumed cases 5-14 years (HF + COM)",
  "pres_ov15",           "Presumed cases over 15 years (HF + COM)",
  "pres_hf_u5",          "Presumed cases under 5 at health facility",
  "pres_hf_5_14",        "Presumed cases 5-14 years at health facility",
  "pres_hf_ov15",        "Presumed cases over 15 years at health facility",
  "pres_com_u5",         "Presumed cases under 5 in community",
  "pres_com_5_14",       "Presumed cases 5-14 years in community",
  "pres_com_ov15",       "Presumed cases over 15 years in community"
)

# combine original and computed dictionaries
full_data_dict <- dplyr::bind_rows(
  data_dict,
  computed_vars
) |>
  # keep only columns present in final dataset
  dplyr::filter(snt_var %in% names(dhis2_df)) |>
  dplyr::arrange(snt_var) |>
  dplyr::select(snt_variable = snt_var,  label = indicator_label)

# check
full_data_dict |>
    head()
NoteOutput
snt_variable label
adm0 orgunitlevel1
adm1 orgunitlevel2
adm2 orgunitlevel3
adm3 orgunitlevel4
allout Total outpatient visits (allout_u5 + allout_ov5)
allout_ov5 OPD (New and follow-up curative) 5+y_X

To adapt the code:

  • Lines 4–77: Review and update computed_vars to match your computed variables. Add or remove rows as needed.
  • Lines 85–91: Update the output file path for the final data dictionary.

Once updated, run the code to generate and export the final data dictionary.

Show the code
# define descriptions for computed/derived columns
computed_vars = pd.DataFrame([
    # time columns
    {"snt_var": "date", "indicator_label": "Report date (YYYY-MM-DD)"},
    {"snt_var": "year", "indicator_label": "Report year"},
    {"snt_var": "month", "indicator_label": "Report month (1-12)"},
    {"snt_var": "yearmon", "indicator_label": "Year-month label (e.g., Jan 2020)"},
    # identifier columns
    {"snt_var": "hf_uid", "indicator_label": "Unique health facility identifier (hash)"},
    {"snt_var": "record_id", "indicator_label": "Unique record identifier (facility + month hash)"},
    {"snt_var": "location_short", "indicator_label": "Location label: adm1 ~ adm2"},
    {"snt_var": "location_full", "indicator_label": "Location label: adm1 ~ adm2 ~ adm3 ~ hf"},
    {"snt_var": "facility_type", "indicator_label": "Facility type (IPD/OPD)"},
    # aggregated totals
    {"snt_var": "allout", "indicator_label": "Total outpatient visits (allout_u5 + allout_ov5)"},
    {"snt_var": "susp", "indicator_label": "Total suspected cases (all ages, HF + COM)"},
    {"snt_var": "test", "indicator_label": "Total tested (test_hf + test_com)"},
    {"snt_var": "test_hf", "indicator_label": "Total tested at health facility"},
    {"snt_var": "test_com", "indicator_label": "Total tested in community"},
    {"snt_var": "conf", "indicator_label": "Total confirmed cases (conf_hf + conf_com)"},
    {"snt_var": "conf_hf", "indicator_label": "Total confirmed cases at health facility"},
    {"snt_var": "conf_com", "indicator_label": "Total confirmed cases in community"},
    {"snt_var": "maltreat", "indicator_label": "Total treated cases (maltreat_hf + maltreat_com)"},
    {"snt_var": "maltreat_hf", "indicator_label": "Total treated cases at health facility"},
    {"snt_var": "maltreat_com", "indicator_label": "Total treated cases in community"},
    {"snt_var": "pres", "indicator_label": "Total presumed cases (pres_hf + pres_com)"},
    {"snt_var": "pres_hf", "indicator_label": "Total presumed cases at health facility"},
    {"snt_var": "pres_com", "indicator_label": "Total presumed cases in community"},
    {"snt_var": "maladm", "indicator_label": "Total malaria admissions (all ages)"},
    {"snt_var": "maldth", "indicator_label": "Total malaria deaths (all ages)"},
    # age-group totals
    {"snt_var": "test_u5", "indicator_label": "Total tested under 5 (HF + COM)"},
    {"snt_var": "test_5_14", "indicator_label": "Total tested 5-14 years (HF + COM)"},
    {"snt_var": "test_ov15", "indicator_label": "Total tested over 15 years (HF + COM)"},
    {"snt_var": "conf_u5", "indicator_label": "Confirmed cases under 5 (HF + COM)"},
    {"snt_var": "conf_5_14", "indicator_label": "Confirmed cases 5-14 years (HF + COM)"},
    {"snt_var": "conf_ov15", "indicator_label": "Confirmed cases over 15 years (HF + COM)"},
    {"snt_var": "maltreat_u5", "indicator_label": "Treated cases under 5 (HF + COM)"},
    {"snt_var": "maltreat_5_14", "indicator_label": "Treated cases 5-14 years (HF + COM)"},
    {"snt_var": "maltreat_ov15", "indicator_label": "Treated cases over 15 years (HF + COM)"},
    {"snt_var": "pres_u5", "indicator_label": "Presumed cases under 5 (HF + COM)"},
    {"snt_var": "pres_5_14", "indicator_label": "Presumed cases 5-14 years (HF + COM)"},
    {"snt_var": "pres_ov15", "indicator_label": "Presumed cases over 15 years (HF + COM)"},
    # add remaining age-group variables as needed...
])

# combine original and computed dictionaries
full_data_dict = (
    pd.concat([data_dict, computed_vars], ignore_index=True)
    .rename(columns={"snt_var": "snt_variable", "indicator_label": "label"})
    [["snt_variable", "label"]]
)

# keep only columns present in final dataset
full_data_dict = (
    full_data_dict[full_data_dict["snt_variable"].isin(dhis2_df.columns)]
    .sort_values("snt_variable")
)

# check
full_data_dict.head(10)
NoteOutput
snt_variable label
adm0 orgunitlevel1
adm1 orgunitlevel2
adm2 orgunitlevel3
adm3 orgunitlevel4
allout Total outpatient visits (allout_u5 + allout_ov5)
allout_ov5 OPD (New and follow-up curative) 5+y_X
allout_u5 OPD (New and follow-up curative) 0-59m_X
conf Total confirmed cases (conf_hf + conf_com)
conf_5_14 Confirmed cases 5-14 years (HF + COM)
conf_com Total confirmed cases in community

To adapt the code:

  • Lines 7–49: Review and update computed_vars to match your computed variables. Add or remove rows as needed.
  • Lines 55–61: Update the output file path for the final data dictionary.

Once updated, run the code to generate and export the final data dictionary.

Step 7.3: Arrange final column order

Before exporting, we organise columns in a logical order to make the dataset easier to navigate. Identifiers and location columns come first, followed by time variables, then all indicator columns.

  • R
  • Python
Show the code
# arrange columns in logical order
dhis2_df <- dhis2_df |>
  dplyr::select(
    # identifiers
    record_id,
    # location hierarchy
    adm0,
    adm1,
    adm2,
    adm3,
    hf,
    hf_uid,
    location_short,
    location_full,
    facility_type,
    # time variables
    date,
    yearmon,
    year,
    month,
    # all remaining indicator columns
    dplyr::everything()
  )

# check column order
colnames(dhis2_df) |> head(20)
NoteOutput
 [1] "record_id"      "adm0"           "adm1"           "adm2"          
 [5] "adm3"           "hf"             "hf_uid"         "location_short"
 [9] "location_full"  "facility_type"  "date"           "yearmon"       
[13] "year"           "month"          "periodname"     "allout_u5"     
[17] "allout_ov5"     "maladm_u5"      "maladm_5_14"    "maladm_ov15"   

To adapt the code:

  • Lines 6–7: Adjust identifier columns based on your dataset.
  • Lines 9–16: Modify location columns to match your administrative hierarchy.
  • Lines 18–21: Update time columns if using different temporal variables (e.g., epiweek).

Once updated, run the code to reorder columns before exporting the final dataset.

Show the code
# define column order
id_cols = ["record_id"]
location_cols = ["adm0", "adm1", "adm2", "adm3", "hf", "hf_uid",
                 "location_short", "location_full", "facility_type"]
time_cols = ["date", "yearmon", "year", "month"]

# get remaining columns in original order
ordered_cols = id_cols + location_cols + time_cols
remaining_cols = [col for col in dhis2_df.columns if col not in ordered_cols]

# reorder dataframe
dhis2_df = dhis2_df[ordered_cols + remaining_cols]

# check column order
print(dhis2_df.columns[:20].tolist())
NoteOutput
['record_id', 'adm0', 'adm1', 'adm2', 'adm3', 'hf', 'hf_uid', 'location_short', 'location_full', 'facility_type', 'date', 'yearmon', 'year', 'month', 'periodname', 'allout_u5', 'allout_ov5', 'maladm_u5', 'maladm_5_14', 'maladm_ov15']

To adapt the code:

  • Line 2: Adjust identifier columns based on your dataset.
  • Lines 3–4: Modify location columns to match your administrative hierarchy.
  • Line 5: Update time columns if using different temporal variables (e.g., epiweek).

Once updated, run the code to reorder columns before exporting the final dataset.

Step 8: Aggregate and Save Data

Step 8.1: Save data at the health facility level

Since some SNT analyses rely on facility-specific information, we will now save the data at the health facility level. Keeping the data at health facility level ensures that facility-level analyses can be done accurately and without loss of detail.

  • R
  • Python
Show the code
# set up output path
save_path <- here::here(
  "01_data",
  "1.2_epidemiology",
  "1.2a_routine_surveillance",
  "processed"
)

# save the canonical clean dataset under the name that downstream pages
# (missing_data.qmd, active_status.qmd, ...) load from. write both .rds
# (for R) and .parquet (for Python) so both languages see identical data.
saveRDS(
  dhis2_df,
  here::here(save_path, "clean_malaria_routine_data_final.rds")
)

arrow::write_parquet(
  dhis2_df,
  here::here(save_path, "clean_malaria_routine_data_final.parquet"),
  compression = "zstd"
)

saveRDS(
  full_data_dict,
  here::here(save_path, "clean_malaria_routine_dict.rds")
)

arrow::write_parquet(
  full_data_dict,
  here::here(save_path, "clean_malaria_routine_dict.parquet"),
  compression = "zstd"
)

# additional export formats for sharing with non-R users
dhis2_hf_list <- list(
  data = dhis2_df,
  dictionary = full_data_dict
)

rio::export(
  dhis2_hf_list,
  here::here(save_path, "sle_dhis2_health_facility_data.xlsx")
)

To adapt the code:

  • Lines 2-7: Update save_path to your preferred output directory.
  • Lines 10-13: The list includes both the cleaned data and data dictionary. Add or remove items as needed.
  • Lines 16-19: Update the filename prefix (e.g., sle_) to match your country code.

Once updated, run the code to export the final preprocessed dataset and data dictionary.

Code
save_path = Path(
    here("01_data/1.2_epidemiology/1.2a_routine_surveillance/processed")
)

# save the canonical clean dataset under the name that downstream pages
# (missing_data.qmd, active_status.qmd, ...) load from. mirrors the .rds
# the R chunk above writes so both languages see identical data.
dhis2_df.to_parquet(
    save_path / "clean_malaria_routine_data_final.parquet",
    compression="zstd",
    index=False
)

full_data_dict.to_parquet(
    save_path / "clean_malaria_routine_dict.parquet",
    compression="zstd",
    index=False
)

# additional export formats for sharing with non-Python users
dhis2_df.to_excel(
    save_path / "sle_dhis2_health_facility_data.xlsx",
    index=False
)

full_data_dict.to_excel(
    save_path / "sle_dhis2_health_facility_dict.xlsx",
    index=False
)

To adapt the code:

  • Lines 2-4: Update save_path to your preferred output directory.
  • Lines 7-10: Update the filename prefix (e.g., sle_) to match your country code.
  • Lines 13-16: Save the dictionary separately if needed.
  • Lines 19-23: The .parquet format is efficient for large datasets. Use .csv instead if sharing with non-Python users.

Once updated, run the code to export the final preprocessed dataset and data dictionary.

Step 8.2: Aggregate and save data at each admin unit level

Now we aggregate and save the data at different administrative levels to support various types of analysis and ensure alignment with how interventions are typically planned and monitored. This allows flexibility when calculating indicators, comparing trends across regions, or linking with other datasets structured at adm0, adm1, adm2, adm3, or facility level.

WarningAvoid Summing Pre-Calculated Rates

When aggregating data to administrative levels, not all indicators can simply be summed. For example, if a test positivity rate or treatment rate has already been calculated, these indicators must be re-calculated at the new administrative level.

  • R
  • Python
Show the code
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols <- c(
  # totals
  "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
  # by location
  "test_hf", "test_com", "conf_hf", "conf_com",
  "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
  # by age group - totals
  "allout_u5", "allout_ov5",
  "test_u5", "test_5_14", "test_ov15",
  "conf_u5", "conf_5_14", "conf_ov15",
  "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
  "pres_u5", "pres_5_14", "pres_ov15",
  "susp_u5", "susp_5_14", "susp_ov15",
  "maladm_u5", "maladm_5_14", "maladm_ov15",
  "maldth_u5", "maldth_5_14", "maldth_ov15",
  # by age and location
  "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
  "test_com_u5", "test_com_5_14", "test_com_ov15",
  "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
  "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
  "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
  "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
  "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
  "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
  # treatment timing
  "maltreat_u24_hf", "maltreat_ov24_hf",
  "maltreat_u24_com", "maltreat_ov24_com",
  "maltreat_u24_total", "maltreat_ov24_total"
)

# function to aggregate at a given admin level
aggregate_admin <- function(df, group_cols, sum_cols) {
  df |>
    dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |>
    dplyr::summarise(
      dplyr::across(
        dplyr::any_of(sum_cols),
        ~ sum(.x, na.rm = TRUE)
      ),
      n_facilities = dplyr::n(),
      .groups = "drop"
    )
}

# aggregate at adm3 level
group_cols_adm3 <- c("adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon")
dhis2_adm3 <- aggregate_admin(dhis2_df, group_cols_adm3, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, adm3, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, sep = " ~ ")
  )

# aggregate at adm2 level
group_cols_adm2 <- c("adm0", "adm1", "adm2", "year", "month", "yearmon")
dhis2_adm2 <- aggregate_admin(dhis2_df, group_cols_adm2, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, sep = " ~ ")
  )

# aggregate at adm1 level
group_cols_adm1 <- c("adm0", "adm1", "year", "month", "yearmon")
dhis2_adm1 <- aggregate_admin(dhis2_df, group_cols_adm1, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, yearmon),
      algo = "xxhash32"
    ),
    location_short = adm1,
    location_full = adm1
  )

# create data dictionaries for each level (filter to relevant columns only)
id_cols <- c("n_facilities", "record_id", "location_short", "location_full")

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_dict <- full_data_dict |>
  dplyr::filter(snt_variable %in% c(group_cols_adm3, sum_cols, id_cols))

# adm2 dict: excludes adm3 (not present at this level)
adm2_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm2, sum_cols, id_cols),
    !snt_variable %in% c("adm3")
  )

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm1, sum_cols, id_cols),
    !snt_variable %in% c("adm2", "adm3")
  )

# save adm3 data with dictionary
rio::export(
  list(data = dhis2_adm3, dictionary = adm3_dict),
  here::here(save_path, "sle_dhis2_adm3_data.xlsx")
)
rio::export(dhis2_adm3, here::here(save_path, "sle_dhis2_adm3_data.rds"))

# save adm2 data with dictionary
rio::export(
  list(data = dhis2_adm2, dictionary = adm2_dict),
  here::here(save_path, "sle_dhis2_adm2_data.xlsx")
)
rio::export(dhis2_adm2, here::here(save_path, "sle_dhis2_adm2_data.rds"))

# save adm1 data with dictionary
rio::export(
  list(data = dhis2_adm1, dictionary = adm1_dict),
  here::here(save_path, "sle_dhis2_adm1_data.xlsx")
)
rio::export(dhis2_adm1, here::here(save_path, "sle_dhis2_adm1_data.rds"))
NoteOutput

ADM3 aggregation preview:

adm3 yearmon record_id location_full conf n_facilities
DEA Jan 2015 a13ef3b8 EASTERN ~ KAILAHUN ~ DEA 336 3
DEA Feb 2015 a6e77ff1 EASTERN ~ KAILAHUN ~ DEA 353 3
DEA Mar 2015 1a075bd7 EASTERN ~ KAILAHUN ~ DEA 318 3
DEA Apr 2015 9e096499 EASTERN ~ KAILAHUN ~ DEA 235 3
DEA May 2015 cf41c82e EASTERN ~ KAILAHUN ~ DEA 450 3

To adapt the code:

  • Lines 2-30: Update sum_cols to include all numeric indicators relevant to your dataset.
  • Lines 33-42: The aggregate_admin() function handles grouping by admin level.
  • Lines 45-79: Adjust grouping columns if your admin hierarchy differs. The record_id is created using sntutils::vdigest() with the full admin hierarchy (e.g., adm0 + adm1 + adm2 + adm3 + yearmon for adm3 level) to ensure unique identifiers. The code also creates location_short and location_full labels for each level.
  • Lines 82-94: Filter the data dictionary to match columns at each level, excluding admin columns not present at that level (e.g., adm3 excluded from adm2 dictionary).
  • Lines 97-112: Update filename prefixes (e.g., sle_) to match your country code.

Once updated, run the code to aggregate and save data at adm3, adm2, and adm1 levels.

Show the code
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols = [
    # totals
    "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
    # by location
    "test_hf", "test_com", "conf_hf", "conf_com",
    "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
    # by age group - totals
    "allout_u5", "allout_ov5",
    "test_u5", "test_5_14", "test_ov15",
    "conf_u5", "conf_5_14", "conf_ov15",
    "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
    "pres_u5", "pres_5_14", "pres_ov15",
    "susp_u5", "susp_5_14", "susp_ov15",
    "maladm_u5", "maladm_5_14", "maladm_ov15",
    "maldth_u5", "maldth_5_14", "maldth_ov15",
    # by age and location
    "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
    "test_com_u5", "test_com_5_14", "test_com_ov15",
    "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
    "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
    "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
    "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
    "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
    "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
    # treatment timing
    "maltreat_u24_hf", "maltreat_ov24_hf",
    "maltreat_u24_com", "maltreat_ov24_com",
    "maltreat_u24_total", "maltreat_ov24_total"
]

# function to aggregate at a given admin level
def aggregate_admin(df, group_cols, sum_cols):
    # filter to columns that exist in dataframe
    existing_sum_cols = [c for c in sum_cols if c in df.columns]

    # aggregate
    agg_df = (
        df
        .groupby(group_cols, as_index=False)
        .agg(
            **{col: (col, "sum") for col in existing_sum_cols},
            n_facilities=("hf_uid", "count")
        )
    )

    return agg_df

# aggregate at adm3 level
group_cols_adm3 = ["adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon"]
dhis2_adm3 = aggregate_admin(dhis2_df, group_cols_adm3, sum_cols)
dhis2_adm3["record_id"] = vdigest(
    dhis2_adm3["adm0"] + " " + dhis2_adm3["adm1"] + " " +
    dhis2_adm3["adm2"] + " " + dhis2_adm3["adm3"] + " " +
    dhis2_adm3["yearmon"].astype(str)
)
dhis2_adm3["location_short"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"]
dhis2_adm3["location_full"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"] + " ~ " + dhis2_adm3["adm3"]

# aggregate at adm2 level
group_cols_adm2 = ["adm0", "adm1", "adm2", "year", "month", "yearmon"]
dhis2_adm2 = aggregate_admin(dhis2_df, group_cols_adm2, sum_cols)
dhis2_adm2["record_id"] = vdigest(
    dhis2_adm2["adm0"] + " " + dhis2_adm2["adm1"] + " " +
    dhis2_adm2["adm2"] + " " + dhis2_adm2["yearmon"].astype(str)
)
dhis2_adm2["location_short"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
dhis2_adm2["location_full"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]

# aggregate at adm1 level
group_cols_adm1 = ["adm0", "adm1", "year", "month", "yearmon"]
dhis2_adm1 = aggregate_admin(dhis2_df, group_cols_adm1, sum_cols)
dhis2_adm1["record_id"] = vdigest(
    dhis2_adm1["adm0"] + " " + dhis2_adm1["adm1"] + " " +
    dhis2_adm1["yearmon"].astype(str)
)
dhis2_adm1["location_short"] = dhis2_adm1["adm1"]
dhis2_adm1["location_full"] = dhis2_adm1["adm1"]

# create data dictionaries for each level (filter to relevant columns only)
id_cols = ["n_facilities", "record_id", "location_short", "location_full"]

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_cols = group_cols_adm3 + [c for c in sum_cols if c in dhis2_adm3.columns] + id_cols
adm3_dict = full_data_dict[full_data_dict["snt_variable"].isin(adm3_cols)]

# adm2 dict: excludes adm3 (not present at this level)
adm2_cols = group_cols_adm2 + [c for c in sum_cols if c in dhis2_adm2.columns] + id_cols
adm2_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm2_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm3"]))
]

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_cols = group_cols_adm1 + [c for c in sum_cols if c in dhis2_adm1.columns] + id_cols
adm1_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm1_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm2", "adm3"]))
]

# save adm3 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm3_data.xlsx") as writer:
    dhis2_adm3.to_excel(writer, sheet_name="data", index=False)
    adm3_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm3.to_parquet(save_path / "sle_dhis2_adm3_data.parquet", compression="zstd", index=False)

# save adm2 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm2_data.xlsx") as writer:
    dhis2_adm2.to_excel(writer, sheet_name="data", index=False)
    adm2_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm2.to_parquet(save_path / "sle_dhis2_adm2_data.parquet", compression="zstd", index=False)

# save adm1 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm1_data.xlsx") as writer:
    dhis2_adm1.to_excel(writer, sheet_name="data", index=False)
    adm1_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm1.to_parquet(save_path / "sle_dhis2_adm1_data.parquet", compression="zstd", index=False)
NoteOutput

ADM3 aggregation preview:

adm3 yearmon record_id location_full conf n_facilities
DEA Jan 2015 a593ce45 EASTERN ~ KAILAHUN ~ DEA 336 3
DEA Feb 2015 44899848 EASTERN ~ KAILAHUN ~ DEA 353 3
DEA Mar 2015 aa072f84 EASTERN ~ KAILAHUN ~ DEA 318 3
DEA Apr 2015 b2cd9502 EASTERN ~ KAILAHUN ~ DEA 235 3
DEA May 2015 80f6a15c EASTERN ~ KAILAHUN ~ DEA 450 3

To adapt the code:

  • Lines 2-30: Update sum_cols to include all numeric indicators relevant to your dataset.
  • Lines 33-46: The aggregate_admin() function handles grouping by admin level.
  • Lines 49-75: Adjust grouping columns if your admin hierarchy differs. The record_id is created using vdigest() with the full admin hierarchy (e.g., adm0 + adm1 + adm2 + adm3 + yearmon for adm3 level) to ensure unique identifiers. The code also creates location_short and location_full labels for each level.
  • Lines 78-90: Filter the data dictionary to match columns at each level, excluding admin columns not present at that level (e.g., adm3 excluded from adm2 dictionary).
  • Lines 93-108: Update filename prefixes (e.g., sle_) to match your country code.

Once updated, run the code to aggregate and save data at adm3, adm2, and adm1 levels.

Summary

We have now walked through the key steps for cleaning, reshaping, and aggregating DHIS2 malaria data to make it SNT-ready. The code covered everything from importing raw files, standardising column names, computing key indicators, and saving outputs at multiple administrative levels. For convenience, a full end-to-end script is included below in a folded code block. Reuse or adapt it for the specific country context by adjusting column names, file paths, and administrative levels as needed.

Full Code

Find the full code script for accessing and processing routine data below.

  • R
  • Python
Show full code
################################################################################
#################### ~ DHIS2 data preprocessing full code ~ ####################
################################################################################

### Step 1: Import Packages and Data -------------------------------------------

#### Step 1.1: Import packages -------------------------------------------------

# install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  tidyverse, # data manipulation, reshaping and plotting
  rio,       # import/export multiple file types
  DT,        # interactive table preview
  here,      # project-relative file paths
  readxl,    # read excel files
  writexl,   # write excel files
  knitr      # render code in quarto/rmarkdown
)

#### Step 1.2: Import data -----------------------------------------------------

# set the path to the data
core_routine_path <- here::here(
  "01_data",
  "1.2_epidemiology",
  "1.2a_routine_surveillance",
  "raw"
)

# set the full DHIS2 path
path_to_dhis2 <- here::here(core_routine_path, "sle_dhis2_2015_2022.xlsx")

# read the data
dhis2_df <- rio::import(file = path_to_dhis2)

dhis2_df <- rio::import_list(
  file = path_to_dhis2,
  rbind = TRUE,        # stack all tabs into one table
  rbind_label = "year" # tab name stored in column 'year'
)

dhis2_df <- rio::import_list(
  file = list.files(
    path = core_routine_path,
    pattern = "\\.(xls)$",    # the file extensions
    full.names = TRUE
  ),
  rbind = TRUE,                # stack all tabs/files into one table
  rbind_label = "sheet_admin", # source name stored in column 'sheet_admin'
  rbind_fill = TRUE            # fill missing columns with NA when stacking
)

# check data
dhis2_df |>
  dplyr::select(1:20) |>
  dplyr::glimpse()

### Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts --

#### Step 2.1: Verify all expected groups are present after import -------------

# check the adm's' in our data
dhis2_df |>
  dplyr::distinct(orgunitlevel3)

#### Step 2.2: Check for completely missing variables --------------------------

# identify complete missing values per column
dhis2_df |>
  dplyr::summarise(
    dplyr::across(
      dplyr::everything(),
      ~ mean(is.na(.x)) * 100
    )
  ) |>
  tidyr::pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "pct_missing"
  ) |>
  dplyr::filter(pct_missing == 100)

### Step 3: Rename Columns -----------------------------------------------------

# import data dictionary
data_dict <- rio::import(
  here::here(core_routine_path, "sle_dhis2_data_dict.xlsx")
)
# create rename vector: object = old names, names = new names
rename_vector <- stats::setNames(
  object = data_dict$indicator_label,
  nm = data_dict$snt_var
)

# rename DHIS2 data
dhis2_df <- dhis2_df |>
  dplyr::rename(
    !!!rename_vector
  ) |>
  # we drop orgunitlevel5 because it s same as hf
  dplyr::select(-orgunitlevel5)

# check the names
colnames(dhis2_df)

### Step 4: Standardise the Date Columns ---------------------------------------

# create dummy data
dates_ex1 <- tibble::tibble(
  raw_date = c("2020-01-15", "2021-03-01", "2022-12-20")
)

# parse dates
dates_ex1 |>
  dplyr::mutate(
    date = lubridate::ymd(raw_date)
  )

# create dummy data
dates_ex2 <- tibble::tibble(
  raw_date = c("Jan 2020", "February 2021", "Mar 2022")
)

# parse dates
dates_ex2 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(
      raw_date,
      orders = c("b Y", "B Y")
    ) |>
      as.Date()
  )

# create dummy data
dates_ex3 <- tibble::tibble(
  raw_date = c(
    "Janvier 2020",
    "Février 2021",
    "décembre 2022",
    "Jan 2021",
    "Fe 2022",
    "Dec 2023"
  )
)

# parse dates
dates_ex3 |>
  dplyr::mutate(
    date = paste0("01 ", raw_date) |>
      lubridate::dmy(locale = "fr_FR") |>
      as.Date()
  )

# create dummy data
dates_ex4 <- tibble::tibble(
  raw_date = c("Janeiro 2020", "fevereiro 2021", "Dezembro 2022",
               "Jan 2021", "Fev 2022", "Dez 2023")
)

# parse dates
dates_ex4 |>
  dplyr::mutate(
    date = paste0("01 ", raw_date) |>
      lubridate::dmy(locale = "pt_PT") |>
      as.Date()
  )

# create dummy data
dates_ex5 <- tibble::tibble(
  raw_date = c("2020-01", "2021/05", "2023-12")
)

# parse dates
dates_ex5 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(raw_date, orders = c("Y-m", "Y/m")),
    date = lubridate::floor_date(date, unit = "month") |> as.Date()
  )

# create dummy data
dates_ex6 <- tibble::tibble(
  raw_date = c("01/2020", "06-2021", "11/2022")
)

# parse dates
dates_ex6 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(raw_date, orders = c("m/Y", "m-Y")),
    date = lubridate::floor_date(date, unit = "month") |> as.Date()
  )

# create dummy data
dates_ex7 <- tibble::tibble(
  raw_date = c("202301", "202102", "202212")
)

dates_ex7 |>
  dplyr::mutate(
    year = substr(raw_date, 1, 4),
    month = substr(raw_date, 5, 6),
    date = lubridate::ymd(sprintf("%s-%s-01", year, month)) |> as.Date()
  )

# create dummy data
dates_ex8 <- tibble::tibble(
  raw_date = c(
    "2020-01-15", # ISO full date
    "Jan 2021",   # English abbreviated month
    "202203",     # compact YYYYMM
    "03/2022",    # month/year with slash
    "2021-07",    # year-month
    "July 2020",  # English full month
    "15-02-2021", # DD-MM-YYYY
    "2020/11/05", # YYYY/MM/DD
    "2020.12.25", # dotted date
    "2022.07",    # YYYY.MM
    "10-2020",    # MM-YYYY
    "20210105",   # YYYYMMDD
    "2020 Jan",   # Year then month (English)
    "2020.01.01"  # dotted YYYY.MM.DD
  )
)

# parse dates
dates_ex8 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(
      raw_date,
      orders = c(
        "Y-m-d", # 2020-01-15
        "Y/m/d", # 2020/11/05
        "Y.m.d", # 2020.12.25
        "b Y",   # Jan 2021
        "B Y",   # January 2021
        "Y b",   # 2020 Jan
        "Y B",   # 2020 January
        "Ym",    # 202203
        "m/Y",   # 03/2022
        "m-Y",   # 10-2020
        "Y-m",   # 2021-07
        "Y.m",   # 2022.07
        "d-m-Y", # 15-02-2021
        "d/m/Y", # 15/02/2021 (if present)
        "Ymd"    #  20210105
      )
    ) |>
      as.Date()
  )

# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
Sys.setlocale("LC_TIME", "en_US.UTF-8")

dhis2_df <- dhis2_df |>
  # parse the raw date into a proper Date object
  dplyr::mutate(
    date = lubridate::parse_date_time(
      periodname,
      orders = c("B Y", "b Y")
    ) |>
      as.Date()
  ) |>
  # create year, month, and ordered yearmon label
  dplyr::mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    # human-readable label, e.g. "Jan 2020"
    yearmon = format(date, "%b %Y"),
    # order factor by actual chronological date
    yearmon = factor(yearmon, levels = unique(yearmon[order(date)]))
  )

# check the head
dhis2_df |>
  dplyr::select(date, year, month, yearmon) |>
  head()

### Step 5: Managing Location Columns ------------------------------------------

#### Step 5.1: Harmonise administrative names ----------------------------------

# install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")

# set up location to save cache
cache_path <- "01_data/1.1_foundational/1.1f_cache_files/processed"

# get adm3 shapefile to use as reference lookup
shp_adm3 <- sntutils::read(
  here::here("01_data/1.1_foundational/1.1a_admin_boundaries/processed/sle_spatial_adm3_2021.rds")
) |>
  sf::st_drop_geometry() # drop geometry, we just need the admin names

# standardise admin names
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    adm0 = toupper(adm0),
    adm2 = toupper(adm1),
    adm3 = toupper(adm3),
    hf = toupper(hf),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2 = stringr::str_remove_all(adm2, " DISTRICT"),
    adm3 = stringr::str_remove_all(adm3, " CHIEFDOM")
  ) |>
  dplyr::mutate(
    # assign provinces based on district groupings
    # (no adm1 as current adm1 is adm2)
    adm1 = dplyr::case_when(
      adm2 %in% c("KAILAHUN", "KENEMA", "KONO") ~ "EASTERN",
      adm2 %in% c("BOMBALI", "FALABA", "KOINADUGU", "TONKOLILI") ~ "NORTH EAST",
      adm2 %in% c("KAMBIA", "KARENE", "PORT LOKO") ~ "NORTH WEST",
      adm2 %in% c("BO", "BONTHE", "MOYAMBA", "PUJEHUN") ~ "SOUTHERN",
      adm2 %in% c("WESTERN AREA RURAL", "WESTERN AREA URBAN") ~ "WESTERN"
    )
  )

# harmonise admin names between DHIS2 data and shapefile
dhis2_df <-
  sntutils::prep_geonames(
    target_df = dhis2_df,
    lookup_df = lookup_keys,
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    interactive = TRUE,
    cache_path = here::here(cache_path, "geoname_decisions_cache.xlsx")
  )

#### Step 5.2: Create unique identifiers and location labels -------------------

# create identifiers
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # create unique health facility identifier from admin hierarchy
    hf_uid = sntutils::vdigest(
      paste0(adm0, adm1, adm2, adm3, hf),
      algo = "xxhash32"
    ),
    # create location labels for mapping
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, hf, sep = " ~ "),
    # generate consistent record id (facility + time period)
    record_id = sntutils::vdigest(
      paste(hf_uid, yearmon),
      algo = "xxhash32"
    )
  )

# check
dhis2_df |>
  dplyr::arrange(location_full) |>
  dplyr::distinct(location_full, hf_uid, record_id) |>
  head()

### Step 5.5: Harmonise Health Facility Names with the Master Facility List ----

# load the fuzzy-matched DHIS2 + MFL output produced on the master facility
# lists page
mfl_matched <- readRDS(here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed",
  "final_dhis2_mfl_integrated.rds"
))

# build a distinct composite-keyed lookup; same facility name (hf) can
# appear in several adm2/adm3 areas, so key on (adm1, adm2, adm3, hf) to
# keep the lookup unique per facility-location
mfl_lookup <- mfl_matched |>
  dplyr::distinct(
    adm1, adm2, adm3, hf, hf_uid_new, hf_clean = hf_mfl_raw
  )

# attach canonical MFL name and stable id; relationship = "many-to-one"
# raises an error if mfl_lookup is not unique per join key, catching any
# accidental row multiplication
dhis2_df <- dhis2_df |>
  dplyr::left_join(
    mfl_lookup,
    by = c("adm1", "adm2", "adm3", "hf"),
    relationship = "many-to-one"
  )

# inspect a few rows of raw vs canonical name
dhis2_df |>
  dplyr::distinct(adm1, adm2, adm3, hf, hf_clean, hf_uid_new) |>
  dplyr::arrange(adm1, adm2, adm3, hf) |>
  utils::head(10)

### Step 6: Compute Variables --------------------------------------------------

#### Step 6.1: Compute indicator totals and new variables ----------------------

# smart row-wise sum with missing data handling
fallback_row_sum <- function(..., min_present = 1, .keep_zero_as_zero = TRUE) {
  vars_matrix <- cbind(...)
  valid_count <- rowSums(!is.na(vars_matrix))
  raw_sum <- rowSums(vars_matrix, na.rm = TRUE)

  ifelse(valid_count >= min_present, raw_sum, NA_real_)
}

# fallback absolute difference between two vectors
fallback_diff <- function(col1, col2, minimum = 0) {
  dplyr::case_when(
    is.na(col1) & is.na(col2) ~ NA_real_,
    is.na(col1) ~ pmax(col2, minimum),
    is.na(col2) ~ pmax(col1, minimum),
    TRUE ~ pmax(col1 - col2, minimum)
  )
}

# compute indicator totals in DHIS2 data
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # outpatient visits
    allout = fallback_row_sum(allout_u5, allout_ov5),

    # suspected cases
    susp = fallback_row_sum(
      susp_u5_hf,
      susp_5_14_hf,
      susp_ov15_hf,
      susp_u5_com,
      susp_5_14_com,
      susp_ov15_com
    ),

    # tested cases
    test_hf = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    test_com = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com,
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com,
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    test = fallback_row_sum(test_hf, test_com),

    # confirmed cases (hf and com)
    conf_hf = fallback_row_sum(
      test_pos_mic_u5_hf,
      test_pos_mic_5_14_hf,
      test_pos_mic_ov15_hf,
      tes_pos_rdt_u5_hf,
      tes_pos_rdt_5_14_hf,
      tes_pos_rdt_ov15_hf
    ),

    conf_com = fallback_row_sum(
      tes_pos_rdt_u5_com,
      tes_pos_rdt_5_14_com,
      tes_pos_rdt_ov15_com
    ),

    conf = fallback_row_sum(conf_hf, conf_com),

    # treated cases
    maltreat_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com,
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    maltreat_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf,
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    maltreat = fallback_row_sum(maltreat_hf, maltreat_com),

    # presumed cases
    pres_com = fallback_diff(maltreat_com, conf_com),
    pres_hf = fallback_diff(maltreat_hf, conf_hf),
    pres = fallback_row_sum(pres_com, pres_hf),

    # malaria admissions
    maladm = fallback_row_sum(
      maladm_u5,
      maladm_5_14,
      maladm_ov15
    ),

    # malaria deaths
    maldth = fallback_row_sum(
      maldth_u5,
      maldth_1_59m,
      maldth_10_14,
      maldth_5_9,
      maldth_5_14,
      maldth_ov15,
      maldth_fem_ov15,
      maldth_mal_ov15
    ),

    # age-group specific aggregations
    # tested cases by age group (HF only)
    test_hf_u5 = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    test_hf_5_14 = fallback_row_sum(
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    test_hf_ov15 = fallback_row_sum(
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # tested cases by age group (Community only)
    test_com_u5 = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com
    ),

    test_com_5_14 = fallback_row_sum(
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com
    ),

    test_com_ov15 = fallback_row_sum(
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    # total tested by age group (HF + Community)
    test_u5 = fallback_row_sum(test_hf_u5, test_com_u5),
    test_5_14 = fallback_row_sum(test_hf_5_14, test_com_5_14),
    test_ov15 = fallback_row_sum(test_hf_ov15, test_com_ov15),

    # suspected cases by age group (HF only)
    susp_hf_u5 = susp_u5_hf,

    susp_hf_5_14 = susp_5_14_hf,

    susp_hf_ov15 = susp_ov15_hf,

    # suspected cases by age group (Community only)
    susp_com_u5 = susp_u5_com,

    susp_com_5_14 = susp_5_14_com,

    susp_com_ov15 = susp_ov15_com,

    # total suspected by age group (HF + Community)
    susp_u5 = fallback_row_sum(susp_hf_u5, susp_com_u5),
    susp_5_14 = fallback_row_sum(susp_hf_5_14, susp_com_5_14),
    susp_ov15 = fallback_row_sum(susp_hf_ov15, susp_com_ov15),

    # confirmed cases by age group (HF only)
    conf_hf_u5 = fallback_row_sum(
      test_pos_mic_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    conf_hf_5_14 = fallback_row_sum(
      test_pos_mic_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    conf_hf_ov15 = fallback_row_sum(
      test_pos_mic_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # confirmed cases by age group (Community only)
    conf_com_u5 = tes_pos_rdt_u5_com,
    conf_com_5_14 = tes_pos_rdt_5_14_com,
    conf_com_ov15 = tes_pos_rdt_ov15_com,

    # total confirmed cases by age group (HF + Community)
    conf_u5 = fallback_row_sum(conf_hf_u5, conf_com_u5),
    conf_5_14 = fallback_row_sum(conf_hf_5_14, conf_com_5_14),
    conf_ov15 = fallback_row_sum(conf_hf_ov15, conf_com_ov15),

    # treated cases by age group (HF only)
    maltreat_hf_u5 = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf
    ),

    maltreat_hf_5_14 = fallback_row_sum(
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf
    ),

    maltreat_hf_ov15 = fallback_row_sum(
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    # treated cases by age group (Community only)
    maltreat_com_u5 = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com
    ),

    maltreat_com_5_14 = fallback_row_sum(
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com
    ),

    maltreat_com_ov15 = fallback_row_sum(
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    # total treated cases by age group (HF + Community)
    maltreat_u5 = fallback_row_sum(maltreat_hf_u5, maltreat_com_u5),
    maltreat_5_14 = fallback_row_sum(maltreat_hf_5_14, maltreat_com_5_14),
    maltreat_ov15 = fallback_row_sum(maltreat_hf_ov15, maltreat_com_ov15),

    # total treated cases within 24 hours (HF only)
    maltreat_u24_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_u24_ov15_hf
    ),

    # total treated cases after 24 hours (HF only)
    maltreat_ov24_hf = fallback_row_sum(
      maltreat_ov24_u5_hf,
      maltreat_ov24_5_14_hf,
      maltreat_ov24_ov15_hf
    ),

    # total treated cases within 24 hours (Community only)
    maltreat_u24_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_u24_ov15_com
    ),

    # total treated cases after 24 hours (Community only)
    maltreat_ov24_com = fallback_row_sum(
      maltreat_ov24_u5_com,
      maltreat_ov24_5_14_com,
      maltreat_ov24_ov15_com
    ),

    # overall totals (HF + Community)
    maltreat_u24_total = fallback_row_sum(maltreat_u24_hf, maltreat_u24_com),
    maltreat_ov24_total = fallback_row_sum(maltreat_ov24_hf, maltreat_ov24_com),

    # presumed cases by age group
    pres_com_u5 = fallback_diff(maltreat_com_u5, conf_com_u5),
    pres_com_5_14 = fallback_diff(maltreat_com_5_14, conf_com_5_14),
    pres_com_ov15 = fallback_diff(maltreat_com_ov15, conf_com_ov15),

    pres_hf_u5 = fallback_diff(maltreat_hf_u5, conf_hf_u5),
    pres_hf_5_14 = fallback_diff(maltreat_hf_5_14, conf_hf_5_14),
    pres_hf_ov15 = fallback_diff(maltreat_hf_ov15, conf_hf_ov15),

    pres_u5 = fallback_row_sum(pres_com_u5, pres_hf_u5),
    pres_5_14 = fallback_row_sum(pres_com_5_14, pres_hf_5_14),
    pres_ov15 = fallback_row_sum(pres_com_ov15, pres_hf_ov15)
  )

# check to see the aggregation worked
dhis2_df |>
  dplyr::filter(
    record_id %in%
      c("6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842")
  ) |>
  dplyr::arrange(allout) |>
  dplyr::select(
    allout,
    allout_u5,
    allout_ov5,
    pres_hf_u5,
    maltreat_hf_u5,
    conf_hf_u5
  ) |>
  head()

#### Step 6.2: Quality control of indicator totals -----------------------------

# create mismatch flags for each indicator group
mismatch_summary <- dhis2_df |>
  dplyr::summarise(
    # outpatient checks
    allout_mismatch = sum(
      allout != (allout_u5 + allout_ov5),
      na.rm = TRUE
    ),

    # malaria admissions check
    maladm_mismatch = sum(
      maladm != (maladm_u5 + maladm_5_14 + maladm_ov15),
      na.rm = TRUE
    ),

    # total tests check
    test_mismatch = sum(
      test != (test_hf + test_com),
      na.rm = TRUE
    ),

    # total confirmed check
    conf_mismatch = sum(
      conf != (conf_hf + conf_com),
      na.rm = TRUE
    ),

    # total treated check
    maltreat_mismatch = sum(
      maltreat != (maltreat_hf + maltreat_com),
      na.rm = TRUE
    ),

    # total presumed check
    pres_mismatch = sum(
      pres != (pres_hf + pres_com),
      na.rm = TRUE
    ),

    # malaria deaths check
    maldth_mismatch = sum(
      maldth !=
        (maldth_1_59m +
          maldth_u5 +
          maldth_5_9 +
          maldth_10_14 +
          maldth_5_14 +
          maldth_fem_ov15 +
          maldth_mal_ov15 +
          maldth_ov15),
      na.rm = TRUE
    ),

    # tested cases by age group
    test_u5_mismatch = sum(
      test_u5 != (test_hf_u5 + test_com_u5),
      na.rm = TRUE
    ),
    test_5_14_mismatch = sum(
      test_5_14 != (test_hf_5_14 + test_com_5_14),
      na.rm = TRUE
    ),
    test_ov15_mismatch = sum(
      test_ov15 != (test_hf_ov15 + test_com_ov15),
      na.rm = TRUE
    ),

    # confirmed cases by age group
    conf_u5_mismatch = sum(
      conf_u5 != (conf_hf_u5 + conf_com_u5),
      na.rm = TRUE
    ),
    conf_5_14_mismatch = sum(
      conf_5_14 != (conf_hf_5_14 + conf_com_5_14),
      na.rm = TRUE
    ),
    conf_ov15_mismatch = sum(
      conf_ov15 != (conf_hf_ov15 + conf_com_ov15),
      na.rm = TRUE
    ),

    # presumed cases by age group
    pres_u5_mismatch = sum(
      pres_u5 != (pres_hf_u5 + pres_com_u5),
      na.rm = TRUE
    ),
    pres_5_14_mismatch = sum(
      pres_5_14 != (pres_hf_5_14 + pres_com_5_14),
      na.rm = TRUE
    ),
    pres_ov15_mismatch = sum(
      pres_ov15 != (pres_hf_ov15 + pres_com_ov15),
      na.rm = TRUE
    ),

    # suspected cases by age group
    susp_u5_mismatch = sum(
      susp_u5 != (susp_u5_hf + susp_u5_com),
      na.rm = TRUE
    ),
    susp_5_14_mismatch = sum(
      susp_5_14 != (susp_5_14_hf + susp_5_14_com),
      na.rm = TRUE
    ),
    susp_ov15_mismatch = sum(
      susp_ov15 != (susp_ov15_hf + susp_ov15_com),
      na.rm = TRUE
    ),

    # treated cases by age group
    maltreat_u5_mismatch = sum(
      maltreat_u5 != (maltreat_hf_u5 + maltreat_com_u5),
      na.rm = TRUE
    ),
    maltreat_5_14_mismatch = sum(
      maltreat_5_14 != (maltreat_hf_5_14 + maltreat_com_5_14),
      na.rm = TRUE
    ),
    maltreat_ov15_mismatch = sum(
      maltreat_ov15 != (maltreat_hf_ov15 + maltreat_com_ov15),
      na.rm = TRUE
    ),

    # treatment timing checks
    maltreat_u24_mismatch = sum(
      maltreat_u24_total != (maltreat_u24_hf + maltreat_u24_com),
      na.rm = TRUE
    ),
    maltreat_ov24_mismatch = sum(
      maltreat_ov24_total != (maltreat_ov24_hf + maltreat_ov24_com),
      na.rm = TRUE
    )
  ) |>
  # pivot to long format for easier viewing
  tidyr::pivot_longer(
    cols = dplyr::everything(),
    names_to = "indicator",
    values_to = "n_mismatches"
  ) |>
  # filter to show only indicators with mismatches
  dplyr::filter(n_mismatches > 0)

mismatch_summary

#### Step 6.3: Export rows with incoherent totals ------------------------------

# identify rows with incoherent totals
incoherent_rows <- dhis2_df |>
  dplyr::filter(
    # outpatient checks
    allout != (allout_u5 + allout_ov5) |
    # malaria admissions check
    maladm != (maladm_u5 + maladm_5_14 + maladm_ov15) |
    # total tests check
    test != (test_hf + test_com) |
    # total confirmed check
    conf != (conf_hf + conf_com) |
    # total treated check
    maltreat != (maltreat_hf + maltreat_com) |
    # total presumed check
    pres != (pres_hf + pres_com)
  ) |>
  dplyr::select(
    hf, periodname,
    dplyr::matches("allout|maladm|test|conf|maltreat|pres")
  )

# set path for saving
output_file <- here::here(
  "03_outputs",
  "3.1_validation",
  "tables",
  "sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
rio::export(incoherent_rows, file = output_file)

#### Step 6.4: Add IPD/OPD specification ---------------------------------------

# option 1: classify based on facility name patterns
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    facility_type = dplyr::case_when(
      # inpatient facilities (hospitals)
      stringr::str_detect(
        hf,
        regex("hospital|hosp", ignore_case = TRUE)
      ) ~ "IPD",
      stringr::str_detect(
        hf,
        regex("district hospital|regional hospital", ignore_case = TRUE)
      ) ~ "IPD",
      # outpatient facilities (clinics, health posts, community health)
      stringr::str_detect(
        hf,
        regex(
          "CHP|CHC|MCHP|clinic|health post|health centre",
          ignore_case = TRUE
        )
      ) ~ "OPD",
      # default to OPD if no pattern matched
      TRUE ~ "OPD"
    )
  )

# option 2: classify based on presence of inpatient indicators
dhis2_df <- dhis2_df |>
  dplyr::group_by(hf_uid) |>
  dplyr::mutate(
    # facility has ipd if it ever reports admissions or deaths
    has_ipd = any(!is.na(maladm) & maladm > 0, na.rm = TRUE) |
      any(!is.na(maldth) & maldth > 0, na.rm = TRUE),
    facility_type = dplyr::if_else(has_ipd, "IPD", "OPD")
  ) |>
  dplyr::ungroup() |>
  dplyr::select(-has_ipd)

# option 3: use a reference lookup file
hf_path <-
  rio::import(
    here::here(
      "01_data",
      "1.1_foundational",
      "1.1c_health_facilities",
      "processed",
      "health_facility_master_list.xlsx"
    )
  )

dhis2_df <- dhis2_df |>
  dplyr::left_join(
    facility_lookup |> dplyr::select(hf_uid, facility_type),
    by = "hf_uid"
  )

# check classification distribution
dhis2_df |>
  dplyr::distinct(hf_uid, facility_type) |>
  dplyr::count(facility_type)

### Step 7: Finalize Data ------------------------------------------------------

#### Step 7.1: Troubleshoot duplicate HF-month records with different data -----

# identify duplicate hf-month combinations
duplicates <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::ungroup()

# count number of duplicate pairs
n_duplicates <- duplicates |>
  dplyr::distinct(record_id) |>
  nrow()

# if duplicates exist, inspect them
if (n_duplicates > 0) {
  # view duplicate records with key indicators
  duplicate_details <- duplicates |>
    dplyr::select(
      record_id,
      adm0, adm1, adm2, adm3, hf, yearmon,
      allout,
      test,
      conf,
      maltreat
    ) |>
    dplyr::arrange(record_id)

  print(duplicate_details)

  # export for review with SNT team
  rio::export(
    duplicate_details,
    here::here(
      "03_outputs",
      "3.1_validation",
      "tables",
      "sle_duplicate_records_dhis2.xlsx"
    )
  )
}

# option 1: keep first record (if duplicates are exact copies)
dhis2_df <- dhis2_df |>
  dplyr::distinct(record_id, .keep_all = TRUE)

# option 2: keep record with most complete data
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::slice_max(
    # count non-NA values across indicator columns
    order_by = rowSums(!is.na(dplyr::across(dplyr::where(is.numeric)))),
    n = 1,
    with_ties = FALSE
  ) |>
  dplyr::ungroup()

# option 3: sum duplicates (if records represent partial reports)
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id, hf, adm0, adm1, adm2, adm3) |>
  dplyr::summarise(
    dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)),
    .groups = "drop"
  )

#### Step 7.2: Generate final data dictionary ----------------------------------

# define descriptions for computed/derived columns
computed_vars <- tibble::tribble(
  ~snt_var,              ~indicator_label,
  # time columns
  "date",                "Report date (YYYY-MM-DD)",
  "year",                "Report year",
  "month",               "Report month (1-12)",
  "yearmon",             "Year-month label (e.g., Jan 2020)",
  # identifier columns
  "hf_uid",              "Unique health facility identifier (hash)",
  "record_id",           "Unique record identifier (facility + month hash)",
  "location_short",      "Location label: adm1 ~ adm2",
  "location_full",       "Location label: adm1 ~ adm2 ~ adm3 ~ hf",
  "facility_type",       "Facility type (IPD/OPD)",
  # aggregated totals
  "allout",              "Total outpatient visits (allout_u5 + allout_ov5)",
  "susp",                "Total suspected cases (all ages, HF + COM)",
  "test",                "Total tested (test_hf + test_com)",
  "test_hf",             "Total tested at health facility",
  "test_com",            "Total tested in community",
  "conf",                "Total confirmed cases (conf_hf + conf_com)",
  "conf_hf",             "Total confirmed cases at health facility",
  "conf_com",            "Total confirmed cases in community",
  "maltreat",            "Total treated cases (maltreat_hf + maltreat_com)",
  "maltreat_hf",         "Total treated cases at health facility",
  "maltreat_com",        "Total treated cases in community",
  "pres",                "Total presumed cases (pres_hf + pres_com)",
  "pres_hf",             "Total presumed cases at health facility",
  "pres_com",            "Total presumed cases in community",
  "maladm",              "Total malaria admissions (all ages)",
  "maldth",              "Total malaria deaths (all ages)",
  # age-group totals
  "test_u5",             "Total tested under 5 (HF + COM)",
  "test_5_14",           "Total tested 5-14 years (HF + COM)",
  "test_ov15",           "Total tested over 15 years (HF + COM)",
  "test_hf_u5",          "Tested under 5 at health facility",
  "test_hf_5_14",        "Tested 5-14 years at health facility",
  "test_hf_ov15",        "Tested over 15 years at health facility",
  "test_com_u5",         "Tested under 5 in community",
  "test_com_5_14",       "Tested 5-14 years in community",
  "test_com_ov15",       "Tested over 15 years in community",
  "susp_u5",             "Suspected cases under 5 (HF + COM)",
  "susp_5_14",           "Suspected cases 5-14 years (HF + COM)",
  "susp_ov15",           "Suspected cases over 15 years (HF + COM)",
  "susp_hf_u5",          "Suspected cases under 5 at health facility",
  "susp_hf_5_14",        "Suspected cases 5-14 years at health facility",
  "susp_hf_ov15",        "Suspected cases over 15 years at health facility",
  "susp_com_u5",         "Suspected cases under 5 in community",
  "susp_com_5_14",       "Suspected cases 5-14 years in community",
  "susp_com_ov15",       "Suspected cases over 15 years in community",
  "conf_u5",             "Confirmed cases under 5 (HF + COM)",
  "conf_5_14",           "Confirmed cases 5-14 years (HF + COM)",
  "conf_ov15",           "Confirmed cases over 15 years (HF + COM)",
  "conf_hf_u5",          "Confirmed cases under 5 at health facility",
  "conf_hf_5_14",        "Confirmed cases 5-14 years at health facility",
  "conf_hf_ov15",        "Confirmed cases over 15 years at health facility",
  "conf_com_u5",         "Confirmed cases under 5 in community",
  "conf_com_5_14",       "Confirmed cases 5-14 years in community",
  "conf_com_ov15",       "Confirmed cases over 15 years in community",
  "maltreat_u5",         "Treated cases under 5 (HF + COM)",
  "maltreat_5_14",       "Treated cases 5-14 years (HF + COM)",
  "maltreat_ov15",       "Treated cases over 15 years (HF + COM)",
  "maltreat_hf_u5",      "Treated cases under 5 at health facility",
  "maltreat_hf_5_14",    "Treated cases 5-14 years at health facility",
  "maltreat_hf_ov15",    "Treated cases over 15 years at health facility",
  "maltreat_com_u5",     "Treated cases under 5 in community",
  "maltreat_com_5_14",   "Treated cases 5-14 years in community",
  "maltreat_com_ov15",   "Treated cases over 15 years in community",
  "maltreat_u24_hf",     "Treated cases within 24hrs at health facility",
  "maltreat_ov24_hf",    "Treated cases after 24hrs at health facility",
  "maltreat_u24_com",    "Treated cases within 24hrs in community",
  "maltreat_ov24_com",   "Treated cases after 24hrs in community",
  "maltreat_u24_total",  "Total treated cases within 24hrs (HF + COM)",
  "maltreat_ov24_total", "Total treated cases after 24hrs (HF + COM)",
  "pres_u5",             "Presumed cases under 5 (HF + COM)",
  "pres_5_14",           "Presumed cases 5-14 years (HF + COM)",
  "pres_ov15",           "Presumed cases over 15 years (HF + COM)",
  "pres_hf_u5",          "Presumed cases under 5 at health facility",
  "pres_hf_5_14",        "Presumed cases 5-14 years at health facility",
  "pres_hf_ov15",        "Presumed cases over 15 years at health facility",
  "pres_com_u5",         "Presumed cases under 5 in community",
  "pres_com_5_14",       "Presumed cases 5-14 years in community",
  "pres_com_ov15",       "Presumed cases over 15 years in community"
)

# combine original and computed dictionaries
full_data_dict <- dplyr::bind_rows(
  data_dict,
  computed_vars
) |>
  # keep only columns present in final dataset
  dplyr::filter(snt_var %in% names(dhis2_df)) |>
  dplyr::arrange(snt_var) |>
  dplyr::select(snt_variable = snt_var,  label = indicator_label)

# check
full_data_dict |>
    head()

#### Step 7.3: Arrange final column order --------------------------------------

# arrange columns in logical order
dhis2_df <- dhis2_df |>
  dplyr::select(
    # identifiers
    record_id,
    # location hierarchy
    adm0,
    adm1,
    adm2,
    adm3,
    hf,
    hf_uid,
    location_short,
    location_full,
    facility_type,
    # time variables
    date,
    yearmon,
    year,
    month,
    # all remaining indicator columns
    dplyr::everything()
  )

# check column order
colnames(dhis2_df) |> head(20)

### Step 8: Aggregate and Save Data --------------------------------------------

#### Step 8.1: Save data at the health facility level --------------------------

# set up output path
save_path <- here::here(
  "01_data",
  "1.2_epidemiology",
  "1.2a_routine_surveillance",
  "processed"
)

# save the canonical clean dataset under the name that downstream pages
# (missing_data.qmd, active_status.qmd, ...) load from. write both .rds
# (for R) and .parquet (for Python) so both languages see identical data.
saveRDS(
  dhis2_df,
  here::here(save_path, "clean_malaria_routine_data_final.rds")
)

arrow::write_parquet(
  dhis2_df,
  here::here(save_path, "clean_malaria_routine_data_final.parquet"),
  compression = "zstd"
)

saveRDS(
  full_data_dict,
  here::here(save_path, "clean_malaria_routine_dict.rds")
)

arrow::write_parquet(
  full_data_dict,
  here::here(save_path, "clean_malaria_routine_dict.parquet"),
  compression = "zstd"
)

# additional export formats for sharing with non-R users
dhis2_hf_list <- list(
  data = dhis2_df,
  dictionary = full_data_dict
)

rio::export(
  dhis2_hf_list,
  here::here(save_path, "sle_dhis2_health_facility_data.xlsx")
)

#### Step 8.2: Aggregate and save data at each admin unit level ----------------

# define numeric columns to sum (excluding HF-level identifiers)
sum_cols <- c(
  # totals
  "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
  # by location
  "test_hf", "test_com", "conf_hf", "conf_com",
  "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
  # by age group - totals
  "allout_u5", "allout_ov5",
  "test_u5", "test_5_14", "test_ov15",
  "conf_u5", "conf_5_14", "conf_ov15",
  "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
  "pres_u5", "pres_5_14", "pres_ov15",
  "susp_u5", "susp_5_14", "susp_ov15",
  "maladm_u5", "maladm_5_14", "maladm_ov15",
  "maldth_u5", "maldth_5_14", "maldth_ov15",
  # by age and location
  "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
  "test_com_u5", "test_com_5_14", "test_com_ov15",
  "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
  "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
  "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
  "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
  "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
  "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
  # treatment timing
  "maltreat_u24_hf", "maltreat_ov24_hf",
  "maltreat_u24_com", "maltreat_ov24_com",
  "maltreat_u24_total", "maltreat_ov24_total"
)

# function to aggregate at a given admin level
aggregate_admin <- function(df, group_cols, sum_cols) {
  df |>
    dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |>
    dplyr::summarise(
      dplyr::across(
        dplyr::any_of(sum_cols),
        ~ sum(.x, na.rm = TRUE)
      ),
      n_facilities = dplyr::n(),
      .groups = "drop"
    )
}

# aggregate at adm3 level
group_cols_adm3 <- c("adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon")
dhis2_adm3 <- aggregate_admin(dhis2_df, group_cols_adm3, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, adm3, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, sep = " ~ ")
  )

# aggregate at adm2 level
group_cols_adm2 <- c("adm0", "adm1", "adm2", "year", "month", "yearmon")
dhis2_adm2 <- aggregate_admin(dhis2_df, group_cols_adm2, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, sep = " ~ ")
  )

# aggregate at adm1 level
group_cols_adm1 <- c("adm0", "adm1", "year", "month", "yearmon")
dhis2_adm1 <- aggregate_admin(dhis2_df, group_cols_adm1, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, yearmon),
      algo = "xxhash32"
    ),
    location_short = adm1,
    location_full = adm1
  )

# create data dictionaries for each level (filter to relevant columns only)
id_cols <- c("n_facilities", "record_id", "location_short", "location_full")

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_dict <- full_data_dict |>
  dplyr::filter(snt_variable %in% c(group_cols_adm3, sum_cols, id_cols))

# adm2 dict: excludes adm3 (not present at this level)
adm2_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm2, sum_cols, id_cols),
    !snt_variable %in% c("adm3")
  )

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm1, sum_cols, id_cols),
    !snt_variable %in% c("adm2", "adm3")
  )

# save adm3 data with dictionary
rio::export(
  list(data = dhis2_adm3, dictionary = adm3_dict),
  here::here(save_path, "sle_dhis2_adm3_data.xlsx")
)
rio::export(dhis2_adm3, here::here(save_path, "sle_dhis2_adm3_data.rds"))

# save adm2 data with dictionary
rio::export(
  list(data = dhis2_adm2, dictionary = adm2_dict),
  here::here(save_path, "sle_dhis2_adm2_data.xlsx")
)
rio::export(dhis2_adm2, here::here(save_path, "sle_dhis2_adm2_data.rds"))

# save adm1 data with dictionary
rio::export(
  list(data = dhis2_adm1, dictionary = adm1_dict),
  here::here(save_path, "sle_dhis2_adm1_data.xlsx")
)
rio::export(dhis2_adm1, here::here(save_path, "sle_dhis2_adm1_data.rds"))
Show full code
################################################################################
#################### ~ DHIS2 data preprocessing full code ~ ####################
################################################################################

### Step 1: Import Packages and Data -------------------------------------------

#### Step 1.1: Import packages -------------------------------------------------

import pandas as pd  # load core data tools
from pathlib import Path  # handle file system paths
from pyprojroot import here  # build project-relative paths
import os  # optional path utilities
import locale  # for setting language locale
import geopandas as gpd  # for importing shapefiles
import xxhash  # for hashing
import pyarrow.parquet as pq  # for exportin parquet files
import pyarrow as pa  # for exportin parquet files

#### Step 1.2: Import data -----------------------------------------------------

# set the full dhis2 path
core_routine_path = Path(
    here("01_data/1.2_epidemiology/1.2a_routine_surveillance/raw")
)

# set the full dhis2 path
path_to_dhis2 = core_routine_path / "sle_dhis2_2015_2022.xlsx"

# read the data
dhis2_df = pd.read_excel(path_to_dhis2)

# import all tabs and stack into a single table
dhis2_df = pd.concat(
    [
        pd.read_excel(path_to_dhis2, sheet_name=s).assign(year=s)
        for s in pd.ExcelFile(path_to_dhis2).sheet_names
    ],
    ignore_index=True,
)
dhis2_df.head(10).style

# set file extension
extension = "xls"

# find all files with specified extension in directory
files = list(core_routine_path.glob(f"*.{extension}"))

# initialise dataframe
dhis2_df = pd.DataFrame()

# iterate over files, concatenate stack as one dataframe
for file in files:
    temp = pd.read_excel(file, sheet_name="Sheet1")
    dhis2_df = pd.concat([dhis2_df, temp])

# Inspect data
dhis2_df.head(10)

### Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts --

#### Step 2.1: Verify all expected groups are present after import -------------

# check the adm's' in our data
dhis2_df[["orgunitlevel3"]].drop_duplicates().reset_index(drop=True)

#### Step 2.2: Check for completely missing variables --------------------------

# compute percent missing for each column
pct_missing = dhis2_df.isna().mean() * 100

# convert to tidy/long format
missing_table = (
    pct_missing.reset_index()
    .rename(columns={"index": "variable", 0: "pct_missing"})
)

# filter for completely missing columns
missing_table[missing_table["pct_missing"] == 100]

### Step 3: Rename Columns -----------------------------------------------------

# import data dictionary
dict_path = os.path.join(core_routine_path, "sle_dhis2_data_dict.xlsx")
data_dict = pd.read_excel(dict_path)

# build rename dictionary: {old_name: new_name}
rename_dict = dict(zip(data_dict["indicator_label"], data_dict["snt_var"]))

# rename dhis2 data
dhis2_df = dhis2_df.rename(columns=rename_dict).drop(
    columns=["orgunitlevel5"], errors="ignore"
)

# view updated column names
dhis2_df.columns.tolist()

### Step 4: Standardise the Date Columns ---------------------------------------

# create dummy data
dates_ex1 = pd.DataFrame({
    "raw_date": ["2020-01-15", "2021-03-01", "2022-12-20"]
})

# parse dates
dates_ex1["date"] = pd.to_datetime(dates_ex1["raw_date"], format="%Y-%m-%d")

dates_ex1

# create dummy data
dates_ex2 = pd.DataFrame({
    "raw_date": ["Jan 2020", "February 2021", "Mar 2022"]
})

# parse dates
dates_ex2["date"] = pd.to_datetime(
    dates_ex2["raw_date"],
    format=None,  # allow flexible parsing of abbreviated and full month names
)

dates_ex2

# set French locale (adjust if needed depending on system)
# common options: 'fr_FR.UTF-8', 'fr_FR'
locale.setlocale(locale.LC_TIME, 'fr_FR.UTF-8')

# create dummy data
dates_ex3 = pd.DataFrame({
    "raw_date": [
        "Janvier 2020",
        "Février 2021",
        "décembre 2022"
    ]
})

# ensure day is present for parsing
dates_ex3["date"] = pd.to_datetime(
    "01 " + dates_ex3["raw_date"],
    format="%d %B %Y",
    errors="coerce"
)

# fallback for abbreviated months
mask_missing = dates_ex3["date"].isna()
dates_ex3.loc[mask_missing, "date"] = pd.to_datetime(
    "01 " + dates_ex3.loc[mask_missing, "raw_date"],
    format="%d %b %Y",
    errors="coerce"
)

dates_ex3

# set Portuguese locale
# common options: 'pt_PT.UTF-8', 'pt_PT'
locale.setlocale(locale.LC_TIME, 'pt_PT.UTF-8')

# create dummy data
dates_ex4 = pd.DataFrame({
    "raw_date": [
        "Janeiro 2020",
        "fevereiro 2021",
        "Dezembro 2022",
        "Jan 2021",
        "Fev 2022",
        "Dez 2023"
    ]
})

# first attempt: full Portuguese month names
dates_ex4["date"] = pd.to_datetime(
    "01 " + dates_ex4["raw_date"],
    format="%d %B %Y",
    errors="coerce"
)

# fallback: abbreviated Portuguese month names (e.g. Jan, Fev, Dez)
mask_missing = dates_ex4["date"].isna()
dates_ex4.loc[mask_missing, "date"] = pd.to_datetime(
    "01 " + dates_ex4.loc[mask_missing, "raw_date"],
    format="%d %b %Y",
    errors="coerce"
)

dates_ex4

# create dummy data
dates_ex5 = pd.DataFrame(
    {
        "raw_date": [
            "2020-01-15",  # ISO full date
            "Jan 2021",    # English abbreviated month
            "202203",      # compact YYYYMM
            "03/2022",     # month/year
            "2021-07",     # year-month
            "July 2020",   # English full month
            "15-02-2021",  # DD-MM-YYYY
            "2020/11/05",  # YYYY/MM/DD
            "2020.12.25",  # YYYY.MM.DD
            "2022.07",     # YYYY.MM
            "10-2020",     # MM-YYYY
            "20210105",    # YYYYMMDD
            "2020 Jan",    # year then month
            "2020.01.01",  # dotted full date
        ]
    }
)

# list of possible formats
formats = [
    "%Y-%m-%d",
    "%Y/%m/%d",
    "%Y.%m.%d",
    "%d-%m-%Y",
    "%d/%m/%Y",
    "%b %Y",
    "%B %Y",
    "%Y %b",
    "%Y %B",
    "%Y-%m",
    "%Y/%m",
    "%Y.%m",
    "%m-%Y",
    "%m/%Y",
    "%Y%m",
    "%Y%m%d",
]

def try_parse(value):
    # try strict formats first
    for fmt in formats:
        try:
            return pd.to_datetime(value, format=fmt)
        except:
            pass
    # fallback: let pandas guess freely
    return pd.to_datetime(value, errors="coerce")

dates_ex5["date"] = dates_ex5["raw_date"].apply(try_parse)

dates_ex5["date"] = dates_ex5["date"].dt.to_period("M").dt.to_timestamp()
dates_ex5

# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
locale.setlocale(locale.LC_TIME, "en_US.UTF-8")

# parse the raw date into a proper datetime object
dhis2_df["date"] = pd.to_datetime(
    dhis2_df["periodname"],
    format="%B %Y",
    errors="coerce"
)

# create year, month, and ordered yearmon columns
dhis2_df["year"] = dhis2_df["date"].dt.year
dhis2_df["month"] = dhis2_df["date"].dt.month
dhis2_df["yearmon"] = dhis2_df["date"].dt.strftime("%b %Y")

# check the output
dhis2_df[["date", "year", "month", "yearmon"]].head()

### Step 5: Managing Location Columns ------------------------------------------

#### Step 5.1: Harmonise administrative names ----------------------------------

# If not already, install the sntutils python package from GitHub
# has a number of useful helper functions
pip install git+https://github.com/ahadi-analytics/sntutils-py.git
from sntutils.geo import prep_geonames # for name matching

# set up location to save cache
cache_path = Path("01_data/1.1_foundational/1.1f_cache_files/processed")

# get adm3 shapefile to use as reference lookup
shp_adm3 = gpd.read_file(
    Path(here("01_data/1.1_foundational/1.1a_admin_boundaries/processed/sle_spatial_adm3_2021.geojson"))
).drop(columns="geometry")

# standardise admin names
dhis2_df = dhis2_df.assign(
    adm0=lambda x: x["adm0"].str.upper(),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2=lambda x: x["adm1"]
    .str.upper()
    .str.replace(" DISTRICT", "", regex=False)
    .str.strip(),
    adm3=lambda x: x["adm3"]
    .str.upper()
    .str.replace(" CHIEFDOM", "", regex=False)
    .str.strip(),
    hf=lambda x: x["hf"]
    .str.upper()
)

# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
district_to_province = {
    "KAILAHUN": "EASTERN", "KENEMA": "EASTERN", "KONO": "EASTERN",
    "BOMBALI": "NORTH EAST", "FALABA": "NORTH EAST",
    "KOINADUGU": "NORTH EAST", "TONKOLILI": "NORTH EAST",
    "KAMBIA": "NORTH WEST", "KARENE": "NORTH WEST", "PORT LOKO": "NORTH WEST",
    "BO": "SOUTHERN", "BONTHE": "SOUTHERN",
    "MOYAMBA": "SOUTHERN", "PUJEHUN": "SOUTHERN",
    "WESTERN AREA RURAL": "WESTERN", "WESTERN AREA URBAN": "WESTERN"
}

dhis2_df["adm1"] = dhis2_df["adm2"].map(district_to_province)

# harmonise admin names between dhis2 data and shapefile
# note: sntutils::prep_geonames() has no direct python equivalent
# manual fuzzy matching or exact merge required
dhis2_df = dhis2_df.merge(
    lookup_keys,
    on=["adm0", "adm1", "adm2", "adm3"],
    how="left"
)

# Harmonise admin names between population data and shapefile
dhis2_df = prep_geonames(
    target_df=dhis2_df,
    lookup_df=shp_adm3,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    cache_path=here(cache_path, "geoname_decisions_cache.xlsx")
)

#### Step 5.2: Create unique identifiers and location labels -------------------

# helper function to create xxhash32 digest (equivalent to sntutils::vdigest)
def vdigest(x, algo="xxhash32"):
    """Create vectorised hash digest."""
    return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())

# create identifiers
dhis2_df = dhis2_df.assign(
    # create unique health facility identifier from admin hierarchy
    # use paste0 semantics (no separator) to match R's paste0(adm0, adm1, adm2, adm3, hf)
    hf_uid=lambda x: vdigest(
        x["adm0"] + x["adm1"] + x["adm2"] + x["adm3"] + x["hf"]
    ),
    # create location labels for mapping
    location_short=lambda x: x["adm1"] + " ~ " + x["adm2"],
    location_full=lambda x: (
        x["adm1"] + " ~ " + x["adm2"] + " ~ " + x["adm3"] + " ~ " + x["hf"]
    ),
)

# generate consistent record id (facility + time period)
# use space to match R's paste(hf_uid, yearmon) default sep=" "
dhis2_df["record_id"] = vdigest(
    dhis2_df["hf_uid"] + " " + dhis2_df["yearmon"].astype(str)
)

# check
dhis2_df[["location_full", "hf_uid", "record_id"]].drop_duplicates().sort_values(
    "location_full"
).head()

### Step 5.5: Harmonise Health Facility Names with the Master Facility List ----

# load the fuzzy-matched DHIS2 + MFL output produced on the master facility
# lists page
mfl_matched = pd.read_parquet(
    Path(here(
        "01_data/1.1_foundational/1.1c_health_facilities/processed/"
        "final_dhis2_mfl_integrated.parquet"
    ))
)

# build a distinct composite-keyed lookup; same facility name (hf) can
# appear in several adm2/adm3 areas, so key on (adm1, adm2, adm3, hf) to
# keep the lookup unique per facility-location
mfl_lookup = (
    mfl_matched[
        ["adm1", "adm2", "adm3", "hf", "hf_uid_new", "hf_mfl_raw"]
    ]
    .drop_duplicates(subset=["adm1", "adm2", "adm3", "hf"])
    .rename(columns={"hf_mfl_raw": "hf_clean"})
)

# attach canonical MFL name and stable id; validate="many_to_one" raises
# an error if mfl_lookup is not unique per join key, catching any
# accidental row multiplication
dhis2_df = dhis2_df.merge(
    mfl_lookup,
    on=["adm1", "adm2", "adm3", "hf"],
    how="left",
    validate="many_to_one",
)

# inspect a few rows of raw vs canonical name
(
    dhis2_df[["adm1", "adm2", "adm3", "hf", "hf_clean", "hf_uid_new"]]
    .drop_duplicates()
    .sort_values(["adm1", "adm2", "adm3", "hf"])
    .head(10)
)

### Step 6: Compute Variables --------------------------------------------------

#### Step 6.1: Compute indicator totals and new variables ----------------------

# helper function: row-wise sum that returns value if only one non-NA
def fallback_row_sum(df, cols):
    return df[cols].sum(axis=1, skipna=True, min_count=1)

# helper function: difference with floor at 0, handling NA the same way R does
# when both values are NA return NA; when only one is NA return the other
# clamped at 0; when both present return max(col1 - col2, 0)
import numpy as np

def fallback_diff(df, col1, col2):
    a = df[col1]
    b = df[col2]
    both_na = a.isna() & b.isna()
    only_b_na = a.notna() & b.isna()
    only_a_na = a.isna() & b.notna()
    result = (a - b).clip(lower=0)
    result = result.where(~only_b_na, a.clip(lower=0))
    result = result.where(~only_a_na, b.clip(lower=0))
    result = result.where(~both_na, np.nan)
    return result

dhis2_df = (
    dhis2_df.assign(
        # outpatient visits
        allout=lambda x: fallback_row_sum(x, ["allout_u5", "allout_ov5"]),
        # suspected cases
        susp=lambda x: fallback_row_sum(
            x,
            [
                "susp_u5_hf",
                "susp_5_14_hf",
                "susp_ov15_hf",
                "susp_u5_com",
                "susp_5_14_com",
                "susp_ov15_com",
            ],
        ),
        # tested cases
        test_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        test_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_neg_rdt_u5_com",
                "tes_pos_rdt_u5_com",
                "tes_neg_rdt_5_14_com",
                "tes_pos_rdt_5_14_com",
                "tes_neg_rdt_ov15_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # confirmed cases (HF and COM)
        conf_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_pos_mic_u5_hf",
                "test_pos_mic_5_14_hf",
                "test_pos_mic_ov15_hf",
                "tes_pos_rdt_u5_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        conf_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_pos_rdt_u5_com",
                "tes_pos_rdt_5_14_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # treated cases
        maltreat_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_ov24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_ov24_5_14_com",
                "maltreat_u24_ov15_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
        maltreat_hf=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_hf",
                "maltreat_ov24_u5_hf",
                "maltreat_u24_5_14_hf",
                "maltreat_ov24_5_14_hf",
                "maltreat_u24_ov15_hf",
                "maltreat_ov24_ov15_hf",
            ],
        ),
        # malaria admissions
        maladm=lambda x: fallback_row_sum(
            x, ["maladm_u5", "maladm_5_14", "maladm_ov15"]
        ),
        # malaria deaths
        maldth=lambda x: fallback_row_sum(
            x,
            [
                "maldth_u5",
                "maldth_1_59m",
                "maldth_10_14",
                "maldth_5_9",
                "maldth_5_14",
                "maldth_ov15",
                "maldth_fem_ov15",
                "maldth_mal_ov15",
            ],
        ),
        # AGE-GROUP SPECIFIC AGGREGATIONS
        # Tested cases by age group (HF only)
        test_hf_u5=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
            ],
        ),
        test_hf_5_14=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
            ],
        ),
        test_hf_ov15=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        # Tested cases by age group (Community only)
        test_com_u5=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_u5_com", "tes_pos_rdt_u5_com"]
        ),
        test_com_5_14=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_5_14_com", "tes_pos_rdt_5_14_com"]
        ),
        test_com_ov15=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_ov15_com", "tes_pos_rdt_ov15_com"]
        ),
        # Suspected cases by age group (rename for consistency)
        susp_hf_u5=lambda x: x["susp_u5_hf"],
        susp_hf_5_14=lambda x: x["susp_5_14_hf"],
        susp_hf_ov15=lambda x: x["susp_ov15_hf"],
        susp_com_u5=lambda x: x["susp_u5_com"],
        susp_com_5_14=lambda x: x["susp_5_14_com"],
        susp_com_ov15=lambda x: x["susp_ov15_com"],
        # Confirmed cases by age group (HF only)
        conf_hf_u5=lambda x: fallback_row_sum(
            x, ["test_pos_mic_u5_hf", "tes_pos_rdt_u5_hf"]
        ),
        conf_hf_5_14=lambda x: fallback_row_sum(
            x, ["test_pos_mic_5_14_hf", "tes_pos_rdt_5_14_hf"]
        ),
        conf_hf_ov15=lambda x: fallback_row_sum(
            x, ["test_pos_mic_ov15_hf", "tes_pos_rdt_ov15_hf"]
        ),
        # Confirmed cases by age group (Community only)
        conf_com_u5=lambda x: x["tes_pos_rdt_u5_com"],
        conf_com_5_14=lambda x: x["tes_pos_rdt_5_14_com"],
        conf_com_ov15=lambda x: x["tes_pos_rdt_ov15_com"],
        # Treated cases by age group (HF only)
        maltreat_hf_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_hf", "maltreat_ov24_u5_hf"]
        ),
        maltreat_hf_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_hf", "maltreat_ov24_5_14_hf"]
        ),
        maltreat_hf_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_hf", "maltreat_ov24_ov15_hf"]
        ),
        # Treated cases by age group (Community only)
        maltreat_com_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_com", "maltreat_ov24_u5_com"]
        ),
        maltreat_com_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_com", "maltreat_ov24_5_14_com"]
        ),
        maltreat_com_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_com", "maltreat_ov24_ov15_com"]
        ),
        # Total treated cases within/after 24 hours
        maltreat_u24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_u24_u5_hf", "maltreat_u24_5_14_hf", "maltreat_u24_ov15_hf"],
        ),
        maltreat_ov24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_ov24_u5_hf", "maltreat_ov24_5_14_hf", "maltreat_ov24_ov15_hf"],
        ),
        maltreat_u24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_u24_ov15_com",
            ],
        ),
        maltreat_ov24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_ov24_u5_com",
                "maltreat_ov24_5_14_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
    )
    .assign(
        # second pass: computed from first pass columns
        test=lambda x: fallback_row_sum(x, ["test_hf", "test_com"]),
        conf=lambda x: fallback_row_sum(x, ["conf_hf", "conf_com"]),
        maltreat=lambda x: fallback_row_sum(x, ["maltreat_hf", "maltreat_com"]),
        # totals by age group
        test_u5=lambda x: fallback_row_sum(x, ["test_hf_u5", "test_com_u5"]),
        test_5_14=lambda x: fallback_row_sum(x, ["test_hf_5_14", "test_com_5_14"]),
        test_ov15=lambda x: fallback_row_sum(x, ["test_hf_ov15", "test_com_ov15"]),
        susp_u5=lambda x: fallback_row_sum(x, ["susp_hf_u5", "susp_com_u5"]),
        susp_5_14=lambda x: fallback_row_sum(x, ["susp_hf_5_14", "susp_com_5_14"]),
        susp_ov15=lambda x: fallback_row_sum(x, ["susp_hf_ov15", "susp_com_ov15"]),
        conf_u5=lambda x: fallback_row_sum(x, ["conf_hf_u5", "conf_com_u5"]),
        conf_5_14=lambda x: fallback_row_sum(x, ["conf_hf_5_14", "conf_com_5_14"]),
        conf_ov15=lambda x: fallback_row_sum(x, ["conf_hf_ov15", "conf_com_ov15"]),
        maltreat_u5=lambda x: fallback_row_sum(
            x, ["maltreat_hf_u5", "maltreat_com_u5"]
        ),
        maltreat_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_hf_5_14", "maltreat_com_5_14"]
        ),
        maltreat_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_hf_ov15", "maltreat_com_ov15"]
        ),
        maltreat_u24_total=lambda x: fallback_row_sum(
            x, ["maltreat_u24_hf", "maltreat_u24_com"]
        ),
        maltreat_ov24_total=lambda x: fallback_row_sum(
            x, ["maltreat_ov24_hf", "maltreat_ov24_com"]
        ),
        # presumed cases
        pres_com=lambda x: fallback_diff(x, "maltreat_com", "conf_com"),
        pres_hf=lambda x: fallback_diff(x, "maltreat_hf", "conf_hf"),
        pres_com_u5=lambda x: fallback_diff(x, "maltreat_com_u5", "conf_com_u5"),
        pres_com_5_14=lambda x: fallback_diff(x, "maltreat_com_5_14", "conf_com_5_14"),
        pres_com_ov15=lambda x: fallback_diff(x, "maltreat_com_ov15", "conf_com_ov15"),
        pres_hf_u5=lambda x: fallback_diff(x, "maltreat_hf_u5", "conf_hf_u5"),
        pres_hf_5_14=lambda x: fallback_diff(x, "maltreat_hf_5_14", "conf_hf_5_14"),
        pres_hf_ov15=lambda x: fallback_diff(x, "maltreat_hf_ov15", "conf_hf_ov15"),
    )
    .assign(
        # third pass: totals from presumed
        pres=lambda x: fallback_row_sum(x, ["pres_com", "pres_hf"]),
        pres_u5=lambda x: fallback_row_sum(x, ["pres_com_u5", "pres_hf_u5"]),
        pres_5_14=lambda x: fallback_row_sum(x, ["pres_com_5_14", "pres_hf_5_14"]),
        pres_ov15=lambda x: fallback_row_sum(x, ["pres_com_ov15", "pres_hf_ov15"]),
    )
)

# inspect results
(
    dhis2_df[
        dhis2_df["record_id"].isin(
            ["6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842"]
        )
    ][
        [
            "allout",
            "allout_u5",
            "allout_ov5",
            "pres_hf_u5",
            "maltreat_hf_u5",
            "conf_hf_u5",
        ]
    ]
    .sort_values("allout")
    .head(6)
)

#### Step 6.2: Quality control of indicator totals -----------------------------

# create mismatch flags for each indicator group
# helper function to count mismatches, ignoring NAs
def count_mismatch(total, components):
    """Count mismatches between total and sum of components, ignoring NAs."""
    calculated = components.sum(axis=1)
    # only compare where both total and calculated are not NA
    mask = total.notna() & calculated.notna()
    return (total[mask] != calculated[mask]).sum()

mismatch_summary = pd.DataFrame({
    # outpatient checks
    "allout_mismatch": [
        count_mismatch(
            dhis2_df["allout"],
            dhis2_df[["allout_u5", "allout_ov5"]]
        )
    ],

    # malaria admissions check
    "maladm_mismatch": [
        count_mismatch(
            dhis2_df["maladm"],
            dhis2_df[["maladm_u5", "maladm_5_14", "maladm_ov15"]]
        )
    ],

    # total tests check
    "test_mismatch": [
        count_mismatch(
            dhis2_df["test"],
            dhis2_df[["test_hf", "test_com"]]
        )
    ],

    # total confirmed check
    "conf_mismatch": [
        count_mismatch(
            dhis2_df["conf"],
            dhis2_df[["conf_hf", "conf_com"]]
        )
    ],

    # total treated check
    "maltreat_mismatch": [
        count_mismatch(
            dhis2_df["maltreat"],
            dhis2_df[["maltreat_hf", "maltreat_com"]]
        )
    ],

    # total presumed check
    "pres_mismatch": [
        count_mismatch(
            dhis2_df["pres"],
            dhis2_df[["pres_hf", "pres_com"]]
        )
    ],

    # malaria deaths check
    "maldth_mismatch": [
        count_mismatch(
            dhis2_df["maldth"],
            dhis2_df[[
                "maldth_1_59m", "maldth_u5", "maldth_5_9", "maldth_10_14",
                "maldth_5_14", "maldth_fem_ov15", "maldth_mal_ov15", "maldth_ov15"
            ]]
        )
    ],

    # tested cases by age group
    "test_u5_mismatch": [
        count_mismatch(
            dhis2_df["test_u5"],
            dhis2_df[["test_hf_u5", "test_com_u5"]]
        )
    ],
    "test_5_14_mismatch": [
        count_mismatch(
            dhis2_df["test_5_14"],
            dhis2_df[["test_hf_5_14", "test_com_5_14"]]
        )
    ],
    "test_ov15_mismatch": [
        count_mismatch(
            dhis2_df["test_ov15"],
            dhis2_df[["test_hf_ov15", "test_com_ov15"]]
        )
    ],

    # confirmed cases by age group
    "conf_u5_mismatch": [
        count_mismatch(
            dhis2_df["conf_u5"],
            dhis2_df[["conf_hf_u5", "conf_com_u5"]]
        )
    ],
    "conf_5_14_mismatch": [
        count_mismatch(
            dhis2_df["conf_5_14"],
            dhis2_df[["conf_hf_5_14", "conf_com_5_14"]]
        )
    ],
    "conf_ov15_mismatch": [
        count_mismatch(
            dhis2_df["conf_ov15"],
            dhis2_df[["conf_hf_ov15", "conf_com_ov15"]]
        )
    ],

    # presumed cases by age group
    "pres_u5_mismatch": [
        count_mismatch(
            dhis2_df["pres_u5"],
            dhis2_df[["pres_hf_u5", "pres_com_u5"]]
        )
    ],
    "pres_5_14_mismatch": [
        count_mismatch(
            dhis2_df["pres_5_14"],
            dhis2_df[["pres_hf_5_14", "pres_com_5_14"]]
        )
    ],
    "pres_ov15_mismatch": [
        count_mismatch(
            dhis2_df["pres_ov15"],
            dhis2_df[["pres_hf_ov15", "pres_com_ov15"]]
        )
    ],

    # suspected cases by age group
    "susp_u5_mismatch": [
        count_mismatch(
            dhis2_df["susp_u5"],
            dhis2_df[["susp_hf_u5", "susp_com_u5"]]
        )
    ],
    "susp_5_14_mismatch": [
        count_mismatch(
            dhis2_df["susp_5_14"],
            dhis2_df[["susp_hf_5_14", "susp_com_5_14"]]
        )
    ],
    "susp_ov15_mismatch": [
        count_mismatch(
            dhis2_df["susp_ov15"],
            dhis2_df[["susp_hf_ov15", "susp_com_ov15"]]
        )
    ],

    # treated cases by age group
    "maltreat_u5_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u5"],
            dhis2_df[["maltreat_hf_u5", "maltreat_com_u5"]]
        )
    ],
    "maltreat_5_14_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_5_14"],
            dhis2_df[["maltreat_hf_5_14", "maltreat_com_5_14"]]
        )
    ],
    "maltreat_ov15_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov15"],
            dhis2_df[["maltreat_hf_ov15", "maltreat_com_ov15"]]
        )
    ],

    # treatment timing checks
    "maltreat_u24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u24_total"],
            dhis2_df[["maltreat_u24_hf", "maltreat_u24_com"]]
        )
    ],
    "maltreat_ov24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov24_total"],
            dhis2_df[["maltreat_ov24_hf", "maltreat_ov24_com"]]
        )
    ]
})

# pivot to long format for easier viewing
mismatch_summary = (
    mismatch_summary
    .melt(var_name="indicator", value_name="n_mismatches")
    # filter to show only indicators with mismatches
    .query("n_mismatches > 0")
)

mismatch_summary

#### Step 6.3: Export rows with incoherent totals ------------------------------

# identify rows with incoherent totals
incoherent_rows = dhis2_df[
    # outpatient checks
    (dhis2_df["allout"] != (dhis2_df["allout_u5"] + dhis2_df["allout_ov5"])) |
    # malaria admissions check
    (dhis2_df["maladm"] != (dhis2_df["maladm_u5"] + dhis2_df["maladm_5_14"] + dhis2_df["maladm_ov15"])) |
    # total tests check
    (dhis2_df["test"] != (dhis2_df["test_hf"] + dhis2_df["test_com"])) |
    # total confirmed check
    (dhis2_df["conf"] != (dhis2_df["conf_hf"] + dhis2_df["conf_com"])) |
    # total treated check
    (dhis2_df["maltreat"] != (dhis2_df["maltreat_hf"] + dhis2_df["maltreat_com"])) |
    # total presumed check
    (dhis2_df["pres"] != (dhis2_df["pres_hf"] + dhis2_df["pres_com"]))
]

# select relevant columns
incoherent_rows = incoherent_rows[
    ["hf", "periodname"] +
    [col for col in incoherent_rows.columns
     if any(x in col for x in ["allout", "maladm", "test", "conf", "maltreat", "pres"])]
]

# set path for saving
output_file = Path(
    here("03_outputs/3.1_validation/tables/sle_incoherent_totals_dhis2.xlsx")
)

# export to xlsx
incoherent_rows.to_excel(output_file, index=False)

#### Step 6.4: Add IPD/OPD specification ---------------------------------------

# option 1: classify based on facility name patterns
def classify_facility(hf_name):
    """Classify facility as IPD or OPD based on name patterns."""
    hf_lower = hf_name.lower() if pd.notna(hf_name) else ""

    # inpatient facilities (hospitals)
    if any(x in hf_lower for x in ["hospital", "hosp"]):
        return "IPD"
    # outpatient facilities (clinics, health posts, community health)
    elif any(x in hf_lower for x in ["chp", "chc", "mchp", "clinic", "health post", "health centre"]):
        return "OPD"
    # default to OPD
    else:
        return "OPD"

dhis2_df["facility_type"] = dhis2_df["hf"].apply(classify_facility)

# option 2: classify based on presence of inpatient indicators
has_ipd = (
    dhis2_df
    .groupby("hf_uid")
    .apply(
        lambda x: ((x["maladm"].notna() & (x["maladm"] > 0)).any() |
                   (x["maldth"].notna() & (x["maldth"] > 0)).any())
    )
    .reset_index(name="has_ipd")
)

dhis2_df = dhis2_df.merge(has_ipd, on="hf_uid", how="left")
dhis2_df["facility_type"] = dhis2_df["has_ipd"].map({True: "IPD", False: "OPD"})
dhis2_df = dhis2_df.drop(columns="has_ipd")

# option 3: use a reference lookup file
facility_lookup = pd.read_excel(
    Path("path/to/facility_classification.xlsx")
)

dhis2_df = dhis2_df.merge(
    facility_lookup[["hf_uid", "facility_type"]],
    on="hf_uid",
    how="left"
)

# check classification distribution
dhis2_df[["hf_uid", "facility_type"]].drop_duplicates()["facility_type"].value_counts()

### Step 7: Finalize Data ------------------------------------------------------

#### Step 7.1: Troubleshoot duplicate HF-month records with different data -----

# identify duplicate hf-month combinations
duplicates = dhis2_df.groupby(["hf_uid", "yearmon"]).filter(lambda x: len(x) > 1)

# count number of duplicate pairs
n_duplicates = duplicates[["hf_uid", "yearmon"]].drop_duplicates().shape[0]

# if duplicates exist, inspect them
if n_duplicates > 0:
    # view duplicate records with key indicators
    duplicate_details = duplicates[
        ["record_id", "allout", "test", "conf", "maltreat"]
    ].sort_values(["hf_uid", "yearmon"])

    print(duplicate_details)

    # export for review with SNT team
    duplicate_details.to_excel(
        Path(
            here(
                "03_outputs/3.1_validation/tables/"
                "sle_duplicate_records_dhis2.xlsx"
            )
        ),
        index=False,
    )

# option 1: keep first record (if duplicates are exact copies)
dhis2_df = dhis2_df.drop_duplicates(subset=["hf_uid", "yearmon"], keep="first")

# option 2: keep record with most complete data
dhis2_df = (
    dhis2_df.assign(
        n_complete=lambda x: x.select_dtypes(include="number").notna().sum(axis=1)
    )
    .sort_values("n_complete", ascending=False)
    .drop_duplicates(subset=["record_id"], keep="first")
    .drop(columns="n_complete")
)

# option 3: sum duplicates (if records represent partial reports)
group_cols = ["hf_uid", "yearmon", "hf", "adm0", "adm1", "adm2", "adm3"]
numeric_cols = dhis2_df.select_dtypes(include="number").columns.tolist()

dhis2_df = dhis2_df.groupby(group_cols, as_index=False)[numeric_cols].sum()

# verify duplicates resolved
n_remaining = dhis2_df.groupby(["record_id"]).filter(lambda x: len(x) > 1).shape[0]

#### Step 7.2: Generate final data dictionary ----------------------------------

# define descriptions for computed/derived columns
computed_vars = pd.DataFrame([
    # time columns
    {"snt_var": "date", "indicator_label": "Report date (YYYY-MM-DD)"},
    {"snt_var": "year", "indicator_label": "Report year"},
    {"snt_var": "month", "indicator_label": "Report month (1-12)"},
    {"snt_var": "yearmon", "indicator_label": "Year-month label (e.g., Jan 2020)"},
    # identifier columns
    {"snt_var": "hf_uid", "indicator_label": "Unique health facility identifier (hash)"},
    {"snt_var": "record_id", "indicator_label": "Unique record identifier (facility + month hash)"},
    {"snt_var": "location_short", "indicator_label": "Location label: adm1 ~ adm2"},
    {"snt_var": "location_full", "indicator_label": "Location label: adm1 ~ adm2 ~ adm3 ~ hf"},
    {"snt_var": "facility_type", "indicator_label": "Facility type (IPD/OPD)"},
    # aggregated totals
    {"snt_var": "allout", "indicator_label": "Total outpatient visits (allout_u5 + allout_ov5)"},
    {"snt_var": "susp", "indicator_label": "Total suspected cases (all ages, HF + COM)"},
    {"snt_var": "test", "indicator_label": "Total tested (test_hf + test_com)"},
    {"snt_var": "test_hf", "indicator_label": "Total tested at health facility"},
    {"snt_var": "test_com", "indicator_label": "Total tested in community"},
    {"snt_var": "conf", "indicator_label": "Total confirmed cases (conf_hf + conf_com)"},
    {"snt_var": "conf_hf", "indicator_label": "Total confirmed cases at health facility"},
    {"snt_var": "conf_com", "indicator_label": "Total confirmed cases in community"},
    {"snt_var": "maltreat", "indicator_label": "Total treated cases (maltreat_hf + maltreat_com)"},
    {"snt_var": "maltreat_hf", "indicator_label": "Total treated cases at health facility"},
    {"snt_var": "maltreat_com", "indicator_label": "Total treated cases in community"},
    {"snt_var": "pres", "indicator_label": "Total presumed cases (pres_hf + pres_com)"},
    {"snt_var": "pres_hf", "indicator_label": "Total presumed cases at health facility"},
    {"snt_var": "pres_com", "indicator_label": "Total presumed cases in community"},
    {"snt_var": "maladm", "indicator_label": "Total malaria admissions (all ages)"},
    {"snt_var": "maldth", "indicator_label": "Total malaria deaths (all ages)"},
    # age-group totals
    {"snt_var": "test_u5", "indicator_label": "Total tested under 5 (HF + COM)"},
    {"snt_var": "test_5_14", "indicator_label": "Total tested 5-14 years (HF + COM)"},
    {"snt_var": "test_ov15", "indicator_label": "Total tested over 15 years (HF + COM)"},
    {"snt_var": "conf_u5", "indicator_label": "Confirmed cases under 5 (HF + COM)"},
    {"snt_var": "conf_5_14", "indicator_label": "Confirmed cases 5-14 years (HF + COM)"},
    {"snt_var": "conf_ov15", "indicator_label": "Confirmed cases over 15 years (HF + COM)"},
    {"snt_var": "maltreat_u5", "indicator_label": "Treated cases under 5 (HF + COM)"},
    {"snt_var": "maltreat_5_14", "indicator_label": "Treated cases 5-14 years (HF + COM)"},
    {"snt_var": "maltreat_ov15", "indicator_label": "Treated cases over 15 years (HF + COM)"},
    {"snt_var": "pres_u5", "indicator_label": "Presumed cases under 5 (HF + COM)"},
    {"snt_var": "pres_5_14", "indicator_label": "Presumed cases 5-14 years (HF + COM)"},
    {"snt_var": "pres_ov15", "indicator_label": "Presumed cases over 15 years (HF + COM)"},
    # add remaining age-group variables as needed...
])

# combine original and computed dictionaries
full_data_dict = (
    pd.concat([data_dict, computed_vars], ignore_index=True)
    .rename(columns={"snt_var": "snt_variable", "indicator_label": "label"})
    [["snt_variable", "label"]]
)

# keep only columns present in final dataset
full_data_dict = (
    full_data_dict[full_data_dict["snt_variable"].isin(dhis2_df.columns)]
    .sort_values("snt_variable")
)

# check
full_data_dict.head(10)

#### Step 7.3: Arrange final column order --------------------------------------

# define column order
id_cols = ["record_id"]
location_cols = ["adm0", "adm1", "adm2", "adm3", "hf", "hf_uid",
                 "location_short", "location_full", "facility_type"]
time_cols = ["date", "yearmon", "year", "month"]

# get remaining columns in original order
ordered_cols = id_cols + location_cols + time_cols
remaining_cols = [col for col in dhis2_df.columns if col not in ordered_cols]

# reorder dataframe
dhis2_df = dhis2_df[ordered_cols + remaining_cols]

# check column order
print(dhis2_df.columns[:20].tolist())

### Step 8: Aggregate and Save Data --------------------------------------------

#### Step 8.1: Save data at the health facility level --------------------------

save_path = Path(
    here("01_data/1.2_epidemiology/1.2a_routine_surveillance/processed")
)

# save the canonical clean dataset under the name that downstream pages
# (missing_data.qmd, active_status.qmd, ...) load from. mirrors the .rds
# the R chunk above writes so both languages see identical data.
dhis2_df.to_parquet(
    save_path / "clean_malaria_routine_data_final.parquet",
    compression="zstd",
    index=False
)

full_data_dict.to_parquet(
    save_path / "clean_malaria_routine_dict.parquet",
    compression="zstd",
    index=False
)

# additional export formats for sharing with non-Python users
dhis2_df.to_excel(
    save_path / "sle_dhis2_health_facility_data.xlsx",
    index=False
)

full_data_dict.to_excel(
    save_path / "sle_dhis2_health_facility_dict.xlsx",
    index=False
)

#### Step 8.2: Aggregate and save data at each admin unit level ----------------

# define numeric columns to sum (excluding HF-level identifiers)
sum_cols = [
    # totals
    "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
    # by location
    "test_hf", "test_com", "conf_hf", "conf_com",
    "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
    # by age group - totals
    "allout_u5", "allout_ov5",
    "test_u5", "test_5_14", "test_ov15",
    "conf_u5", "conf_5_14", "conf_ov15",
    "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
    "pres_u5", "pres_5_14", "pres_ov15",
    "susp_u5", "susp_5_14", "susp_ov15",
    "maladm_u5", "maladm_5_14", "maladm_ov15",
    "maldth_u5", "maldth_5_14", "maldth_ov15",
    # by age and location
    "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
    "test_com_u5", "test_com_5_14", "test_com_ov15",
    "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
    "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
    "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
    "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
    "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
    "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
    # treatment timing
    "maltreat_u24_hf", "maltreat_ov24_hf",
    "maltreat_u24_com", "maltreat_ov24_com",
    "maltreat_u24_total", "maltreat_ov24_total"
]

# function to aggregate at a given admin level
def aggregate_admin(df, group_cols, sum_cols):
    # filter to columns that exist in dataframe
    existing_sum_cols = [c for c in sum_cols if c in df.columns]

    # aggregate
    agg_df = (
        df
        .groupby(group_cols, as_index=False)
        .agg(
            **{col: (col, "sum") for col in existing_sum_cols},
            n_facilities=("hf_uid", "count")
        )
    )

    return agg_df

# aggregate at adm3 level
group_cols_adm3 = ["adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon"]
dhis2_adm3 = aggregate_admin(dhis2_df, group_cols_adm3, sum_cols)
dhis2_adm3["record_id"] = vdigest(
    dhis2_adm3["adm0"] + " " + dhis2_adm3["adm1"] + " " +
    dhis2_adm3["adm2"] + " " + dhis2_adm3["adm3"] + " " +
    dhis2_adm3["yearmon"].astype(str)
)
dhis2_adm3["location_short"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"]
dhis2_adm3["location_full"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"] + " ~ " + dhis2_adm3["adm3"]

# aggregate at adm2 level
group_cols_adm2 = ["adm0", "adm1", "adm2", "year", "month", "yearmon"]
dhis2_adm2 = aggregate_admin(dhis2_df, group_cols_adm2, sum_cols)
dhis2_adm2["record_id"] = vdigest(
    dhis2_adm2["adm0"] + " " + dhis2_adm2["adm1"] + " " +
    dhis2_adm2["adm2"] + " " + dhis2_adm2["yearmon"].astype(str)
)
dhis2_adm2["location_short"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
dhis2_adm2["location_full"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]

# aggregate at adm1 level
group_cols_adm1 = ["adm0", "adm1", "year", "month", "yearmon"]
dhis2_adm1 = aggregate_admin(dhis2_df, group_cols_adm1, sum_cols)
dhis2_adm1["record_id"] = vdigest(
    dhis2_adm1["adm0"] + " " + dhis2_adm1["adm1"] + " " +
    dhis2_adm1["yearmon"].astype(str)
)
dhis2_adm1["location_short"] = dhis2_adm1["adm1"]
dhis2_adm1["location_full"] = dhis2_adm1["adm1"]

# create data dictionaries for each level (filter to relevant columns only)
id_cols = ["n_facilities", "record_id", "location_short", "location_full"]

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_cols = group_cols_adm3 + [c for c in sum_cols if c in dhis2_adm3.columns] + id_cols
adm3_dict = full_data_dict[full_data_dict["snt_variable"].isin(adm3_cols)]

# adm2 dict: excludes adm3 (not present at this level)
adm2_cols = group_cols_adm2 + [c for c in sum_cols if c in dhis2_adm2.columns] + id_cols
adm2_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm2_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm3"]))
]

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_cols = group_cols_adm1 + [c for c in sum_cols if c in dhis2_adm1.columns] + id_cols
adm1_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm1_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm2", "adm3"]))
]

# save adm3 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm3_data.xlsx") as writer:
    dhis2_adm3.to_excel(writer, sheet_name="data", index=False)
    adm3_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm3.to_parquet(save_path / "sle_dhis2_adm3_data.parquet", compression="zstd", index=False)

# save adm2 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm2_data.xlsx") as writer:
    dhis2_adm2.to_excel(writer, sheet_name="data", index=False)
    adm2_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm2.to_parquet(save_path / "sle_dhis2_adm2_data.parquet", compression="zstd", index=False)

# save adm1 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm1_data.xlsx") as writer:
    dhis2_adm1.to_excel(writer, sheet_name="data", index=False)
    adm1_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm1.to_parquet(save_path / "sle_dhis2_adm1_data.parquet", compression="zstd", index=False)
 

©2026 Applied Health Analytics for Delivery and Innovation. All rights reserved