Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. DHIS2 data preprocessing
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Determining active and inactive status
      • Contextual considerations
      • Missing data detection methods
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to Care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
    • 6.1: Trend analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • Key Considerations
    • Troubleshooting
    • Best Practices When Working with Routine Data
    • What a Clean and Well-Structured Dataset Should Look Like
  • Step-by-Step
    • Step 1: Import Packages and Data
      • Step 1.1: Import Packages
      • Step 1.2: Import Data
    • Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts
      • Step 2.1: Verify All Expected Groups Are Present After Import
      • Step 2.2: Check for Completely Missing Variables
    • Step 3: Rename Columns
    • Step 4: Standardise the Date Columns
    • Step 5: Managing Location Columns
      • Steps 5.1: Harmonise Administrative Names
      • Steps 5.1: Create unique identifiers and location labels
    • Step 6: Compute Variables
      • Step 6.1: Compute Indicator Totals and New Variables
      • Step 6.2: Quality Control of Indicator Totals
      • Step 6.3: Export Rows with Incoherent Totals
      • Step 6.3: Add IPD/OPD Specification
    • Step 7: Finalize Data
      • Step 7.1: Troubleshoot Duplicate HF-Month Records with Different Data
      • Step 7.2: Generate Final Data Dictionary
      • Step 7.3: Arrange Final Column Order
    • Step 8: Aggregate and Save Data
      • Step 8.1: Save Data at the Health Facility Level
      • Step 8.2: Aggregate and Save Data at Each Admin Unit Level
  • Summary
  • Full code
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. DHIS2 data preprocessing

DHIS2 data preprocessing

Beginner

Overview

DHIS2 is a core data source for routine malaria surveillance, capturing indicators such as outpatient case counts, diagnostic testing, and treatment provision. However, raw DHIS2 exports may not be ready for SNT analysis if they have structural or quality issues, are messy, include inconsistent or unreadable column names, missing values, mismatched facility names, duplicate records, and formats that vary across months or locations. To make this data usable for SNT, it needs to be cleaned, reshaped, and aligned with a spatial and temporal structure that supports later steps of analysis: identifying active and inactive health facilities, calculating reporting rates, managing outliers, calculating malaria incidence, etc.

In this section, we walk through how to preprocess monthly DHIS2 malaria records to make them SNT-ready, using example data from Sierra Leone. This includes cleaning and standardizing facility-level data, parsing and standardizing date formats, harmonizing administrative names with reference shapefiles for spatial joins, computing derived indicators, and aggregating to administrative levels where needed to produce structured datasets.

The DHIS2 data you are working with may require some or all of the specific steps detailed below, or additional steps may be required to prepare the data for analysis.

Objectives
  • Import and combine DHIS2 data files from multiple sources
  • Verify data completeness after import (check all groups present, identify missing variables)
  • Rename columns using a data dictionary for standardized naming
  • Standardize date columns (parse dates, create year/month/yearmon fields)
  • Harmonize administrative names with reference shapefiles for spatial joins
  • Create unique identifiers (UIDs) for facilities and records
  • Compute indicator totals and derived variables (e.g. presumed cases)
  • Perform quality control checks on computed indicators
  • Identify and resolve duplicate HF-month records
  • Aggregate data at the selected administrative level
  • Export cleaned facility-level and aggregated datasets

Key Considerations

Troubleshooting

Routine surveillance data can be trickier to work with than other data sources that are more standardized. Expect to encounter and resolve data issues throughout the DHIS2 data preprocessing process. You will likely need to:

  • Investigate unexpected patterns in your data
  • Make judgment calls about how to handle inconsistencies
  • Collaborate with your SNT team on country-specific issues
  • Develop custom solutions for problems not covered here

The troubleshooting approaches shown in this guide provide a starting framework, but real-world data cleaning often requires iterative problem-solving and local expertise. While this guide addresses challenges others have encountered when preparing DHIS2 data for analysis, every country is different. Your specific context may present unique problems that are not yet covered in the code library. We encourage you to be persistent and creative if needed and to consult others for help if you are unable to resolve on your own.

If you encounter a new challenge when preprocessing DHIS2 data and would like new content to be added to the code library, please fill out our feedback form.

Best Practices When Working with Routine Data

1. Understand the System’s Context

  • Reporting workflows: Know how data moves from clinics to DHIS2 (paper vs. direct entry, aggregation rules).
  • Indicator definitions: Confirm case defintions and inclusions

2. Triangulate Strategically

Cross-check with:

  • External benchmarks (e.g., HMIS reports, survey data)
  • Parallel systems (e.g., stock management data for testing kits)
  • Temporal patterns (compare dry vs. rainy season trends)

3. Document Your Process

Maintain a troubleshooting log tracking:

  • Issues encountered
  • Data sources consulted
  • Decisions made (with rationale)

Timing lags: Facilities may report late, especially after holidays or system outages.

What a Clean and Well-Structured Dataset Should Look Like

Your data might come in different format to Sierra Leone example below. Whether you are able to use the chunk of code provided, or parts of it, to import your data, please make sure your data, after import, contains the following information:

  • A column with the date (year and month) of the report
  • A column with the health facility name
  • Various columns with parent admin units - at mimimum you will need the name of the admin unit for SNT analysis (for example, chiefdom for Sierra Leone)
  • Your data indicators columns

Below is an example of what the data should look like after import:

adm0 adm1 adm2 adm3 hf date month year allout_u5 allout_ov5 conf_u5 conf_5_14
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-01 01 2021 44 48 36 22
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-02 02 2021 37 27 20 30
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-03 03 2021 44 42 18 22
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-04 04 2021 28 23 30 30
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-05 05 2021 138 141 72 40
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-06 06 2021 127 82 59 16
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-07 07 2021 130 175 86 24
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-08 08 2021 144 203 93 17
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-09 09 2021 159 159 78 18
SIERRA LEONE BO DISTRICT BO CITY COUNCIL BO CITY AETHEL CHP 2021-10 10 2021 88 128 6 80

Step-by-Step

In this section, we walk through the key preprocessing steps needed to get DHIS2 malaria data ready for SNT analysis, using example DHIS2 data from Sierra Leone. The focus here is on cleaning, reshaping, and aggregating facility-level data to produce structured outputs at both the facility and administrative levels.

Each step is designed to guide you through the process cleanly. Make sure to follow the notes in the code, especially where edits are required (e.g. column names or admin levels). The goal is to make the data analysis-ready without breaking linkages to spatial units or reporting structures.

Remember that you will need to adapt this process to the specific country context in which you are working.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Import Packages and Data

We’ll use Sierra Leone DHIS2 malaria data as a working example. Variable names or database structures may differ across countries. Be sure to review and update the variable names to match your country’s DHIS2 configuration.

We begin by installing and loading the required packages for our preprocessing, then import the data.

If you are using python and need help installing packages, please refer to the Getting Started section.

Step 1.1: Import Packages

  • R
  • Python
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  tidyverse, # data manipulation, reshaping and plotting
  rio,       # import/export multiple file types
  DT,        # interactive table preview
  here,      # project-relative file paths
  readxl,    # read excel files
  writexl,   # write excel files
  knitr      # render code in quarto/rmarkdown
)
#|message: false
#|echo: true
#|eval: true
#| label: import packages - py (fullcode)
import pandas as pd  # load core data tools
from pathlib import Path  # handle file system paths
from pyprojroot import here  # build project-relative paths
import os  # optional path utilities
import locale  # for setting language locale
import geopandas as gpd  # for importing shapefiles
import xxhash  # for hashing
import pyarrow.parquet as pq  # for exportin parquet files
import pyarrow as pa  # for exportin parquet files

Step 1.2: Import Data

The code chunk below defines a helper function to import different types of data, for example Excel or csv data. If your DHIS2 extraction resulted in multiple files, this function will merge them together into a single dataset. Then we use it to read in our data file(s).

When calling the DHIS2 data import function, we read files from a folder using the here package. The here package builds paths from the project root, keeping code portable and reliable.

Inspect Your Data Before Proceeding

Before running any code, take a moment to visually inspect your data. This is especially important when working with DHIS2 data or other routine health sources. As explained in the Data Structures of the Getting Started: For Analysts page, your data should follow tidy data principles: each column represents a variable, each row an observation, and each cell contains a single value.

To check this, open the Excel or CSV file directly and review:

  • Headers: Are the column names in the first row? Are they clear and consistent?
  • Columns and rows:
    • Each column should represent a variable (e.g. health_facility, month, confirmed_cases).
    • Each row should represent a single observation (e.g. one facility-month).
  • Cells: Each cell should contain a single value (not merged or grouped). Avoid cells that contain multiple values (e.g. 10/5 for tested/positive).
  • Extra rows: Remove any top rows used for titles, notes, or merged headers.

If your dataset is already tidy, you’re ready to move on. If not, try to clean it up before working with it. You can do this:

  • In Excel, by editing a copy of the file and removing any extra formatting, then saving your new copy to import and work with in later steps.

  • Or in R, using packages like:

    • rio::import(file, skip = X) to skip X number of extra header rows
    • tidyr::pivot_longer() to reshape wide data into long format
    • dplyr::mutate() and separate() to split combined values into separate columns
  • Or in Python, using:

    • pandas.read_excel(file, skiprows=X) to skip X number of unwanted header rows
    • pandas.melt() to reshape wide data into long format
    • pandas.Series.str.split() and pandas.assign() to split and create new columns

For more complex formatting issues, like multi-row or merged headers, see this excellent step-by-step guide: Tidying Multi-Header Excel Data with R by Paul Campbell (2019).

Getting this right early will save time and reduce confusion later in the workflow, as all the subsequent steps assume that the data is already in tidy format and ready for analysis.

  • R
  • Python

Set the paths to the DHIS2 data initially.

# set the path to the data
core_routine_path <- here::here(
  "1.1.2_epidemiology",
  "1.1.2a_routine_surveillance",
  "raw"
)

# set the full dhis2 path
path_to_dhis2 <- here::here(core_routine_path, "sle_dhis2_2015_2022.xlsx")

If the dataset is a single Excel spreadsheet, it can be imported directly using rio::import, which reads multiple file formats. This approach assumes all indicators, locations, and time periods are stored in one file

# read the data
dhis2_df <- rio::import(file = path_to_dhis2)

In cases where the data is spread across multiple tabs, for example split into yearly tabs, use the following approach.

dhis2_df <- rio::import_list(
  file = path_to_dhis2,
  rbind = TRUE,        # stack all tabs into one table
  rbind_label = "year" # tab name stored in column 'year'
)

DHIS2 exports can be difficult when data spans many facilities or years. File size and memory use increase when stacking many tabs or files. For this reason, data may be exported across multiple files or sheets. If the data is in a zipped archive, or split into multiple files in the same format, import them together with rio::import_list() to import them together. This assumes all files share the same structure, column names, and data types. It also assumes sheets contain rectangular data only, and the file extension is consistent so rio can detect the format. Examples of this include district or admin named files exported by year ranges, such as Bombali_District_2015-2023.xls or Kambia_District_2015-2023.xls.

dhis2_df <- rio::import_list(
  file = list.files(
    path = core_routine_path,
    pattern = "\\.(xls)$",    # the file extensions
    full.names = TRUE
  ),
  rbind = TRUE,                # stack all tabs/files into one table
  rbind_label = "sheet_admin", # source name stored in column 'sheet_admin'
  rbind_fill = TRUE            # fill missing columns with NA when stacking
)

# check data
dhis2_df |>
  dplyr::select(1:20) |>
  dplyr::glimpse()
Output
Rows: 184,056
Columns: 10
$ orgunitlevel1                                  <chr> "Sierra Leone", "Sierra…
$ orgunitlevel2                                  <chr> "Bo District", "Bo Dist…
$ orgunitlevel3                                  <chr> "Bo City Council", "Bo …
$ orgunitlevel4                                  <chr> "Bo City", "Bo City", "…
$ orgunitlevel5                                  <chr> "Aethel CHP", "Aethel C…
$ organisationunitname                           <chr> "Aethel CHP", "Aethel C…
$ periodname                                     <chr> "January 2015", "Februa…
$ `OPD (New and follow-up curative) 0-59m_X`     <dbl> NA, NA, NA, NA, NA, NA,…
$ `OPD (New and follow-up curative) 5+y_X`       <dbl> NA, NA, NA, NA, NA, NA,…
$ `Admission - Child with malaria 0-59 months_X` <dbl> NA, NA, NA, NA, NA, NA,…

To adapt the code:

  • Line 3: Set core_routine_path to your target folder.
  • Line 4: Modify the pattern to match your extension (xls, xlsx, csv).
  • Line 8: Set rbind_label = 'sheet_admin' to store the source tab or file, such as admin units, use year for yearly tabs/files.

Set the paths to the DHIS2 data initially.

# set the full dhis2 path
core_routine_path = Path(
    here("1.1.2_epidemiology/1.1.2a_routine_surveillance/raw")
)

# set the full dhis2 path
path_to_dhis2 = core_routine_path / "sle_dhis2_2015_2022.xlsx"

If the dataset is a single Excel file, import it directly using pd.read_excel. This assumes all indicators, locations, and time periods are in one spreadsheet.

# read the data
dhis2_df = pd.read_excel(path_to_dhis2)

In cases where data is stored in multiple tabs, such as yearly splits, load all tabs and stack them into one table. Add a column to keep the tab source, like year.

# import all tabs and stack into a single table
dhis2_df = pd.concat(
    [
        pd.read_excel(path_to_dhis2, sheet_name=s).assign(year=s)
        for s in pd.ExcelFile(path_to_dhis2).sheet_names
    ],
    ignore_index=True,
)
dhis2_df.head(10).style

Cumulative exports across many facilities or years are often split into multiple files. File size and memory use increase when stacking. All files and sheets are assumed to share the same columns and types. Use pandas to detect the sheets, read each file, stack them, and store the file or tab source in a label column, such as sheet_admin. Examples of this include district or admin named files exported by year ranges, such as Bombali_District_2015-2023.xls or Kambia_District_2015-2023.xls.

# set file extension
extension = "xls"

# find all files with specified extension in directory
files = list(core_routine_path.glob(f"*.{extension}"))

# initialise dataframe
dhis2_df = pd.DataFrame()

# iterate over files, concatenate stack as one dataframe
for file in files:
    temp = pd.read_excel(file, sheet_name="Sheet1")
    dhis2_df = pd.concat([dhis2_df, temp])

# Inspect data
dhis2_df.head(10)
  orgunitlevel1  ... Malaria treated with ACT >24 hours 15+y_X
0  Sierra Leone  ...                                       NaN
1  Sierra Leone  ...                                       1.0
2  Sierra Leone  ...                                       NaN
3  Sierra Leone  ...                                       2.0
4  Sierra Leone  ...                                       7.0
5  Sierra Leone  ...                                       5.0
6  Sierra Leone  ...                                       5.0
7  Sierra Leone  ...                                       4.0
8  Sierra Leone  ...                                       7.0
9  Sierra Leone  ...                                       3.0

[10 rows x 56 columns]

To adapt the code:

  • Line 1: Update extension to xls, xlsx, or csv.
  • Line 4: Set core_routine_path to your data folder.
  • Line 11: Remove sheet_name="Sheet1" if your files do not use a fixed tab name.

Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts

Before any cleaning or indicator construction, confirm that the merged dataset contains all groups and components from the original DHIS2 exports.

Step 2.1: Verify All Expected Groups Are Present After Import

DHIS2 exports are often split across several files or sheets. The structure varies by country. Files may be separated by admin units, by years, or by other DHIS2-defined groupings such as monthly batches, subnational regions, wards, partner-assigned districts, or custom DHIS2 groups such as outpatient tallies etc., After importing and appending these extracts into a single dataset, you should run a set of checks to confirm that all components were included correctly and that nothing was missed or duplicated.

The example below uses adm2 (currently orgunitlevel3 in our data) as the grouping variable, but you should replace adm2 with year or any other grouping used in your export.

  • R
  • Python
# check the adm's' in our data
dhis2_df |>
  dplyr::distinct(orgunitlevel3)
                         orgunitlevel3
1                      Bo City Council
2                  Bo District Council
3             Bombali District Council
4                  Makeni City Council
5              Bonthe District Council
6             Bonthe Municipal Council
7              Falaba District Council
8            Kailahun District Council
9              Kambia District Council
10             Karene District Council
11                 Kenema City Council
12             Kenema District Council
13          Koinadugu District Council
14               Kono District Council
15     Koidu New Sembehun City Council
16            Moyamba District Council
17          Port Loko District Council
18              Port Loko City Council
19            Pujehun District Council
20          Tonkolili District Council
21 Western Area Rural District Council
22               Freetown City Council

To adapt the code:

  • Line 3: Replace orgunitlevel3 with the grouping used in your export (for example, adm2, adm1, year, ward, district, or any custom DHIS2 grouping).
# check the adm's' in our data
dhis2_df[["orgunitlevel3"]].drop_duplicates().reset_index(drop=True)
                          orgunitlevel3
0              Moyamba District Council
1              Bombali District Council
2                   Makeni City Council
3               Kambia District Council
4            Tonkolili District Council
5            Port Loko District Council
6                Port Loko City Council
7               Karene District Council
8            Koinadugu District Council
9                 Kono District Council
10      Koidu New Sembehun City Council
11  Western Area Rural District Council
12            Kailahun District Council
13             Pujehun District Council
14                  Kenema City Council
15              Kenema District Council
16              Bonthe District Council
17             Bonthe Municipal Council
18                      Bo City Council
19                  Bo District Council
20              Falaba District Council
21                Freetown City Council

To adapt the code:

  • Line 2: Replace orgunitlevel3 with the grouping used in your export (for example, adm2, adm1, year, ward, district, or any custom DHIS2 grouping).

This diagnostic checks that all expected groups from your extracts appear in the merged dataset, whether files were split by admin units, years, wards, or any other DHIS2 grouping.

If you imported 15 files or sheets, you should see 15 distinct groups. Fewer groups usually indicate a missing file, inconsistent tab names, or spelling differences. Row counts will vary across groups, so focus on group presence, not size.

If groups are missing, confirm that all source files are present, sheet names follow a consistent pattern, and group names match across extracts. After fixing any issues and re-importing, continue with the preprocessing steps.

Step 2.2: Check for Completely Missing Variables

Some variables, such as health facility name, reporting date, and administrative unit, must not be missing. Rows with missing values in these fields cannot be used directly and should be investigated with the DHIS2 focal person.

Completely missing columns should be reviewed before continuing. They may reflect changes in DHIS2 indicators, missing files, or indicators that were never collected. Check with the SNT team to confirm whether the variable should be kept, renamed, or removed. Resolving this early prevents issues in all downstream SNT steps.

The code below prints the number of missing values for each variable and highlights any missingness in the critical identifiers needed for SNT analysis.

  • R
  • Python
# identify complete missing values per column
dhis2_df |>
  dplyr::summarise(
    dplyr::across(
      dplyr::everything(),
      ~ mean(is.na(.x)) * 100
    )
  ) |>
  tidyr::pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "pct_missing"
  ) |>
  dplyr::filter(pct_missing == 100)
# A tibble: 0 × 2
# ℹ 2 variables: variable <chr>, pct_missing <dbl>
# compute percent missing for each column
pct_missing = dhis2_df.isna().mean() * 100

# convert to tidy/long format
missing_table = (
    pct_missing.reset_index()
    .rename(columns={"index": "variable", 0: "pct_missing"})
)

# filter for completely missing columns
missing_table[missing_table["pct_missing"] == 100]
Empty DataFrame
Columns: [variable, pct_missing]
Index: []

Do not change or adapt the code above.

Step 3: Rename Columns

A practical way to standardise and renames DHIS2 column names is to use a data dictionary. The dictionary contains at least two columns: 1. the original DHIS2 column names as they appear in the raw export 2. the corresponding SNT column names you prefer to use

The SNT names are typically shorter and more consistent, which helps maintain a clear structure across the analytical pipeline.

Data Dictionary
indicator_label snt_var
orgunitlevel1 adm0
orgunitlevel2 adm1
orgunitlevel3 adm2
orgunitlevel4 adm3
organisationunitname hf
OPD (New and follow-up curative) 0-59m_X allout_u5
OPD (New and follow-up curative) 5+y_X allout_ov5
Admission - Child with malaria 0-59 months_X maladm_u5
Admission - Child with malaria 5-14 years_X maladm_5_14
Admission - Malaria 15+ years_X maladm_ov15
Child death - Malaria 1-59m_X maldth_1_59m
Child death - Malaria 10-14y_X maldth_10_14
Child death - Malaria 5-9y_X maldth_5_9
Death malaria 15+ years Female maldth_fem_ov15
Death malaria 15+ years Male maldth_mal_ov15
Separation - Child with malaria 0-59 months_X Death maldth_u5
Separation - Child with malaria 5-14 years_X Death maldth_5_14
Separation - Malaria 15+ years_X Death maldth_ov15
Fever case - suspected Malaria 0-59m_X susp_u5_hf
Fever case - suspected Malaria 5-14y_X susp_5_14_hf
Fever case - suspected Malaria 15+y_X susp_ov15_hf
Fever case in community (Suspected Malaria) 0-59m_X susp_u5_com
Fever case in community (Suspected Malaria) 5-14y_X susp_5_14_com
Fever case in community (Suspected Malaria) 15+y_X susp_ov15_com
Fever case in community tested for Malaria (RDT) - Negative 0-59m_X tes_neg_rdt_u5_com
Fever case in community tested for Malaria (RDT) - Positive 0-59m_X tes_pos_rdt_u5_com
Fever case in community tested for Malaria (RDT) - Negative 5-14y_X tes_neg_rdt_5_14_com
Fever case in community tested for Malaria (RDT) - Positive 5-14y_X tes_pos_rdt_5_14_com
Fever case in community tested for Malaria (RDT) - Negative 15+y_X tes_neg_rdt_ov15_com
Fever case in community tested for Malaria (RDT) - Positive 15+y_X tes_pos_rdt_ov15_com
Fever case tested for Malaria (Microscopy) - Negative 0-59m_X test_neg_mic_u5_hf
Fever case tested for Malaria (Microscopy) - Positive 0-59m_X test_pos_mic_u5_hf
Fever case tested for Malaria (Microscopy) - Negative 5-14y_X test_neg_mic_5_14_hf
Fever case tested for Malaria (Microscopy) - Positive 5-14y_X test_pos_mic_5_14_hf
Fever case tested for Malaria (Microscopy) - Negative 15+y_X test_neg_mic_ov15_hf
Fever case tested for Malaria (Microscopy) - Positive 15+y_X test_pos_mic_ov15_hf
Fever case tested for Malaria (RDT) - Negative 0-59m_X tes_neg_rdt_u5_hf
Fever case tested for Malaria (RDT) - Positive 0-59m_X tes_pos_rdt_u5_hf
Fever case tested for Malaria (RDT) - Negative 5-14y_X tes_neg_rdt_5_14_hf
Fever case tested for Malaria (RDT) - Positive 5-14y_X tes_pos_rdt_5_14_hf
Fever case tested for Malaria (RDT) - Negative 15+y_X tes_neg_rdt_ov15_hf
Fever case tested for Malaria (RDT) - Positive 15+y_X tes_pos_rdt_ov15_hf
Malaria treated in community with ACT <24 hours 0-59m_X maltreat_u24_u5_com
Malaria treated in community with ACT >24 hours 0-59m_X maltreat_ov24_u5_com
Malaria treated in community with ACT <24 hours 5-14y_X maltreat_u24_5_14_com
Malaria treated in community with ACT >24 hours 5-14y_X maltreat_ov24_5_14_com
Malaria treated in community with ACT <24 hours 15+y_X maltreat_u24_ov15_com
Malaria treated in community with ACT >24 hours 15+y_X maltreat_ov24_ov15_com
Malaria treated with ACT <24 hours 0-59m_X maltreat_u24_u5_hf
Malaria treated with ACT >24 hours 5-14y_X maltreat_ov24_5_14_hf
Malaria treated with ACT <24 hours 5-14y_X maltreat_u24_5_14_hf
Malaria treated with ACT >24 hours 15+y_X maltreat_ov24_ov15_hf
Malaria treated with ACT >24 hours 0-59m_X maltreat_ov24_u5_hf
Malaria treated with ACT <24 hours 15+y_X maltreat_u24_ov15_hf

Keeping this dictionary in an external file (for example, Excel or CSV) allows others to review and update it, and national teams can confirm indicator naming directly. This also reduces the risk of errors from hard-coded renaming rules, which may break when column formats change or when scripts are modified.

Using a dictionary creates a shared reference for DHIS2 renaming and helps maintain a consistent naming approach throughout the SNT workflow.

Consult SNT team

Review variable definitions for your country to better understand and appropriately analyze the data. Consider investigating questions like:

  • Are admissions included in outpatients?
  • Are data on pregnant women included in the data for adults?
  • Is there double counting between RDT and microscopy results? If yes, is it appropriate to only use RDT results?
  • Are data from the private sector included here? If so, what percentage of the private sector reports into DHIS2?
  • Are data from commmunity health workers included in their assigned health facility data or are data separate?
  • Have any variables been included or adapted throughout the years? If so, how should they be treated throughout the time series?

Always confirm variable definitions with the SNT team.

  • R
  • Python
# import data dictionary
data_dict <- rio::import(
  here::here(core_routine_path, "sle_dhis2_data_dict.xlsx")
)
# create rename vector: object = old names, names = new names
rename_vector <- stats::setNames(
  object = data_dict$indicator_label,
  nm = data_dict$snt_var
)

# rename dhis2 data
dhis2_df <- dhis2_df |>
  dplyr::rename(
    !!!rename_vector
  ) |>
  # we drop orgunitlevel5 because it s same as hf
  dplyr::select(-orgunitlevel5)

# check the names
colnames(dhis2_df)
Output
 [1] "adm0"                   "adm1"                   "adm2"                  
 [4] "adm3"                   "hf"                     "periodname"            
 [7] "allout_u5"              "allout_ov5"             "maladm_u5"             
[10] "maladm_5_14"            "maladm_ov15"            "maldth_u5"             
[13] "maldth_5_14"            "maldth_ov15"            "susp_u5_hf"            
[16] "susp_5_14_hf"           "susp_ov15_hf"           "susp_u5_com"           
[19] "susp_5_14_com"          "susp_ov15_com"          "maldth_fem_ov15"       
[22] "maldth_mal_ov15"        "maldth_1_59m"           "maldth_10_14"          
[25] "maldth_5_9"             "tes_neg_rdt_u5_com"     "tes_pos_rdt_u5_com"    
[28] "tes_neg_rdt_5_14_com"   "tes_pos_rdt_5_14_com"   "tes_neg_rdt_ov15_com"  
[31] "tes_pos_rdt_ov15_com"   "test_neg_mic_u5_hf"     "test_pos_mic_u5_hf"    
[34] "test_neg_mic_5_14_hf"   "test_pos_mic_5_14_hf"   "test_neg_mic_ov15_hf"  
[37] "test_pos_mic_ov15_hf"   "tes_neg_rdt_u5_hf"      "tes_pos_rdt_u5_hf"     
[40] "tes_neg_rdt_5_14_hf"    "tes_pos_rdt_5_14_hf"    "tes_neg_rdt_ov15_hf"   
[43] "tes_pos_rdt_ov15_hf"    "maltreat_u24_u5_com"    "maltreat_ov24_u5_com"  
[46] "maltreat_u24_5_14_com"  "maltreat_ov24_5_14_com" "maltreat_u24_ov15_com" 
[49] "maltreat_ov24_ov15_com" "maltreat_u24_u5_hf"     "maltreat_ov24_u5_hf"   
[52] "maltreat_u24_5_14_hf"   "maltreat_ov24_5_14_hf"  "maltreat_u24_ov15_hf"  
[55] "maltreat_ov24_ov15_hf"  "sheet_admin"           

To adapt the code:

  • Line 3: Point to your own dictionary file and folder.
  • Lines 7–8: Update the dictionary column names for original and SNT variables.
  • Line 12: Replace dhis2_df with your DHIS2 dataset.
  • Line 17: Remove or change orgunitlevel5 if this column is not present.
# import data dictionary
dict_path = os.path.join(core_routine_path, "sle_dhis2_data_dict.xlsx")
data_dict = pd.read_excel(dict_path)

# build rename dictionary: {old_name: new_name}
rename_dict = dict(zip(data_dict["indicator_label"], data_dict["snt_var"]))

# rename dhis2 data
dhis2_df = dhis2_df.rename(columns=rename_dict).drop(
    columns=["orgunitlevel5"], errors="ignore"
)

# view updated column names
dhis2_df.columns.tolist()
Output
['adm0', 'adm1', 'adm2', 'adm3', 'hf', 'periodname', 'allout_u5', 'allout_ov5', 'maladm_u5', 'maladm_5_14', 'maladm_ov15', 'maldth_u5', 'maldth_5_14', 'maldth_ov15', 'susp_u5_hf', 'susp_5_14_hf', 'susp_ov15_hf', 'susp_u5_com', 'susp_5_14_com', 'susp_ov15_com', 'maldth_fem_ov15', 'maldth_mal_ov15', 'maldth_1_59m', 'maldth_10_14', 'maldth_5_9', 'tes_neg_rdt_u5_com', 'tes_pos_rdt_u5_com', 'tes_neg_rdt_5_14_com', 'tes_pos_rdt_5_14_com', 'tes_neg_rdt_ov15_com', 'tes_pos_rdt_ov15_com', 'test_neg_mic_u5_hf', 'test_pos_mic_u5_hf', 'test_neg_mic_5_14_hf', 'test_pos_mic_5_14_hf', 'test_neg_mic_ov15_hf', 'test_pos_mic_ov15_hf', 'tes_neg_rdt_u5_hf', 'tes_pos_rdt_u5_hf', 'tes_neg_rdt_5_14_hf', 'tes_pos_rdt_5_14_hf', 'tes_neg_rdt_ov15_hf', 'tes_pos_rdt_ov15_hf', 'maltreat_u24_u5_com', 'maltreat_ov24_u5_com', 'maltreat_u24_5_14_com', 'maltreat_ov24_5_14_com', 'maltreat_u24_ov15_com', 'maltreat_ov24_ov15_com', 'maltreat_u24_u5_hf', 'maltreat_ov24_u5_hf', 'maltreat_u24_5_14_hf', 'maltreat_ov24_5_14_hf', 'maltreat_u24_ov15_hf', 'maltreat_ov24_ov15_hf']

To adapt the code:

  • Line 2: Update the file path if your dictionary is stored elsewhere.
  • Line 6: Change indicator_label and snt_var to match your dictionary column names.
  • Line 9: Replace dhis2_df with your DHIS2 dataset.
  • Line 10: Adjust or remove "orgunitlevel5" if this column is not present.

The Excel data dictionary used here is based on the Sierra Leone DHIS2 export. If you are working with another country, update the rows in your Excel file so that the “original DHIS2 name” and “SNT name” columns match your indicators.

For example, if your dataset uses “Admission of child with malaria - 5–14yrs”, enter that exact name in the dictionary and set your preferred SNT name beside it. Once the Excel file is updated, the renaming code applies your mapping automatically.

Warning

The dictionary must reflect your country’s DHIS2 export. Add, edit, or remove rows in the Excel file so that it aligns with your indicator naming and reporting structure.

Step 4: Standardise the Date Columns

Most SNT workflows require a consistent set of date variables: a YYYY-MM-DD date column, separate year and month fields, and a yearmon field in YYYY-MM for ordered time-series work. The goal of this step is to take whatever date format appears in the DHIS2 export and convert it into this standard structure.

The steps you apply depend on your starting format. Some formats parse automatically (for example, YYYY-MM-DD). Others, such as "Jan 2020" or French "Janvier 2020", need small additional steps.

The code block below shows how to parse the main date formats seen in DHIS2 exports and convert them into a consistent structure. Once the date field is standardised, creating the year, month, and yearmon fields becomes straightforward.

Parsing Dates
  • R
  • Python

Format already in YYYY-MM-DD

# create dummy data
dates_ex1 <- tibble::tibble(
  raw_date = c("2020-01-15", "2021-03-01", "2022-12-20")
)

# parse dates
dates_ex1 |>
  dplyr::mutate(
    date = lubridate::ymd(raw_date)
  )
Output
# A tibble: 3 × 2
  raw_date   date      
  <chr>      <date>    
1 2020-01-15 2020-01-15
2 2021-03-01 2021-03-01
3 2022-12-20 2022-12-20

Format “Jan 2020” or “January 2020”

# create dummy data
dates_ex2 <- tibble::tibble(
  raw_date = c("Jan 2020", "February 2021", "Mar 2022")
)

# parse dates
dates_ex2 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(
      raw_date,
      orders = c("b Y", "B Y")
    ) |>
      as.Date()
  )
Output
# A tibble: 3 × 2
  raw_date      date      
  <chr>         <date>    
1 Jan 2020      2020-01-01
2 February 2021 2021-02-01
3 Mar 2022      2022-03-01

French month names (for example, “Janvier 2020”)

# create dummy data
dates_ex3 <- tibble::tibble(
  raw_date = c(
    "Janvier 2020",
    "Février 2021",
    "décembre 2022",
    "Jan 2021",
    "Fe 2022",
    "Dec 2023"
  )
)

# parse dates
dates_ex3 |>
  dplyr::mutate(
    date = paste0("01 ", raw_date) |>
      lubridate::dmy(locale = "fr_FR") |>
      as.Date()
  )
Output
# A tibble: 6 × 2
  raw_date      date      
  <chr>         <date>    
1 Janvier 2020  2020-01-01
2 Février 2021  2021-02-01
3 décembre 2022 2022-12-01
4 Jan 2021      2021-01-01
5 Fe 2022       2022-02-01
6 Dec 2023      2023-12-01

Portuguese month names (for example, “Janeiro 2020”)

# create dummy data
dates_ex4 <- tibble::tibble(
  raw_date = c("Janeiro 2020", "fevereiro 2021", "Dezembro 2022",
               "Jan 2021", "Fev 2022", "Dez 2023")
)

# parse dates
dates_ex4 |>
  dplyr::mutate(
    date = paste0("01 ", raw_date) |>
      lubridate::dmy(locale = "pt_PT") |>
      as.Date()
  )
Output
# A tibble: 6 × 2
  raw_date       date      
  <chr>          <date>    
1 Janeiro 2020   2020-01-01
2 fevereiro 2021 2021-02-01
3 Dezembro 2022  2022-12-01
4 Jan 2021       2021-01-01
5 Fev 2022       2022-02-01
6 Dez 2023       2023-12-01

Format “2023-01” or “2023/01”

# create dummy data
dates_ex5 <- tibble::tibble(
  raw_date = c("2020-01", "2021/05", "2023-12")
)

# parse dates
dates_ex5 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(raw_date, orders = c("Y-m", "Y/m")),
    date = lubridate::floor_date(date, unit = "month") |> as.Date()
  )
Output
# A tibble: 3 × 2
  raw_date date      
  <chr>    <date>    
1 2020-01  2020-01-01
2 2021/05  2021-05-01
3 2023-12  2023-12-01

Format “01/2020” or “01-2020” (Month–Year)

# create dummy data
dates_ex6 <- tibble::tibble(
  raw_date = c("01/2020", "06-2021", "11/2022")
)

# parse dates
dates_ex6 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(raw_date, orders = c("m/Y", "m-Y")),
    date = lubridate::floor_date(date, unit = "month") |> as.Date()
  )
Output
# A tibble: 3 × 2
  raw_date date      
  <chr>    <date>    
1 01/2020  2020-01-01
2 06-2021  2021-06-01
3 11/2022  2022-11-01

Compact format “202301” (YYYYMM)

# create dummy data
dates_ex7 <- tibble::tibble(
  raw_date = c("202301", "202102", "202212")
)

dates_ex7 |>
  dplyr::mutate(
    year = substr(raw_date, 1, 4),
    month = substr(raw_date, 5, 6),
    date = lubridate::ymd(sprintf("%s-%s-01", year, month)) |> as.Date()
  )
Output
# A tibble: 3 × 4
  raw_date year  month date      
  <chr>    <chr> <chr> <date>    
1 202301   2023  01    2023-01-01
2 202102   2021  02    2021-02-01
3 202212   2022  12    2022-12-01

Mixed or unclear formats

# create dummy data
dates_ex8 <- tibble::tibble(
  raw_date = c(
    "2020-01-15", # ISO full date
    "Jan 2021",   # English abbreviated month
    "202203",     # compact YYYYMM
    "03/2022",    # month/year with slash
    "2021-07",    # year-month
    "July 2020",  # English full month
    "15-02-2021", # DD-MM-YYYY
    "2020/11/05", # YYYY/MM/DD
    "2020.12.25", # dotted date
    "2022.07",    # YYYY.MM
    "10-2020",    # MM-YYYY
    "20210105",   # YYYYMMDD
    "2020 Jan",   # Year then month (English)
    "2020.01.01"  # dotted YYYY.MM.DD
  )
)

# parse dates
dates_ex8 |>
  dplyr::mutate(
    date = lubridate::parse_date_time(
      raw_date,
      orders = c(
        "Y-m-d", # 2020-01-15
        "Y/m/d", # 2020/11/05
        "Y.m.d", # 2020.12.25
        "b Y",   # Jan 2021
        "B Y",   # January 2021
        "Y b",   # 2020 Jan
        "Y B",   # 2020 January
        "Ym",    # 202203
        "m/Y",   # 03/2022
        "m-Y",   # 10-2020
        "Y-m",   # 2021-07
        "Y.m",   # 2022.07
        "d-m-Y", # 15-02-2021
        "d/m/Y", # 15/02/2021 (if present)
        "Ymd"    #  20210105
      )
    ) |>
      as.Date()
  )
Output
# A tibble: 14 × 2
   raw_date   date      
   <chr>      <date>    
 1 2020-01-15 2020-01-15
 2 Jan 2021   2021-01-01
 3 202203     2022-03-01
 4 03/2022    2022-03-01
 5 2021-07    2021-07-01
 6 July 2020  2020-07-01
 7 15-02-2021 2021-02-15
 8 2020/11/05 2020-11-05
 9 2020.12.25 2020-12-25
10 2022.07    2022-07-01
11 10-2020    2020-10-01
12 20210105   2021-01-05
13 2020 Jan   2020-01-01
14 2020.01.01 2020-01-01

Format already in YYYY-MM-DD

# create dummy data
dates_ex1 = pd.DataFrame({
    "raw_date": ["2020-01-15", "2021-03-01", "2022-12-20"]
})

# parse dates
dates_ex1["date"] = pd.to_datetime(dates_ex1["raw_date"], format="%Y-%m-%d")

dates_ex1
Output
     raw_date       date
0  2020-01-15 2020-01-15
1  2021-03-01 2021-03-01
2  2022-12-20 2022-12-20

Format “Jan 2020” or “January 2020”

# create dummy data
dates_ex2 = pd.DataFrame({
    "raw_date": ["Jan 2020", "February 2021", "Mar 2022"]
})

# parse dates
dates_ex2["date"] = pd.to_datetime(
    dates_ex2["raw_date"],
    format=None,  # allow flexible parsing of abbreviated and full month names
)

dates_ex2
Output
        raw_date       date
0       Jan 2020 2020-01-01
1  February 2021 2021-02-01
2       Mar 2022 2022-03-01

French month names (for example, “Janvier 2020”)

# set French locale (adjust if needed depending on system)
# common options: 'fr_FR.UTF-8', 'fr_FR'
locale.setlocale(locale.LC_TIME, 'fr_FR.UTF-8')

# create dummy data
dates_ex3 = pd.DataFrame({
    "raw_date": [
        "Janvier 2020",
        "Février 2021",
        "décembre 2022"
    ]
})

# ensure day is present for parsing
dates_ex3["date"] = pd.to_datetime(
    "01 " + dates_ex3["raw_date"],
    format="%d %B %Y",
    errors="coerce"
)

# fallback for abbreviated months
mask_missing = dates_ex3["date"].isna()
dates_ex3.loc[mask_missing, "date"] = pd.to_datetime(
    "01 " + dates_ex3.loc[mask_missing, "raw_date"],
    format="%d %b %Y",
    errors="coerce"
)

dates_ex3
Output
'fr_FR.UTF-8'
        raw_date       date
0   Janvier 2020 2020-01-01
1   Février 2021 2021-02-01
2  décembre 2022 2022-12-01

Portuguese month names (for example, “Janeiro 2020”)

# set Portuguese locale
# common options: 'pt_PT.UTF-8', 'pt_PT'
locale.setlocale(locale.LC_TIME, 'pt_PT.UTF-8')

# create dummy data
dates_ex4 = pd.DataFrame({
    "raw_date": [
        "Janeiro 2020",
        "fevereiro 2021",
        "Dezembro 2022",
        "Jan 2021",
        "Fev 2022",
        "Dez 2023"
    ]
})

# first attempt: full Portuguese month names
dates_ex4["date"] = pd.to_datetime(
    "01 " + dates_ex4["raw_date"],
    format="%d %B %Y",
    errors="coerce"
)

# fallback: abbreviated Portuguese month names (e.g. Jan, Fev, Dez)
mask_missing = dates_ex4["date"].isna()
dates_ex4.loc[mask_missing, "date"] = pd.to_datetime(
    "01 " + dates_ex4.loc[mask_missing, "raw_date"],
    format="%d %b %Y",
    errors="coerce"
)

dates_ex4
Output
'pt_PT.UTF-8'
         raw_date       date
0    Janeiro 2020 2020-01-01
1  fevereiro 2021 2021-02-01
2   Dezembro 2022 2022-12-01
3        Jan 2021 2021-01-01
4        Fev 2022 2022-02-01
5        Dez 2023 2023-12-01

Mixed or unclear formats

# create dummy data
dates_ex5 = pd.DataFrame(
    {
        "raw_date": [
            "2020-01-15",  # ISO full date
            "Jan 2021",    # English abbreviated month
            "202203",      # compact YYYYMM
            "03/2022",     # month/year
            "2021-07",     # year-month
            "July 2020",   # English full month
            "15-02-2021",  # DD-MM-YYYY
            "2020/11/05",  # YYYY/MM/DD
            "2020.12.25",  # YYYY.MM.DD
            "2022.07",     # YYYY.MM
            "10-2020",     # MM-YYYY
            "20210105",    # YYYYMMDD
            "2020 Jan",    # year then month
            "2020.01.01",  # dotted full date
        ]
    }
)

# list of possible formats
formats = [
    "%Y-%m-%d",
    "%Y/%m/%d",
    "%Y.%m.%d",
    "%d-%m-%Y",
    "%d/%m/%Y",
    "%b %Y",
    "%B %Y",
    "%Y %b",
    "%Y %B",
    "%Y-%m",
    "%Y/%m",
    "%Y.%m",
    "%m-%Y",
    "%m/%Y",
    "%Y%m",
    "%Y%m%d",
]


def try_parse(value):
    # try strict formats first
    for fmt in formats:
        try:
            return pd.to_datetime(value, format=fmt)
        except:
            pass
    # fallback: let pandas guess freely
    return pd.to_datetime(value, errors="coerce")


dates_ex5["date"] = dates_ex5["raw_date"].apply(try_parse)

dates_ex5["date"] = dates_ex5["date"].dt.to_period("M").dt.to_timestamp()
dates_ex5
Output
      raw_date       date
0   2020-01-15 2020-01-01
1     Jan 2021 2021-01-01
2       202203 2022-03-01
3      03/2022 2022-03-01
4      2021-07 2021-07-01
5    July 2020 2020-07-01
6   15-02-2021 2021-02-01
7   2020/11/05 2020-11-01
8   2020.12.25 2020-12-01
9      2022.07 2022-07-01
10     10-2020 2020-10-01
11    20210105 2021-01-01
12    2020 Jan 2020-01-01
13  2020.01.01 2020-01-01

For the Sierra Leone dataset, the date field is stored in periodname and appears in formats such as “January 2019” or “February 2019”. We first standardise this into a proper YYYY-MM-DD date. Once the date column is cleaned, we can generate the year, month, and yearmon fields used throughout the SNT workflow.

  • R
  • Python
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
Sys.setlocale("LC_TIME", "en_US.UTF-8")

dhis2_df <- dhis2_df |>
  # parse the raw date into a proper Date object
  dplyr::mutate(
    date = lubridate::parse_date_time(
      periodname,
      orders = c("B Y", "b Y")
    ) |>
      as.Date()
  ) |>
  # create year, month, and ordered yearmon label
  dplyr::mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    # human-readable label, e.g. "Jan 2020"
    yearmon = format(date, "%b %Y"),
    # order factor by actual chronological date
    yearmon = factor(yearmon, levels = unique(yearmon[order(date)]))
  )

# check the head
dhis2_df |>
  dplyr::select(date, year, month, yearmon) |>
  head()
Output
[1] "en_US.UTF-8"
        date year month  yearmon
1 2015-01-01 2015     1 Jan 2015
2 2015-02-01 2015     2 Feb 2015
3 2015-03-01 2015     3 Mar 2015
4 2015-04-01 2015     4 Apr 2015
5 2015-05-01 2015     5 May 2015
6 2015-06-01 2015     6 Jun 2015

To adapt the code:

  • Line 3: Set the locale to match the language of month names in your export, for example "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8".
  • Line 9: Update the column name if your date field is not called periodname.
  • Line 10: Adjust the orders = c("B Y", "b Y") list only if your dates use a different structure.
  • Line 19 Change the yearmon display (for example, "%b %Y" to "%B %Y") based on how you want month names shown.
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
locale.setlocale(locale.LC_TIME, "en_US.UTF-8")

# parse the raw date into a proper datetime object
dhis2_df["date"] = pd.to_datetime(
    dhis2_df["periodname"],
    format="%B %Y",
    errors="coerce"
)

# create year, month, and ordered yearmon columns
dhis2_df["year"] = dhis2_df["date"].dt.year
dhis2_df["month"] = dhis2_df["date"].dt.month
dhis2_df["yearmon"] = dhis2_df["date"].dt.strftime("%b %Y")

# check the output
dhis2_df[["date", "year", "month", "yearmon"]].head()
Output
'en_US.UTF-8'
        date  year  month   yearmon
0 2015-01-01  2015      1  Jan 2015
1 2015-02-01  2015      2  Feb 2015
2 2015-03-01  2015      3  Mar 2015
3 2015-04-01  2015      4  Apr 2015
4 2015-05-01  2015      5  May 2015

To adapt the code:

  • Line 3: Set the locale to match the language of month names in your export, for example "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8".
  • Line 6: Update the column name if your date field is not called periodname.
  • Line 7: Adjust the format="%B %Y" if your dates use a different structure (for example, "%b %Y" for abbreviated months).
  • Line 14: Change the strftime format (for example, "%b %Y" to "%B %Y") based on how you want month names shown.

Step 5: Managing Location Columns

Steps 5.1: Harmonise Administrative Names

Differences in spelling, capitalisation, spacing, or naming conventions between DHIS2 exports and shapefiles are common. These differences will cause joins to fail and lead to missing or duplicated geographic units. Before any spatial join, you should harmonise administrative names used in your DHIS2 dataset with those defined in the shapefile.

A practical way to do this is to use the prep_geonames() function from the sntutils function from the sntutils package. The function detects mismatches, suggests likely matches using string distance methods, and lets you confirm or edit them interactively. It also saves your decisions to a cache file so that harmonisation is reproducible and consistent across analysts and across reruns.

We only show the basic structure here. The full workflow, including interactive matching and scripted reuse, is explained in the Merging shapefiles with tabular data section of the SNT Code Library. For an in-depth blog post based on this method, see Cleaning and standardising geographic names in R with prep_geonames() from the AMMnet Hackathon.

Before any interactive matching, we ensure that the columns are correctly assigned. From observation of the admin names, it appeared that adm1 was actually adm2, so we create a simple lookup to assign districts (adm2) to their respective provinces (adm1).

  • R
  • Python
# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")

# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"

# get adm3 shapfile to use as reference lookup
shp_adm3 <- sntutils::read(
  here::here("english/data_r/shapefiles/sle_spatial_adm3_2021.rds")
) |>
  sf::st_drop_geometry() # drop geometry, we just need the admin names

# standardise admin names
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    adm0 = toupper(adm0),
    adm2 = toupper(adm1),
    adm3 = toupper(adm3),
    hf = toupper(hf),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2 = stringr::str_remove_all(adm2, " DISTRICT"),
    adm3 = stringr::str_remove_all(adm3, " CHIEFDOM")
  ) |>
  dplyr::mutate(
    # assign provinces based on district groupings
    # (no adm1 as current adm1 is adm2)
    adm1 = dplyr::case_when(
      adm2 %in% c("KAILAHUN", "KENEMA", "KONO") ~ "EASTERN",
      adm2 %in% c("BOMBALI", "FALABA", "KOINADUGU", "TONKOLILI") ~ "NORTH EAST",
      adm2 %in% c("KAMBIA", "KARENE", "PORT LOKO") ~ "NORTH WEST",
      adm2 %in% c("BO", "BONTHE", "MOYAMBA", "PUJEHUN") ~ "SOUTHERN",
      adm2 %in% c("WESTERN AREA RURAL", "WESTERN AREA URBAN") ~ "WESTERN"
    )
  )

# harmonise admin names between dhis2 data and shapefile
dhis2_df <-
  sntutils::prep_geonames(
    target_df = dhis2_df,
    lookup_df = lookup_keys,
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    interactive = TRUE,
    cache_path = here::here(cache_path, "geoname_decisions_cache.xlsx")
  )
Output
── ℹ Match Summary ─────────────────────────────────────────────────────────────
ℹ Lookup has extra names not in data.

Target data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 16 matched                                       
• adm3 (level3): 207 out of 207 matched                                     
Lookup data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 16 matched                                       
• adm3 (level3): 207 out of 208 matched                                     
✔ All records matched; process completed. Exiting...

To adapt the code:

  • Line 6: Update cache_path to your preferred location for saving cached files.
  • Lines 9–12: Replace the shapefile path with the path to your reference shapefile.
  • Line 15: Replace dhis2_df with your DHIS2 or target dataset.
  • Lines 15–36: This standardisation block is specific to Sierra Leone. Adapt the toupper(), str_remove_all(), and case_when() logic to match your country’s administrative structure and naming conventions.
  • Line 39: Replace dhis2_df with your target dataframe name.
  • Line 42: Replace lookup_keys with your lookup or reference dataset.
  • Lines 43–46: Adjust level0, level1, level2, and level3 to match your actual admin column names if they differ (e.g., “country”, “region”, “district”, “ward”). Ensure the columns exist in both datasets.
  • Line 48: Update the cache filename if desired.

Once updated, run the code to harmonise your data’s admin names with the shapefile.

# If not already, install the sntutils python package from GitHub
# has a number of useful helper functions
pip install git+https://github.com/ahadi-analytics/sntutils-py.git
from sntutils.geo import prep_geonames # for name matching

# set up location to save cache
cache_path = Path("1.1_foundational/1d_cache_files")

# get adm3 shapefile to use as reference lookup
shp_adm3 = gpd.read_file(
    Path(here("english/data_r/shapefiles/sle_spatial_adm3_2021.geojson"))
).drop(columns="geometry")

# standardise admin names
dhis2_df = dhis2_df.assign(
    adm0=lambda x: x["adm0"].str.upper(),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2=lambda x: x["adm1"]
    .str.upper()
    .str.replace(" DISTRICT", "", regex=False)
    .str.strip(),
    adm3=lambda x: x["adm3"]
    .str.upper()
    .str.replace(" CHIEFDOM", "", regex=False)
    .str.strip(),
    hf=lambda x: x["hf"]
    .str.upper()
)

# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
district_to_province = {
    "KAILAHUN": "EASTERN", "KENEMA": "EASTERN", "KONO": "EASTERN",
    "BOMBALI": "NORTH EAST", "FALABA": "NORTH EAST",
    "KOINADUGU": "NORTH EAST", "TONKOLILI": "NORTH EAST",
    "KAMBIA": "NORTH WEST", "KARENE": "NORTH WEST", "PORT LOKO": "NORTH WEST",
    "BO": "SOUTHERN", "BONTHE": "SOUTHERN",
    "MOYAMBA": "SOUTHERN", "PUJEHUN": "SOUTHERN",
    "WESTERN AREA RURAL": "WESTERN", "WESTERN AREA URBAN": "WESTERN"
}

dhis2_df["adm1"] = dhis2_df["adm2"].map(district_to_province)

# harmonise admin names between dhis2 data and shapefile
# note: sntutils::prep_geonames() has no direct python equivalent
# manual fuzzy matching or exact merge required
dhis2_df = dhis2_df.merge(
    lookup_keys,
    on=["adm0", "adm1", "adm2", "adm3"],
    how="left"
)

# Harmonise admin names between population data and shapefile
dhis2_df = prep_geonames(
    target_df=dhis2_df,
    lookup_df=shp_adm3,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    cache_path=here(cache_path, "geoname_decisions_cache.xlsx")
)
Output
Info: Loaded cache from /Users/mohamedyusuf/ahadi-analytics/code/GitHub/snt-code-library/english/data_r/routine_cases/geoname_decisions_cache.xlsx

ℹ Match Summary

╭───────────────┬─────────┬───────────────┬───────────────╮
│ Level         │ Matched │ Target Data N │ Lookup Data N │
├───────────────┼─────────┼───────────────┼───────────────┤
│ adm0 (level0) │    1    │       1       │       1       │
│ adm1 (level1) │    5    │       5       │       5       │
│ adm2 (level2) │   16    │      16       │      16       │
│ adm3 (level3) │   207   │      207      │      208      │
╰───────────────┴─────────┴───────────────┴───────────────╯
ℹ Lookup has extra names not in data                       

Success: All records matched; process completed. Exiting...

To adapt the code:

  • Line 10: Run once in terminal to install the sntutils-py package from GitHub.
  • Line 11: Import statement for the prep_geonames function.
  • Line 14: Update cache_path to your preferred location for saving cached files.
  • Lines 17–19: Replace the shapefile path with the path to your reference shapefile.
  • Line 22: Replace dhis2_df with your DHIS2 or target dataset.
  • Lines 22–46: This standardisation block is specific to Sierra Leone. Adapt the .str.upper(), .str.replace(), and district_to_province dictionary to match your country’s administrative structure and naming conventions.
  • Lines 51–55: Remove this merge block if using prep_geonames() below, as it duplicates the matching step.
  • Line 58: Replace dhis2_df with your target dataframe name.
  • Line 59: Replace shp_adm3 with your lookup or reference dataset.
  • Lines 60–63: Adjust level0, level1, level2, and level3 to match your actual admin column names if they differ (e.g., “country”, “region”, “district”, “ward”).
    • Ensure the columns exist in both datasets
  • Line 64: Update the cache filename if desired.

Once updated, run the code to harmonise your data’s admin names with the shapefile.

Once the interactive matching is complete and the results are saved, the output message confirms that all administrative levels have been successfully aligned between your dataset and the reference shapefile. If mismatches remain, the prep_geoname function will prompt you to resolve them interactively.

To double-check your decisions, you can share the cache file geoname_decisions_cache.xlsx with the country SNT team to ensure all matching decisions are correct.

Steps 5.1: Create unique identifiers and location labels

Using our corrected admin names, we can now create key columns for downstream analysis and mapping:

  • hf_uid: A unique health facility identifier generated by hashing the full administrative hierarchy and facility name. This ensures each facility has a consistent, reproducible ID.
  • location_short: A concatenated label of province and district (adm1 ~ adm2), useful for summary tables and aggregated views.
  • location_full: The complete administrative hierarchy including chiefdom and facility name, useful for detailed map tooltips and drill-down labelling.
  • record_id: A unique record identifier combining the facility ID and time period, ensuring each facility-month combination is uniquely identifiable.
  • R
  • Python
# create identifiers
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # create unique health facility identifier from admin hierarchy
    hf_uid = sntutils::vdigest(
      paste0(adm0, adm1, adm2, adm3, hf),
      algo = "xxhash32"
    ),
    # create location labels for mapping
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, hf, sep = " ~ "),
    # generate consistent record id (facility + time period)
    record_id = sntutils::vdigest(
      paste(hf_uid, yearmon),
      algo = "xxhash32"
    )
  )

# check
dhis2_df |>
  dplyr::arrange(location_full) |>
  dplyr::distinct(location_full, hf_uid, record_id) |>
  head()
Output
                           location_full   hf_uid record_id
1 EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b  20724f8d
2 EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b  cc1562d5
3 EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b  2d42fbf9
4 EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b  6958bd08
5 EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b  d2cf4c99
6 EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP cff1ec2b  0bcfc2a4

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Line 6: Adjust the paste0() arguments to match your administrative column names (e.g., adm0, adm1, adm2, adm3) and facility column (e.g., hf).
  • Line 10: Modify location_short to include the admin levels you want for summary views.
  • Line 11: Modify location_full to include all admin levels and facility name for detailed labelling.
  • Line 14: Ensure yearmon matches your time period column name. If using a different temporal unit (e.g., year, epiweek), update accordingly.

Once updated, run the code to generate unique identifiers for your health facilities and records.

# helper function to create xxhash32 digest (equivalent to sntutils::vdigest)
def vdigest(x, algo="xxhash32"):
    """Create vectorised hash digest."""
    return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())


# create identifiers
dhis2_df = dhis2_df.assign(
    # create unique health facility identifier from admin hierarchy
    # use spaces to match R's paste() behavior for consistent hashes
    hf_uid=lambda x: vdigest(
        x["adm0"] + " " + x["adm1"] + " " + x["adm2"] + " " + x["adm3"] + " " + x["hf"]
    ),
    # create location labels for mapping
    location_short=lambda x: x["adm1"] + " ~ " + x["adm2"],
    location_full=lambda x: (
        x["adm1"] + " ~ " + x["adm2"] + " ~ " + x["adm3"] + " ~ " + x["hf"]
    ),
)

# generate consistent record id (facility + time period)
# use space to match R's paste() behavior
dhis2_df["record_id"] = vdigest(
    dhis2_df["hf_uid"] + " " + dhis2_df["yearmon"].astype(str)
)

# check
dhis2_df[["location_full", "hf_uid", "record_id"]].drop_duplicates().sort_values(
    "location_full"
).head()
Output
                              location_full    hf_uid record_id
128  EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP  b3bcce19  5a724d8c
96   EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP  b3bcce19  a94e6130
97   EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP  b3bcce19  85fe198a
98   EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP  b3bcce19  ca527347
99   EASTERN ~ KAILAHUN ~ DEA ~ BAIWALA CHP  b3bcce19  a0c8d216

To adapt the code:

  • Line 16: Replace dhis2_df with your target dataset.
  • Lines 18–20: Adjust the concatenated columns to match your administrative column names (e.g., adm0, adm1, adm2, adm3) and facility column (e.g., hf).
  • Line 22: Modify location_short to include the admin levels you want for summary views.
  • Lines 23–25: Modify location_full to include all admin levels and facility name for detailed labelling.
  • Lines 29–31: Ensure yearmon matches your time period column name. If using a different temporal unit (e.g., year, date), update accordingly.

Once updated, run the code to generate unique identifiers for your health facilities and records.

Step 6: Compute Variables

Step 6.1: Compute Indicator Totals and New Variables

Now that we have imported our DHIS2 data and cleaned up the columns, we will compute derived variables to create totals for specific indicators. DHIS2 can provide totals, but these columns must specifically be extracted. In the raw DHIS2 dataset, indicators are typically disaggregated by age group, community, and health facility level. Analysts should strive to obtain the most disaggregated databases possible.

For example, if we want to calculate the total number of outpatient visits, DHIS2 may not provide a direct total, but includes components that can be summed, such as allout_u5 and allout_ov5. Remember to review variable definitions to ensure your aggregations do not double count cases.

The same logic applies to other indicators, including total suspected cases (susp), tested cases (test), confirmed cases (conf), treated cases (maltreat), and presumed cases (pres). The code snippet below shows how to compute these totals by combining the relevant components. The code also includes indicator aggregation by age groups, such as test_hf_u5 which captures all children under five tested by either RDT or microscopy at a health facility.

It is best to verify with the SNT team which specific data elements from DHIS2 should be summed to obtain the correct total. While some calculations may seem obvious, others are not. The box in Step 3 lists example questions that the analyst should get verification for before summing across disaggregated data elements. Depending on your country’s DHIS2 and data reporting practice, other questions may also be relevant.

Consult SNT team

While some countries have presumed cases in their DHIS2 data, not all do! Here are three options for presumed cases that countries have used:

  • Option 1: Your data already has a pres column, no further calculation is needed.
  • Option 2: Calculate presumed cases using the difference between treated cases and confirmed cases: maltreat - conf
  • Option 3: Calculate presumed cases using the difference between suspected cases and tested cases: susp - test

The code below uses Option 2 (lines 52 to 55). Testing is subject to resource availability meaning facilities handle presumed cases differently.

Consult the SNT team to determine how best to approach presumed cases calculations, whether it’s one of the options shown above or a different approach.

  • R
  • Python

Below we create to new functions: fallback_row_sum() and fallback_diff(). The fallback_row_sum() function replaces rowSums(..., na.rm = TRUE), which returns 0 when all values in a row are NA. This default behavior can obscure missingness.

In routine health data, it’s critical to distinguish between:

  • True zeros: e.g. a facility reported 0 cases
  • Missing values: e.g. the facility did not report at all

fallback_row_sum() returns NA when too few values are present, preserving this distinction and preventing overestimation of completeness.

Similarly, fallback_diff() returns the absolute difference if both values are present, the non-missing value if only one is present, and NA if both are missing. It applies pmax() to ensure results meet a minimum threshold.

Both functions guard against misleading outputs when data are incomplete.

# smart row-wise sum with missing data handling
fallback_row_sum <- function(..., min_present = 1, .keep_zero_as_zero = TRUE) {
  vars_matrix <- cbind(...)
  valid_count <- rowSums(!is.na(vars_matrix))
  raw_sum <- rowSums(vars_matrix, na.rm = TRUE)

  ifelse(valid_count >= min_present, raw_sum, NA_real_)
}

# fallback absolute difference between two vectors
fallback_diff <- function(col1, col2, minimum = 0) {
  dplyr::case_when(
    is.na(col1) & is.na(col2) ~ NA_real_,
    is.na(col1) ~ pmax(col2, minimum),
    is.na(col2) ~ pmax(col1, minimum),
    TRUE ~ pmax(col1 - col2, minimum)
  )
}

Now we use these functions to aggregate our columns.

Show the code
# compute indicator totals in DHIS2 data
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # outpatient visits
    allout = fallback_row_sum(allout_u5, allout_ov5),

    # suspected cases
    susp = fallback_row_sum(
      susp_u5_hf,
      susp_5_14_hf,
      susp_ov15_hf,
      susp_u5_com,
      susp_5_14_com,
      susp_ov15_com
    ),

    # tested cases
    test_hf = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    test_com = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com,
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com,
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    test = fallback_row_sum(test_hf, test_com),

    # confirmed cases (HF and COM)
    conf_hf = fallback_row_sum(
      test_pos_mic_u5_hf,
      test_pos_mic_5_14_hf,
      test_pos_mic_ov15_hf,
      tes_pos_rdt_u5_hf,
      tes_pos_rdt_5_14_hf,
      tes_pos_rdt_ov15_hf
    ),

    conf_com = fallback_row_sum(
      tes_pos_rdt_u5_com,
      tes_pos_rdt_5_14_com,
      tes_pos_rdt_ov15_com
    ),

    conf = fallback_row_sum(conf_hf, conf_com),

    # treated cases
    maltreat_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com,
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    maltreat_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf,
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    maltreat = fallback_row_sum(maltreat_hf, maltreat_com),

    # presumed cases
    pres_com = fallback_diff(maltreat_com, conf_com),
    pres_hf = fallback_diff(maltreat_hf, conf_hf),
    pres = fallback_row_sum(pres_com, pres_hf),

    # malaria admissions
    maladm = fallback_row_sum(
      maladm_u5,
      maladm_5_14,
      maladm_ov15
    ),

    # malaria deaths
    maldth = fallback_row_sum(
      maldth_u5,
      maldth_1_59m,
      maldth_10_14,
      maldth_5_9,
      maldth_5_14,
      maldth_ov15,
      maldth_fem_ov15,
      maldth_mal_ov15
    ),

    # AGE-GROUP SPECIFIC AGGREGATIONS
    # Tested cases by age group (HF only)
    test_hf_u5 = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    test_hf_5_14 = fallback_row_sum(
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    test_hf_ov15 = fallback_row_sum(
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # Tested cases by age group (Community only)
    test_com_u5 = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com
    ),

    test_com_5_14 = fallback_row_sum(
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com
    ),

    test_com_ov15 = fallback_row_sum(
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    # Total tested by age group (HF + Community)
    test_u5 = fallback_row_sum(test_hf_u5, test_com_u5),
    test_5_14 = fallback_row_sum(test_hf_5_14, test_com_5_14),
    test_ov15 = fallback_row_sum(test_hf_ov15, test_com_ov15),

    # Suspected cases by age group (HF only)
    susp_hf_u5 = susp_u5_hf,

    susp_hf_5_14 = susp_5_14_hf,

    susp_hf_ov15 = susp_ov15_hf,

    # Suspected cases by age group (Community only)
    susp_com_u5 = susp_u5_com,

    susp_com_5_14 = susp_5_14_com,

    susp_com_ov15 = susp_ov15_com,

    # Total suspected by age group (HF + Community)
    susp_u5 = fallback_row_sum(susp_hf_u5, susp_com_u5),
    susp_5_14 = fallback_row_sum(susp_hf_5_14, susp_com_5_14),
    susp_ov15 = fallback_row_sum(susp_hf_ov15, susp_com_ov15),

    # Confirmed cases by age group (HF only)
    conf_hf_u5 = fallback_row_sum(
      test_pos_mic_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    conf_hf_5_14 = fallback_row_sum(
      test_pos_mic_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    conf_hf_ov15 = fallback_row_sum(
      test_pos_mic_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # Confirmed cases by age group (Community only)
    conf_com_u5 = tes_pos_rdt_u5_com,
    conf_com_5_14 = tes_pos_rdt_5_14_com,
    conf_com_ov15 = tes_pos_rdt_ov15_com,

    # Total confirmed cases by age group (HF + Community)
    conf_u5 = fallback_row_sum(conf_hf_u5, conf_com_u5),
    conf_5_14 = fallback_row_sum(conf_hf_5_14, conf_com_5_14),
    conf_ov15 = fallback_row_sum(conf_hf_ov15, conf_com_ov15),

    # Treated cases by age group (HF only)
    maltreat_hf_u5 = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf
    ),

    maltreat_hf_5_14 = fallback_row_sum(
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf
    ),

    maltreat_hf_ov15 = fallback_row_sum(
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    # Treated cases by age group (Community only)
    maltreat_com_u5 = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com
    ),

    maltreat_com_5_14 = fallback_row_sum(
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com
    ),

    maltreat_com_ov15 = fallback_row_sum(
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    # Total treated cases by age group (HF + Community)
    maltreat_u5 = fallback_row_sum(maltreat_hf_u5, maltreat_com_u5),
    maltreat_5_14 = fallback_row_sum(maltreat_hf_5_14, maltreat_com_5_14),
    maltreat_ov15 = fallback_row_sum(maltreat_hf_ov15, maltreat_com_ov15),

    # Total treated cases within 24 hours (HF only)
    maltreat_u24_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_u24_ov15_hf
    ),

    # Total treated cases after 24 hours (HF only)
    maltreat_ov24_hf = fallback_row_sum(
      maltreat_ov24_u5_hf,
      maltreat_ov24_5_14_hf,
      maltreat_ov24_ov15_hf
    ),

    # Total treated cases within 24 hours (Community only)
    maltreat_u24_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_u24_ov15_com
    ),

    # Total treated cases after 24 hours (Community only)
    maltreat_ov24_com = fallback_row_sum(
      maltreat_ov24_u5_com,
      maltreat_ov24_5_14_com,
      maltreat_ov24_ov15_com
    ),

    # Overall totals (HF + Community)
    maltreat_u24_total = fallback_row_sum(maltreat_u24_hf, maltreat_u24_com),
    maltreat_ov24_total = fallback_row_sum(maltreat_ov24_hf, maltreat_ov24_com),

    # Presumed cases by age group
    pres_com_u5 = fallback_diff(maltreat_com_u5, conf_com_u5),
    pres_com_5_14 = fallback_diff(maltreat_com_5_14, conf_com_5_14),
    pres_com_ov15 = fallback_diff(maltreat_com_ov15, conf_com_ov15),

    pres_hf_u5 = fallback_diff(maltreat_hf_u5, conf_hf_u5),
    pres_hf_5_14 = fallback_diff(maltreat_hf_5_14, conf_hf_5_14),
    pres_hf_ov15 = fallback_diff(maltreat_hf_ov15, conf_hf_ov15),

    pres_u5 = fallback_row_sum(pres_com_u5, pres_hf_u5),
    pres_5_14 = fallback_row_sum(pres_com_5_14, pres_hf_5_14),
    pres_ov15 = fallback_row_sum(pres_com_ov15, pres_hf_ov15)
  )

# check to see the aggregation worked
dhis2_df |>
  dplyr::filter(
    record_id %in%
      c("6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842")
  ) |>
  dplyr::arrange(allout) |>
  dplyr::select(
    allout,
    allout_u5,
    allout_ov5,
    pres_hf_u5,
    maltreat_hf_u5,
    conf_hf_u5
  ) |>
  head()
Output
  allout allout_u5 allout_ov5 pres_hf_u5 maltreat_hf_u5 conf_hf_u5
1      2         2         NA          0             19         19
2     30        NA         30          0             41         58
3    123        NA        123          0            118        118
4    139        86         53         46             NA         46
5    165       156          9         59             NA         59
6    378       256        122         50             50         NA

To adapt the code: - Lines 4–64: Adjust the variable names to reflect those relevant to your dataset when calculating new variables - Once updated, run the code

Show the code
# helper function: row-wise sum that returns value if only one non-NA
def fallback_row_sum(df, cols):
    return df[cols].sum(axis=1, skipna=True, min_count=1)


# helper function: difference with floor at 0
def fallback_diff(df, col1, col2):
    return (df[col1] - df[col2]).clip(lower=0)


dhis2_df = (
    dhis2_df.assign(
        # outpatient visits
        allout=lambda x: fallback_row_sum(x, ["allout_u5", "allout_ov5"]),
        # suspected cases
        susp=lambda x: fallback_row_sum(
            x,
            [
                "susp_u5_hf",
                "susp_5_14_hf",
                "susp_ov15_hf",
                "susp_u5_com",
                "susp_5_14_com",
                "susp_ov15_com",
            ],
        ),
        # tested cases
        test_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        test_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_neg_rdt_u5_com",
                "tes_pos_rdt_u5_com",
                "tes_neg_rdt_5_14_com",
                "tes_pos_rdt_5_14_com",
                "tes_neg_rdt_ov15_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # confirmed cases (HF and COM)
        conf_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_pos_mic_u5_hf",
                "test_pos_mic_5_14_hf",
                "test_pos_mic_ov15_hf",
                "tes_pos_rdt_u5_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        conf_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_pos_rdt_u5_com",
                "tes_pos_rdt_5_14_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # treated cases
        maltreat_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_ov24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_ov24_5_14_com",
                "maltreat_u24_ov15_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
        maltreat_hf=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_hf",
                "maltreat_ov24_u5_hf",
                "maltreat_u24_5_14_hf",
                "maltreat_ov24_5_14_hf",
                "maltreat_u24_ov15_hf",
                "maltreat_ov24_ov15_hf",
            ],
        ),
        # malaria admissions
        maladm=lambda x: fallback_row_sum(
            x, ["maladm_u5", "maladm_5_14", "maladm_ov15"]
        ),
        # malaria deaths
        maldth=lambda x: fallback_row_sum(
            x,
            [
                "maldth_u5",
                "maldth_1_59m",
                "maldth_10_14",
                "maldth_5_9",
                "maldth_5_14",
                "maldth_ov15",
                "maldth_fem_ov15",
                "maldth_mal_ov15",
            ],
        ),
        # AGE-GROUP SPECIFIC AGGREGATIONS
        # Tested cases by age group (HF only)
        test_hf_u5=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
            ],
        ),
        test_hf_5_14=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
            ],
        ),
        test_hf_ov15=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        # Tested cases by age group (Community only)
        test_com_u5=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_u5_com", "tes_pos_rdt_u5_com"]
        ),
        test_com_5_14=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_5_14_com", "tes_pos_rdt_5_14_com"]
        ),
        test_com_ov15=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_ov15_com", "tes_pos_rdt_ov15_com"]
        ),
        # Suspected cases by age group (rename for consistency)
        susp_hf_u5=lambda x: x["susp_u5_hf"],
        susp_hf_5_14=lambda x: x["susp_5_14_hf"],
        susp_hf_ov15=lambda x: x["susp_ov15_hf"],
        susp_com_u5=lambda x: x["susp_u5_com"],
        susp_com_5_14=lambda x: x["susp_5_14_com"],
        susp_com_ov15=lambda x: x["susp_ov15_com"],
        # Confirmed cases by age group (HF only)
        conf_hf_u5=lambda x: fallback_row_sum(
            x, ["test_pos_mic_u5_hf", "tes_pos_rdt_u5_hf"]
        ),
        conf_hf_5_14=lambda x: fallback_row_sum(
            x, ["test_pos_mic_5_14_hf", "tes_pos_rdt_5_14_hf"]
        ),
        conf_hf_ov15=lambda x: fallback_row_sum(
            x, ["test_pos_mic_ov15_hf", "tes_pos_rdt_ov15_hf"]
        ),
        # Confirmed cases by age group (Community only)
        conf_com_u5=lambda x: x["tes_pos_rdt_u5_com"],
        conf_com_5_14=lambda x: x["tes_pos_rdt_5_14_com"],
        conf_com_ov15=lambda x: x["tes_pos_rdt_ov15_com"],
        # Treated cases by age group (HF only)
        maltreat_hf_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_hf", "maltreat_ov24_u5_hf"]
        ),
        maltreat_hf_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_hf", "maltreat_ov24_5_14_hf"]
        ),
        maltreat_hf_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_hf", "maltreat_ov24_ov15_hf"]
        ),
        # Treated cases by age group (Community only)
        maltreat_com_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_com", "maltreat_ov24_u5_com"]
        ),
        maltreat_com_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_com", "maltreat_ov24_5_14_com"]
        ),
        maltreat_com_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_com", "maltreat_ov24_ov15_com"]
        ),
        # Total treated cases within/after 24 hours
        maltreat_u24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_u24_u5_hf", "maltreat_u24_5_14_hf", "maltreat_u24_ov15_hf"],
        ),
        maltreat_ov24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_ov24_u5_hf", "maltreat_ov24_5_14_hf", "maltreat_ov24_ov15_hf"],
        ),
        maltreat_u24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_u24_ov15_com",
            ],
        ),
        maltreat_ov24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_ov24_u5_com",
                "maltreat_ov24_5_14_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
    )
    .assign(
        # second pass: computed from first pass columns
        test=lambda x: fallback_row_sum(x, ["test_hf", "test_com"]),
        conf=lambda x: fallback_row_sum(x, ["conf_hf", "conf_com"]),
        maltreat=lambda x: fallback_row_sum(x, ["maltreat_hf", "maltreat_com"]),
        # totals by age group
        test_u5=lambda x: fallback_row_sum(x, ["test_hf_u5", "test_com_u5"]),
        test_5_14=lambda x: fallback_row_sum(x, ["test_hf_5_14", "test_com_5_14"]),
        test_ov15=lambda x: fallback_row_sum(x, ["test_hf_ov15", "test_com_ov15"]),
        susp_u5=lambda x: fallback_row_sum(x, ["susp_hf_u5", "susp_com_u5"]),
        susp_5_14=lambda x: fallback_row_sum(x, ["susp_hf_5_14", "susp_com_5_14"]),
        susp_ov15=lambda x: fallback_row_sum(x, ["susp_hf_ov15", "susp_com_ov15"]),
        conf_u5=lambda x: fallback_row_sum(x, ["conf_hf_u5", "conf_com_u5"]),
        conf_5_14=lambda x: fallback_row_sum(x, ["conf_hf_5_14", "conf_com_5_14"]),
        conf_ov15=lambda x: fallback_row_sum(x, ["conf_hf_ov15", "conf_com_ov15"]),
        maltreat_u5=lambda x: fallback_row_sum(
            x, ["maltreat_hf_u5", "maltreat_com_u5"]
        ),
        maltreat_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_hf_5_14", "maltreat_com_5_14"]
        ),
        maltreat_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_hf_ov15", "maltreat_com_ov15"]
        ),
        maltreat_u24_total=lambda x: fallback_row_sum(
            x, ["maltreat_u24_hf", "maltreat_u24_com"]
        ),
        maltreat_ov24_total=lambda x: fallback_row_sum(
            x, ["maltreat_ov24_hf", "maltreat_ov24_com"]
        ),
        # presumed cases
        pres_com=lambda x: fallback_diff(x, "maltreat_com", "conf_com"),
        pres_hf=lambda x: fallback_diff(x, "maltreat_hf", "conf_hf"),
        pres_com_u5=lambda x: fallback_diff(x, "maltreat_com_u5", "conf_com_u5"),
        pres_com_5_14=lambda x: fallback_diff(x, "maltreat_com_5_14", "conf_com_5_14"),
        pres_com_ov15=lambda x: fallback_diff(x, "maltreat_com_ov15", "conf_com_ov15"),
        pres_hf_u5=lambda x: fallback_diff(x, "maltreat_hf_u5", "conf_hf_u5"),
        pres_hf_5_14=lambda x: fallback_diff(x, "maltreat_hf_5_14", "conf_hf_5_14"),
        pres_hf_ov15=lambda x: fallback_diff(x, "maltreat_hf_ov15", "conf_hf_ov15"),
    )
    .assign(
        # third pass: totals from presumed
        pres=lambda x: fallback_row_sum(x, ["pres_com", "pres_hf"]),
        pres_u5=lambda x: fallback_row_sum(x, ["pres_com_u5", "pres_hf_u5"]),
        pres_5_14=lambda x: fallback_row_sum(x, ["pres_com_5_14", "pres_hf_5_14"]),
        pres_ov15=lambda x: fallback_row_sum(x, ["pres_com_ov15", "pres_hf_ov15"]),
    )
)

# inspect results
(
    dhis2_df[
        dhis2_df["record_id"].isin(
            ["a80a6352", "aa2eb756", "c8c29999", "eb63621c", "27ee9eeb", "93b16cfe"]
        )
    ][
        [
            "allout",
            "allout_u5",
            "allout_ov5",
            "pres_hf_u5",
            "maltreat_hf_u5",
            "conf_hf_u5",
        ]
    ]
    .sort_values("allout")
    .head(6)
)
Output
      allout  allout_u5  allout_ov5  pres_hf_u5  maltreat_hf_u5  conf_hf_u5
2426     2.0        2.0         NaN         0.0            19.0        19.0
656     30.0        NaN        30.0         0.0            41.0        58.0
421    123.0        NaN       123.0         0.0           118.0       118.0
2939   139.0       86.0        53.0         NaN             NaN        46.0
2908   165.0      156.0         9.0         NaN             NaN        59.0
189    378.0      256.0       122.0         NaN            50.0         NaN

To adapt the code:

  • Lines 11–270: Adjust the variable names to reflect those relevant to your dataset when calculating new variables

Once updated, run the code

We can see that the aggregation worked as expected using our fallback functions. The allout column is the sum of allout_u5 and allout_ov5—when either value is NA, the non-NA value is preserved rather than returning NA. Similarly, pres_hf_u5 is derived from the difference between maltreat_hf_u5 and conf_hf_u5. When either value is NA, the result reflects the availabel data rather than defaulting to NA.

Step 6.2: Quality Control of Indicator Totals

Now let’s check that indicator totals are equal to the sum of their disaggregated components. The code chunk below counts the number of rows where the indicator total does not equal the sum of its components. For example, rows where allout is not equal to the sum of allout_u5 and allout_ov5.

Why check indicator totals?

If all totals were manually summed using the previous step, you may choose to skip this step. However, even if totals were manually calculated in the previous step, checking totals here can help you identify errors.

In the case that total columns are extracted from DHIS2, it is essential to perform this quality control check for coherency. This check confirms which components go into the aggregated totals extracted from DHIS2.

  • R
  • Python
Show the code
# create mismatch flags for each indicator group
mismatch_summary <- dhis2_df |>
  dplyr::summarise(
    # outpatient checks
    allout_mismatch = sum(
      allout != (allout_u5 + allout_ov5),
      na.rm = TRUE
    ),

    # malaria admissions check
    maladm_mismatch = sum(
      maladm != (maladm_u5 + maladm_5_14 + maladm_ov15),
      na.rm = TRUE
    ),

    # total tests check
    test_mismatch = sum(
      test != (test_hf + test_com),
      na.rm = TRUE
    ),

    # total confirmed check
    conf_mismatch = sum(
      conf != (conf_hf + conf_com),
      na.rm = TRUE
    ),

    # total treated check
    maltreat_mismatch = sum(
      maltreat != (maltreat_hf + maltreat_com),
      na.rm = TRUE
    ),

    # total presumed check
    pres_mismatch = sum(
      pres != (pres_hf + pres_com),
      na.rm = TRUE
    ),

    # malaria deaths check
    maldth_mismatch = sum(
      maldth !=
        (maldth_1_59m +
          maldth_u5 +
          maldth_5_9 +
          maldth_10_14 +
          maldth_5_14 +
          maldth_fem_ov15 +
          maldth_mal_ov15 +
          maldth_ov15),
      na.rm = TRUE
    ),

    # tested cases by age group
    test_u5_mismatch = sum(
      test_u5 != (test_hf_u5 + test_com_u5),
      na.rm = TRUE
    ),
    test_5_14_mismatch = sum(
      test_5_14 != (test_hf_5_14 + test_com_5_14),
      na.rm = TRUE
    ),
    test_ov15_mismatch = sum(
      test_ov15 != (test_hf_ov15 + test_com_ov15),
      na.rm = TRUE
    ),

    # confirmed cases by age group
    conf_u5_mismatch = sum(
      conf_u5 != (conf_hf_u5 + conf_com_u5),
      na.rm = TRUE
    ),
    conf_5_14_mismatch = sum(
      conf_5_14 != (conf_hf_5_14 + conf_com_5_14),
      na.rm = TRUE
    ),
    conf_ov15_mismatch = sum(
      conf_ov15 != (conf_hf_ov15 + conf_com_ov15),
      na.rm = TRUE
    ),

    # presumed cases by age group
    pres_u5_mismatch = sum(
      pres_u5 != (pres_hf_u5 + pres_com_u5),
      na.rm = TRUE
    ),
    pres_5_14_mismatch = sum(
      pres_5_14 != (pres_hf_5_14 + pres_com_5_14),
      na.rm = TRUE
    ),
    pres_ov15_mismatch = sum(
      pres_ov15 != (pres_hf_ov15 + pres_com_ov15),
      na.rm = TRUE
    ),

    # suspected cases by age group
    susp_u5_mismatch = sum(
      susp_u5 != (susp_u5_hf + susp_u5_com),
      na.rm = TRUE
    ),
    susp_5_14_mismatch = sum(
      susp_5_14 != (susp_5_14_hf + susp_5_14_com),
      na.rm = TRUE
    ),
    susp_ov15_mismatch = sum(
      susp_ov15 != (susp_ov15_hf + susp_ov15_com),
      na.rm = TRUE
    ),

    # treated cases by age group
    maltreat_u5_mismatch = sum(
      maltreat_u5 != (maltreat_hf_u5 + maltreat_com_u5),
      na.rm = TRUE
    ),
    maltreat_5_14_mismatch = sum(
      maltreat_5_14 != (maltreat_hf_5_14 + maltreat_com_5_14),
      na.rm = TRUE
    ),
    maltreat_ov15_mismatch = sum(
      maltreat_ov15 != (maltreat_hf_ov15 + maltreat_com_ov15),
      na.rm = TRUE
    ),

    # treatment timing checks
    maltreat_u24_mismatch = sum(
      maltreat_u24_total != (maltreat_u24_hf + maltreat_u24_com),
      na.rm = TRUE
    ),
    maltreat_ov24_mismatch = sum(
      maltreat_ov24_total != (maltreat_ov24_hf + maltreat_ov24_com),
      na.rm = TRUE
    )
  ) |>
  # pivot to long format for easier viewing
  tidyr::pivot_longer(
    cols = dplyr::everything(),
    names_to = "indicator",
    values_to = "n_mismatches"
  ) |>
  # filter to show only indicators with mismatches
  dplyr::filter(n_mismatches > 0)

mismatch_summary
Output

Note: because this code displays rows where indicator totals are not equal to the sum of their components, we should expect to see output with <0 rows> if all calculations are correct. If rows do appear in the output, use the hf and periodname to further investigate the discrepancy.

# A tibble: 0 × 2
# ℹ 2 variables: indicator <chr>, n_mismatches <int>

To adapt the code:

  • Line 3: Replace dhis2_df with your target dataset.
  • Lines 6–52: Adjust the indicator checks to match your data. Each check compares a total column against the sum of its components. Add or remove checks based on the indicators availabel in your dataset.
  • Lines 54–126: These check age-disaggregated indicators. Modify column names to match your age group naming conventions (e.g., _u5, _5_14, _ov15).
  • Line 135: The output filters to show only indicators with mismatches. Remove this filter if you want to see all indicators regardless of mismatch status.
Show the code
# create mismatch flags for each indicator group
# helper function to count mismatches, ignoring NAs
def count_mismatch(total, components):
    """Count mismatches between total and sum of components, ignoring NAs."""
    calculated = components.sum(axis=1)
    # only compare where both total and calculated are not NA
    mask = total.notna() & calculated.notna()
    return (total[mask] != calculated[mask]).sum()

mismatch_summary = pd.DataFrame({
    # outpatient checks
    "allout_mismatch": [
        count_mismatch(
            dhis2_df["allout"],
            dhis2_df[["allout_u5", "allout_ov5"]]
        )
    ],

    # malaria admissions check
    "maladm_mismatch": [
        count_mismatch(
            dhis2_df["maladm"],
            dhis2_df[["maladm_u5", "maladm_5_14", "maladm_ov15"]]
        )
    ],

    # total tests check
    "test_mismatch": [
        count_mismatch(
            dhis2_df["test"],
            dhis2_df[["test_hf", "test_com"]]
        )
    ],

    # total confirmed check
    "conf_mismatch": [
        count_mismatch(
            dhis2_df["conf"],
            dhis2_df[["conf_hf", "conf_com"]]
        )
    ],

    # total treated check
    "maltreat_mismatch": [
        count_mismatch(
            dhis2_df["maltreat"],
            dhis2_df[["maltreat_hf", "maltreat_com"]]
        )
    ],

    # total presumed check
    "pres_mismatch": [
        count_mismatch(
            dhis2_df["pres"],
            dhis2_df[["pres_hf", "pres_com"]]
        )
    ],

    # malaria deaths check
    "maldth_mismatch": [
        count_mismatch(
            dhis2_df["maldth"],
            dhis2_df[[
                "maldth_1_59m", "maldth_u5", "maldth_5_9", "maldth_10_14",
                "maldth_5_14", "maldth_fem_ov15", "maldth_mal_ov15", "maldth_ov15"
            ]]
        )
    ],

    # tested cases by age group
    "test_u5_mismatch": [
        count_mismatch(
            dhis2_df["test_u5"],
            dhis2_df[["test_hf_u5", "test_com_u5"]]
        )
    ],
    "test_5_14_mismatch": [
        count_mismatch(
            dhis2_df["test_5_14"],
            dhis2_df[["test_hf_5_14", "test_com_5_14"]]
        )
    ],
    "test_ov15_mismatch": [
        count_mismatch(
            dhis2_df["test_ov15"],
            dhis2_df[["test_hf_ov15", "test_com_ov15"]]
        )
    ],

    # confirmed cases by age group
    "conf_u5_mismatch": [
        count_mismatch(
            dhis2_df["conf_u5"],
            dhis2_df[["conf_hf_u5", "conf_com_u5"]]
        )
    ],
    "conf_5_14_mismatch": [
        count_mismatch(
            dhis2_df["conf_5_14"],
            dhis2_df[["conf_hf_5_14", "conf_com_5_14"]]
        )
    ],
    "conf_ov15_mismatch": [
        count_mismatch(
            dhis2_df["conf_ov15"],
            dhis2_df[["conf_hf_ov15", "conf_com_ov15"]]
        )
    ],

    # presumed cases by age group
    "pres_u5_mismatch": [
        count_mismatch(
            dhis2_df["pres_u5"],
            dhis2_df[["pres_hf_u5", "pres_com_u5"]]
        )
    ],
    "pres_5_14_mismatch": [
        count_mismatch(
            dhis2_df["pres_5_14"],
            dhis2_df[["pres_hf_5_14", "pres_com_5_14"]]
        )
    ],
    "pres_ov15_mismatch": [
        count_mismatch(
            dhis2_df["pres_ov15"],
            dhis2_df[["pres_hf_ov15", "pres_com_ov15"]]
        )
    ],

    # suspected cases by age group
    "susp_u5_mismatch": [
        count_mismatch(
            dhis2_df["susp_u5"],
            dhis2_df[["susp_hf_u5", "susp_com_u5"]]
        )
    ],
    "susp_5_14_mismatch": [
        count_mismatch(
            dhis2_df["susp_5_14"],
            dhis2_df[["susp_hf_5_14", "susp_com_5_14"]]
        )
    ],
    "susp_ov15_mismatch": [
        count_mismatch(
            dhis2_df["susp_ov15"],
            dhis2_df[["susp_hf_ov15", "susp_com_ov15"]]
        )
    ],

    # treated cases by age group
    "maltreat_u5_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u5"],
            dhis2_df[["maltreat_hf_u5", "maltreat_com_u5"]]
        )
    ],
    "maltreat_5_14_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_5_14"],
            dhis2_df[["maltreat_hf_5_14", "maltreat_com_5_14"]]
        )
    ],
    "maltreat_ov15_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov15"],
            dhis2_df[["maltreat_hf_ov15", "maltreat_com_ov15"]]
        )
    ],

    # treatment timing checks
    "maltreat_u24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u24_total"],
            dhis2_df[["maltreat_u24_hf", "maltreat_u24_com"]]
        )
    ],
    "maltreat_ov24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov24_total"],
            dhis2_df[["maltreat_ov24_hf", "maltreat_ov24_com"]]
        )
    ]
})

# pivot to long format for easier viewing
mismatch_summary = (
    mismatch_summary
    .melt(var_name="indicator", value_name="n_mismatches")
    # filter to show only indicators with mismatches
    .query("n_mismatches > 0")
)

mismatch_summary
Output

Note: because this code displays rows where indicator totals are not equal to the sum of their components, we should expect to see output with <0 rows> if all calculations are correct. If rows do appear in the output, use the hf and periodname to further investigate the discrepancy.

Empty DataFrame
Columns: [indicator, n_mismatches]
Index: []

To adapt the code:

  • Lines 10–156: Replace dhis2_df with your target dataset.
  • Lines 10–58: Adjust the indicator checks to match your data. Each check compares a total column against the sum of its components. Add or remove checks based on the indicators availabel in your dataset.
  • Lines 60–156: These check age-disaggregated indicators. Modify column names to match your age group naming conventions (e.g., _u5, _5_14, _ov15).
  • Line 164: The output filters to show only indicators with mismatches. Remove .query("n_mismatches > 0") if you want to see all indicators regardless of mismatch status.

Step 6.3: Export Rows with Incoherent Totals

If there are incoherent totals detected, they can be exported for further assessment using the below code. Exported outputs must be shared with the SNT team for review and guidance.

  • R
  • Python
Show the code
# Identify rows with incoherent totals
incoherent_rows <- dhis2_df |>
  dplyr::filter(
    # outpatient checks
    allout != (allout_u5 + allout_ov5) |
    # malaria admissions check
    maladm != (maladm_u5 + maladm_5_14 + maladm_ov15) |
    # total tests check
    test != (test_hf + test_com) |
    # total confirmed check
    conf != (conf_hf + conf_com) |
    # total treated check
    maltreat != (maltreat_hf + maltreat_com) |
    # total presumed check
    pres != (pres_hf + pres_com)
  ) |>
  dplyr::select(
    hf, periodname,
    dplyr::matches("allout|maladm|test|conf|maltreat|pres")
  )

# set path for saving
output_file <- here::here(
  "1.1.2_epidemiology",
  "1.1.2a_routine_surveillance",
  "processed",
  "sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
rio::export(incoherent_rows, file = output_file)

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Lines 3–16: Adjust the filter conditions to match your indicator totals and components.
  • Lines 23–29: Update the output file path to your preferred location.

Once updated, run the code to export rows with incoherent totals for review with the SNT team.

Show the code
# identify rows with incoherent totals
incoherent_rows = dhis2_df[
    # outpatient checks
    (dhis2_df["allout"] != (dhis2_df["allout_u5"] + dhis2_df["allout_ov5"])) |
    # malaria admissions check
    (dhis2_df["maladm"] != (dhis2_df["maladm_u5"] + dhis2_df["maladm_5_14"] + dhis2_df["maladm_ov15"])) |
    # total tests check
    (dhis2_df["test"] != (dhis2_df["test_hf"] + dhis2_df["test_com"])) |
    # total confirmed check
    (dhis2_df["conf"] != (dhis2_df["conf_hf"] + dhis2_df["conf_com"])) |
    # total treated check
    (dhis2_df["maltreat"] != (dhis2_df["maltreat_hf"] + dhis2_df["maltreat_com"])) |
    # total presumed check
    (dhis2_df["pres"] != (dhis2_df["pres_hf"] + dhis2_df["pres_com"]))
]

# select relevant columns
incoherent_rows = incoherent_rows[
    ["hf", "periodname"] +
    [col for col in incoherent_rows.columns
     if any(x in col for x in ["allout", "maladm", "test", "conf", "maltreat", "pres"])]
]

# set path for saving
output_file = Path(
    "1.1.2_epidemiology",
    "1.1.2a_routine_surveillance",
    "processed",
    "sle_incoherent_totals_dhis2.xlsx"
)

# export to xlsx
incoherent_rows.to_excel(output_file, index=False)

To adapt the code:

  • Line 2: Replace dhis2_df with your target dataset.
  • Lines 2–15: Adjust the filter conditions to match your indicator totals and components.
  • Lines 25–30: Update the output file path to your preferred location.

Once updated, run the code to export rows with incoherent totals for review with the SNT team.

Step 6.3: Add IPD/OPD Specification

It may be useful in later analyses to subset routine data to only inpatient or only outpatient facilities or departments. For example, you may want to analyse malaria admissions only from facilities with inpatient capacity, or focus on outpatient case management indicators from primary care facilities.

Here, we create a column that specifies whether a facility provides inpatient (IPD) or outpatient (OPD) services. This classification can be based on:

  • Facility type naming conventions (e.g., “Hospital” = IPD, “CHP” = OPD)
  • Presence of inpatient indicators (e.g., facilities reporting admissions or deaths)
  • A reference lookup file from the SNT team or health ministry
Consult SNT team

Facility classification varies by country. Confirm with the SNT team which facility types should be classified as IPD vs OPD. Some facilities may offer both services and require special handling.

  • R
  • Python
Show the code
# option 1: classify based on facility name patterns
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    facility_type = dplyr::case_when(
      # inpatient facilities (hospitals)
      stringr::str_detect(
        hf,
        regex("hospital|hosp", ignore_case = TRUE)
      ) ~ "IPD",
      stringr::str_detect(
        hf,
        regex("district hospital|regional hospital", ignore_case = TRUE)
      ) ~ "IPD",
      # outpatient facilities (clinics, health posts, community health)
      stringr::str_detect(
        hf,
        regex(
          "CHP|CHC|MCHP|clinic|health post|health centre",
          ignore_case = TRUE
        )
      ) ~ "OPD",
      # default to OPD if no pattern matched
      TRUE ~ "OPD"
    )
  )

# option 2: classify based on presence of inpatient indicators
dhis2_df <- dhis2_df |>
  dplyr::group_by(hf_uid) |>
  dplyr::mutate(
    # facility has IPD if it ever reports admissions or deaths
    has_ipd = any(!is.na(maladm) & maladm > 0, na.rm = TRUE) |
      any(!is.na(maldth) & maldth > 0, na.rm = TRUE),
    facility_type = dplyr::if_else(has_ipd, "IPD", "OPD")
  ) |>
  dplyr::ungroup() |>
  dplyr::select(-has_ipd)

# option 3: use a reference lookup file
hf_path <-
  rio::import(
    here::here(
      "01_data",
      "1.1_foundational",
      "1.1b_health_facilities",
      "processed",
      "health_facility_master_list.xlsx"
    )
  )

dhis2_df <- dhis2_df |>
  dplyr::left_join(
    facility_lookup |> dplyr::select(hf_uid, facility_type),
    by = "hf_uid"
  )

# check classification distribution
dhis2_df |>
  dplyr::distinct(hf_uid, facility_type) |>
  dplyr::count(facility_type)
Output
  facility_type    n
1           IPD   89
2           OPD 1682

To adapt the code:

  • Line 12: Replace dhis2_df with your target dataset.
  • Lines 16–31 (Option 1): Adjust the str_detect() patterns to match your country’s facility naming conventions (e.g., “dispensary”, “health center”, “referral hospital”).
  • Lines 40–43 (Option 2): Modify the indicator columns (maladm, maldth) if your dataset uses different names for inpatient indicators.
  • Lines 47–55 (Option 3): Update the path to your facility classification lookup file if using a reference list from the SNT team.

Choose the option that best fits your data and context. Option 3 (reference lookup) is recommended when an official facility classification exists.

Show the code
# option 1: classify based on facility name patterns
def classify_facility(hf_name):
    """Classify facility as IPD or OPD based on name patterns."""
    hf_lower = hf_name.lower() if pd.notna(hf_name) else ""

    # inpatient facilities (hospitals)
    if any(x in hf_lower for x in ["hospital", "hosp"]):
        return "IPD"
    # outpatient facilities (clinics, health posts, community health)
    elif any(x in hf_lower for x in ["chp", "chc", "mchp", "clinic", "health post", "health centre"]):
        return "OPD"
    # default to OPD
    else:
        return "OPD"

dhis2_df["facility_type"] = dhis2_df["hf"].apply(classify_facility)

# option 2: classify based on presence of inpatient indicators
has_ipd = (
    dhis2_df
    .groupby("hf_uid")
    .apply(
        lambda x: ((x["maladm"].notna() & (x["maladm"] > 0)).any() |
                   (x["maldth"].notna() & (x["maldth"] > 0)).any())
    )
    .reset_index(name="has_ipd")
)

dhis2_df = dhis2_df.merge(has_ipd, on="hf_uid", how="left")
dhis2_df["facility_type"] = dhis2_df["has_ipd"].map({True: "IPD", False: "OPD"})
dhis2_df = dhis2_df.drop(columns="has_ipd")

# option 3: use a reference lookup file
facility_lookup = pd.read_excel(
    Path("path/to/facility_classification.xlsx")
)

dhis2_df = dhis2_df.merge(
    facility_lookup[["hf_uid", "facility_type"]],
    on="hf_uid",
    how="left"
)

# check classification distribution
dhis2_df[["hf_uid", "facility_type"]].drop_duplicates()["facility_type"].value_counts()
Output
  facility_type     n
0           OPD  1682
1           IPD    89

To adapt the code:

  • Lines 6–13 (Option 1): Adjust the str_detect() / string patterns to match your country’s facility naming conventions (e.g., “dispensary”, “health center”, “referral hospital”).
  • Lines 17–25 (Option 2): Modify the indicator columns (maladm, maldth) if your dataset uses different names for inpatient indicators.
  • Lines 28–35 (Option 3): Update the path to your facility classification lookup file if using a reference list from the SNT team.

Choose the option that best fits your data and context. Option 3 (reference lookup) is recommended when an official facility classification exists.

Step 7: Finalize Data

Step 7.1: Troubleshoot Duplicate HF-Month Records with Different Data

Here we search for rows in the dataset that correspond to the same HF-month report but happen to have different data. If any are found, these need to be cleaned at this stage to make sure the dataset contains only one report for each combination of HF and month. Expect to work closely with the SNT team or DHIS2 focal person at this stage to ensure correct data handling decisions are made.

Duplicate HF-month records can arise from multiple data entry submissions for the same reporting period, data imports from different sources being combined, system errors during DHIS2 extraction, or facilities reporting through both paper and electronic systems.

Unresolved duplicates will inflate case counts and distort indicators. Identifying and resolving them is essential before any analysis.

  • R
  • Python
Show the code
# identify duplicate hf-month combinations
duplicates <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::ungroup()

# count number of duplicate pairs
n_duplicates <- duplicates |>
  dplyr::distinct(record_id) |>
  nrow()

# if duplicates exist, inspect them
if (n_duplicates > 0) {
  # view duplicate records with key indicators
  duplicate_details <- duplicates |>
    dplyr::select(
      record_id,
      adm0, adm1, adm2, adm3, hfm yearmon,
      allout,
      test,
      conf,
      maltreat
    ) |>
    dplyr::arrange(record_id)

  print(duplicate_details)

  # export for review with SNT team
  rio::export(
    duplicate_details,
    here::here(
      "1.1.2_epidemiology",
      "1.1.2a_routine_surveillance",
      "processed",
      "sle_duplicate_records_dhis2.xlsx"
    )
  )
}

# option 1: keep first record (if duplicates are exact copies)
dhis2_df <- dhis2_df |>
  dplyr::distinct(record_id, .keep_all = TRUE)

# option 2: keep record with most complete data
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::slice_max(
    # count non-NA values across indicator columns
    order_by = rowSums(!is.na(dplyr::across(dplyr::where(is.numeric)))),
    n = 1,
    with_ties = FALSE
  ) |>
  dplyr::ungroup()

# option 3: sum duplicates (if records represent partial reports)
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id, hf, adm0, adm1, adm2, adm3) |>
  dplyr::summarise(
    dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)),
    .groups = "drop"
  )

To adapt the code:

  • Line 3: Replace dhis2_df with your target dataset.
  • Lines 18–21: Adjust the columns selected for duplicate inspection based on your key indicators.
  • Lines 26–33: Update the output file path for exporting duplicates.
  • Lines 36–56: Choose the appropriate resolution option based on your context

Once updated, run the code to identify and resolve duplicate HF-month records.

Show the code
# identify duplicate hf-month combinations
duplicates = dhis2_df.groupby(["hf_uid", "yearmon"]).filter(lambda x: len(x) > 1)

# count number of duplicate pairs
n_duplicates = duplicates[["hf_uid", "yearmon"]].drop_duplicates().shape[0]

# if duplicates exist, inspect them
if n_duplicates > 0:
    # view duplicate records with key indicators
    duplicate_details = duplicates[
        ["record_id", "allout", "test", "conf", "maltreat"]
    ].sort_values(["hf_uid", "yearmon"])

    print(duplicate_details)

    # export for review with SNT team
    duplicate_details.to_excel(
        Path(
            "1.1.2_epidemiology",
            "1.1.2a_routine_surveillance",
            "processed",
            "sle_duplicate_records_dhis2.xlsx",
        ),
        index=False,
    )

# option 1: keep first record (if duplicates are exact copies)
dhis2_df = dhis2_df.drop_duplicates(subset=["hf_uid", "yearmon"], keep="first")

# option 2: keep record with most complete data
dhis2_df = (
    dhis2_df.assign(
        n_complete=lambda x: x.select_dtypes(include="number").notna().sum(axis=1)
    )
    .sort_values("n_complete", ascending=False)
    .drop_duplicates(subset=["record_id"], keep="first")
    .drop(columns="n_complete")
)

# option 3: sum duplicates (if records represent partial reports)
group_cols = ["hf_uid", "yearmon", "hf", "adm0", "adm1", "adm2", "adm3"]
numeric_cols = dhis2_df.select_dtypes(include="number").columns.tolist()

dhis2_df = dhis2_df.groupby(group_cols, as_index=False)[numeric_cols].sum()

# verify duplicates resolved
n_remaining = dhis2_df.groupby(["record_id"]).filter(lambda x: len(x) > 1).shape[0]

To adapt the code:

  • Line 3: Replace dhis2_df with your target dataset.
  • Lines 18–21: Adjust the columns selected for duplicate inspection based on your key indicators.
  • Lines 26–33: Update the output file path for exporting duplicates.
  • Lines 36–56: Choose the appropriate resolution option based on your context

Once updated, run the code to identify and resolve duplicate HF-month records.

Choosing a resolution strategy

Consult with the SNT team before choosing how to resolve duplicates. The correct approach depends on why duplicates exist. Exact duplicates from system errors can be handled by keeping the first record (Option 1). When records have different completeness from re-submissions, keep the most complete record (Option 2). Partial reports from split submissions may need to be summed together (Option 3). Conflicting data requiring investigation should be exported and resolved manually with the SNT team.

Step 7.2: Generate Final Data Dictionary

A complete data dictionary documents all columns in the preprocessed dataset, including both the original DHIS2 variables and the computed/derived columns created during preprocessing. This dictionary serves as a reference for downstream analyses and facilitates collaboration with the SNT team.

  • R
  • Python
Show the code
# define descriptions for computed/derived columns
computed_vars <- tibble::tribble(
  ~snt_var,              ~indicator_label,
  # time columns
  "date",                "Report date (YYYY-MM-DD)",
  "year",                "Report year",
  "month",               "Report month (1-12)",
  "yearmon",             "Year-month label (e.g., Jan 2020)",
  # identifier columns
  "hf_uid",              "Unique health facility identifier (hash)",
  "record_id",           "Unique record identifier (facility + month hash)",
  "location_short",      "Location label: adm1 ~ adm2",
  "location_full",       "Location label: adm1 ~ adm2 ~ adm3 ~ hf",
  "facility_type",       "Facility type (IPD/OPD)",
  # aggregated totals
  "allout",              "Total outpatient visits (allout_u5 + allout_ov5)",
  "susp",                "Total suspected cases (all ages, HF + COM)",
  "test",                "Total tested (test_hf + test_com)",
  "test_hf",             "Total tested at health facility",
  "test_com",            "Total tested in community",
  "conf",                "Total confirmed cases (conf_hf + conf_com)",
  "conf_hf",             "Total confirmed cases at health facility",
  "conf_com",            "Total confirmed cases in community",
  "maltreat",            "Total treated cases (maltreat_hf + maltreat_com)",
  "maltreat_hf",         "Total treated cases at health facility",
  "maltreat_com",        "Total treated cases in community",
  "pres",                "Total presumed cases (pres_hf + pres_com)",
  "pres_hf",             "Total presumed cases at health facility",
  "pres_com",            "Total presumed cases in community",
  "maladm",              "Total malaria admissions (all ages)",
  "maldth",              "Total malaria deaths (all ages)",
  # age-group totals
  "test_u5",             "Total tested under 5 (HF + COM)",
  "test_5_14",           "Total tested 5-14 years (HF + COM)",
  "test_ov15",           "Total tested over 15 years (HF + COM)",
  "test_hf_u5",          "Tested under 5 at health facility",
  "test_hf_5_14",        "Tested 5-14 years at health facility",
  "test_hf_ov15",        "Tested over 15 years at health facility",
  "test_com_u5",         "Tested under 5 in community",
  "test_com_5_14",       "Tested 5-14 years in community",
  "test_com_ov15",       "Tested over 15 years in community",
  "susp_u5",             "Suspected cases under 5 (HF + COM)",
  "susp_5_14",           "Suspected cases 5-14 years (HF + COM)",
  "susp_ov15",           "Suspected cases over 15 years (HF + COM)",
  "susp_hf_u5",          "Suspected cases under 5 at health facility",
  "susp_hf_5_14",        "Suspected cases 5-14 years at health facility",
  "susp_hf_ov15",        "Suspected cases over 15 years at health facility",
  "susp_com_u5",         "Suspected cases under 5 in community",
  "susp_com_5_14",       "Suspected cases 5-14 years in community",
  "susp_com_ov15",       "Suspected cases over 15 years in community",
  "conf_u5",             "Confirmed cases under 5 (HF + COM)",
  "conf_5_14",           "Confirmed cases 5-14 years (HF + COM)",
  "conf_ov15",           "Confirmed cases over 15 years (HF + COM)",
  "conf_hf_u5",          "Confirmed cases under 5 at health facility",
  "conf_hf_5_14",        "Confirmed cases 5-14 years at health facility",
  "conf_hf_ov15",        "Confirmed cases over 15 years at health facility",
  "conf_com_u5",         "Confirmed cases under 5 in community",
  "conf_com_5_14",       "Confirmed cases 5-14 years in community",
  "conf_com_ov15",       "Confirmed cases over 15 years in community",
  "maltreat_u5",         "Treated cases under 5 (HF + COM)",
  "maltreat_5_14",       "Treated cases 5-14 years (HF + COM)",
  "maltreat_ov15",       "Treated cases over 15 years (HF + COM)",
  "maltreat_hf_u5",      "Treated cases under 5 at health facility",
  "maltreat_hf_5_14",    "Treated cases 5-14 years at health facility",
  "maltreat_hf_ov15",    "Treated cases over 15 years at health facility",
  "maltreat_com_u5",     "Treated cases under 5 in community",
  "maltreat_com_5_14",   "Treated cases 5-14 years in community",
  "maltreat_com_ov15",   "Treated cases over 15 years in community",
  "maltreat_u24_hf",     "Treated cases within 24hrs at health facility",
  "maltreat_ov24_hf",    "Treated cases after 24hrs at health facility",
  "maltreat_u24_com",    "Treated cases within 24hrs in community",
  "maltreat_ov24_com",   "Treated cases after 24hrs in community",
  "maltreat_u24_total",  "Total treated cases within 24hrs (HF + COM)",
  "maltreat_ov24_total", "Total treated cases after 24hrs (HF + COM)",
  "pres_u5",             "Presumed cases under 5 (HF + COM)",
  "pres_5_14",           "Presumed cases 5-14 years (HF + COM)",
  "pres_ov15",           "Presumed cases over 15 years (HF + COM)",
  "pres_hf_u5",          "Presumed cases under 5 at health facility",
  "pres_hf_5_14",        "Presumed cases 5-14 years at health facility",
  "pres_hf_ov15",        "Presumed cases over 15 years at health facility",
  "pres_com_u5",         "Presumed cases under 5 in community",
  "pres_com_5_14",       "Presumed cases 5-14 years in community",
  "pres_com_ov15",       "Presumed cases over 15 years in community"
)

# combine original and computed dictionaries
full_data_dict <- dplyr::bind_rows(
  data_dict,
  computed_vars
) |>
  # keep only columns present in final dataset
  dplyr::filter(snt_var %in% names(dhis2_df)) |>
  dplyr::arrange(snt_var) |>
  dplyr::select(snt_variable = snt_var,  label = indicator_label)

# check
full_data_dict |>
    head()
Output
  snt_variable                                            label
1         adm0                                    orgunitlevel1
2         adm1                                    orgunitlevel2
3         adm2                                    orgunitlevel3
4         adm3                                    orgunitlevel4
5       allout Total outpatient visits (allout_u5 + allout_ov5)
6   allout_ov5           OPD (New and follow-up curative) 5+y_X

To adapt the code:

  • Lines 4–77: Review and update computed_vars to match your computed variables. Add or remove rows as needed.
  • Lines 85–91: Update the output file path for the final data dictionary.

Once updated, run the code to generate and export the final data dictionary.

Show the code
# define descriptions for computed/derived columns
computed_vars = pd.DataFrame([
    # time columns
    {"snt_var": "date", "indicator_label": "Report date (YYYY-MM-DD)"},
    {"snt_var": "year", "indicator_label": "Report year"},
    {"snt_var": "month", "indicator_label": "Report month (1-12)"},
    {"snt_var": "yearmon", "indicator_label": "Year-month label (e.g., Jan 2020)"},
    # identifier columns
    {"snt_var": "hf_uid", "indicator_label": "Unique health facility identifier (hash)"},
    {"snt_var": "record_id", "indicator_label": "Unique record identifier (facility + month hash)"},
    {"snt_var": "location_short", "indicator_label": "Location label: adm1 ~ adm2"},
    {"snt_var": "location_full", "indicator_label": "Location label: adm1 ~ adm2 ~ adm3 ~ hf"},
    {"snt_var": "facility_type", "indicator_label": "Facility type (IPD/OPD)"},
    # aggregated totals
    {"snt_var": "allout", "indicator_label": "Total outpatient visits (allout_u5 + allout_ov5)"},
    {"snt_var": "susp", "indicator_label": "Total suspected cases (all ages, HF + COM)"},
    {"snt_var": "test", "indicator_label": "Total tested (test_hf + test_com)"},
    {"snt_var": "test_hf", "indicator_label": "Total tested at health facility"},
    {"snt_var": "test_com", "indicator_label": "Total tested in community"},
    {"snt_var": "conf", "indicator_label": "Total confirmed cases (conf_hf + conf_com)"},
    {"snt_var": "conf_hf", "indicator_label": "Total confirmed cases at health facility"},
    {"snt_var": "conf_com", "indicator_label": "Total confirmed cases in community"},
    {"snt_var": "maltreat", "indicator_label": "Total treated cases (maltreat_hf + maltreat_com)"},
    {"snt_var": "maltreat_hf", "indicator_label": "Total treated cases at health facility"},
    {"snt_var": "maltreat_com", "indicator_label": "Total treated cases in community"},
    {"snt_var": "pres", "indicator_label": "Total presumed cases (pres_hf + pres_com)"},
    {"snt_var": "pres_hf", "indicator_label": "Total presumed cases at health facility"},
    {"snt_var": "pres_com", "indicator_label": "Total presumed cases in community"},
    {"snt_var": "maladm", "indicator_label": "Total malaria admissions (all ages)"},
    {"snt_var": "maldth", "indicator_label": "Total malaria deaths (all ages)"},
    # age-group totals
    {"snt_var": "test_u5", "indicator_label": "Total tested under 5 (HF + COM)"},
    {"snt_var": "test_5_14", "indicator_label": "Total tested 5-14 years (HF + COM)"},
    {"snt_var": "test_ov15", "indicator_label": "Total tested over 15 years (HF + COM)"},
    {"snt_var": "conf_u5", "indicator_label": "Confirmed cases under 5 (HF + COM)"},
    {"snt_var": "conf_5_14", "indicator_label": "Confirmed cases 5-14 years (HF + COM)"},
    {"snt_var": "conf_ov15", "indicator_label": "Confirmed cases over 15 years (HF + COM)"},
    {"snt_var": "maltreat_u5", "indicator_label": "Treated cases under 5 (HF + COM)"},
    {"snt_var": "maltreat_5_14", "indicator_label": "Treated cases 5-14 years (HF + COM)"},
    {"snt_var": "maltreat_ov15", "indicator_label": "Treated cases over 15 years (HF + COM)"},
    {"snt_var": "pres_u5", "indicator_label": "Presumed cases under 5 (HF + COM)"},
    {"snt_var": "pres_5_14", "indicator_label": "Presumed cases 5-14 years (HF + COM)"},
    {"snt_var": "pres_ov15", "indicator_label": "Presumed cases over 15 years (HF + COM)"},
    # add remaining age-group variables as needed...
])

# combine original and computed dictionaries
full_data_dict = (
    pd.concat([data_dict, computed_vars], ignore_index=True)
    .rename(columns={"snt_var": "snt_variable", "indicator_label": "label"})
    [["snt_variable", "label"]]
)

# keep only columns present in final dataset
full_data_dict = (
    full_data_dict[full_data_dict["snt_variable"].isin(dhis2_df.columns)]
    .sort_values("snt_variable")
)

# check
full_data_dict.head(10)
Output
   snt_variable                                             label
0          adm0                                     orgunitlevel1
1          adm1                                     orgunitlevel2
2          adm2                                     orgunitlevel3
3          adm3                                     orgunitlevel4
63       allout  Total outpatient visits (allout_u5 + allout_ov5)
6    allout_ov5            OPD (New and follow-up curative) 5+y_X
5     allout_u5          OPD (New and follow-up curative) 0-59m_X
68         conf        Total confirmed cases (conf_hf + conf_com)
83    conf_5_14             Confirmed cases 5-14 years (HF + COM)
70     conf_com                Total confirmed cases in community

To adapt the code:

  • Lines 7–49: Review and update computed_vars to match your computed variables. Add or remove rows as needed.
  • Lines 55–61: Update the output file path for the final data dictionary.

Once updated, run the code to generate and export the final data dictionary.

Step 7.3: Arrange Final Column Order

Before exporting, we organise columns in a logical order to make the dataset easier to navigate. Identifiers and location columns come first, followed by time variables, then all indicator columns.

  • R
  • Python
Show the code
# arrange columns in logical order
dhis2_df <- dhis2_df |>
  dplyr::select(
    # identifiers
    record_id,
    # location hierarchy
    adm0,
    adm1,
    adm2,
    adm3,
    hf,
    hf_uid,
    location_short,
    location_full,
    facility_type,
    # time variables
    date,
    yearmon,
    year,
    month,
    # all remaining indicator columns
    dplyr::everything()
  )

# check column order
colnames(dhis2_df) |> head(20)
Output
 [1] "record_id"      "adm0"           "adm1"           "adm2"          
 [5] "adm3"           "hf"             "hf_uid"         "location_short"
 [9] "location_full"  "facility_type"  "date"           "yearmon"       
[13] "year"           "month"          "periodname"     "allout_u5"     
[17] "allout_ov5"     "maladm_u5"      "maladm_5_14"    "maladm_ov15"   

To adapt the code:

  • Lines 6–7: Adjust identifier columns based on your dataset.
  • Lines 9–16: Modify location columns to match your administrative hierarchy.
  • Lines 18–21: Update time columns if using different temporal variables (e.g., epiweek).

Once updated, run the code to reorder columns before exporting the final dataset.

Show the code
# define column order
id_cols = ["record_id"]
location_cols = ["adm0", "adm1", "adm2", "adm3", "hf", "hf_uid",
                 "location_short", "location_full", "facility_type"]
time_cols = ["date", "yearmon", "year", "month"]

# get remaining columns in original order
ordered_cols = id_cols + location_cols + time_cols
remaining_cols = [col for col in dhis2_df.columns if col not in ordered_cols]

# reorder dataframe
dhis2_df = dhis2_df[ordered_cols + remaining_cols]

# check column order
print(dhis2_df.columns[:20].tolist())
Output
['record_id', 'adm0', 'adm1', 'adm2', 'adm3', 'hf', 'hf_uid', 'location_short', 'location_full', 'facility_type', 'date', 'yearmon', 'year', 'month', 'periodname', 'allout_u5', 'allout_ov5', 'maladm_u5', 'maladm_5_14', 'maladm_ov15']

To adapt the code:

  • Line 4: Adjust identifier columns based on your dataset.
  • Lines 5–6: Modify location columns to match your administrative hierarchy.
  • Lines 7: Update time columns if using different temporal variables (e.g., epiweek).

Once updated, run the code to reorder columns before exporting the final dataset.

Step 8: Aggregate and Save Data

Step 8.1: Save Data at the Health Facility Level

Since some SNT analyses rely on facility-specific information, we will now save the data at the health facility level. Keeping the data at health facility level ensures that facility-level analyses can be done accurately and without loss of detail.

  • R
  • Python
Show the code
# set up output path
save_path <- here::here(
  "01_data",
  "02_epidemiology",
  "2a_routine_surveillance",
  "processed"
)

# create list to save
dhis2_hf_list <- list(
  data = dhis2_df,
  dictionary = full_data_dict
)

# save to xlsx
rio::export(
  dhis2_hf_list,
  here::here(save_path, "sle_dhis2_health_facility_data.xlsx")
)

# save to rds
rio::export(
  dhis2_hf_list,
  here::here(save_path, "sle_dhis2_health_facility_data.rds")
)

To adapt the code:

  • Lines 2-7: Update save_path to your preferred output directory.
  • Lines 10-13: The list includes both the cleaned data and data dictionary. Add or remove items as needed.
  • Lines 16-19: Update the filename prefix (e.g., sle_) to match your country code.

Once updated, run the code to export the final preprocessed dataset and data dictionary.

Code
save_path = Path(
    here("01_data/02_epidemiology/2a_routine_surveillance/processed")
)

# save data to xlsx
dhis2_df.to_excel(
    save_path / "sle_dhis2_health_facility_data.xlsx",
    index=False
)

# save dictionary to xlsx
full_data_dict.to_excel(
    save_path / "sle_dhis2_health_facility_dict.xlsx",
    index=False
)

# save to parquet
dhis2_df.to_parquet(
    save_path / "sle_dhis2_health_facility_data.parquet",
    compression="zstd",
    index=False
)

To adapt the code:

  • Lines 2-4: Update save_path to your preferred output directory.
  • Lines 7-10: Update the filename prefix (e.g., sle_) to match your country code.
  • Lines 13-16: Save the dictionary separately if needed.
  • Lines 19-23: The .parquet format is efficient for large datasets. Use .csv instead if sharing with non-Python users.

Once updated, run the code to export the final preprocessed dataset and data dictionary.

Step 8.2: Aggregate and Save Data at Each Admin Unit Level

Now we aggregate and save the data at different administrative levels to support various types of analysis and ensure alignment with how interventions are typically planned and monitored. This allows flexibility when calculating indicators, comparing trends across regions, or linking with other datasets structured at adm0, adm1, adm2, adm3, or facility level.

Avoid Summing Pre-Calculated Rates

When aggregating data to administrative levels, not all indicators can simply be summed. For example, if you have already calculated a test positivity rate or treatment rate in your dataset, these indicators must be re-calculated at the new administrative level.

  • R
  • Python
Show the code
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols <- c(
  # totals
  "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
  # by location
  "test_hf", "test_com", "conf_hf", "conf_com",
  "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
  # by age group - totals
  "allout_u5", "allout_ov5",
  "test_u5", "test_5_14", "test_ov15",
  "conf_u5", "conf_5_14", "conf_ov15",
  "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
  "pres_u5", "pres_5_14", "pres_ov15",
  "susp_u5", "susp_5_14", "susp_ov15",
  "maladm_u5", "maladm_5_14", "maladm_ov15",
  "maldth_u5", "maldth_5_14", "maldth_ov15",
  # by age and location
  "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
  "test_com_u5", "test_com_5_14", "test_com_ov15",
  "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
  "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
  "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
  "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
  "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
  "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
  # treatment timing
  "maltreat_u24_hf", "maltreat_ov24_hf",
  "maltreat_u24_com", "maltreat_ov24_com",
  "maltreat_u24_total", "maltreat_ov24_total"
)

# function to aggregate at a given admin level
aggregate_admin <- function(df, group_cols, sum_cols) {
  df |>
    dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |>
    dplyr::summarise(
      dplyr::across(
        dplyr::any_of(sum_cols),
        ~ sum(.x, na.rm = TRUE)
      ),
      n_facilities = dplyr::n(),
      .groups = "drop"
    )
}

# aggregate at adm3 level
group_cols_adm3 <- c("adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon")
dhis2_adm3 <- aggregate_admin(dhis2_df, group_cols_adm3, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, adm3, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, sep = " ~ ")
  )

# aggregate at adm2 level
group_cols_adm2 <- c("adm0", "adm1", "adm2", "year", "month", "yearmon")
dhis2_adm2 <- aggregate_admin(dhis2_df, group_cols_adm2, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, sep = " ~ ")
  )

# aggregate at adm1 level
group_cols_adm1 <- c("adm0", "adm1", "year", "month", "yearmon")
dhis2_adm1 <- aggregate_admin(dhis2_df, group_cols_adm1, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, yearmon),
      algo = "xxhash32"
    ),
    location_short = adm1,
    location_full = adm1
  )

# create data dictionaries for each level (filter to relevant columns only)
id_cols <- c("n_facilities", "record_id", "location_short", "location_full")

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_dict <- full_data_dict |>
  dplyr::filter(snt_variable %in% c(group_cols_adm3, sum_cols, id_cols))

# adm2 dict: excludes adm3 (not present at this level)
adm2_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm2, sum_cols, id_cols),
    !snt_variable %in% c("adm3")
  )

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm1, sum_cols, id_cols),
    !snt_variable %in% c("adm2", "adm3")
  )

# save adm3 data with dictionary
rio::export(
  list(data = dhis2_adm3, dictionary = adm3_dict),
  here::here(save_path, "sle_dhis2_adm3_data.xlsx")
)
rio::export(dhis2_adm3, here::here(save_path, "sle_dhis2_adm3_data.rds"))

# save adm2 data with dictionary
rio::export(
  list(data = dhis2_adm2, dictionary = adm2_dict),
  here::here(save_path, "sle_dhis2_adm2_data.xlsx")
)
rio::export(dhis2_adm2, here::here(save_path, "sle_dhis2_adm2_data.rds"))

# save adm1 data with dictionary
rio::export(
  list(data = dhis2_adm1, dictionary = adm1_dict),
  here::here(save_path, "sle_dhis2_adm1_data.xlsx")
)
rio::export(dhis2_adm1, here::here(save_path, "sle_dhis2_adm1_data.rds"))
Output
ADM3 aggregation preview:
# A tibble: 5 × 6
  adm3  yearmon  record_id location_full             conf n_facilities
  <chr> <fct>    <chr>     <chr>                    <dbl>        <int>
1 DEA   Jan 2015 a13ef3b8  EASTERN ~ KAILAHUN ~ DEA   336            3
2 DEA   Feb 2015 a6e77ff1  EASTERN ~ KAILAHUN ~ DEA   353            3
3 DEA   Mar 2015 1a075bd7  EASTERN ~ KAILAHUN ~ DEA   318            3
4 DEA   Apr 2015 9e096499  EASTERN ~ KAILAHUN ~ DEA   235            3
5 DEA   May 2015 cf41c82e  EASTERN ~ KAILAHUN ~ DEA   450            3

To adapt the code:

  • Lines 2-30: Update sum_cols to include all numeric indicators relevant to your dataset.
  • Lines 33-42: The aggregate_admin() function handles grouping by admin level.
  • Lines 45-79: Adjust grouping columns if your admin hierarchy differs. The record_id is created using sntutils::vdigest() with the full admin hierarchy (e.g., adm0 + adm1 + adm2 + adm3 + yearmon for adm3 level) to ensure unique identifiers. The code also creates location_short and location_full labels for each level.
  • Lines 82-94: Filter the data dictionary to match columns at each level, excluding admin columns not present at that level (e.g., adm3 excluded from adm2 dictionary).
  • Lines 97-112: Update filename prefixes (e.g., sle_) to match your country code.

Once updated, run the code to aggregate and save data at adm3, adm2, and adm1 levels.

Show the code
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols = [
    # totals
    "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
    # by location
    "test_hf", "test_com", "conf_hf", "conf_com",
    "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
    # by age group - totals
    "allout_u5", "allout_ov5",
    "test_u5", "test_5_14", "test_ov15",
    "conf_u5", "conf_5_14", "conf_ov15",
    "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
    "pres_u5", "pres_5_14", "pres_ov15",
    "susp_u5", "susp_5_14", "susp_ov15",
    "maladm_u5", "maladm_5_14", "maladm_ov15",
    "maldth_u5", "maldth_5_14", "maldth_ov15",
    # by age and location
    "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
    "test_com_u5", "test_com_5_14", "test_com_ov15",
    "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
    "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
    "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
    "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
    "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
    "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
    # treatment timing
    "maltreat_u24_hf", "maltreat_ov24_hf",
    "maltreat_u24_com", "maltreat_ov24_com",
    "maltreat_u24_total", "maltreat_ov24_total"
]

# function to aggregate at a given admin level
def aggregate_admin(df, group_cols, sum_cols):
    # filter to columns that exist in dataframe
    existing_sum_cols = [c for c in sum_cols if c in df.columns]

    # aggregate
    agg_df = (
        df
        .groupby(group_cols, as_index=False)
        .agg(
            **{col: (col, "sum") for col in existing_sum_cols},
            n_facilities=("hf_uid", "count")
        )
    )

    return agg_df

# aggregate at adm3 level
group_cols_adm3 = ["adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon"]
dhis2_adm3 = aggregate_admin(dhis2_df, group_cols_adm3, sum_cols)
dhis2_adm3["record_id"] = vdigest(
    dhis2_adm3["adm0"] + " " + dhis2_adm3["adm1"] + " " +
    dhis2_adm3["adm2"] + " " + dhis2_adm3["adm3"] + " " +
    dhis2_adm3["yearmon"].astype(str)
)
dhis2_adm3["location_short"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"]
dhis2_adm3["location_full"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"] + " ~ " + dhis2_adm3["adm3"]

# aggregate at adm2 level
group_cols_adm2 = ["adm0", "adm1", "adm2", "year", "month", "yearmon"]
dhis2_adm2 = aggregate_admin(dhis2_df, group_cols_adm2, sum_cols)
dhis2_adm2["record_id"] = vdigest(
    dhis2_adm2["adm0"] + " " + dhis2_adm2["adm1"] + " " +
    dhis2_adm2["adm2"] + " " + dhis2_adm2["yearmon"].astype(str)
)
dhis2_adm2["location_short"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
dhis2_adm2["location_full"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]

# aggregate at adm1 level
group_cols_adm1 = ["adm0", "adm1", "year", "month", "yearmon"]
dhis2_adm1 = aggregate_admin(dhis2_df, group_cols_adm1, sum_cols)
dhis2_adm1["record_id"] = vdigest(
    dhis2_adm1["adm0"] + " " + dhis2_adm1["adm1"] + " " +
    dhis2_adm1["yearmon"].astype(str)
)
dhis2_adm1["location_short"] = dhis2_adm1["adm1"]
dhis2_adm1["location_full"] = dhis2_adm1["adm1"]

# create data dictionaries for each level (filter to relevant columns only)
id_cols = ["n_facilities", "record_id", "location_short", "location_full"]

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_cols = group_cols_adm3 + [c for c in sum_cols if c in dhis2_adm3.columns] + id_cols
adm3_dict = full_data_dict[full_data_dict["snt_variable"].isin(adm3_cols)]

# adm2 dict: excludes adm3 (not present at this level)
adm2_cols = group_cols_adm2 + [c for c in sum_cols if c in dhis2_adm2.columns] + id_cols
adm2_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm2_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm3"]))
]

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_cols = group_cols_adm1 + [c for c in sum_cols if c in dhis2_adm1.columns] + id_cols
adm1_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm1_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm2", "adm3"]))
]

# save adm3 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm3_data.xlsx") as writer:
    dhis2_adm3.to_excel(writer, sheet_name="data", index=False)
    adm3_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm3.to_parquet(save_path / "sle_dhis2_adm3_data.parquet", compression="zstd", index=False)

# save adm2 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm2_data.xlsx") as writer:
    dhis2_adm2.to_excel(writer, sheet_name="data", index=False)
    adm2_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm2.to_parquet(save_path / "sle_dhis2_adm2_data.parquet", compression="zstd", index=False)

# save adm1 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm1_data.xlsx") as writer:
    dhis2_adm1.to_excel(writer, sheet_name="data", index=False)
    adm1_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm1.to_parquet(save_path / "sle_dhis2_adm1_data.parquet", compression="zstd", index=False)
Output

To adapt the code:

  • Lines 2-30: Update sum_cols to include all numeric indicators relevant to your dataset.
  • Lines 33-46: The aggregate_admin() function handles grouping by admin level.
  • Lines 49-75: Adjust grouping columns if your admin hierarchy differs. The record_id is created using vdigest() with the full admin hierarchy (e.g., adm0 + adm1 + adm2 + adm3 + yearmon for adm3 level) to ensure unique identifiers. The code also creates location_short and location_full labels for each level.
  • Lines 78-90: Filter the data dictionary to match columns at each level, excluding admin columns not present at that level (e.g., adm3 excluded from adm2 dictionary).
  • Lines 93-108: Update filename prefixes (e.g., sle_) to match your country code.

Once updated, run the code to aggregate and save data at adm3, adm2, and adm1 levels.

Summary

We’ve now walked through the key steps for cleaning, reshaping, and aggregating DHIS2 malaria data to make it SNT-ready. The code covered everything from importing raw files, standardising column names, computing key indicators, and saving outputs at multiple administrative levels. For convenience, a full end-to-end script is included below in a folded code block. You can reuse or adapt this for your own country context by adjusting column names, file paths, and administrative levels as needed.

Full code

Find the full code script for accessing and processing routine data below.

  • R
  • Python
Show full code
################################################################################
#################### ~ DHIS2 data preprocessing full code ~ ####################
################################################################################

### Step 1: Import Packages and Data -------------------------------------------

#### Step 1.1: Import Packages -------------------------------------------------

# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  tidyverse, # data manipulation, reshaping and plotting
  rio,       # import/export multiple file types
  DT,        # interactive table preview
  here,      # project-relative file paths
  readxl,    # read excel files
  writexl,   # write excel files
  knitr      # render code in quarto/rmarkdown
)

#### Step 1.2: Import Data -----------------------------------------------------

# set the path to the data
core_routine_path <- here::here(
  "1.1.2_epidemiology",
  "1.1.2a_routine_surveillance",
  "raw"
)

# set the full dhis2 path
path_to_dhis2 <- here::here(core_routine_path, "sle_dhis2_2015_2022.xlsx")

dhis2_df <- rio::import_list(
  file = list.files(
    path = core_routine_path,
    pattern = "\\.(xls)$",    # the file extensions
    full.names = TRUE
  ),
  rbind = TRUE,                # stack all tabs/files into one table
  rbind_label = "sheet_admin", # source name stored in column 'sheet_admin'
  rbind_fill = TRUE            # fill missing columns with NA when stacking
)

# check data
dhis2_df |>
  dplyr::select(1:20) |>
  dplyr::glimpse()

### Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts --

#### Step 2.1: Verify All Expected Groups Are Present After Import -------------

# check the adm's' in our data
dhis2_df |>
  dplyr::distinct(orgunitlevel3)

#### Step 2.2: Check for Completely Missing Variables --------------------------

# identify complete missing values per column
dhis2_df |>
  dplyr::summarise(
    dplyr::across(
      dplyr::everything(),
      ~ mean(is.na(.x)) * 100
    )
  ) |>
  tidyr::pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "pct_missing"
  ) |>
  dplyr::filter(pct_missing == 100)

### Step 3: Rename Columns -----------------------------------------------------

# import data dictionary
data_dict <- rio::import(
  here::here(core_routine_path, "sle_dhis2_data_dict.xlsx")
)
# create rename vector: object = old names, names = new names
rename_vector <- stats::setNames(
  object = data_dict$indicator_label,
  nm = data_dict$snt_var
)

# rename dhis2 data
dhis2_df <- dhis2_df |>
  dplyr::rename(
    !!!rename_vector
  ) |>
  # we drop orgunitlevel5 because it s same as hf
  dplyr::select(-orgunitlevel5)

# check the names
colnames(dhis2_df)

### Step 4: Standardise the Date Columns ---------------------------------------

# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
Sys.setlocale("LC_TIME", "en_US.UTF-8")

dhis2_df <- dhis2_df |>
  # parse the raw date into a proper Date object
  dplyr::mutate(
    date = lubridate::parse_date_time(
      periodname,
      orders = c("B Y", "b Y")
    ) |>
      as.Date()
  ) |>
  # create year, month, and ordered yearmon label
  dplyr::mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    # human-readable label, e.g. "Jan 2020"
    yearmon = format(date, "%b %Y"),
    # order factor by actual chronological date
    yearmon = factor(yearmon, levels = unique(yearmon[order(date)]))
  )

# check the head
dhis2_df |>
  dplyr::select(date, year, month, yearmon) |>
  head()

### Step 5: Managing Location Columns ------------------------------------------

#### Steps 5.1: Harmonise Administrative Names ---------------------------------

# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")

# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"

# get adm3 shapfile to use as reference lookup
shp_adm3 <- sntutils::read(
  here::here("english/data_r/shapefiles/sle_spatial_adm3_2021.rds")
) |>
  sf::st_drop_geometry() # drop geometry, we just need the admin names

# standardise admin names
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    adm0 = toupper(adm0),
    adm2 = toupper(adm1),
    adm3 = toupper(adm3),
    hf = toupper(hf),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2 = stringr::str_remove_all(adm2, " DISTRICT"),
    adm3 = stringr::str_remove_all(adm3, " CHIEFDOM")
  ) |>
  dplyr::mutate(
    # assign provinces based on district groupings
    # (no adm1 as current adm1 is adm2)
    adm1 = dplyr::case_when(
      adm2 %in% c("KAILAHUN", "KENEMA", "KONO") ~ "EASTERN",
      adm2 %in% c("BOMBALI", "FALABA", "KOINADUGU", "TONKOLILI") ~ "NORTH EAST",
      adm2 %in% c("KAMBIA", "KARENE", "PORT LOKO") ~ "NORTH WEST",
      adm2 %in% c("BO", "BONTHE", "MOYAMBA", "PUJEHUN") ~ "SOUTHERN",
      adm2 %in% c("WESTERN AREA RURAL", "WESTERN AREA URBAN") ~ "WESTERN"
    )
  )

# harmonise admin names between dhis2 data and shapefile
dhis2_df <-
  sntutils::prep_geonames(
    target_df = dhis2_df,
    lookup_df = lookup_keys,
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    interactive = TRUE,
    cache_path = here::here(cache_path, "geoname_decisions_cache.xlsx")
  )

#### Steps 5.1: Create unique identifiers and location labels ------------------

# create identifiers
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # create unique health facility identifier from admin hierarchy
    hf_uid = sntutils::vdigest(
      paste0(adm0, adm1, adm2, adm3, hf),
      algo = "xxhash32"
    ),
    # create location labels for mapping
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, hf, sep = " ~ "),
    # generate consistent record id (facility + time period)
    record_id = sntutils::vdigest(
      paste(hf_uid, yearmon),
      algo = "xxhash32"
    )
  )

# check
dhis2_df |>
  dplyr::arrange(location_full) |>
  dplyr::distinct(location_full, hf_uid, record_id) |>
  head()

### Step 6: Compute Variables --------------------------------------------------

#### Step 6.1: Compute Indicator Totals and New Variables ----------------------

# smart row-wise sum with missing data handling
fallback_row_sum <- function(..., min_present = 1, .keep_zero_as_zero = TRUE) {
  vars_matrix <- cbind(...)
  valid_count <- rowSums(!is.na(vars_matrix))
  raw_sum <- rowSums(vars_matrix, na.rm = TRUE)

  ifelse(valid_count >= min_present, raw_sum, NA_real_)
}

# fallback absolute difference between two vectors
fallback_diff <- function(col1, col2, minimum = 0) {
  dplyr::case_when(
    is.na(col1) & is.na(col2) ~ NA_real_,
    is.na(col1) ~ pmax(col2, minimum),
    is.na(col2) ~ pmax(col1, minimum),
    TRUE ~ pmax(col1 - col2, minimum)
  )
}

# compute indicator totals in DHIS2 data
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # outpatient visits
    allout = fallback_row_sum(allout_u5, allout_ov5),

    # suspected cases
    susp = fallback_row_sum(
      susp_u5_hf,
      susp_5_14_hf,
      susp_ov15_hf,
      susp_u5_com,
      susp_5_14_com,
      susp_ov15_com
    ),

    # tested cases
    test_hf = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    test_com = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com,
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com,
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    test = fallback_row_sum(test_hf, test_com),

    # confirmed cases (HF and COM)
    conf_hf = fallback_row_sum(
      test_pos_mic_u5_hf,
      test_pos_mic_5_14_hf,
      test_pos_mic_ov15_hf,
      tes_pos_rdt_u5_hf,
      tes_pos_rdt_5_14_hf,
      tes_pos_rdt_ov15_hf
    ),

    conf_com = fallback_row_sum(
      tes_pos_rdt_u5_com,
      tes_pos_rdt_5_14_com,
      tes_pos_rdt_ov15_com
    ),

    conf = fallback_row_sum(conf_hf, conf_com),

    # treated cases
    maltreat_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com,
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    maltreat_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf,
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    maltreat = fallback_row_sum(maltreat_hf, maltreat_com),

    # presumed cases
    pres_com = fallback_diff(maltreat_com, conf_com),
    pres_hf = fallback_diff(maltreat_hf, conf_hf),
    pres = fallback_row_sum(pres_com, pres_hf),

    # malaria admissions
    maladm = fallback_row_sum(
      maladm_u5,
      maladm_5_14,
      maladm_ov15
    ),

    # malaria deaths
    maldth = fallback_row_sum(
      maldth_u5,
      maldth_1_59m,
      maldth_10_14,
      maldth_5_9,
      maldth_5_14,
      maldth_ov15,
      maldth_fem_ov15,
      maldth_mal_ov15
    ),

    # AGE-GROUP SPECIFIC AGGREGATIONS
    # Tested cases by age group (HF only)
    test_hf_u5 = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    test_hf_5_14 = fallback_row_sum(
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    test_hf_ov15 = fallback_row_sum(
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # Tested cases by age group (Community only)
    test_com_u5 = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com
    ),

    test_com_5_14 = fallback_row_sum(
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com
    ),

    test_com_ov15 = fallback_row_sum(
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    # Total tested by age group (HF + Community)
    test_u5 = fallback_row_sum(test_hf_u5, test_com_u5),
    test_5_14 = fallback_row_sum(test_hf_5_14, test_com_5_14),
    test_ov15 = fallback_row_sum(test_hf_ov15, test_com_ov15),

    # Suspected cases by age group (HF only)
    susp_hf_u5 = susp_u5_hf,

    susp_hf_5_14 = susp_5_14_hf,

    susp_hf_ov15 = susp_ov15_hf,

    # Suspected cases by age group (Community only)
    susp_com_u5 = susp_u5_com,

    susp_com_5_14 = susp_5_14_com,

    susp_com_ov15 = susp_ov15_com,

    # Total suspected by age group (HF + Community)
    susp_u5 = fallback_row_sum(susp_hf_u5, susp_com_u5),
    susp_5_14 = fallback_row_sum(susp_hf_5_14, susp_com_5_14),
    susp_ov15 = fallback_row_sum(susp_hf_ov15, susp_com_ov15),

    # Confirmed cases by age group (HF only)
    conf_hf_u5 = fallback_row_sum(
      test_pos_mic_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    conf_hf_5_14 = fallback_row_sum(
      test_pos_mic_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    conf_hf_ov15 = fallback_row_sum(
      test_pos_mic_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # Confirmed cases by age group (Community only)
    conf_com_u5 = tes_pos_rdt_u5_com,
    conf_com_5_14 = tes_pos_rdt_5_14_com,
    conf_com_ov15 = tes_pos_rdt_ov15_com,

    # Total confirmed cases by age group (HF + Community)
    conf_u5 = fallback_row_sum(conf_hf_u5, conf_com_u5),
    conf_5_14 = fallback_row_sum(conf_hf_5_14, conf_com_5_14),
    conf_ov15 = fallback_row_sum(conf_hf_ov15, conf_com_ov15),

    # Treated cases by age group (HF only)
    maltreat_hf_u5 = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf
    ),

    maltreat_hf_5_14 = fallback_row_sum(
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf
    ),

    maltreat_hf_ov15 = fallback_row_sum(
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    # Treated cases by age group (Community only)
    maltreat_com_u5 = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com
    ),

    maltreat_com_5_14 = fallback_row_sum(
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com
    ),

    maltreat_com_ov15 = fallback_row_sum(
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    # Total treated cases by age group (HF + Community)
    maltreat_u5 = fallback_row_sum(maltreat_hf_u5, maltreat_com_u5),
    maltreat_5_14 = fallback_row_sum(maltreat_hf_5_14, maltreat_com_5_14),
    maltreat_ov15 = fallback_row_sum(maltreat_hf_ov15, maltreat_com_ov15),

    # Total treated cases within 24 hours (HF only)
    maltreat_u24_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_u24_ov15_hf
    ),

    # Total treated cases after 24 hours (HF only)
    maltreat_ov24_hf = fallback_row_sum(
      maltreat_ov24_u5_hf,
      maltreat_ov24_5_14_hf,
      maltreat_ov24_ov15_hf
    ),

    # Total treated cases within 24 hours (Community only)
    maltreat_u24_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_u24_ov15_com
    ),

    # Total treated cases after 24 hours (Community only)
    maltreat_ov24_com = fallback_row_sum(
      maltreat_ov24_u5_com,
      maltreat_ov24_5_14_com,
      maltreat_ov24_ov15_com
    ),

    # Overall totals (HF + Community)
    maltreat_u24_total = fallback_row_sum(maltreat_u24_hf, maltreat_u24_com),
    maltreat_ov24_total = fallback_row_sum(maltreat_ov24_hf, maltreat_ov24_com),

    # Presumed cases by age group
    pres_com_u5 = fallback_diff(maltreat_com_u5, conf_com_u5),
    pres_com_5_14 = fallback_diff(maltreat_com_5_14, conf_com_5_14),
    pres_com_ov15 = fallback_diff(maltreat_com_ov15, conf_com_ov15),

    pres_hf_u5 = fallback_diff(maltreat_hf_u5, conf_hf_u5),
    pres_hf_5_14 = fallback_diff(maltreat_hf_5_14, conf_hf_5_14),
    pres_hf_ov15 = fallback_diff(maltreat_hf_ov15, conf_hf_ov15),

    pres_u5 = fallback_row_sum(pres_com_u5, pres_hf_u5),
    pres_5_14 = fallback_row_sum(pres_com_5_14, pres_hf_5_14),
    pres_ov15 = fallback_row_sum(pres_com_ov15, pres_hf_ov15)
  )

# check to see the aggregation worked
dhis2_df |>
  dplyr::filter(
    record_id %in%
      c("6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842")
  ) |>
  dplyr::arrange(allout) |>
  dplyr::select(
    allout,
    allout_u5,
    allout_ov5,
    pres_hf_u5,
    maltreat_hf_u5,
    conf_hf_u5
  ) |>
  head()

#### Step 6.2: Quality Control of Indicator Totals -----------------------------

# create mismatch flags for each indicator group
mismatch_summary <- dhis2_df |>
  dplyr::summarise(
    # outpatient checks
    allout_mismatch = sum(
      allout != (allout_u5 + allout_ov5),
      na.rm = TRUE
    ),

    # malaria admissions check
    maladm_mismatch = sum(
      maladm != (maladm_u5 + maladm_5_14 + maladm_ov15),
      na.rm = TRUE
    ),

    # total tests check
    test_mismatch = sum(
      test != (test_hf + test_com),
      na.rm = TRUE
    ),

    # total confirmed check
    conf_mismatch = sum(
      conf != (conf_hf + conf_com),
      na.rm = TRUE
    ),

    # total treated check
    maltreat_mismatch = sum(
      maltreat != (maltreat_hf + maltreat_com),
      na.rm = TRUE
    ),

    # total presumed check
    pres_mismatch = sum(
      pres != (pres_hf + pres_com),
      na.rm = TRUE
    ),

    # malaria deaths check
    maldth_mismatch = sum(
      maldth !=
        (maldth_1_59m +
          maldth_u5 +
          maldth_5_9 +
          maldth_10_14 +
          maldth_5_14 +
          maldth_fem_ov15 +
          maldth_mal_ov15 +
          maldth_ov15),
      na.rm = TRUE
    ),

    # tested cases by age group
    test_u5_mismatch = sum(
      test_u5 != (test_hf_u5 + test_com_u5),
      na.rm = TRUE
    ),
    test_5_14_mismatch = sum(
      test_5_14 != (test_hf_5_14 + test_com_5_14),
      na.rm = TRUE
    ),
    test_ov15_mismatch = sum(
      test_ov15 != (test_hf_ov15 + test_com_ov15),
      na.rm = TRUE
    ),

    # confirmed cases by age group
    conf_u5_mismatch = sum(
      conf_u5 != (conf_hf_u5 + conf_com_u5),
      na.rm = TRUE
    ),
    conf_5_14_mismatch = sum(
      conf_5_14 != (conf_hf_5_14 + conf_com_5_14),
      na.rm = TRUE
    ),
    conf_ov15_mismatch = sum(
      conf_ov15 != (conf_hf_ov15 + conf_com_ov15),
      na.rm = TRUE
    ),

    # presumed cases by age group
    pres_u5_mismatch = sum(
      pres_u5 != (pres_hf_u5 + pres_com_u5),
      na.rm = TRUE
    ),
    pres_5_14_mismatch = sum(
      pres_5_14 != (pres_hf_5_14 + pres_com_5_14),
      na.rm = TRUE
    ),
    pres_ov15_mismatch = sum(
      pres_ov15 != (pres_hf_ov15 + pres_com_ov15),
      na.rm = TRUE
    ),

    # suspected cases by age group
    susp_u5_mismatch = sum(
      susp_u5 != (susp_u5_hf + susp_u5_com),
      na.rm = TRUE
    ),
    susp_5_14_mismatch = sum(
      susp_5_14 != (susp_5_14_hf + susp_5_14_com),
      na.rm = TRUE
    ),
    susp_ov15_mismatch = sum(
      susp_ov15 != (susp_ov15_hf + susp_ov15_com),
      na.rm = TRUE
    ),

    # treated cases by age group
    maltreat_u5_mismatch = sum(
      maltreat_u5 != (maltreat_hf_u5 + maltreat_com_u5),
      na.rm = TRUE
    ),
    maltreat_5_14_mismatch = sum(
      maltreat_5_14 != (maltreat_hf_5_14 + maltreat_com_5_14),
      na.rm = TRUE
    ),
    maltreat_ov15_mismatch = sum(
      maltreat_ov15 != (maltreat_hf_ov15 + maltreat_com_ov15),
      na.rm = TRUE
    ),

    # treatment timing checks
    maltreat_u24_mismatch = sum(
      maltreat_u24_total != (maltreat_u24_hf + maltreat_u24_com),
      na.rm = TRUE
    ),
    maltreat_ov24_mismatch = sum(
      maltreat_ov24_total != (maltreat_ov24_hf + maltreat_ov24_com),
      na.rm = TRUE
    )
  ) |>
  # pivot to long format for easier viewing
  tidyr::pivot_longer(
    cols = dplyr::everything(),
    names_to = "indicator",
    values_to = "n_mismatches"
  ) |>
  # filter to show only indicators with mismatches
  dplyr::filter(n_mismatches > 0)

mismatch_summary

#### Step 6.3: Export Rows with Incoherent Totals ------------------------------

# Identify rows with incoherent totals
incoherent_rows <- dhis2_df |>
  dplyr::filter(
    # outpatient checks
    allout != (allout_u5 + allout_ov5) |
    # malaria admissions check
    maladm != (maladm_u5 + maladm_5_14 + maladm_ov15) |
    # total tests check
    test != (test_hf + test_com) |
    # total confirmed check
    conf != (conf_hf + conf_com) |
    # total treated check
    maltreat != (maltreat_hf + maltreat_com) |
    # total presumed check
    pres != (pres_hf + pres_com)
  ) |>
  dplyr::select(
    hf, periodname,
    dplyr::matches("allout|maladm|test|conf|maltreat|pres")
  )

# set path for saving
output_file <- here::here(
  "1.1.2_epidemiology",
  "1.1.2a_routine_surveillance",
  "processed",
  "sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
rio::export(incoherent_rows, file = output_file)

#### Step 6.3: Add IPD/OPD Specification ---------------------------------------

# option 1: classify based on facility name patterns
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    facility_type = dplyr::case_when(
      # inpatient facilities (hospitals)
      stringr::str_detect(
        hf,
        regex("hospital|hosp", ignore_case = TRUE)
      ) ~ "IPD",
      stringr::str_detect(
        hf,
        regex("district hospital|regional hospital", ignore_case = TRUE)
      ) ~ "IPD",
      # outpatient facilities (clinics, health posts, community health)
      stringr::str_detect(
        hf,
        regex(
          "CHP|CHC|MCHP|clinic|health post|health centre",
          ignore_case = TRUE
        )
      ) ~ "OPD",
      # default to OPD if no pattern matched
      TRUE ~ "OPD"
    )
  )

# option 2: classify based on presence of inpatient indicators
dhis2_df <- dhis2_df |>
  dplyr::group_by(hf_uid) |>
  dplyr::mutate(
    # facility has IPD if it ever reports admissions or deaths
    has_ipd = any(!is.na(maladm) & maladm > 0, na.rm = TRUE) |
      any(!is.na(maldth) & maldth > 0, na.rm = TRUE),
    facility_type = dplyr::if_else(has_ipd, "IPD", "OPD")
  ) |>
  dplyr::ungroup() |>
  dplyr::select(-has_ipd)

# option 3: use a reference lookup file
hf_path <-
  rio::import(
    here::here(
      "01_data",
      "1.1_foundational",
      "1.1b_health_facilities",
      "processed",
      "health_facility_master_list.xlsx"
    )
  )

dhis2_df <- dhis2_df |>
  dplyr::left_join(
    facility_lookup |> dplyr::select(hf_uid, facility_type),
    by = "hf_uid"
  )

# check classification distribution
dhis2_df |>
  dplyr::distinct(hf_uid, facility_type) |>
  dplyr::count(facility_type)

### Step 7: Finalize Data ------------------------------------------------------

#### Step 7.1: Troubleshoot Duplicate HF-Month Records with Different Data -----

# identify duplicate hf-month combinations
duplicates <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::ungroup()

# count number of duplicate pairs
n_duplicates <- duplicates |>
  dplyr::distinct(record_id) |>
  nrow()

# if duplicates exist, inspect them
if (n_duplicates > 0) {
  # view duplicate records with key indicators
  duplicate_details <- duplicates |>
    dplyr::select(
      record_id,
      adm0, adm1, adm2, adm3, hfm yearmon,
      allout,
      test,
      conf,
      maltreat
    ) |>
    dplyr::arrange(record_id)

  print(duplicate_details)

  # export for review with SNT team
  rio::export(
    duplicate_details,
    here::here(
      "1.1.2_epidemiology",
      "1.1.2a_routine_surveillance",
      "processed",
      "sle_duplicate_records_dhis2.xlsx"
    )
  )
}

# option 1: keep first record (if duplicates are exact copies)
dhis2_df <- dhis2_df |>
  dplyr::distinct(record_id, .keep_all = TRUE)

# option 2: keep record with most complete data
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id) |>
  dplyr::slice_max(
    # count non-NA values across indicator columns
    order_by = rowSums(!is.na(dplyr::across(dplyr::where(is.numeric)))),
    n = 1,
    with_ties = FALSE
  ) |>
  dplyr::ungroup()

# option 3: sum duplicates (if records represent partial reports)
dhis2_df <- dhis2_df |>
  dplyr::group_by(record_id, hf, adm0, adm1, adm2, adm3) |>
  dplyr::summarise(
    dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)),
    .groups = "drop"
  )

#### Step 7.2: Generate Final Data Dictionary ----------------------------------

# define descriptions for computed/derived columns
computed_vars <- tibble::tribble(
  ~snt_var,              ~indicator_label,
  # time columns
  "date",                "Report date (YYYY-MM-DD)",
  "year",                "Report year",
  "month",               "Report month (1-12)",
  "yearmon",             "Year-month label (e.g., Jan 2020)",
  # identifier columns
  "hf_uid",              "Unique health facility identifier (hash)",
  "record_id",           "Unique record identifier (facility + month hash)",
  "location_short",      "Location label: adm1 ~ adm2",
  "location_full",       "Location label: adm1 ~ adm2 ~ adm3 ~ hf",
  "facility_type",       "Facility type (IPD/OPD)",
  # aggregated totals
  "allout",              "Total outpatient visits (allout_u5 + allout_ov5)",
  "susp",                "Total suspected cases (all ages, HF + COM)",
  "test",                "Total tested (test_hf + test_com)",
  "test_hf",             "Total tested at health facility",
  "test_com",            "Total tested in community",
  "conf",                "Total confirmed cases (conf_hf + conf_com)",
  "conf_hf",             "Total confirmed cases at health facility",
  "conf_com",            "Total confirmed cases in community",
  "maltreat",            "Total treated cases (maltreat_hf + maltreat_com)",
  "maltreat_hf",         "Total treated cases at health facility",
  "maltreat_com",        "Total treated cases in community",
  "pres",                "Total presumed cases (pres_hf + pres_com)",
  "pres_hf",             "Total presumed cases at health facility",
  "pres_com",            "Total presumed cases in community",
  "maladm",              "Total malaria admissions (all ages)",
  "maldth",              "Total malaria deaths (all ages)",
  # age-group totals
  "test_u5",             "Total tested under 5 (HF + COM)",
  "test_5_14",           "Total tested 5-14 years (HF + COM)",
  "test_ov15",           "Total tested over 15 years (HF + COM)",
  "test_hf_u5",          "Tested under 5 at health facility",
  "test_hf_5_14",        "Tested 5-14 years at health facility",
  "test_hf_ov15",        "Tested over 15 years at health facility",
  "test_com_u5",         "Tested under 5 in community",
  "test_com_5_14",       "Tested 5-14 years in community",
  "test_com_ov15",       "Tested over 15 years in community",
  "susp_u5",             "Suspected cases under 5 (HF + COM)",
  "susp_5_14",           "Suspected cases 5-14 years (HF + COM)",
  "susp_ov15",           "Suspected cases over 15 years (HF + COM)",
  "susp_hf_u5",          "Suspected cases under 5 at health facility",
  "susp_hf_5_14",        "Suspected cases 5-14 years at health facility",
  "susp_hf_ov15",        "Suspected cases over 15 years at health facility",
  "susp_com_u5",         "Suspected cases under 5 in community",
  "susp_com_5_14",       "Suspected cases 5-14 years in community",
  "susp_com_ov15",       "Suspected cases over 15 years in community",
  "conf_u5",             "Confirmed cases under 5 (HF + COM)",
  "conf_5_14",           "Confirmed cases 5-14 years (HF + COM)",
  "conf_ov15",           "Confirmed cases over 15 years (HF + COM)",
  "conf_hf_u5",          "Confirmed cases under 5 at health facility",
  "conf_hf_5_14",        "Confirmed cases 5-14 years at health facility",
  "conf_hf_ov15",        "Confirmed cases over 15 years at health facility",
  "conf_com_u5",         "Confirmed cases under 5 in community",
  "conf_com_5_14",       "Confirmed cases 5-14 years in community",
  "conf_com_ov15",       "Confirmed cases over 15 years in community",
  "maltreat_u5",         "Treated cases under 5 (HF + COM)",
  "maltreat_5_14",       "Treated cases 5-14 years (HF + COM)",
  "maltreat_ov15",       "Treated cases over 15 years (HF + COM)",
  "maltreat_hf_u5",      "Treated cases under 5 at health facility",
  "maltreat_hf_5_14",    "Treated cases 5-14 years at health facility",
  "maltreat_hf_ov15",    "Treated cases over 15 years at health facility",
  "maltreat_com_u5",     "Treated cases under 5 in community",
  "maltreat_com_5_14",   "Treated cases 5-14 years in community",
  "maltreat_com_ov15",   "Treated cases over 15 years in community",
  "maltreat_u24_hf",     "Treated cases within 24hrs at health facility",
  "maltreat_ov24_hf",    "Treated cases after 24hrs at health facility",
  "maltreat_u24_com",    "Treated cases within 24hrs in community",
  "maltreat_ov24_com",   "Treated cases after 24hrs in community",
  "maltreat_u24_total",  "Total treated cases within 24hrs (HF + COM)",
  "maltreat_ov24_total", "Total treated cases after 24hrs (HF + COM)",
  "pres_u5",             "Presumed cases under 5 (HF + COM)",
  "pres_5_14",           "Presumed cases 5-14 years (HF + COM)",
  "pres_ov15",           "Presumed cases over 15 years (HF + COM)",
  "pres_hf_u5",          "Presumed cases under 5 at health facility",
  "pres_hf_5_14",        "Presumed cases 5-14 years at health facility",
  "pres_hf_ov15",        "Presumed cases over 15 years at health facility",
  "pres_com_u5",         "Presumed cases under 5 in community",
  "pres_com_5_14",       "Presumed cases 5-14 years in community",
  "pres_com_ov15",       "Presumed cases over 15 years in community"
)

# combine original and computed dictionaries
full_data_dict <- dplyr::bind_rows(
  data_dict,
  computed_vars
) |>
  # keep only columns present in final dataset
  dplyr::filter(snt_var %in% names(dhis2_df)) |>
  dplyr::arrange(snt_var) |>
  dplyr::select(snt_variable = snt_var,  label = indicator_label)

# check
full_data_dict |>
    head()

#### Step 7.3: Arrange Final Column Order --------------------------------------

# arrange columns in logical order
dhis2_df <- dhis2_df |>
  dplyr::select(
    # identifiers
    record_id,
    # location hierarchy
    adm0,
    adm1,
    adm2,
    adm3,
    hf,
    hf_uid,
    location_short,
    location_full,
    facility_type,
    # time variables
    date,
    yearmon,
    year,
    month,
    # all remaining indicator columns
    dplyr::everything()
  )

# check column order
colnames(dhis2_df) |> head(20)

### Step 8: Aggregate and Save Data --------------------------------------------

#### Step 8.1: Save Data at the Health Facility Level --------------------------

# set up output path
save_path <- here::here(
  "01_data",
  "02_epidemiology",
  "2a_routine_surveillance",
  "processed"
)

# create list to save
dhis2_hf_list <- list(
  data = dhis2_df,
  dictionary = full_data_dict
)

# save to xlsx
rio::export(
  dhis2_hf_list,
  here::here(save_path, "sle_dhis2_health_facility_data.xlsx")
)

# save to rds
rio::export(
  dhis2_hf_list,
  here::here(save_path, "sle_dhis2_health_facility_data.rds")
)

#### Step 8.2: Aggregate and Save Data at Each Admin Unit Level ----------------

# define numeric columns to sum (excluding HF-level identifiers)
sum_cols <- c(
  # totals
  "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
  # by location
  "test_hf", "test_com", "conf_hf", "conf_com",
  "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
  # by age group - totals
  "allout_u5", "allout_ov5",
  "test_u5", "test_5_14", "test_ov15",
  "conf_u5", "conf_5_14", "conf_ov15",
  "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
  "pres_u5", "pres_5_14", "pres_ov15",
  "susp_u5", "susp_5_14", "susp_ov15",
  "maladm_u5", "maladm_5_14", "maladm_ov15",
  "maldth_u5", "maldth_5_14", "maldth_ov15",
  # by age and location
  "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
  "test_com_u5", "test_com_5_14", "test_com_ov15",
  "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
  "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
  "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
  "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
  "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
  "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
  # treatment timing
  "maltreat_u24_hf", "maltreat_ov24_hf",
  "maltreat_u24_com", "maltreat_ov24_com",
  "maltreat_u24_total", "maltreat_ov24_total"
)

# function to aggregate at a given admin level
aggregate_admin <- function(df, group_cols, sum_cols) {
  df |>
    dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |>
    dplyr::summarise(
      dplyr::across(
        dplyr::any_of(sum_cols),
        ~ sum(.x, na.rm = TRUE)
      ),
      n_facilities = dplyr::n(),
      .groups = "drop"
    )
}

# aggregate at adm3 level
group_cols_adm3 <- c("adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon")
dhis2_adm3 <- aggregate_admin(dhis2_df, group_cols_adm3, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, adm3, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, adm3, sep = " ~ ")
  )

# aggregate at adm2 level
group_cols_adm2 <- c("adm0", "adm1", "adm2", "year", "month", "yearmon")
dhis2_adm2 <- aggregate_admin(dhis2_df, group_cols_adm2, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, adm2, yearmon),
      algo = "xxhash32"
    ),
    location_short = paste(adm1, adm2, sep = " ~ "),
    location_full = paste(adm1, adm2, sep = " ~ ")
  )

# aggregate at adm1 level
group_cols_adm1 <- c("adm0", "adm1", "year", "month", "yearmon")
dhis2_adm1 <- aggregate_admin(dhis2_df, group_cols_adm1, sum_cols) |>
  dplyr::mutate(
    record_id = sntutils::vdigest(
      paste(adm0, adm1, yearmon),
      algo = "xxhash32"
    ),
    location_short = adm1,
    location_full = adm1
  )

# create data dictionaries for each level (filter to relevant columns only)
id_cols <- c("n_facilities", "record_id", "location_short", "location_full")

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_dict <- full_data_dict |>
  dplyr::filter(snt_variable %in% c(group_cols_adm3, sum_cols, id_cols))

# adm2 dict: excludes adm3 (not present at this level)
adm2_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm2, sum_cols, id_cols),
    !snt_variable %in% c("adm3")
  )

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_dict <- full_data_dict |>
  dplyr::filter(
    snt_variable %in% c(group_cols_adm1, sum_cols, id_cols),
    !snt_variable %in% c("adm2", "adm3")
  )

# save adm3 data with dictionary
rio::export(
  list(data = dhis2_adm3, dictionary = adm3_dict),
  here::here(save_path, "sle_dhis2_adm3_data.xlsx")
)
rio::export(dhis2_adm3, here::here(save_path, "sle_dhis2_adm3_data.rds"))

# save adm2 data with dictionary
rio::export(
  list(data = dhis2_adm2, dictionary = adm2_dict),
  here::here(save_path, "sle_dhis2_adm2_data.xlsx")
)
rio::export(dhis2_adm2, here::here(save_path, "sle_dhis2_adm2_data.rds"))

# save adm1 data with dictionary
rio::export(
  list(data = dhis2_adm1, dictionary = adm1_dict),
  here::here(save_path, "sle_dhis2_adm1_data.xlsx")
)
rio::export(dhis2_adm1, here::here(save_path, "sle_dhis2_adm1_data.rds"))
Show full code
################################################################################
#################### ~ DHIS2 data preprocessing full code ~ ####################
################################################################################

### Step 1: Import Packages and Data -------------------------------------------

#### Step 1.1: Import Packages -------------------------------------------------

import pandas as pd  # load core data tools
from pathlib import Path  # handle file system paths
from pyprojroot import here  # build project-relative paths
import os  # optional path utilities
import locale  # for setting language locale
import geopandas as gpd  # for importing shapefiles
import xxhash  # for hashing
import pyarrow.parquet as pq  # for exportin parquet files
import pyarrow as pa  # for exportin parquet files

#### Step 1.2: Import Data -----------------------------------------------------

# set the full dhis2 path
core_routine_path = Path(
    here("1.1.2_epidemiology/1.1.2a_routine_surveillance/raw")
)

# set the full dhis2 path
path_to_dhis2 = core_routine_path / "sle_dhis2_2015_2022.xlsx"

# set file extension
extension = "xls"

# find all files with specified extension in directory
files = list(core_routine_path.glob(f"*.{extension}"))

# initialise dataframe
dhis2_df = pd.DataFrame()

# iterate over files, concatenate stack as one dataframe
for file in files:
    temp = pd.read_excel(file, sheet_name="Sheet1")
    dhis2_df = pd.concat([dhis2_df, temp])

# Inspect data
dhis2_df.head(10)

### Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts --

#### Step 2.1: Verify All Expected Groups Are Present After Import -------------

# check the adm's' in our data
dhis2_df[["orgunitlevel3"]].drop_duplicates().reset_index(drop=True)

#### Step 2.2: Check for Completely Missing Variables --------------------------

# compute percent missing for each column
pct_missing = dhis2_df.isna().mean() * 100

# convert to tidy/long format
missing_table = (
    pct_missing.reset_index()
    .rename(columns={"index": "variable", 0: "pct_missing"})
)

# filter for completely missing columns
missing_table[missing_table["pct_missing"] == 100]

### Step 3: Rename Columns -----------------------------------------------------

# import data dictionary
dict_path = os.path.join(core_routine_path, "sle_dhis2_data_dict.xlsx")
data_dict = pd.read_excel(dict_path)

# build rename dictionary: {old_name: new_name}
rename_dict = dict(zip(data_dict["indicator_label"], data_dict["snt_var"]))

# rename dhis2 data
dhis2_df = dhis2_df.rename(columns=rename_dict).drop(
    columns=["orgunitlevel5"], errors="ignore"
)

# view updated column names
dhis2_df.columns.tolist()

### Step 4: Standardise the Date Columns ---------------------------------------

# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
locale.setlocale(locale.LC_TIME, "en_US.UTF-8")

# parse the raw date into a proper datetime object
dhis2_df["date"] = pd.to_datetime(
    dhis2_df["periodname"],
    format="%B %Y",
    errors="coerce"
)

# create year, month, and ordered yearmon columns
dhis2_df["year"] = dhis2_df["date"].dt.year
dhis2_df["month"] = dhis2_df["date"].dt.month
dhis2_df["yearmon"] = dhis2_df["date"].dt.strftime("%b %Y")

# check the output
dhis2_df[["date", "year", "month", "yearmon"]].head()

### Step 5: Managing Location Columns ------------------------------------------

#### Steps 5.1: Harmonise Administrative Names ---------------------------------

# If not already, install the sntutils python package from GitHub
# has a number of useful helper functions
pip install git+https://github.com/ahadi-analytics/sntutils-py.git
from sntutils.geo import prep_geonames # for name matching

# set up location to save cache
cache_path = Path("1.1_foundational/1d_cache_files")

# get adm3 shapefile to use as reference lookup
shp_adm3 = gpd.read_file(
    Path(here("english/data_r/shapefiles/sle_spatial_adm3_2021.geojson"))
).drop(columns="geometry")

# standardise admin names
dhis2_df = dhis2_df.assign(
    adm0=lambda x: x["adm0"].str.upper(),
    # the adm2 in the lookup shapefile don't have
    # "DISTRICT" in the name, so we remove it to enable matching
    adm2=lambda x: x["adm1"]
    .str.upper()
    .str.replace(" DISTRICT", "", regex=False)
    .str.strip(),
    adm3=lambda x: x["adm3"]
    .str.upper()
    .str.replace(" CHIEFDOM", "", regex=False)
    .str.strip(),
    hf=lambda x: x["hf"]
    .str.upper()
)

# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
district_to_province = {
    "KAILAHUN": "EASTERN", "KENEMA": "EASTERN", "KONO": "EASTERN",
    "BOMBALI": "NORTH EAST", "FALABA": "NORTH EAST",
    "KOINADUGU": "NORTH EAST", "TONKOLILI": "NORTH EAST",
    "KAMBIA": "NORTH WEST", "KARENE": "NORTH WEST", "PORT LOKO": "NORTH WEST",
    "BO": "SOUTHERN", "BONTHE": "SOUTHERN",
    "MOYAMBA": "SOUTHERN", "PUJEHUN": "SOUTHERN",
    "WESTERN AREA RURAL": "WESTERN", "WESTERN AREA URBAN": "WESTERN"
}

dhis2_df["adm1"] = dhis2_df["adm2"].map(district_to_province)

# harmonise admin names between dhis2 data and shapefile
# note: sntutils::prep_geonames() has no direct python equivalent
# manual fuzzy matching or exact merge required
dhis2_df = dhis2_df.merge(
    lookup_keys,
    on=["adm0", "adm1", "adm2", "adm3"],
    how="left"
)

# Harmonise admin names between population data and shapefile
dhis2_df = prep_geonames(
    target_df=dhis2_df,
    lookup_df=shp_adm3,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    cache_path=here(cache_path, "geoname_decisions_cache.xlsx")
)

#### Steps 5.1: Create unique identifiers and location labels ------------------

# helper function to create xxhash32 digest (equivalent to sntutils::vdigest)
def vdigest(x, algo="xxhash32"):
    """Create vectorised hash digest."""
    return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())

# create identifiers
dhis2_df = dhis2_df.assign(
    # create unique health facility identifier from admin hierarchy
    # use spaces to match R's paste() behavior for consistent hashes
    hf_uid=lambda x: vdigest(
        x["adm0"] + " " + x["adm1"] + " " + x["adm2"] + " " + x["adm3"] + " " + x["hf"]
    ),
    # create location labels for mapping
    location_short=lambda x: x["adm1"] + " ~ " + x["adm2"],
    location_full=lambda x: (
        x["adm1"] + " ~ " + x["adm2"] + " ~ " + x["adm3"] + " ~ " + x["hf"]
    ),
)

# generate consistent record id (facility + time period)
# use space to match R's paste() behavior
dhis2_df["record_id"] = vdigest(
    dhis2_df["hf_uid"] + " " + dhis2_df["yearmon"].astype(str)
)

# check
dhis2_df[["location_full", "hf_uid", "record_id"]].drop_duplicates().sort_values(
    "location_full"
).head()

### Step 6: Compute Variables --------------------------------------------------

#### Step 6.1: Compute Indicator Totals and New Variables ----------------------

# helper function: row-wise sum that returns value if only one non-NA
def fallback_row_sum(df, cols):
    return df[cols].sum(axis=1, skipna=True, min_count=1)

# helper function: difference with floor at 0
def fallback_diff(df, col1, col2):
    return (df[col1] - df[col2]).clip(lower=0)

dhis2_df = (
    dhis2_df.assign(
        # outpatient visits
        allout=lambda x: fallback_row_sum(x, ["allout_u5", "allout_ov5"]),
        # suspected cases
        susp=lambda x: fallback_row_sum(
            x,
            [
                "susp_u5_hf",
                "susp_5_14_hf",
                "susp_ov15_hf",
                "susp_u5_com",
                "susp_5_14_com",
                "susp_ov15_com",
            ],
        ),
        # tested cases
        test_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        test_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_neg_rdt_u5_com",
                "tes_pos_rdt_u5_com",
                "tes_neg_rdt_5_14_com",
                "tes_pos_rdt_5_14_com",
                "tes_neg_rdt_ov15_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # confirmed cases (HF and COM)
        conf_hf=lambda x: fallback_row_sum(
            x,
            [
                "test_pos_mic_u5_hf",
                "test_pos_mic_5_14_hf",
                "test_pos_mic_ov15_hf",
                "tes_pos_rdt_u5_hf",
                "tes_pos_rdt_5_14_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        conf_com=lambda x: fallback_row_sum(
            x,
            [
                "tes_pos_rdt_u5_com",
                "tes_pos_rdt_5_14_com",
                "tes_pos_rdt_ov15_com",
            ],
        ),
        # treated cases
        maltreat_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_ov24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_ov24_5_14_com",
                "maltreat_u24_ov15_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
        maltreat_hf=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_hf",
                "maltreat_ov24_u5_hf",
                "maltreat_u24_5_14_hf",
                "maltreat_ov24_5_14_hf",
                "maltreat_u24_ov15_hf",
                "maltreat_ov24_ov15_hf",
            ],
        ),
        # malaria admissions
        maladm=lambda x: fallback_row_sum(
            x, ["maladm_u5", "maladm_5_14", "maladm_ov15"]
        ),
        # malaria deaths
        maldth=lambda x: fallback_row_sum(
            x,
            [
                "maldth_u5",
                "maldth_1_59m",
                "maldth_10_14",
                "maldth_5_9",
                "maldth_5_14",
                "maldth_ov15",
                "maldth_fem_ov15",
                "maldth_mal_ov15",
            ],
        ),
        # AGE-GROUP SPECIFIC AGGREGATIONS
        # Tested cases by age group (HF only)
        test_hf_u5=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_u5_hf",
                "test_pos_mic_u5_hf",
                "tes_neg_rdt_u5_hf",
                "tes_pos_rdt_u5_hf",
            ],
        ),
        test_hf_5_14=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_5_14_hf",
                "test_pos_mic_5_14_hf",
                "tes_neg_rdt_5_14_hf",
                "tes_pos_rdt_5_14_hf",
            ],
        ),
        test_hf_ov15=lambda x: fallback_row_sum(
            x,
            [
                "test_neg_mic_ov15_hf",
                "test_pos_mic_ov15_hf",
                "tes_neg_rdt_ov15_hf",
                "tes_pos_rdt_ov15_hf",
            ],
        ),
        # Tested cases by age group (Community only)
        test_com_u5=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_u5_com", "tes_pos_rdt_u5_com"]
        ),
        test_com_5_14=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_5_14_com", "tes_pos_rdt_5_14_com"]
        ),
        test_com_ov15=lambda x: fallback_row_sum(
            x, ["tes_neg_rdt_ov15_com", "tes_pos_rdt_ov15_com"]
        ),
        # Suspected cases by age group (rename for consistency)
        susp_hf_u5=lambda x: x["susp_u5_hf"],
        susp_hf_5_14=lambda x: x["susp_5_14_hf"],
        susp_hf_ov15=lambda x: x["susp_ov15_hf"],
        susp_com_u5=lambda x: x["susp_u5_com"],
        susp_com_5_14=lambda x: x["susp_5_14_com"],
        susp_com_ov15=lambda x: x["susp_ov15_com"],
        # Confirmed cases by age group (HF only)
        conf_hf_u5=lambda x: fallback_row_sum(
            x, ["test_pos_mic_u5_hf", "tes_pos_rdt_u5_hf"]
        ),
        conf_hf_5_14=lambda x: fallback_row_sum(
            x, ["test_pos_mic_5_14_hf", "tes_pos_rdt_5_14_hf"]
        ),
        conf_hf_ov15=lambda x: fallback_row_sum(
            x, ["test_pos_mic_ov15_hf", "tes_pos_rdt_ov15_hf"]
        ),
        # Confirmed cases by age group (Community only)
        conf_com_u5=lambda x: x["tes_pos_rdt_u5_com"],
        conf_com_5_14=lambda x: x["tes_pos_rdt_5_14_com"],
        conf_com_ov15=lambda x: x["tes_pos_rdt_ov15_com"],
        # Treated cases by age group (HF only)
        maltreat_hf_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_hf", "maltreat_ov24_u5_hf"]
        ),
        maltreat_hf_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_hf", "maltreat_ov24_5_14_hf"]
        ),
        maltreat_hf_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_hf", "maltreat_ov24_ov15_hf"]
        ),
        # Treated cases by age group (Community only)
        maltreat_com_u5=lambda x: fallback_row_sum(
            x, ["maltreat_u24_u5_com", "maltreat_ov24_u5_com"]
        ),
        maltreat_com_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_u24_5_14_com", "maltreat_ov24_5_14_com"]
        ),
        maltreat_com_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_u24_ov15_com", "maltreat_ov24_ov15_com"]
        ),
        # Total treated cases within/after 24 hours
        maltreat_u24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_u24_u5_hf", "maltreat_u24_5_14_hf", "maltreat_u24_ov15_hf"],
        ),
        maltreat_ov24_hf=lambda x: fallback_row_sum(
            x,
            ["maltreat_ov24_u5_hf", "maltreat_ov24_5_14_hf", "maltreat_ov24_ov15_hf"],
        ),
        maltreat_u24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_u24_u5_com",
                "maltreat_u24_5_14_com",
                "maltreat_u24_ov15_com",
            ],
        ),
        maltreat_ov24_com=lambda x: fallback_row_sum(
            x,
            [
                "maltreat_ov24_u5_com",
                "maltreat_ov24_5_14_com",
                "maltreat_ov24_ov15_com",
            ],
        ),
    )
    .assign(
        # second pass: computed from first pass columns
        test=lambda x: fallback_row_sum(x, ["test_hf", "test_com"]),
        conf=lambda x: fallback_row_sum(x, ["conf_hf", "conf_com"]),
        maltreat=lambda x: fallback_row_sum(x, ["maltreat_hf", "maltreat_com"]),
        # totals by age group
        test_u5=lambda x: fallback_row_sum(x, ["test_hf_u5", "test_com_u5"]),
        test_5_14=lambda x: fallback_row_sum(x, ["test_hf_5_14", "test_com_5_14"]),
        test_ov15=lambda x: fallback_row_sum(x, ["test_hf_ov15", "test_com_ov15"]),
        susp_u5=lambda x: fallback_row_sum(x, ["susp_hf_u5", "susp_com_u5"]),
        susp_5_14=lambda x: fallback_row_sum(x, ["susp_hf_5_14", "susp_com_5_14"]),
        susp_ov15=lambda x: fallback_row_sum(x, ["susp_hf_ov15", "susp_com_ov15"]),
        conf_u5=lambda x: fallback_row_sum(x, ["conf_hf_u5", "conf_com_u5"]),
        conf_5_14=lambda x: fallback_row_sum(x, ["conf_hf_5_14", "conf_com_5_14"]),
        conf_ov15=lambda x: fallback_row_sum(x, ["conf_hf_ov15", "conf_com_ov15"]),
        maltreat_u5=lambda x: fallback_row_sum(
            x, ["maltreat_hf_u5", "maltreat_com_u5"]
        ),
        maltreat_5_14=lambda x: fallback_row_sum(
            x, ["maltreat_hf_5_14", "maltreat_com_5_14"]
        ),
        maltreat_ov15=lambda x: fallback_row_sum(
            x, ["maltreat_hf_ov15", "maltreat_com_ov15"]
        ),
        maltreat_u24_total=lambda x: fallback_row_sum(
            x, ["maltreat_u24_hf", "maltreat_u24_com"]
        ),
        maltreat_ov24_total=lambda x: fallback_row_sum(
            x, ["maltreat_ov24_hf", "maltreat_ov24_com"]
        ),
        # presumed cases
        pres_com=lambda x: fallback_diff(x, "maltreat_com", "conf_com"),
        pres_hf=lambda x: fallback_diff(x, "maltreat_hf", "conf_hf"),
        pres_com_u5=lambda x: fallback_diff(x, "maltreat_com_u5", "conf_com_u5"),
        pres_com_5_14=lambda x: fallback_diff(x, "maltreat_com_5_14", "conf_com_5_14"),
        pres_com_ov15=lambda x: fallback_diff(x, "maltreat_com_ov15", "conf_com_ov15"),
        pres_hf_u5=lambda x: fallback_diff(x, "maltreat_hf_u5", "conf_hf_u5"),
        pres_hf_5_14=lambda x: fallback_diff(x, "maltreat_hf_5_14", "conf_hf_5_14"),
        pres_hf_ov15=lambda x: fallback_diff(x, "maltreat_hf_ov15", "conf_hf_ov15"),
    )
    .assign(
        # third pass: totals from presumed
        pres=lambda x: fallback_row_sum(x, ["pres_com", "pres_hf"]),
        pres_u5=lambda x: fallback_row_sum(x, ["pres_com_u5", "pres_hf_u5"]),
        pres_5_14=lambda x: fallback_row_sum(x, ["pres_com_5_14", "pres_hf_5_14"]),
        pres_ov15=lambda x: fallback_row_sum(x, ["pres_com_ov15", "pres_hf_ov15"]),
    )
)

# inspect results
(
    dhis2_df[
        dhis2_df["record_id"].isin(
            ["a80a6352", "aa2eb756", "c8c29999", "eb63621c", "27ee9eeb", "93b16cfe"]
        )
    ][
        [
            "allout",
            "allout_u5",
            "allout_ov5",
            "pres_hf_u5",
            "maltreat_hf_u5",
            "conf_hf_u5",
        ]
    ]
    .sort_values("allout")
    .head(6)
)

#### Step 6.2: Quality Control of Indicator Totals -----------------------------

# create mismatch flags for each indicator group
# helper function to count mismatches, ignoring NAs
def count_mismatch(total, components):
    """Count mismatches between total and sum of components, ignoring NAs."""
    calculated = components.sum(axis=1)
    # only compare where both total and calculated are not NA
    mask = total.notna() & calculated.notna()
    return (total[mask] != calculated[mask]).sum()

mismatch_summary = pd.DataFrame({
    # outpatient checks
    "allout_mismatch": [
        count_mismatch(
            dhis2_df["allout"],
            dhis2_df[["allout_u5", "allout_ov5"]]
        )
    ],

    # malaria admissions check
    "maladm_mismatch": [
        count_mismatch(
            dhis2_df["maladm"],
            dhis2_df[["maladm_u5", "maladm_5_14", "maladm_ov15"]]
        )
    ],

    # total tests check
    "test_mismatch": [
        count_mismatch(
            dhis2_df["test"],
            dhis2_df[["test_hf", "test_com"]]
        )
    ],

    # total confirmed check
    "conf_mismatch": [
        count_mismatch(
            dhis2_df["conf"],
            dhis2_df[["conf_hf", "conf_com"]]
        )
    ],

    # total treated check
    "maltreat_mismatch": [
        count_mismatch(
            dhis2_df["maltreat"],
            dhis2_df[["maltreat_hf", "maltreat_com"]]
        )
    ],

    # total presumed check
    "pres_mismatch": [
        count_mismatch(
            dhis2_df["pres"],
            dhis2_df[["pres_hf", "pres_com"]]
        )
    ],

    # malaria deaths check
    "maldth_mismatch": [
        count_mismatch(
            dhis2_df["maldth"],
            dhis2_df[[
                "maldth_1_59m", "maldth_u5", "maldth_5_9", "maldth_10_14",
                "maldth_5_14", "maldth_fem_ov15", "maldth_mal_ov15", "maldth_ov15"
            ]]
        )
    ],

    # tested cases by age group
    "test_u5_mismatch": [
        count_mismatch(
            dhis2_df["test_u5"],
            dhis2_df[["test_hf_u5", "test_com_u5"]]
        )
    ],
    "test_5_14_mismatch": [
        count_mismatch(
            dhis2_df["test_5_14"],
            dhis2_df[["test_hf_5_14", "test_com_5_14"]]
        )
    ],
    "test_ov15_mismatch": [
        count_mismatch(
            dhis2_df["test_ov15"],
            dhis2_df[["test_hf_ov15", "test_com_ov15"]]
        )
    ],

    # confirmed cases by age group
    "conf_u5_mismatch": [
        count_mismatch(
            dhis2_df["conf_u5"],
            dhis2_df[["conf_hf_u5", "conf_com_u5"]]
        )
    ],
    "conf_5_14_mismatch": [
        count_mismatch(
            dhis2_df["conf_5_14"],
            dhis2_df[["conf_hf_5_14", "conf_com_5_14"]]
        )
    ],
    "conf_ov15_mismatch": [
        count_mismatch(
            dhis2_df["conf_ov15"],
            dhis2_df[["conf_hf_ov15", "conf_com_ov15"]]
        )
    ],

    # presumed cases by age group
    "pres_u5_mismatch": [
        count_mismatch(
            dhis2_df["pres_u5"],
            dhis2_df[["pres_hf_u5", "pres_com_u5"]]
        )
    ],
    "pres_5_14_mismatch": [
        count_mismatch(
            dhis2_df["pres_5_14"],
            dhis2_df[["pres_hf_5_14", "pres_com_5_14"]]
        )
    ],
    "pres_ov15_mismatch": [
        count_mismatch(
            dhis2_df["pres_ov15"],
            dhis2_df[["pres_hf_ov15", "pres_com_ov15"]]
        )
    ],

    # suspected cases by age group
    "susp_u5_mismatch": [
        count_mismatch(
            dhis2_df["susp_u5"],
            dhis2_df[["susp_hf_u5", "susp_com_u5"]]
        )
    ],
    "susp_5_14_mismatch": [
        count_mismatch(
            dhis2_df["susp_5_14"],
            dhis2_df[["susp_hf_5_14", "susp_com_5_14"]]
        )
    ],
    "susp_ov15_mismatch": [
        count_mismatch(
            dhis2_df["susp_ov15"],
            dhis2_df[["susp_hf_ov15", "susp_com_ov15"]]
        )
    ],

    # treated cases by age group
    "maltreat_u5_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u5"],
            dhis2_df[["maltreat_hf_u5", "maltreat_com_u5"]]
        )
    ],
    "maltreat_5_14_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_5_14"],
            dhis2_df[["maltreat_hf_5_14", "maltreat_com_5_14"]]
        )
    ],
    "maltreat_ov15_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov15"],
            dhis2_df[["maltreat_hf_ov15", "maltreat_com_ov15"]]
        )
    ],

    # treatment timing checks
    "maltreat_u24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_u24_total"],
            dhis2_df[["maltreat_u24_hf", "maltreat_u24_com"]]
        )
    ],
    "maltreat_ov24_mismatch": [
        count_mismatch(
            dhis2_df["maltreat_ov24_total"],
            dhis2_df[["maltreat_ov24_hf", "maltreat_ov24_com"]]
        )
    ]
})

# pivot to long format for easier viewing
mismatch_summary = (
    mismatch_summary
    .melt(var_name="indicator", value_name="n_mismatches")
    # filter to show only indicators with mismatches
    .query("n_mismatches > 0")
)

mismatch_summary

#### Step 6.3: Export Rows with Incoherent Totals ------------------------------

# identify rows with incoherent totals
incoherent_rows = dhis2_df[
    # outpatient checks
    (dhis2_df["allout"] != (dhis2_df["allout_u5"] + dhis2_df["allout_ov5"])) |
    # malaria admissions check
    (dhis2_df["maladm"] != (dhis2_df["maladm_u5"] + dhis2_df["maladm_5_14"] + dhis2_df["maladm_ov15"])) |
    # total tests check
    (dhis2_df["test"] != (dhis2_df["test_hf"] + dhis2_df["test_com"])) |
    # total confirmed check
    (dhis2_df["conf"] != (dhis2_df["conf_hf"] + dhis2_df["conf_com"])) |
    # total treated check
    (dhis2_df["maltreat"] != (dhis2_df["maltreat_hf"] + dhis2_df["maltreat_com"])) |
    # total presumed check
    (dhis2_df["pres"] != (dhis2_df["pres_hf"] + dhis2_df["pres_com"]))
]

# select relevant columns
incoherent_rows = incoherent_rows[
    ["hf", "periodname"] +
    [col for col in incoherent_rows.columns
     if any(x in col for x in ["allout", "maladm", "test", "conf", "maltreat", "pres"])]
]

# set path for saving
output_file = Path(
    "1.1.2_epidemiology",
    "1.1.2a_routine_surveillance",
    "processed",
    "sle_incoherent_totals_dhis2.xlsx"
)

# export to xlsx
incoherent_rows.to_excel(output_file, index=False)

#### Step 6.3: Add IPD/OPD Specification ---------------------------------------

# option 1: classify based on facility name patterns
def classify_facility(hf_name):
    """Classify facility as IPD or OPD based on name patterns."""
    hf_lower = hf_name.lower() if pd.notna(hf_name) else ""

    # inpatient facilities (hospitals)
    if any(x in hf_lower for x in ["hospital", "hosp"]):
        return "IPD"
    # outpatient facilities (clinics, health posts, community health)
    elif any(x in hf_lower for x in ["chp", "chc", "mchp", "clinic", "health post", "health centre"]):
        return "OPD"
    # default to OPD
    else:
        return "OPD"

dhis2_df["facility_type"] = dhis2_df["hf"].apply(classify_facility)

# option 2: classify based on presence of inpatient indicators
has_ipd = (
    dhis2_df
    .groupby("hf_uid")
    .apply(
        lambda x: ((x["maladm"].notna() & (x["maladm"] > 0)).any() |
                   (x["maldth"].notna() & (x["maldth"] > 0)).any())
    )
    .reset_index(name="has_ipd")
)

dhis2_df = dhis2_df.merge(has_ipd, on="hf_uid", how="left")
dhis2_df["facility_type"] = dhis2_df["has_ipd"].map({True: "IPD", False: "OPD"})
dhis2_df = dhis2_df.drop(columns="has_ipd")

# option 3: use a reference lookup file
facility_lookup = pd.read_excel(
    Path("path/to/facility_classification.xlsx")
)

dhis2_df = dhis2_df.merge(
    facility_lookup[["hf_uid", "facility_type"]],
    on="hf_uid",
    how="left"
)

# check classification distribution
dhis2_df[["hf_uid", "facility_type"]].drop_duplicates()["facility_type"].value_counts()

### Step 7: Finalize Data ------------------------------------------------------

#### Step 7.1: Troubleshoot Duplicate HF-Month Records with Different Data -----

# identify duplicate hf-month combinations
duplicates = dhis2_df.groupby(["hf_uid", "yearmon"]).filter(lambda x: len(x) > 1)

# count number of duplicate pairs
n_duplicates = duplicates[["hf_uid", "yearmon"]].drop_duplicates().shape[0]

# if duplicates exist, inspect them
if n_duplicates > 0:
    # view duplicate records with key indicators
    duplicate_details = duplicates[
        ["record_id", "allout", "test", "conf", "maltreat"]
    ].sort_values(["hf_uid", "yearmon"])

    print(duplicate_details)

    # export for review with SNT team
    duplicate_details.to_excel(
        Path(
            "1.1.2_epidemiology",
            "1.1.2a_routine_surveillance",
            "processed",
            "sle_duplicate_records_dhis2.xlsx",
        ),
        index=False,
    )

# option 1: keep first record (if duplicates are exact copies)
dhis2_df = dhis2_df.drop_duplicates(subset=["hf_uid", "yearmon"], keep="first")

# option 2: keep record with most complete data
dhis2_df = (
    dhis2_df.assign(
        n_complete=lambda x: x.select_dtypes(include="number").notna().sum(axis=1)
    )
    .sort_values("n_complete", ascending=False)
    .drop_duplicates(subset=["record_id"], keep="first")
    .drop(columns="n_complete")
)

# option 3: sum duplicates (if records represent partial reports)
group_cols = ["hf_uid", "yearmon", "hf", "adm0", "adm1", "adm2", "adm3"]
numeric_cols = dhis2_df.select_dtypes(include="number").columns.tolist()

dhis2_df = dhis2_df.groupby(group_cols, as_index=False)[numeric_cols].sum()

# verify duplicates resolved
n_remaining = dhis2_df.groupby(["record_id"]).filter(lambda x: len(x) > 1).shape[0]

#### Step 7.2: Generate Final Data Dictionary ----------------------------------

# define descriptions for computed/derived columns
computed_vars = pd.DataFrame([
    # time columns
    {"snt_var": "date", "indicator_label": "Report date (YYYY-MM-DD)"},
    {"snt_var": "year", "indicator_label": "Report year"},
    {"snt_var": "month", "indicator_label": "Report month (1-12)"},
    {"snt_var": "yearmon", "indicator_label": "Year-month label (e.g., Jan 2020)"},
    # identifier columns
    {"snt_var": "hf_uid", "indicator_label": "Unique health facility identifier (hash)"},
    {"snt_var": "record_id", "indicator_label": "Unique record identifier (facility + month hash)"},
    {"snt_var": "location_short", "indicator_label": "Location label: adm1 ~ adm2"},
    {"snt_var": "location_full", "indicator_label": "Location label: adm1 ~ adm2 ~ adm3 ~ hf"},
    {"snt_var": "facility_type", "indicator_label": "Facility type (IPD/OPD)"},
    # aggregated totals
    {"snt_var": "allout", "indicator_label": "Total outpatient visits (allout_u5 + allout_ov5)"},
    {"snt_var": "susp", "indicator_label": "Total suspected cases (all ages, HF + COM)"},
    {"snt_var": "test", "indicator_label": "Total tested (test_hf + test_com)"},
    {"snt_var": "test_hf", "indicator_label": "Total tested at health facility"},
    {"snt_var": "test_com", "indicator_label": "Total tested in community"},
    {"snt_var": "conf", "indicator_label": "Total confirmed cases (conf_hf + conf_com)"},
    {"snt_var": "conf_hf", "indicator_label": "Total confirmed cases at health facility"},
    {"snt_var": "conf_com", "indicator_label": "Total confirmed cases in community"},
    {"snt_var": "maltreat", "indicator_label": "Total treated cases (maltreat_hf + maltreat_com)"},
    {"snt_var": "maltreat_hf", "indicator_label": "Total treated cases at health facility"},
    {"snt_var": "maltreat_com", "indicator_label": "Total treated cases in community"},
    {"snt_var": "pres", "indicator_label": "Total presumed cases (pres_hf + pres_com)"},
    {"snt_var": "pres_hf", "indicator_label": "Total presumed cases at health facility"},
    {"snt_var": "pres_com", "indicator_label": "Total presumed cases in community"},
    {"snt_var": "maladm", "indicator_label": "Total malaria admissions (all ages)"},
    {"snt_var": "maldth", "indicator_label": "Total malaria deaths (all ages)"},
    # age-group totals
    {"snt_var": "test_u5", "indicator_label": "Total tested under 5 (HF + COM)"},
    {"snt_var": "test_5_14", "indicator_label": "Total tested 5-14 years (HF + COM)"},
    {"snt_var": "test_ov15", "indicator_label": "Total tested over 15 years (HF + COM)"},
    {"snt_var": "conf_u5", "indicator_label": "Confirmed cases under 5 (HF + COM)"},
    {"snt_var": "conf_5_14", "indicator_label": "Confirmed cases 5-14 years (HF + COM)"},
    {"snt_var": "conf_ov15", "indicator_label": "Confirmed cases over 15 years (HF + COM)"},
    {"snt_var": "maltreat_u5", "indicator_label": "Treated cases under 5 (HF + COM)"},
    {"snt_var": "maltreat_5_14", "indicator_label": "Treated cases 5-14 years (HF + COM)"},
    {"snt_var": "maltreat_ov15", "indicator_label": "Treated cases over 15 years (HF + COM)"},
    {"snt_var": "pres_u5", "indicator_label": "Presumed cases under 5 (HF + COM)"},
    {"snt_var": "pres_5_14", "indicator_label": "Presumed cases 5-14 years (HF + COM)"},
    {"snt_var": "pres_ov15", "indicator_label": "Presumed cases over 15 years (HF + COM)"},
    # add remaining age-group variables as needed...
])

# combine original and computed dictionaries
full_data_dict = (
    pd.concat([data_dict, computed_vars], ignore_index=True)
    .rename(columns={"snt_var": "snt_variable", "indicator_label": "label"})
    [["snt_variable", "label"]]
)

# keep only columns present in final dataset
full_data_dict = (
    full_data_dict[full_data_dict["snt_variable"].isin(dhis2_df.columns)]
    .sort_values("snt_variable")
)

# check
full_data_dict.head(10)

#### Step 7.3: Arrange Final Column Order --------------------------------------

# define column order
id_cols = ["record_id"]
location_cols = ["adm0", "adm1", "adm2", "adm3", "hf", "hf_uid",
                 "location_short", "location_full", "facility_type"]
time_cols = ["date", "yearmon", "year", "month"]

# get remaining columns in original order
ordered_cols = id_cols + location_cols + time_cols
remaining_cols = [col for col in dhis2_df.columns if col not in ordered_cols]

# reorder dataframe
dhis2_df = dhis2_df[ordered_cols + remaining_cols]

# check column order
print(dhis2_df.columns[:20].tolist())

### Step 8: Aggregate and Save Data --------------------------------------------

#### Step 8.1: Save Data at the Health Facility Level --------------------------

save_path = Path(
    here("01_data/02_epidemiology/2a_routine_surveillance/processed")
)

# save data to xlsx
dhis2_df.to_excel(
    save_path / "sle_dhis2_health_facility_data.xlsx",
    index=False
)

# save dictionary to xlsx
full_data_dict.to_excel(
    save_path / "sle_dhis2_health_facility_dict.xlsx",
    index=False
)

# save to parquet
dhis2_df.to_parquet(
    save_path / "sle_dhis2_health_facility_data.parquet",
    compression="zstd",
    index=False
)

#### Step 8.2: Aggregate and Save Data at Each Admin Unit Level ----------------

# define numeric columns to sum (excluding HF-level identifiers)
sum_cols = [
    # totals
    "allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
    # by location
    "test_hf", "test_com", "conf_hf", "conf_com",
    "maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
    # by age group - totals
    "allout_u5", "allout_ov5",
    "test_u5", "test_5_14", "test_ov15",
    "conf_u5", "conf_5_14", "conf_ov15",
    "maltreat_u5", "maltreat_5_14", "maltreat_ov15",
    "pres_u5", "pres_5_14", "pres_ov15",
    "susp_u5", "susp_5_14", "susp_ov15",
    "maladm_u5", "maladm_5_14", "maladm_ov15",
    "maldth_u5", "maldth_5_14", "maldth_ov15",
    # by age and location
    "test_hf_u5", "test_hf_5_14", "test_hf_ov15",
    "test_com_u5", "test_com_5_14", "test_com_ov15",
    "conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
    "conf_com_u5", "conf_com_5_14", "conf_com_ov15",
    "maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
    "maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
    "pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
    "pres_com_u5", "pres_com_5_14", "pres_com_ov15",
    # treatment timing
    "maltreat_u24_hf", "maltreat_ov24_hf",
    "maltreat_u24_com", "maltreat_ov24_com",
    "maltreat_u24_total", "maltreat_ov24_total"
]

# function to aggregate at a given admin level
def aggregate_admin(df, group_cols, sum_cols):
    # filter to columns that exist in dataframe
    existing_sum_cols = [c for c in sum_cols if c in df.columns]

    # aggregate
    agg_df = (
        df
        .groupby(group_cols, as_index=False)
        .agg(
            **{col: (col, "sum") for col in existing_sum_cols},
            n_facilities=("hf_uid", "count")
        )
    )

    return agg_df

# aggregate at adm3 level
group_cols_adm3 = ["adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon"]
dhis2_adm3 = aggregate_admin(dhis2_df, group_cols_adm3, sum_cols)
dhis2_adm3["record_id"] = vdigest(
    dhis2_adm3["adm0"] + " " + dhis2_adm3["adm1"] + " " +
    dhis2_adm3["adm2"] + " " + dhis2_adm3["adm3"] + " " +
    dhis2_adm3["yearmon"].astype(str)
)
dhis2_adm3["location_short"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"]
dhis2_adm3["location_full"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"] + " ~ " + dhis2_adm3["adm3"]

# aggregate at adm2 level
group_cols_adm2 = ["adm0", "adm1", "adm2", "year", "month", "yearmon"]
dhis2_adm2 = aggregate_admin(dhis2_df, group_cols_adm2, sum_cols)
dhis2_adm2["record_id"] = vdigest(
    dhis2_adm2["adm0"] + " " + dhis2_adm2["adm1"] + " " +
    dhis2_adm2["adm2"] + " " + dhis2_adm2["yearmon"].astype(str)
)
dhis2_adm2["location_short"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
dhis2_adm2["location_full"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]

# aggregate at adm1 level
group_cols_adm1 = ["adm0", "adm1", "year", "month", "yearmon"]
dhis2_adm1 = aggregate_admin(dhis2_df, group_cols_adm1, sum_cols)
dhis2_adm1["record_id"] = vdigest(
    dhis2_adm1["adm0"] + " " + dhis2_adm1["adm1"] + " " +
    dhis2_adm1["yearmon"].astype(str)
)
dhis2_adm1["location_short"] = dhis2_adm1["adm1"]
dhis2_adm1["location_full"] = dhis2_adm1["adm1"]

# create data dictionaries for each level (filter to relevant columns only)
id_cols = ["n_facilities", "record_id", "location_short", "location_full"]

# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_cols = group_cols_adm3 + [c for c in sum_cols if c in dhis2_adm3.columns] + id_cols
adm3_dict = full_data_dict[full_data_dict["snt_variable"].isin(adm3_cols)]

# adm2 dict: excludes adm3 (not present at this level)
adm2_cols = group_cols_adm2 + [c for c in sum_cols if c in dhis2_adm2.columns] + id_cols
adm2_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm2_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm3"]))
]

# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_cols = group_cols_adm1 + [c for c in sum_cols if c in dhis2_adm1.columns] + id_cols
adm1_dict = full_data_dict[
    (full_data_dict["snt_variable"].isin(adm1_cols)) &
    (~full_data_dict["snt_variable"].isin(["adm2", "adm3"]))
]

# save adm3 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm3_data.xlsx") as writer:
    dhis2_adm3.to_excel(writer, sheet_name="data", index=False)
    adm3_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm3.to_parquet(save_path / "sle_dhis2_adm3_data.parquet", compression="zstd", index=False)

# save adm2 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm2_data.xlsx") as writer:
    dhis2_adm2.to_excel(writer, sheet_name="data", index=False)
    adm2_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm2.to_parquet(save_path / "sle_dhis2_adm2_data.parquet", compression="zstd", index=False)

# save adm1 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm1_data.xlsx") as writer:
    dhis2_adm1.to_excel(writer, sheet_name="data", index=False)
    adm1_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm1.to_parquet(save_path / "sle_dhis2_adm1_data.parquet", compression="zstd", index=False)
 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved