• English
  • Français
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. DHIS2 data preprocessing
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
      • Determining active and inactive status
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Assessing missing data
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputing missing data and correcting outliers
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • Key considerations
    • Troubleshooting
    • Best practices when working with routine data
    • What a clean and well-structured dataset should look like
  • Step-by-Step
    • Step 1: Import packages and data
      • Step 1.1: Import packages
      • Step 1.2: Import data
    • Step 2: First pass troubleshooting
      • Step 2.1: Basic cleaning
      • Step 2.2: (If needed) Troubleshoot initial import (maybe multiple steps here)
    • Step 3: Rename columns
      • Step 3.1: Assign renaming rules
      • Step 3.2: Apply renaming
    • Step 4: (If needed) Compute variables
      • Step 4.1: Compute indicator totals and new variables
      • Step 4.2: (If needed) Quality control of indicator totals
      • Step 4.3: indicator totals coherency visualization
      • Step 4.4: Export rows with incoherent totals
    • Step 5: Create a unique year-month column
      • Step 5.1: Example data date format
      • Step 5.2: Handling alternative DHIS2 date formats
      • Step 5.3: Unclear data format
    • Step 6: (If needed) Create a unique ID
      • Step 6.1: Create unique IDs for admin units
      • Step 6.2: Create unique IDs for health facilities
    • Step 7: Second pass troubleshooting
      • Step 7.1: Highlight duplicate HF-month records with different data
      • Step 7.2: ETC OTHER PROBLEMS
    • Step 8: (If needed) Prepare optional ancillary datasets and versions
      • Step 8.1: Create and save your admin unit hierarchy
      • Step 8.2: Fill in missing HF-month reports
      • Step 8.3: IPD/OPD Specification
    • Step 9: Aggregate and save data
      • Step 9.1: Save data at the health facility level
      • Step 9.2: Aggregate and save data at each admin unit level level
  • Summary
  • Full code
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. DHIS2 data preprocessing

DHIS2 data preprocessing

Overview

DHIS2 is a core data source for routine malaria surveillance, capturing indicators such as outpatient case counts, diagnostic testing, and treatment provision. However, raw DHIS2 exports may not be ready for SNT analysis if they have structural or quality issues, are messy, include inconsistent or unreadable column names, missing values, mismatched facility names, duplicate records, and formats that vary across months or locations. To make this data usable for SNT, it needs to be cleaned, reshaped, and aligned with a spatial and temporal structure that supports later steps of analysis: identifying active and inactive health facilities, calculating reporting rates, managing outliers, calculating malaria incidence, etc.

In this section, we walk through how to preprocess monthly DHIS2 malaria records to make them SNT-ready, using example data from Sierra Leone. This includes cleaning and standardising facility-level data, aggregating to administrative levels where needed, to produce structured datasets.

The DHIS2 data you are working with may require some or all of the specific steps detailed below, or additional steps may be required to prepare the data for analysis.

Objectives
  • Import and/or combine DHIS2 data files
  • Select and rename key variables
  • Compute new variables at the facility level (e.g. presumed cases)
  • Set up any UIDs needed
  • Troubleshoot potential issues along the way
  • Aggregate data monthly and yearly at the selected administrative level
  • Export cleaned and aggregated datasets

Key considerations

Troubleshooting

Routine surveillance data can be trickier to work with than other data sources that are more standardized. Expect to encounter and resolve data issues throughout the DHIS2 data preprocessing process. You will likely need to:

  • Investigate unexpected patterns in your data
  • Make judgment calls about how to handle inconsistencies
  • Collaborate with your SNT team on country-specific issues
  • Develop custom solutions for problems not covered here

The troubleshooting approaches shown in this guide provide a starting framework, but real-world data cleaning often requires iterative problem-solving and local expertise. While this guide addresses challenges others have encountered when preparing DHIS2 data for analysis, every country is different. Your specific context may present unique problems that are not yet covered in the code library. We encourage you to be persistent and creative if needed and to consult others for help if you are unable to resolve on your own.

If you encounter a new challenge when preprocessing DHIS2 data and would like new content to be added to the code library, please fill out our feedback form.

Best practices when working with routine data

1. Understand the System’s Context

  • Reporting workflows: Know how data moves from clinics to DHIS2 (paper vs. direct entry, aggregation rules).
  • Indicator definitions: Confirm case defintions and inclusions

2. Triangulate Strategically

Cross-check with:

  • External benchmarks (e.g., HMIS reports, survey data)
  • Parallel systems (e.g., stock management data for testing kits)
  • Temporal patterns (compare dry vs. rainy season trends)

3. Document Your Process

Maintain a troubleshooting log tracking:

  • Issues encountered
  • Data sources consulted
  • Decisions made (with rationale)

Timing lags: Facilities may report late, especially after holidays or system outages.

What a clean and well-structured dataset should look like

Your data might come in different format to Sierra Leone example below. Whether you are able to use the chunk of code provided, or parts of it, to import your data, please make sure your data, after import, contains the following information:

  • A column with the date (year and month) of the report
  • A column with the health facility name
  • Various columns with parent admin units - at mimimum you will need the name of the admin unit for SNT analysis (for example, chiefdom for Sierra Leone)
  • Your data indicators columns

Below is an example of what the data should look like after import:

VT: Do we want to select specific indicator columns to avoid showing too many NA values?

  adm0 adm1 adm2 adm3 orgunitlevel5 hf date month year allout_u5 allout_ov5 maladm_u5 maladm_5_14 maladm_ov15 maldth_u5 maldth_5_14 maldth_ov15 susp_u5_hf susp_5_14_hf susp_ov15_hf susp_u5_com susp_5_14_com susp_ov15_com maldth_fem_ov15 maldth_mal_ov15 maldth_1_59m maldth_10_14 maldth_5_9 tes_neg_rdt_u5_com tes_pos_rdt_u5_com tes_neg_rdt_5_14_com tes_pos_rdt_5_14_com tes_neg_rdt_ov15_com tes_pos_rdt_ov15_com test_neg_mic_u5_hf test_pos_mic_u5_hf test_neg_mic_5_14_hf test_pos_mic_5_14_hf test_neg_mic_ov15_hf test_pos_mic_ov15_hf tes_neg_rdt_u5_hf tes_pos_rdt_u5_hf tes_neg_rdt_5_14_hf tes_pos_rdt_5_14_hf tes_neg_rdt_ov15_hf tes_pos_rdt_ov15_hf maltreat_u24_u5_com maltreat_ov24_u5_com maltreat_u24_5_14_com maltreat_ov24_5_14_com maltreat_u24_ov15_com maltreat_ov24_ov15_com maltreat_u24_u5_hf maltreat_ov24_u5_hf maltreat_u24_5_14_hf maltreat_ov24_5_14_hf maltreat_u24_ov15_hf maltreat_ov24_ov15_hf allout susp test_hf test_com test conf_hf conf_com conf maltreat_com maltreat_hf maltreat pres_com pres_hf pres maladm maldth test_hf_u5 test_hf_5_14 test_hf_ov15 test_com_u5 test_com_5_14 test_com_ov15 test_u5 test_5_14 test_ov15 susp_hf_u5 susp_hf_5_14 susp_hf_ov15 susp_com_u5 susp_com_5_14 susp_com_ov15 susp_u5 susp_5_14 susp_ov15 conf_hf_u5 conf_hf_5_14 conf_hf_ov15 conf_com_u5 conf_com_5_14 conf_com_ov15 conf_u5 conf_5_14 conf_ov15 maltreat_hf_u5 maltreat_hf_5_14 maltreat_hf_ov15 maltreat_com_u5 maltreat_com_5_14 maltreat_com_ov15 maltreat_u5 maltreat_5_14 maltreat_ov15 maltreat_u24_hf maltreat_ov24_hf maltreat_u24_com maltreat_ov24_com maltreat_u24_total maltreat_ov24_total pres_com_u5 pres_com_5_14 pres_com_ov15 pres_hf_u5 pres_hf_5_14 pres_hf_ov15 pres_u5 pres_5_14 pres_ov15 hf_uid
0 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-01 01 2021 44 48 nan nan nan nan nan nan 45 30 45 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 9 36 8 22 9 36 nan nan nan nan nan nan 6 30 nan 20 4 31 92 120 120 nan 120 94 nan 94 nan 91 91 nan 0 0 nan nan 45 30 45 nan nan nan 45 30 45 45 30 45 nan nan nan 45 30 45 36 22 36 nan nan nan 36 22 36 36 20 35 nan nan nan 36 20 35 10 81 nan nan 10 81 nan nan nan 0 0 0 0 0 0 hf_0001
1 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-02 02 2021 37 27 nan nan nan nan nan nan 28 40 24 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 8 20 10 30 8 16 nan nan nan nan nan nan 3 17 25 5 nan 16 64 92 92 nan 92 66 nan 66 nan 66 66 nan 0 0 nan nan 28 40 24 nan nan nan 28 40 24 28 40 24 nan nan nan 28 40 24 20 30 16 nan nan nan 20 30 16 20 30 16 nan nan nan 20 30 16 28 38 nan nan 28 38 nan nan nan 0 0 0 0 0 0 hf_0001
2 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-03 03 2021 44 42 nan nan nan nan nan nan 21 32 201 nan nan nan nan nan nan nan nan nan nan nan nan nan nan 10 nan 15 nan 100 nan nan 18 nan 22 nan 186 nan nan nan nan nan nan 8 nan 7 nan 86 nan 86 254 351 nan 351 226 nan 226 nan 101 101 nan 0 0 nan nan 28 37 286 nan nan nan 28 37 286 21 32 201 nan nan nan 21 32 201 18 22 186 nan nan nan 18 22 186 8 7 86 nan nan nan 8 7 86 101 nan nan nan 101 nan nan nan nan 0 0 0 0 0 0 hf_0001
3 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-04 04 2021 28 23 nan nan nan nan nan nan 37 40 42 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 7 30 10 30 12 30 nan nan nan nan nan nan 10 20 30 nan 20 10 51 119 119 nan 119 90 nan 90 nan 90 90 nan 0 0 nan nan 37 40 42 nan nan nan 37 40 42 37 40 42 nan nan nan 37 40 42 30 30 30 nan nan nan 30 30 30 30 30 30 nan nan nan 30 30 30 60 30 nan nan 60 30 nan nan nan 0 0 0 0 0 0 hf_0001
4 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-05 05 2021 138 141 nan nan nan nan nan nan 100 63 36 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 28 72 23 40 1 35 nan nan nan nan nan nan 10 35 15 25 10 25 279 199 199 nan 199 147 nan 147 nan 120 120 nan 0 0 nan nan 100 63 36 nan nan nan 100 63 36 100 63 36 nan nan nan 100 63 36 72 40 35 nan nan nan 72 40 35 45 40 35 nan nan nan 45 40 35 35 85 nan nan 35 85 nan nan nan 0 0 0 0 0 0 hf_0001
5 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-06 06 2021 127 82 nan nan nan nan nan nan 91 19 75 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 32 59 3 16 24 51 nan nan nan nan nan nan nan 36 nan 16 nan 34 209 185 185 nan 185 126 nan 126 nan 86 86 nan 0 0 nan nan 91 19 75 nan nan nan 91 19 75 91 19 75 nan nan nan 91 19 75 59 16 51 nan nan nan 59 16 51 36 16 34 nan nan nan 36 16 34 nan 86 nan nan nan 86 nan nan nan 0 0 0 0 0 0 hf_0001
6 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-07 07 2021 130 175 nan nan nan nan nan nan 114 32 118 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 28 86 8 24 22 96 nan nan nan nan nan nan nan 86 nan 23 nan 96 305 264 264 nan 264 206 nan 206 nan 205 205 nan 0 0 nan nan 114 32 118 nan nan nan 114 32 118 114 32 118 nan nan nan 114 32 118 86 24 96 nan nan nan 86 24 96 86 23 96 nan nan nan 86 23 96 nan 205 nan nan nan 205 nan nan nan 0 0 0 0 0 0 hf_0001
7 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-08 08 2021 144 203 nan nan nan nan nan nan 125 22 129 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 32 93 5 17 31 98 nan nan nan nan nan nan 9 84 1 16 18 80 347 276 276 nan 276 208 nan 208 nan 208 208 nan 0 0 nan nan 125 22 129 nan nan nan 125 22 129 125 22 129 nan nan nan 125 22 129 93 17 98 nan nan nan 93 17 98 93 17 98 nan nan nan 93 17 98 28 180 nan nan 28 180 nan nan nan 0 0 0 0 0 0 hf_0001
8 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-09 09 2021 159 159 nan nan nan nan nan nan 135 28 102 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 57 78 10 18 32 70 nan nan nan nan nan nan nan 78 nan 18 nan 70 318 265 265 nan 265 166 nan 166 nan 166 166 nan 0 0 nan nan 135 28 102 nan nan nan 135 28 102 135 28 102 nan nan nan 135 28 102 78 18 70 nan nan nan 78 18 70 78 18 70 nan nan nan 78 18 70 nan 166 nan nan nan 166 nan nan nan 0 0 0 0 0 0 hf_0001
9 Sierra Leone Bo District Bo City Council Bo City Aethel CHP Aethel CHP 2021-10 10 2021 88 128 nan nan nan nan nan nan 9 92 100 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 3 6 12 80 27 73 nan nan nan nan nan nan nan 6 31 49 13 60 216 201 201 nan 201 159 nan 159 nan 159 159 nan 0 0 nan nan 9 92 100 nan nan nan 9 92 100 9 92 100 nan nan nan 9 92 100 6 80 73 nan nan nan 6 80 73 6 80 73 nan nan nan 6 80 73 44 115 nan nan 44 115 nan nan nan 0 0 0 0 0 0 hf_0001

Step-by-Step

In this section, we walk through the key preprocessing steps needed to get DHIS2 malaria data ready for SNT analysis, using example DHIS2 data from Sierra Leone. The focus here is on cleaning, reshaping, and aggregating facility-level data to produce structured outputs at both the facility and administrative levels.

Each step is designed to guide you through the process cleanly. Make sure to follow the notes in the code, especially where edits are required (e.g. column names or admin levels). The goal is to make the data analysis-ready without breaking linkages to spatial units or reporting structures.

Remember that you will need to adapt this process to the specific country context in which you are working.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Import packages and data

We’ll use Sierra Leone DHIS2 malaria data as a working example. Variable names or database structures may differ across countries. Be sure to review and update the variable names to match your country’s DHIS2 configuration.

We begin by installing and loading the required packages for our preprocessing, then import the data.

If you are using python and need help installing packages, please refer to the Getting Started section.

Step 1.1: Import packages

  • R
  • Python
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  tidyverse,  # core tidy tools (dplyr, tidyr, etc.)
  rio,        # import and export and file format
  DT,         # view interactive tables
  here        # for easy file referencing
)
import pandas as pd
from pathlib import Path
from pyhere import here

Step 1.2: Import data

The code chunk below defines a helper function to import different types of data, for example Excel or csv data. If your DHIS2 extraction resulted in multiple files, this function will merge them together into a single dataset. Then we use it to read in our data file(s).

When calling the DHIS2 data import function, we read files from a folder using the here package. The here package builds paths from the project root, keeping code portable and reliable.

Inspect Your Data Before Proceeding

Before running any code, take a moment to visually inspect your data. This is especially important when working with DHIS2 data or other routine health sources. As explained in the Data Structures of the Getting Started: For Analysts page, your data should follow tidy data principles: each column represents a variable, each row an observation, and each cell contains a single value.

To check this, open the Excel or CSV file directly and review:

  • Headers: Are the column names in the first row? Are they clear and consistent?
  • Columns and rows:
    • Each column should represent a variable (e.g. health_facility, month, confirmed_cases).
    • Each row should represent a single observation (e.g. one facility-month).
  • Cells: Each cell should contain a single value (not merged or grouped). Avoid cells that contain multiple values (e.g. 10/5 for tested/positive).
  • Extra rows: Remove any top rows used for titles, notes, or merged headers.

If your dataset is already tidy, you’re ready to move on. If not, try to clean it up before working with it. You can do this:

  • In Excel, by editing a copy of the file and removing any extra formatting, then saving your new copy to import and work with in later steps.

  • Or in R, using packages like:

    • rio::import(file, skip = X) to skip X number of extra header rows
    • tidyr::pivot_longer() to reshape wide data into long format
    • dplyr::mutate() and separate() to split combined values into separate columns
  • Or in Python, using:

    • pandas.read_excel(file, skiprows=X) to skip X number of unwanted header rows
    • pandas.melt() to reshape wide data into long format
    • pandas.Series.str.split() and pandas.assign() to split and create new columns

For more complex formatting issues, like multi-row or merged headers, see this excellent step-by-step guide: Tidying Multi-Header Excel Data with R by Paul Campbell (2019).

Getting this right early will save time and reduce confusion later in the workflow, as all the subsequent steps assume that the data is already in tidy format and ready for analysis.

  • R
  • Python
Show the code
#' Reads and combines multiple Excel or CSV files from a directory into a single
#' dataframe.
#'
#' This function searches for files with specified extension in the given
#' directory and combines them into a single dataframe. It handles both Excel
#' (.xls, .xlsx) and CSV files.
#'
#' @param directory Path to directory containing files
#' @param pattern File extension (default: "xls"). Options: "xls", "xlsx",
#'   "csv" and and other formats available in rio package.
#' @return A dataframe combining all files with trimmed column names. Returns
#'    NULL if no files found or if an error occurs during reading/combining.
#' @examples
#' # Read Excel files
#' df1 <- read_and_combine_files("path/to/excel", "xlsx")
#' # Read CSV files
#' df2 <- read_and_combine_files("path/to/csv", "csv")
#'
read_and_combine_files <- function(
  directory,
  pattern = "xls"
) {
  # directory <- here::here(directory)
  pattern <- paste0("\\.", pattern, "$")
  # Find files matching the pattern
  file_paths <- list.files(
    path = directory,
    pattern = pattern,
    full.names = TRUE
  )

  # Stop if no files found
  if (length(file_paths) == 0) {
    stop("No files found with specified extension in the directory.")
  }

  # Read all files based on extension
  data_files <- lapply(file_paths, function(file) {
    tryCatch(
      {
        rio::import(file)
      },
      error = function(e) {
        NULL
      }
    )
  })

  # Combine all data frames
  combined_df <- dplyr::bind_rows(data_files) |>
    dplyr::as_tibble()

  # Trim whitespace from column names
  names(combined_df) <- trimws(names(combined_df))

  return(combined_df)
}

# read in data files
dhis2_df <- read_and_combine_files(
  directory = here::here(
    "1.1.2_epidemiology",
    "1.1.2a_routine_surveillance",
    "raw"
  ),
  pattern = "xls"
)

head(dhis2_df)
Output
# A tibble: 6 × 10
  orgunitlevel1 orgunitlevel2 orgunitlevel3   orgunitlevel4 orgunitlevel5
  <chr>         <chr>         <chr>           <chr>         <chr>        
1 Sierra Leone  Bo District   Bo City Council Bo City       Aethel CHP   
2 Sierra Leone  Bo District   Bo City Council Bo City       Aethel CHP   
3 Sierra Leone  Bo District   Bo City Council Bo City       Aethel CHP   
4 Sierra Leone  Bo District   Bo City Council Bo City       Aethel CHP   
5 Sierra Leone  Bo District   Bo City Council Bo City       Aethel CHP   
6 Sierra Leone  Bo District   Bo City Council Bo City       Aethel CHP   
# ℹ 5 more variables: organisationunitname <chr>, periodname <chr>,
#   `OPD (New and follow-up curative) 0-59m_X` <dbl>,
#   `OPD (New and follow-up curative) 5+y_X` <dbl>,
#   `Admission - Child with malaria 0-59 months_X` <dbl>

To adapt the code:

  • Line 58: Change directory = "1.1.2_epidemiology..." to your current working directory
  • Line 63: Modify the file extension pattern according to your file type:
    • For .xls files: pattern = "xls"
    • For .xlsx files: pattern = "xlsx
    • For .csv files: pattern = "csv"
Show the code
# set up directory
directory = Path(here("english/data_r/routine_cases/dhis2_data"))
extension = 'xls'

# find all files with specified extension in directory
files = list(directory.glob(f'*.{extension}'))

# initialise dataframe
dhis2_df = pd.DataFrame()

# iterate over files, concatenate into one dataframe
for file in files:
    temp = pd.read_excel(file, sheet_name = 'Sheet1')
    dhis2_df = pd.concat([dhis2_df, temp])

# Inspect results
dhis2_df.head(10).style
Output
  orgunitlevel1 orgunitlevel2 orgunitlevel3 orgunitlevel4 orgunitlevel5 organisationunitname periodname OPD (New and follow-up curative) 0-59m_X OPD (New and follow-up curative) 5+y_X Admission - Child with malaria 0-59 months_X Admission - Child with malaria 5-14 years_X Admission - Malaria 15+ years_X Separation - Child with malaria 0-59 months_X Death Separation - Child with malaria 5-14 years_X Death Separation - Malaria 15+ years_X Death Fever case - suspected Malaria 0-59m_X Fever case - suspected Malaria 5-14y_X Fever case - suspected Malaria 15+y_X Fever case in community (Suspected Malaria) 0-59m_X Fever case in community (Suspected Malaria) 5-14y_X Fever case in community (Suspected Malaria) 15+y_X Death malaria 15+ years Female Death malaria 15+ years Male Child death - Malaria 1-59m_X Child death - Malaria 10-14y_X Child death - Malaria 5-9y_X Fever case in community tested for Malaria (RDT) - Negative 0-59m_X Fever case in community tested for Malaria (RDT) - Positive 0-59m_X Fever case in community tested for Malaria (RDT) - Negative 5-14y_X Fever case in community tested for Malaria (RDT) - Positive 5-14y_X Fever case in community tested for Malaria (RDT) - Negative 15+y_X Fever case in community tested for Malaria (RDT) - Positive 15+y_X Fever case tested for Malaria (Microscopy) - Negative 0-59m_X Fever case tested for Malaria (Microscopy) - Positive 0-59m_X Fever case tested for Malaria (Microscopy) - Negative 5-14y_X Fever case tested for Malaria (Microscopy) - Positive 5-14y_X Fever case tested for Malaria (Microscopy) - Negative 15+y_X Fever case tested for Malaria (Microscopy) - Positive 15+y_X Fever case tested for Malaria (RDT) - Negative 0-59m_X Fever case tested for Malaria (RDT) - Positive 0-59m_X Fever case tested for Malaria (RDT) - Negative 5-14y_X Fever case tested for Malaria (RDT) - Positive 5-14y_X Fever case tested for Malaria (RDT) - Negative 15+y_X Fever case tested for Malaria (RDT) - Positive 15+y_X Malaria treated in community with ACT <24 hours 0-59m_X Malaria treated in community with ACT >24 hours 0-59m_X Malaria treated in community with ACT <24 hours 5-14y_X Malaria treated in community with ACT >24 hours 5-14y_X Malaria treated in community with ACT <24 hours 15+y_X Malaria treated in community with ACT >24 hours 15+y_X Malaria treated with ACT <24 hours 0-59m_X Malaria treated with ACT >24 hours 0-59m_X Malaria treated with ACT <24 hours 5-14y_X Malaria treated with ACT >24 hours 5-14y_X Malaria treated with ACT <24 hours 15+y_X Malaria treated with ACT >24 hours 15+y_X
0 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP January 2015 44.000000 43.000000 nan nan nan nan nan nan 21.000000 4.000000 4.000000 10.000000 7.000000 4.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 21.000000 nan 4.000000 nan 4.000000 15.000000 6.000000 5.000000 2.000000 3.000000 1.000000 18.000000 3.000000 4.000000 nan 2.000000 nan
1 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP February 2015 49.000000 34.000000 nan nan nan nan nan nan 37.000000 3.000000 4.000000 10.000000 7.000000 4.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 10.000000 27.000000 nan 3.000000 nan 4.000000 15.000000 6.000000 5.000000 2.000000 3.000000 1.000000 17.000000 4.000000 2.000000 nan 2.000000 1.000000
2 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP March 2015 102.000000 83.000000 nan nan nan nan nan nan 46.000000 5.000000 14.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 36.000000 nan 3.000000 nan 8.000000 nan nan nan nan nan nan 36.000000 nan 3.000000 nan 5.000000 nan
3 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP April 2015 68.000000 68.000000 nan nan nan nan nan nan 44.000000 10.000000 10.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 9.000000 35.000000 4.000000 6.000000 5.000000 5.000000 nan nan nan nan nan nan 27.000000 8.000000 4.000000 2.000000 3.000000 2.000000
4 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP May 2015 118.000000 76.000000 nan nan nan nan nan nan 105.000000 15.000000 22.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 20.000000 85.000000 4.000000 11.000000 5.000000 17.000000 nan nan nan nan nan nan 67.000000 18.000000 7.000000 4.000000 10.000000 7.000000
5 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP June 2015 145.000000 89.000000 nan nan nan nan nan nan 110.000000 10.000000 21.000000 10.000000 11.000000 9.000000 nan nan nan nan nan 2.000000 8.000000 1.000000 10.000000 1.000000 8.000000 nan nan nan nan nan nan 24.000000 86.000000 6.000000 4.000000 10.000000 11.000000 13.000000 4.000000 9.000000 1.000000 6.000000 1.000000 70.000000 16.000000 3.000000 1.000000 6.000000 5.000000
6 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP July 2015 95.000000 94.000000 nan nan nan nan nan nan 67.000000 31.000000 26.000000 10.000000 13.000000 9.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 17.000000 50.000000 9.000000 22.000000 8.000000 18.000000 20.000000 nan 10.000000 nan 8.000000 nan 43.000000 6.000000 13.000000 5.000000 10.000000 5.000000
7 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP August 2015 86.000000 63.000000 nan nan nan nan nan nan 72.000000 6.000000 15.000000 15.000000 17.000000 14.000000 nan nan nan nan nan 2.000000 13.000000 3.000000 14.000000 2.000000 12.000000 nan nan nan nan nan nan 8.000000 64.000000 nan 4.000000 nan 15.000000 20.000000 4.000000 11.000000 3.000000 6.000000 6.000000 52.000000 12.000000 3.000000 1.000000 11.000000 4.000000
8 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP September 2015 117.000000 94.000000 nan nan nan nan nan nan 66.000000 15.000000 24.000000 16.000000 19.000000 16.000000 nan nan nan nan nan 2.000000 14.000000 3.000000 16.000000 2.000000 14.000000 nan nan nan nan nan nan 16.000000 50.000000 9.000000 6.000000 2.000000 22.000000 22.000000 4.000000 13.000000 3.000000 6.000000 3.000000 42.000000 8.000000 3.000000 3.000000 5.000000 7.000000
9 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP Bai Largo MCHP October 2015 97.000000 68.000000 nan nan nan nan nan nan 84.000000 5.000000 11.000000 16.000000 5.000000 10.000000 nan nan nan nan nan 5.000000 11.000000 4.000000 1.000000 4.000000 6.000000 nan nan nan nan nan nan 11.000000 73.000000 2.000000 3.000000 3.000000 8.000000 23.000000 3.000000 1.000000 nan 4.000000 nan 66.000000 7.000000 2.000000 1.000000 5.000000 3.000000

To adapt the code:

  • Line 2: Change directory = "1.1.2_epidemiology..." to your current working directory
  • Line 3: Modify the file extension pattern according to your file type:
    • For .xls files: pattern = "xls"
    • For .xlsx files: pattern = "xlsx
    • For .csv files: pattern = "csv"

Step 2: First pass troubleshooting

Step 2.1: Basic cleaning

Basic cleaning should be performed based on initial inspection of the imported data. Cleaning at this stage covers best practices of preprocessing.

HERE I WANT TO PUT EXTREMELY UNCONTROVERTIAL THINGS THAT IT WOULDN’T BE NEEDED TO CONSULT SNT TEAM ON, LIKE IN VAL’S EXAMPLE 2 ITEMS. WE COULD EVEN STASH THE YEAR-MONTH THING HERE.

Notes: include a diagnostic where you print number of NA per variable. some variables (admin unit name, periodname for ex) really ought not be NA and if they appear, they ought to be investigated with the DHIS2 expert / SNT team focal person for routine data

  • R
  • Python
# Remove whitespace from column names
dhis2_df <- dhis2_df |>
  dplyr::rename_with(~ stringr::str_trim(.x))

# Remove duplicate rows and reset index
dhis2_df <- dhis2_df |>
  dplyr::distinct() |>
  dplyr::mutate(index = dplyr::row_number(), .before = 1) |>
  dplyr::select(-index)

#  NA count by column
na_counts <- base::colSums(base::is.na(dhis2_df))
na_counts <- na_counts[na_counts > 0]  # Show only variables with NAs

if (base::length(na_counts) > 0) {
  message("Variables with missing values:")
  print(na_counts)

  # Check critical variables
  critical_vars <- c("adm0", "adm1", "adm2", "hf", "periodname")
  critical_na <- na_counts[base::names(na_counts) %in% critical_vars]

  if (base::length(critical_na) > 0) {
    message("\n❌ CRITICAL ISSUE: Missing values in:")
    print(critical_na)
    message("Contact DHIS2 expert/SNT team immediately")
  }
} else {
  message("✅ No missing values found in any variables")
}
Output
                           OPD (New and follow-up curative) 0-59m_X 
                                                              51839 
                             OPD (New and follow-up curative) 5+y_X 
                                                              52113 
                       Admission - Child with malaria 0-59 months_X 
                                                             181613 
                        Admission - Child with malaria 5-14 years_X 
                                                             181766 
                                    Admission - Malaria 15+ years_X 
                                                             181681 
                Separation - Child with malaria 0-59 months_X Death 
                                                             182482 
                 Separation - Child with malaria 5-14 years_X Death 
                                                             183804 
                             Separation - Malaria 15+ years_X Death 
                                                             183442 
                             Fever case - suspected Malaria 0-59m_X 
                                                              47744 
                             Fever case - suspected Malaria 5-14y_X 
                                                              50442 
                              Fever case - suspected Malaria 15+y_X 
                                                              48364 
                Fever case in community (Suspected Malaria) 0-59m_X 
                                                             117905 
                Fever case in community (Suspected Malaria) 5-14y_X 
                                                             138029 
                 Fever case in community (Suspected Malaria) 15+y_X 
                                                             141873 
                                     Death malaria 15+ years Female 
                                                             183620 
                                       Death malaria 15+ years Male 
                                                             183520 
                                      Child death - Malaria 1-59m_X 
                                                             181450 
                                     Child death - Malaria 10-14y_X 
                                                             183901 
                                       Child death - Malaria 5-9y_X 
                                                             183358 
Fever case in community tested for Malaria (RDT) - Negative 0-59m_X 
                                                             141254 
Fever case in community tested for Malaria (RDT) - Positive 0-59m_X 
                                                             127075 
Fever case in community tested for Malaria (RDT) - Negative 5-14y_X 
                                                             155147 
Fever case in community tested for Malaria (RDT) - Positive 5-14y_X 
                                                             142447 
 Fever case in community tested for Malaria (RDT) - Negative 15+y_X 
                                                             156455 
 Fever case in community tested for Malaria (RDT) - Positive 15+y_X 
                                                             145204 
      Fever case tested for Malaria (Microscopy) - Negative 0-59m_X 
                                                             181768 
      Fever case tested for Malaria (Microscopy) - Positive 0-59m_X 
                                                             181815 
      Fever case tested for Malaria (Microscopy) - Negative 5-14y_X 
                                                             181938 
      Fever case tested for Malaria (Microscopy) - Positive 5-14y_X 
                                                             181753 
       Fever case tested for Malaria (Microscopy) - Negative 15+y_X 
                                                             181501 
       Fever case tested for Malaria (Microscopy) - Positive 15+y_X 
                                                             181365 
             Fever case tested for Malaria (RDT) - Negative 0-59m_X 
                                                              53711 
             Fever case tested for Malaria (RDT) - Positive 0-59m_X 
                                                              50162 
             Fever case tested for Malaria (RDT) - Negative 5-14y_X 
                                                              80925 
             Fever case tested for Malaria (RDT) - Positive 5-14y_X 
                                                              53992 
              Fever case tested for Malaria (RDT) - Negative 15+y_X 
                                                              58479 
              Fever case tested for Malaria (RDT) - Positive 15+y_X 
                                                              51629 
            Malaria treated in community with ACT <24 hours 0-59m_X 
                                                             128887 
            Malaria treated in community with ACT >24 hours 0-59m_X 
                                                             149402 
            Malaria treated in community with ACT <24 hours 5-14y_X 
                                                             145310 
            Malaria treated in community with ACT >24 hours 5-14y_X 
                                                             161293 
             Malaria treated in community with ACT <24 hours 15+y_X 
                                                             148234 
             Malaria treated in community with ACT >24 hours 15+y_X 
                                                             162307 
                         Malaria treated with ACT <24 hours 0-59m_X 
                                                              61097 
                         Malaria treated with ACT >24 hours 0-59m_X 
                                                              89784 
                         Malaria treated with ACT <24 hours 5-14y_X 
                                                              72240 
                         Malaria treated with ACT >24 hours 5-14y_X 
                                                             101811 
                          Malaria treated with ACT <24 hours 15+y_X 
                                                              68481 
                          Malaria treated with ACT >24 hours 15+y_X 
                                                              94301 
Show the code
# remove whitespace from column names
dhis2_df.columns = dhis2_df.columns.str.strip()
dhis2_df = dhis2_df.reset_index(drop=True)

# remove duplicate rows
dhis2_df = dhis2_df.drop_duplicates().reset_index(drop = True)
Output

To adapt the code:

Step 2.2: (If needed) Troubleshoot initial import (maybe multiple steps here)

Check for problems and handle stuff like:

  • Inconsistent columns across files
  • Missing admin unit names
  • Name matching of HFs across multiple extracts

Step 3: Rename columns

Consult SNT team

Review variable definitions for your country to better understand and appropriately analyze the data. Consider investigating questions like:

  • Are admissions included in outpatients?
  • Are data on pregnant women included in the data for adults?
  • Is there double counting between RDT and microscopy results? If yes, is it appropriate to only use RDT results?
  • Are data from the private sector included here? If so, what percentage of the private sector reports into DHIS2?
  • Are data from commmunity health workers included in their assigned health facility data or are data separate?
  • Have any variables been included or adapted throughout the years? If so, how should they be treated throughout the time series?

Always confirm variable definitions with the SNT team.

Step 3.1: Assign renaming rules

The rename_columns() function (for R) or renaming dictionaries (for python) standardises the column names in the Sierra Leone DHIS2 malaria dataset. If you’re working with your own country’s data, you only need to adjust the column names to match your DHIS2 data dictionary. For example, if your dataset uses “Admission of child with malaria - 5-14yrs” instead of “Admission - Child with malaria 5-14 years_X”, simply update the relevant names in rename_columns() (for R), d_cols_admin and/or d_cols_data (for python) accordingly.

Warning

The renaming here is based on variables from the Sierra Leone DHIS2. You must add, edit, or remove variables from the renaming function (R) or dictionaries (python) based on your country-specific data and needs.

  • R
  • Python
Show the code
#' Rename columns in a dataframe according to a predefined mapping
#'#' @param df A dataframe containing columns to be renamed
#'
#' @return
#' Returns the dataframe with renamed columns. Returns NULL if an error occurs.

#' @examples
#' df <- data.frame(adm0 = 1:3, allout_u5 = 4:6)
#' renamed_df <- rename_columns(df)
rename_columns <- function(df) {
  # rename columns
  rename_cols <- c(

    # Organisational units
    "adm0" = "orgunitlevel1",
    "adm1" = "orgunitlevel2",
    "adm2" = "orgunitlevel3",
    "adm3" = "orgunitlevel4",
    "hf" = "organisationunitname",

    # All outpatient visits
    "allout_u5" = "OPD (New and follow-up curative) 0-59m_X",
    "allout_ov5" = "OPD (New and follow-up curative) 5+y_X",

    # Malaria admissions
    "maladm_u5" = "Admission - Child with malaria 0-59 months_X",
    "maladm_5_14" = "Admission - Child with malaria 5-14 years_X",
    "maladm_ov15" = "Admission - Malaria 15+ years_X",

    # Malaria deaths
    "maldth_1_59m" = "Child death - Malaria 1-59m_X",
    "maldth_10_14" = "Child death - Malaria 10-14y_X",
    "maldth_5_9" = "Child death - Malaria 5-9y_X",
    "maldth_fem_ov15" = "Death malaria 15+ years Female",
    "maldth_mal_ov15" = "Death malaria 15+ years Male",
    "maldth_u5" = "Separation - Child with malaria 0-59 months_X Death",
    "maldth_5_14" = "Separation - Child with malaria 5-14 years_X Death",
    "maldth_ov15" = "Separation - Malaria 15+ years_X Death",

    # Suspected cases
    "susp_u5_hf" = "Fever case - suspected Malaria 0-59m_X",
    "susp_5_14_hf" = "Fever case - suspected Malaria 5-14y_X",
    "susp_ov15_hf" = "Fever case - suspected Malaria 15+y_X",
    "susp_u5_com" = "Fever case in community (Suspected Malaria) 0-59m_X",
    "susp_5_14_com" = "Fever case in community (Suspected Malaria) 5-14y_X",
    "susp_ov15_com" = "Fever case in community (Suspected Malaria) 15+y_X",

    # Malaria tests conducted at community level
    "tes_neg_rdt_u5_com" =
      "Fever case in community tested for Malaria (RDT) - Negative 0-59m_X",
    "tes_pos_rdt_u5_com" =
      "Fever case in community tested for Malaria (RDT) - Positive 0-59m_X",
    "tes_neg_rdt_5_14_com" =
      "Fever case in community tested for Malaria (RDT) - Negative 5-14y_X",
    "tes_pos_rdt_5_14_com" =
      "Fever case in community tested for Malaria (RDT) - Positive 5-14y_X",
    "tes_neg_rdt_ov15_com" =
      "Fever case in community tested for Malaria (RDT) - Negative 15+y_X",
    "tes_pos_rdt_ov15_com" =
      "Fever case in community tested for Malaria (RDT) - Positive 15+y_X",

    # Malaria tests conducted at health facility level
    "test_neg_mic_u5_hf" =
      "Fever case tested for Malaria (Microscopy) - Negative 0-59m_X",
    "test_pos_mic_u5_hf" =
      "Fever case tested for Malaria (Microscopy) - Positive 0-59m_X",
    "test_neg_mic_5_14_hf" =
      "Fever case tested for Malaria (Microscopy) - Negative 5-14y_X",
    "test_pos_mic_5_14_hf" =
      "Fever case tested for Malaria (Microscopy) - Positive 5-14y_X",
    "test_neg_mic_ov15_hf" =
      "Fever case tested for Malaria (Microscopy) - Negative 15+y_X",
    "test_pos_mic_ov15_hf" =
      "Fever case tested for Malaria (Microscopy) - Positive 15+y_X",
    "tes_neg_rdt_u5_hf" =
      "Fever case tested for Malaria (RDT) - Negative 0-59m_X",
    "tes_pos_rdt_u5_hf" =
      "Fever case tested for Malaria (RDT) - Positive 0-59m_X",
    "tes_neg_rdt_5_14_hf" =
      "Fever case tested for Malaria (RDT) - Negative 5-14y_X",
    "tes_pos_rdt_5_14_hf" =
      "Fever case tested for Malaria (RDT) - Positive 5-14y_X",
    "tes_neg_rdt_ov15_hf" =
      "Fever case tested for Malaria (RDT) - Negative 15+y_X",
    "tes_pos_rdt_ov15_hf" =
      "Fever case tested for Malaria (RDT) - Positive 15+y_X",

    # Malaria tests conducted at community level
    "maltreat_u24_u5_com" =
      "Malaria treated in community with ACT <24 hours 0-59m_X",
    "maltreat_ov24_u5_com" =
      "Malaria treated in community with ACT >24 hours 0-59m_X",
    "maltreat_u24_5_14_com" =
      "Malaria treated in community with ACT <24 hours 5-14y_X",
    "maltreat_ov24_5_14_com" =
      "Malaria treated in community with ACT >24 hours 5-14y_X",
    "maltreat_u24_ov15_com" =
      "Malaria treated in community with ACT <24 hours 15+y_X",
    "maltreat_ov24_ov15_com" =
      "Malaria treated in community with ACT >24 hours 15+y_X",

    # Malaria tests conducted at community level
    "maltreat_u24_u5_hf" =
      "Malaria treated with ACT <24 hours 0-59m_X",
    "maltreat_ov24_u5_hf" =
      "Malaria treated with ACT >24 hours 0-59m_X",
    "maltreat_u24_5_14_hf" =
      "Malaria treated with ACT <24 hours 5-14y_X",
    "maltreat_ov24_5_14_hf" =
      "Malaria treated with ACT >24 hours 5-14y_X",
    "maltreat_u24_ov15_hf" =
      "Malaria treated with ACT <24 hours 15+y_X",
    "maltreat_ov24_ov15_hf" =
      "Malaria treated with ACT >24 hours 15+y_X"
  )

  df <- df |>
    dplyr::rename(!!!rename_cols)

  return(df)
}
Output

To adapt the code:

  • Lines 15–114: Modify the column names (add, edit, remove) to match those specific to your country’s data dictionary and its SNT needs
  • Once updated, run the code
Show the code
# Here we define standardised column names, in relation to the names we have in the dataset. We will then use this dictionary to rename columns.
d_cols_admin = {
    # Organisational units
    'adm0' : 'orgunitlevel1',
    'adm1' : 'orgunitlevel2',
    'adm2' : 'orgunitlevel3',
    'adm3' : 'orgunitlevel4',
    'hf' : 'organisationunitname',
    'YM': 'periodname'}

d_cols_data = {
    # All outpatient visits
    'allout_u5' : 'OPD (New and follow-up curative) 0-59m_X',
    'allout_ov5' : 'OPD (New and follow-up curative) 5+y_X',

    # Malaria admissions
    'maladm_u5' : 'Admission - Child with malaria 0-59 months_X',
    'maladm_5_14' : 'Admission - Child with malaria 5-14 years_X',
    'maladm_ov15' : 'Admission - Malaria 15+ years_X',

    # Malaria deaths
    'maldth_1_59m' : 'Child death - Malaria 1-59m_X',
    'maldth_10_14' : 'Child death - Malaria 10-14y_X',
    'maldth_5_9' : 'Child death - Malaria 5-9y_X',
    'maldth_fem_ov15' : 'Death malaria 15+ years Female',
    'maldth_mal_ov15' : 'Death malaria 15+ years Male',
    'maldth_u5' : 'Separation - Child with malaria 0-59 months_X Death',
    'maldth_5_14' : 'Separation - Child with malaria 5-14 years_X Death',
    'maldth_ov15' : 'Separation - Malaria 15+ years_X Death',

    # Suspected cases
    'susp_hf_u5' : 'Fever case - suspected Malaria 0-59m_X',
    'susp_hf_5_14' : 'Fever case - suspected Malaria 5-14y_X',
    'susp_hf_ov15' : 'Fever case - suspected Malaria 15+y_X',
    'susp_com_u5' : 'Fever case in community (Suspected Malaria) 0-59m_X',
    'susp_com_5_14' : 'Fever case in community (Suspected Malaria) 5-14y_X',
    'susp_com_ov15' : 'Fever case in community (Suspected Malaria) 15+y_X',

    # Malaria tests conducted at community level
    'test_neg_rdt_u5_com' : 'Fever case in community tested for Malaria (RDT) - Negative 0-59m_X',
    'test_pos_rdt_u5_com' : 'Fever case in community tested for Malaria (RDT) - Positive 0-59m_X',
    'test_neg_rdt_5_14_com' : 'Fever case in community tested for Malaria (RDT) - Negative 5-14y_X',
    'test_pos_rdt_5_14_com' : 'Fever case in community tested for Malaria (RDT) - Positive 5-14y_X',
    'test_neg_rdt_ov15_com' : 'Fever case in community tested for Malaria (RDT) - Negative 15+y_X',
    'test_pos_rdt_ov15_com' : 'Fever case in community tested for Malaria (RDT) - Positive 15+y_X',

    # Malaria tests conducted at health facility level
    'test_neg_mic_u5_hf' : 'Fever case tested for Malaria (Microscopy) - Negative 0-59m_X',
    'test_pos_mic_u5_hf' : 'Fever case tested for Malaria (Microscopy) - Positive 0-59m_X',
    'test_neg_mic_5_14_hf' : 'Fever case tested for Malaria (Microscopy) - Negative 5-14y_X',
    'test_pos_mic_5_14_hf' : 'Fever case tested for Malaria (Microscopy) - Positive 5-14y_X',
    'test_neg_mic_ov15_hf' : 'Fever case tested for Malaria (Microscopy) - Negative 15+y_X',
    'test_pos_mic_ov15_hf' : 'Fever case tested for Malaria (Microscopy) - Positive 15+y_X',
    'test_neg_rdt_u5_hf' : 'Fever case tested for Malaria (RDT) - Negative 0-59m_X',
    'test_pos_rdt_u5_hf' : 'Fever case tested for Malaria (RDT) - Positive 0-59m_X',
    'test_neg_rdt_5_14_hf' : 'Fever case tested for Malaria (RDT) - Negative 5-14y_X',
    'test_pos_rdt_5_14_hf' : 'Fever case tested for Malaria (RDT) - Positive 5-14y_X',
    'test_neg_rdt_ov15_hf' : 'Fever case tested for Malaria (RDT) - Negative 15+y_X',
    'test_pos_rdt_ov15_hf' : 'Fever case tested for Malaria (RDT) - Positive 15+y_X',

    # Malaria tests conducted at community level
    'maltreat_u24_u5_com' : 'Malaria treated in community with ACT <24 hours 0-59m_X',
    'maltreat_ov24_u5_com' : 'Malaria treated in community with ACT >24 hours 0-59m_X',
    'maltreat_u24_5_14_com' : 'Malaria treated in community with ACT <24 hours 5-14y_X',
    'maltreat_ov24_5_14_com' : 'Malaria treated in community with ACT >24 hours 5-14y_X',
    'maltreat_u24_ov15_com' : 'Malaria treated in community with ACT <24 hours 15+y_X',
    'maltreat_ov24_ov15_com' : 'Malaria treated in community with ACT >24 hours 15+y_X',

    # Malaria tests conducted at community level
    'maltreat_u24_u5_hf' : 'Malaria treated with ACT <24 hours 0-59m_X',
    'maltreat_ov24_u5_hf' : 'Malaria treated with ACT >24 hours 0-59m_X',
    'maltreat_u24_5_14_hf' : 'Malaria treated with ACT <24 hours 5-14y_X',
    'maltreat_ov24_5_14_hf' : 'Malaria treated with ACT >24 hours 5-14y_X',
    'maltreat_u24_ov15_hf' : 'Malaria treated with ACT <24 hours 15+y_X',
    'maltreat_ov24_ov15_hf' : 'Malaria treated with ACT >24 hours 15+y_X'
}
Output

To adapt the code:

  • Lines XXX: Modify the column names (add, edit, remove) to match those specific to your country’s data dictionary and its SNT needs
  • Once updated, run the code

Step 3.2: Apply renaming

Now that the renaming rules are defined, we rename columns.

  • R
  • Python
# clean columns using helper function
dhis2_df <- rename_columns(dhis2_df) # clean names

# print new column names to check that the renaming of columns worked
colnames(dhis2_df) |> print()
Output
 [1] "adm0"                   "adm1"                   "adm2"                  
 [4] "adm3"                   "orgunitlevel5"          "hf"                    
 [7] "periodname"             "allout_u5"              "allout_ov5"            
[10] "maladm_u5"              "maladm_5_14"            "maladm_ov15"           
[13] "maldth_u5"              "maldth_5_14"            "maldth_ov15"           
[16] "susp_u5_hf"             "susp_5_14_hf"           "susp_ov15_hf"          
[19] "susp_u5_com"            "susp_5_14_com"          "susp_ov15_com"         
[22] "maldth_fem_ov15"        "maldth_mal_ov15"        "maldth_1_59m"          
[25] "maldth_10_14"           "maldth_5_9"             "tes_neg_rdt_u5_com"    
[28] "tes_pos_rdt_u5_com"     "tes_neg_rdt_5_14_com"   "tes_pos_rdt_5_14_com"  
[31] "tes_neg_rdt_ov15_com"   "tes_pos_rdt_ov15_com"   "test_neg_mic_u5_hf"    
[34] "test_pos_mic_u5_hf"     "test_neg_mic_5_14_hf"   "test_pos_mic_5_14_hf"  
[37] "test_neg_mic_ov15_hf"   "test_pos_mic_ov15_hf"   "tes_neg_rdt_u5_hf"     
[40] "tes_pos_rdt_u5_hf"      "tes_neg_rdt_5_14_hf"    "tes_pos_rdt_5_14_hf"   
[43] "tes_neg_rdt_ov15_hf"    "tes_pos_rdt_ov15_hf"    "maltreat_u24_u5_com"   
[46] "maltreat_ov24_u5_com"   "maltreat_u24_5_14_com"  "maltreat_ov24_5_14_com"
[49] "maltreat_u24_ov15_com"  "maltreat_ov24_ov15_com" "maltreat_u24_u5_hf"    
[52] "maltreat_ov24_u5_hf"    "maltreat_u24_5_14_hf"   "maltreat_ov24_5_14_hf" 
[55] "maltreat_u24_ov15_hf"   "maltreat_ov24_ov15_hf" 

To adapt the code:

  • Line 16: Change "english/data_r" to your current working directory
  • Once updated, run the code
Show the code
# rename columns
for d in [d_cols_admin, d_cols_data]:
  temp = pd.DataFrame(d.items(), columns = ['new_name', 'old_name'])
  dhis2_df = dhis2_df.rename(columns = dict(zip(temp['old_name'], temp['new_name'])))

# remove unecessary columns
cols_to_remove = ['orgunitlevel5']
dhis2_df = dhis2_df.drop(cols_to_remove, axis = 1)

# Inspect results
dhis2_df.head(10).style
  adm0 adm1 adm2 adm3 hf YM allout_u5 allout_ov5 maladm_u5 maladm_5_14 maladm_ov15 maldth_u5 maldth_5_14 maldth_ov15 susp_hf_u5 susp_hf_5_14 susp_hf_ov15 susp_com_u5 susp_com_5_14 susp_com_ov15 maldth_fem_ov15 maldth_mal_ov15 maldth_1_59m maldth_10_14 maldth_5_9 test_neg_rdt_u5_com test_pos_rdt_u5_com test_neg_rdt_5_14_com test_pos_rdt_5_14_com test_neg_rdt_ov15_com test_pos_rdt_ov15_com test_neg_mic_u5_hf test_pos_mic_u5_hf test_neg_mic_5_14_hf test_pos_mic_5_14_hf test_neg_mic_ov15_hf test_pos_mic_ov15_hf test_neg_rdt_u5_hf test_pos_rdt_u5_hf test_neg_rdt_5_14_hf test_pos_rdt_5_14_hf test_neg_rdt_ov15_hf test_pos_rdt_ov15_hf maltreat_u24_u5_com maltreat_ov24_u5_com maltreat_u24_5_14_com maltreat_ov24_5_14_com maltreat_u24_ov15_com maltreat_ov24_ov15_com maltreat_u24_u5_hf maltreat_ov24_u5_hf maltreat_u24_5_14_hf maltreat_ov24_5_14_hf maltreat_u24_ov15_hf maltreat_ov24_ov15_hf
0 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP January 2015 44.000000 43.000000 nan nan nan nan nan nan 21.000000 4.000000 4.000000 10.000000 7.000000 4.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 21.000000 nan 4.000000 nan 4.000000 15.000000 6.000000 5.000000 2.000000 3.000000 1.000000 18.000000 3.000000 4.000000 nan 2.000000 nan
1 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP February 2015 49.000000 34.000000 nan nan nan nan nan nan 37.000000 3.000000 4.000000 10.000000 7.000000 4.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 10.000000 27.000000 nan 3.000000 nan 4.000000 15.000000 6.000000 5.000000 2.000000 3.000000 1.000000 17.000000 4.000000 2.000000 nan 2.000000 1.000000
2 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP March 2015 102.000000 83.000000 nan nan nan nan nan nan 46.000000 5.000000 14.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 36.000000 nan 3.000000 nan 8.000000 nan nan nan nan nan nan 36.000000 nan 3.000000 nan 5.000000 nan
3 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP April 2015 68.000000 68.000000 nan nan nan nan nan nan 44.000000 10.000000 10.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 9.000000 35.000000 4.000000 6.000000 5.000000 5.000000 nan nan nan nan nan nan 27.000000 8.000000 4.000000 2.000000 3.000000 2.000000
4 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP May 2015 118.000000 76.000000 nan nan nan nan nan nan 105.000000 15.000000 22.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 20.000000 85.000000 4.000000 11.000000 5.000000 17.000000 nan nan nan nan nan nan 67.000000 18.000000 7.000000 4.000000 10.000000 7.000000
5 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP June 2015 145.000000 89.000000 nan nan nan nan nan nan 110.000000 10.000000 21.000000 10.000000 11.000000 9.000000 nan nan nan nan nan 2.000000 8.000000 1.000000 10.000000 1.000000 8.000000 nan nan nan nan nan nan 24.000000 86.000000 6.000000 4.000000 10.000000 11.000000 13.000000 4.000000 9.000000 1.000000 6.000000 1.000000 70.000000 16.000000 3.000000 1.000000 6.000000 5.000000
6 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP July 2015 95.000000 94.000000 nan nan nan nan nan nan 67.000000 31.000000 26.000000 10.000000 13.000000 9.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 17.000000 50.000000 9.000000 22.000000 8.000000 18.000000 20.000000 nan 10.000000 nan 8.000000 nan 43.000000 6.000000 13.000000 5.000000 10.000000 5.000000
7 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP August 2015 86.000000 63.000000 nan nan nan nan nan nan 72.000000 6.000000 15.000000 15.000000 17.000000 14.000000 nan nan nan nan nan 2.000000 13.000000 3.000000 14.000000 2.000000 12.000000 nan nan nan nan nan nan 8.000000 64.000000 nan 4.000000 nan 15.000000 20.000000 4.000000 11.000000 3.000000 6.000000 6.000000 52.000000 12.000000 3.000000 1.000000 11.000000 4.000000
8 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP September 2015 117.000000 94.000000 nan nan nan nan nan nan 66.000000 15.000000 24.000000 16.000000 19.000000 16.000000 nan nan nan nan nan 2.000000 14.000000 3.000000 16.000000 2.000000 14.000000 nan nan nan nan nan nan 16.000000 50.000000 9.000000 6.000000 2.000000 22.000000 22.000000 4.000000 13.000000 3.000000 6.000000 3.000000 42.000000 8.000000 3.000000 3.000000 5.000000 7.000000
9 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP October 2015 97.000000 68.000000 nan nan nan nan nan nan 84.000000 5.000000 11.000000 16.000000 5.000000 10.000000 nan nan nan nan nan 5.000000 11.000000 4.000000 1.000000 4.000000 6.000000 nan nan nan nan nan nan 11.000000 73.000000 2.000000 3.000000 3.000000 8.000000 23.000000 3.000000 1.000000 nan 4.000000 nan 66.000000 7.000000 2.000000 1.000000 5.000000 3.000000

Step 4: (If needed) Compute variables

Step 4.1: Compute indicator totals and new variables

Now that we have imported our DHIS2 data and cleaned up the columns, we will compute derived variables to create totals for specific indicators. DHIS2 can provide totals, but these columns must specifically be extracted. In the raw DHIS2 dataset, indicators are typically disaggregated by age group, community, and health facility level. Analysts should strive to obtain the most disaggregated databases possible.

For example, if we want to calculate the total number of outpatient visits, DHIS2 may not provide a direct total, but includes components that can be summed, such as allout_u5 and allout_ov5. Remember to review variable definitions to ensure your aggregations do not double count cases.

The same logic applies to other indicators, including total suspected cases (susp), tested cases (test), confirmed cases (conf), treated cases (maltreat), and presumed cases (pres). The code snippet below shows how to compute these totals by combining the relevant components. The code also includes indicator aggregation by age groups, such as test_hf_u5 which captures all children under five tested by either RDT or microscopy at a health facility.

Consult the SNT team

It is best to verify with the SNT team which specific data elements from DHIS2 should be summed to obtain the correct total. While some calculations may seem obvious, others are not. The box in Step 1.2 lists example questions that the analyst should get verification for before summing across disaggregated data elements. Depending on your country’s DHIS2 and data reporting practice, other questions may also be relevant.

Consult the SNT team

While some countries have presumed cases in their DHIS2 data, not all do! Here are three options for presumed cases that countries have used:

  • Option 1: Your data already has a pres column, no further calculation is needed.
  • Option 2: Calculate presumed cases using the difference between treated cases and confirmed cases: maltreat - conf
  • Option 3: Calculate presumed cases using the difference between suspected cases and tested cases: susp - test

The code below uses Option 2 (lines 52 to 55). Testing is subject to resource availability meaning facilities handle presumed cases differently.

Consult the SNT team to determine how best to approach presumed cases calculations, whether it’s one of the options shown above or a different approach.

  • R
  • Python

Below we create to new functions: fallback_row_sum() and fallback_diff(). The fallback_row_sum() function replaces rowSums(..., na.rm = TRUE), which returns 0 when all values in a row are NA. This default behavior can obscure missingness.

In routine health data, it’s critical to distinguish between:

  • True zeros: e.g. a facility reported 0 cases
  • Missing values: e.g. the facility did not report at all

fallback_row_sum() returns NA when too few values are present, preserving this distinction and preventing overestimation of completeness.

Similarly, fallback_diff() returns the absolute difference if both values are present, the non-missing value if only one is present, and NA if both are missing. It applies pmax() to ensure results meet a minimum threshold.

Both functions guard against misleading outputs when data are incomplete.

Show the code
# smart row-wise sum with missing data handling
fallback_row_sum <- function(..., min_present = 1, .keep_zero_as_zero = TRUE) {
  vars_matrix <- cbind(...)
  valid_count <- rowSums(!is.na(vars_matrix))
  raw_sum <- rowSums(vars_matrix, na.rm = TRUE)

  ifelse(valid_count >= min_present, raw_sum, NA_real_)
}

# fallback absolute difference between two vectors
fallback_diff <- function(col1, col2, minimum = 0) {
  dplyr::case_when(
    is.na(col1) & is.na(col2) ~ NA_real_,
    is.na(col1) ~ pmax(col2, minimum),
    is.na(col2) ~ pmax(col1, minimum),
    TRUE ~ pmax(col1 - col2, minimum)
  )
}

# compute indicator totals in DHIS2 data
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # outpatient visits
    allout = fallback_row_sum(allout_u5, allout_ov5),

    # suspected cases
    susp = fallback_row_sum(
      susp_u5_hf,
      susp_5_14_hf,
      susp_ov15_hf,
      susp_u5_com,
      susp_5_14_com,
      susp_ov15_com
    ),

    # tested cases
    test_hf = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    test_com = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com,
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com,
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    test = fallback_row_sum(test_hf, test_com),

    # confirmed cases (HF and COM)
    conf_hf = fallback_row_sum(
      test_pos_mic_u5_hf,
      test_pos_mic_5_14_hf,
      test_pos_mic_ov15_hf,
      tes_pos_rdt_u5_hf,
      tes_pos_rdt_5_14_hf,
      tes_pos_rdt_ov15_hf
    ),

    conf_com = fallback_row_sum(
      tes_pos_rdt_u5_com,
      tes_pos_rdt_5_14_com,
      tes_pos_rdt_ov15_com
    ),

    conf = fallback_row_sum(conf_hf, conf_com),

    # treated cases
    maltreat_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com,
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    maltreat_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf,
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    maltreat = fallback_row_sum(maltreat_hf, maltreat_com),

    # presumed cases
    pres_com = fallback_diff(maltreat_com, conf_com),
    pres_hf = fallback_diff(maltreat_hf, conf_hf),
    pres = fallback_row_sum(pres_com, pres_hf),

    # malaria admissions
    maladm = fallback_row_sum(
      maladm_u5,
      maladm_5_14,
      maladm_ov15
    ),

    # malaria deaths
    maldth = fallback_row_sum(
      maldth_u5,
      maldth_1_59m,
      maldth_10_14,
      maldth_5_9,
      maldth_5_14,
      maldth_ov15,
      maldth_fem_ov15,
      maldth_mal_ov15
    ),

    # AGE-GROUP SPECIFIC AGGREGATIONS
    # Tested cases by age group (HF only)
    test_hf_u5 = fallback_row_sum(
      test_neg_mic_u5_hf,
      test_pos_mic_u5_hf,
      tes_neg_rdt_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    test_hf_5_14 = fallback_row_sum(
      test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf,
      tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    test_hf_ov15 = fallback_row_sum(
      test_neg_mic_ov15_hf,
      test_pos_mic_ov15_hf,
      tes_neg_rdt_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # Tested cases by age group (Community only)
    test_com_u5 = fallback_row_sum(
      tes_neg_rdt_u5_com,
      tes_pos_rdt_u5_com
    ),

    test_com_5_14 = fallback_row_sum(
      tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com
    ),

    test_com_ov15 = fallback_row_sum(
      tes_neg_rdt_ov15_com,
      tes_pos_rdt_ov15_com
    ),

    # Total tested by age group (HF + Community)
    test_u5 = fallback_row_sum(test_hf_u5, test_com_u5),
    test_5_14 = fallback_row_sum(test_hf_5_14, test_com_5_14),
    test_ov15 = fallback_row_sum(test_hf_ov15, test_com_ov15),

    # Suspected cases by age group (HF only)
    susp_hf_u5 = susp_u5_hf,

    susp_hf_5_14 = susp_5_14_hf,

    susp_hf_ov15 = susp_ov15_hf,

    # Suspected cases by age group (Community only)
    susp_com_u5 = susp_u5_com,

    susp_com_5_14 = susp_5_14_com,

    susp_com_ov15 = susp_ov15_com,

    # Total suspected by age group (HF + Community)
    susp_u5 = fallback_row_sum(susp_hf_u5, susp_com_u5),
    susp_5_14 = fallback_row_sum(susp_hf_5_14, susp_com_5_14),
    susp_ov15 = fallback_row_sum(susp_hf_ov15, susp_com_ov15),

    # Confirmed cases by age group (HF only)
    conf_hf_u5 = fallback_row_sum(
      test_pos_mic_u5_hf,
      tes_pos_rdt_u5_hf
    ),

    conf_hf_5_14 = fallback_row_sum(
      test_pos_mic_5_14_hf,
      tes_pos_rdt_5_14_hf
    ),

    conf_hf_ov15 = fallback_row_sum(
      test_pos_mic_ov15_hf,
      tes_pos_rdt_ov15_hf
    ),

    # Confirmed cases by age group (Community only)
    conf_com_u5 = tes_pos_rdt_u5_com,
    conf_com_5_14 = tes_pos_rdt_5_14_com,
    conf_com_ov15 = tes_pos_rdt_ov15_com,

    # Total confirmed by age group (HF + Community)
    conf_u5 = fallback_row_sum(conf_hf_u5, conf_com_u5),
    conf_5_14 = fallback_row_sum(conf_hf_5_14, conf_com_5_14),
    conf_ov15 = fallback_row_sum(conf_hf_ov15, conf_com_ov15),

    # Treated cases by age group (HF only)
    maltreat_hf_u5 = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_ov24_u5_hf
    ),

    maltreat_hf_5_14 = fallback_row_sum(
      maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf
    ),

    maltreat_hf_ov15 = fallback_row_sum(
      maltreat_u24_ov15_hf,
      maltreat_ov24_ov15_hf
    ),

    # Treated cases by age group (Community only)
    maltreat_com_u5 = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_ov24_u5_com
    ),

    maltreat_com_5_14 = fallback_row_sum(
      maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com
    ),

    maltreat_com_ov15 = fallback_row_sum(
      maltreat_u24_ov15_com,
      maltreat_ov24_ov15_com
    ),

    # Total treated by age group (HF + Community)
    maltreat_u5 = fallback_row_sum(maltreat_hf_u5, maltreat_com_u5),
    maltreat_5_14 = fallback_row_sum(maltreat_hf_5_14, maltreat_com_5_14),
    maltreat_ov15 = fallback_row_sum(maltreat_hf_ov15, maltreat_com_ov15),

    # Total treated within 24 hours (HF only)
    maltreat_u24_hf = fallback_row_sum(
      maltreat_u24_u5_hf,
      maltreat_u24_5_14_hf,
      maltreat_u24_ov15_hf
    ),

    # Total treated after 24 hours (HF only)
    maltreat_ov24_hf = fallback_row_sum(
      maltreat_ov24_u5_hf,
      maltreat_ov24_5_14_hf,
      maltreat_ov24_ov15_hf
    ),

    # Total treated within 24 hours (Community only)
    maltreat_u24_com = fallback_row_sum(
      maltreat_u24_u5_com,
      maltreat_u24_5_14_com,
      maltreat_u24_ov15_com
    ),

    # Total treated after 24 hours (Community only)
    maltreat_ov24_com = fallback_row_sum(
      maltreat_ov24_u5_com,
      maltreat_ov24_5_14_com,
      maltreat_ov24_ov15_com
    ),

    # Overall totals (HF + Community)
    maltreat_u24_total = fallback_row_sum(maltreat_u24_hf, maltreat_u24_com),
    maltreat_ov24_total = fallback_row_sum(maltreat_ov24_hf, maltreat_ov24_com),

    # Presumed cases by age group
    pres_com_u5 = fallback_diff(maltreat_com_u5, conf_com_u5),
    pres_com_5_14 = fallback_diff(maltreat_com_5_14, conf_com_5_14),
    pres_com_ov15 = fallback_diff(maltreat_com_ov15, conf_com_ov15),

    pres_hf_u5 = fallback_diff(maltreat_hf_u5, conf_hf_u5),
    pres_hf_5_14 = fallback_diff(maltreat_hf_5_14, conf_hf_5_14),
    pres_hf_ov15 = fallback_diff(maltreat_hf_ov15, conf_hf_ov15),

    pres_u5 = fallback_row_sum(pres_com_u5, pres_hf_u5),
    pres_5_14 = fallback_row_sum(pres_com_5_14, pres_hf_5_14),
    pres_ov15 = fallback_row_sum(pres_com_ov15, pres_hf_ov15)
  )
Output

To adapt the code: - Lines 4–64: Adjust the variable names to reflect those relevant to your dataset when calculating new variables - Once updated, run the code

Show the code
d_sums = {
    # outpatient visits
    'allout' : ['allout_u5',
                'allout_ov5'],

    # suspected cases
    'susp' : ['susp_hf_u5',
              'susp_hf_5_14',
              'susp_hf_ov15',
              'susp_com_u5',
              'susp_com_5_14',
              'susp_com_ov15'],

    # tested cases
    'test_hf' : ['test_neg_mic_u5_hf',
                 'test_pos_mic_u5_hf',
                 'test_neg_mic_5_14_hf',
                 'test_pos_mic_5_14_hf',
                 'test_neg_mic_ov15_hf',
                 'test_pos_mic_ov15_hf',
                 'test_neg_rdt_u5_hf',
                 'test_pos_rdt_u5_hf',
                 'test_neg_rdt_5_14_hf',
                 'test_pos_rdt_5_14_hf',
                 'test_neg_rdt_ov15_hf',
                 'test_pos_rdt_ov15_hf'],

    'test_com' : ['test_neg_rdt_u5_com',
                  'test_pos_rdt_u5_com',
                  'test_neg_rdt_5_14_com',
                  'test_pos_rdt_5_14_com',
                  'test_neg_rdt_ov15_com',
                  'test_pos_rdt_ov15_com'],

    'test' : ['test_hf',
              'test_com'],

    # confirmed cases (HF and COM)
    'conf_hf' : ['test_pos_mic_u5_hf',
                 'test_pos_mic_5_14_hf',
                 'test_pos_mic_ov15_hf',
                 'test_pos_rdt_u5_hf',
                 'test_pos_rdt_5_14_hf',
                 'test_pos_rdt_ov15_hf'],

    'conf_com' : ['test_pos_rdt_u5_com',
                  'test_pos_rdt_5_14_com',
                  'test_pos_rdt_ov15_com'],

    'conf' : ['conf_hf',
              'conf_com'],

    # treated cases
    'maltreat_com' : ['maltreat_u24_u5_com',
                      'maltreat_ov24_u5_com',
                      'maltreat_u24_5_14_com',
                      'maltreat_ov24_5_14_com',
                      'maltreat_u24_ov15_com',
                      'maltreat_ov24_ov15_com'],

    'maltreat_hf' : ['maltreat_u24_u5_hf',
                      'maltreat_ov24_u5_hf',
                      'maltreat_u24_5_14_hf',
                      'maltreat_ov24_5_14_hf',
                      'maltreat_u24_ov15_hf',
                      'maltreat_ov24_ov15_hf'],

    'maltreat' : ['maltreat_hf',
                  'maltreat_com'],

    # malaria admissions
    'maladm' : ['maladm_u5',
                'maladm_5_14',
                'maladm_ov15'],

    # malaria deaths
    'maldth' : ['maldth_u5',
                'maldth_1_59m',
                'maldth_10_14',
                'maldth_5_9',
                'maldth_5_14',
                'maldth_ov15',
                'maldth_fem_ov15',
                'maldth_mal_ov15'],

    # AGE-GROUP SPECIFIC AGGREGATIONS
    # Tested cases by age group (HF only)
    'test_hf_u5' : ['test_neg_mic_u5_hf',
                    'test_pos_mic_u5_hf',
                    'test_neg_rdt_u5_hf',
                    'test_pos_rdt_u5_hf'],

    'test_hf_5_14' : ['test_neg_mic_5_14_hf',
                      'test_pos_mic_5_14_hf',
                      'test_neg_rdt_5_14_hf',
                      'test_pos_rdt_5_14_hf'],

    'test_hf_ov15' : ['test_neg_mic_ov15_hf',
                      'test_pos_mic_ov15_hf',
                      'test_neg_rdt_ov15_hf',
                      'test_pos_rdt_ov15_hf'],

    # Tested cases by age group (Community only)
    'test_com_u5' : ['test_neg_rdt_u5_com',
                     'test_pos_rdt_u5_com'],

    'test_com_5_14' : ['test_neg_rdt_5_14_com',
                       'test_pos_rdt_5_14_com'],

    'test_com_ov15' : ['test_neg_rdt_ov15_com',
                       'test_pos_rdt_ov15_com'],

    # Total tested by age group (HF + Community)
    'test_u5' : ['test_hf_u5',
                 'test_com_u5'],
    'test_5_14' : ['test_hf_5_14',
                   'test_com_5_14'],
    'test_ov15' : ['test_hf_ov15',
                   'test_com_ov15'],

   # Suspected cases by age group
    # Total suspected by age group (HF + Community)
    'susp_u5' : ['susp_hf_u5',
                 'susp_com_u5'],
    'susp_5_14' : ['susp_hf_5_14',
                   'susp_com_5_14'],
    'susp_ov15' : ['susp_hf_ov15',
                   'susp_com_ov15'],

    # Confirmed cases by age group (HF only)
    'conf_hf_u5' : ['test_pos_mic_u5_hf',
                    'test_pos_rdt_u5_hf'],

    'conf_hf_5_14' : ['test_pos_mic_5_14_hf',
                      'test_pos_rdt_5_14_hf'],

    'conf_hf_ov15' : ['test_pos_mic_ov15_hf',
                      'test_pos_rdt_ov15_hf'],

    # Confirmed cases by age group (Community only)
    'conf_com_u5' : ['test_pos_rdt_u5_com'],
    'conf_com_5_14' : ['test_pos_rdt_5_14_com'],
    'conf_com_ov15' : ['test_pos_rdt_ov15_com'],

    # Total confirmed by age group (HF + Community)
    'conf_u5' : ['conf_hf_u5',
                 'conf_com_u5'],
    'conf_5_14' : ['conf_hf_5_14',
                   'conf_com_5_14'],
    'conf_ov15' : ['conf_hf_ov15',
                   'conf_com_ov15'],

    # Treated cases by age group (HF only)
    'maltreat_hf_u5' : ['maltreat_u24_u5_hf',
                        'maltreat_ov24_u5_hf'],

    'maltreat_hf_5_14' : ['maltreat_u24_5_14_hf',
                          'maltreat_ov24_5_14_hf'],

    'maltreat_hf_ov15' : ['maltreat_u24_ov15_hf',
                          'maltreat_ov24_ov15_hf'],

    # Treated cases by age group (Community only)
    'maltreat_com_u5' : ['maltreat_u24_u5_com',
                         'maltreat_ov24_u5_com'],

    'maltreat_com_5_14' : ['maltreat_u24_5_14_com',
                           'maltreat_ov24_5_14_com'],

    'maltreat_com_ov15' : ['maltreat_u24_ov15_com',
                           'maltreat_ov24_ov15_com'],

    # Total treated by age group (HF + Community)
    'maltreat_u5' : ['maltreat_hf_u5',
                     'maltreat_com_u5'],
    'maltreat_5_14' : ['maltreat_hf_5_14',
                       'maltreat_com_5_14'],
    'maltreat_ov15' : ['maltreat_hf_ov15',
                       'maltreat_com_ov15'],

    # Total treated within 24 hours (HF only)
    'maltreat_u24_hf' : ['maltreat_u24_u5_hf',
                         'maltreat_u24_5_14_hf',
                         'maltreat_u24_ov15_hf'],

    # Total treated after 24 hours (HF only)
    'maltreat_ov24_hf' : ['maltreat_ov24_u5_hf',
                          'maltreat_ov24_5_14_hf',
                          'maltreat_ov24_ov15_hf'],

    # Total treated within 24 hours (Community only)
    'maltreat_u24_com' : ['maltreat_u24_u5_com',
                          'maltreat_u24_5_14_com',
                          'maltreat_u24_ov15_com'],

    # Total treated after 24 hours (Community only)
    'maltreat_ov24_com' : ['maltreat_ov24_u5_com',
                           'maltreat_ov24_5_14_com',
                           'maltreat_ov24_ov15_com'],

    # Overall totals (HF + Community)
    'maltreat_u24_total' : ['maltreat_u24_hf',
                            'maltreat_u24_com'],
    'maltreat_ov24_total' : ['maltreat_ov24_hf',
                             'maltreat_ov24_com'],
}

for col_sum, cols in d_sums.items():
    # create a new column with the sum of the columns in the list
    dhis2_df[col_sum] = dhis2_df[cols].sum(axis = 1, skipna = True, min_count = 1)

# substractions
# presumed cases
# The order of the two columns in the dictionary is important.
# If we want to perform the operation pres = maltreat-conf, then we need to prepare the dictionary as : d = {'pres': ['maltreat', 'conf']}
d_pres = {'pres_com' : ['maltreat_com', 'conf_com'],
          'pres_hf' : ['maltreat_hf', 'conf_hf'],
          # Presumed cases by age group
          'pres_com_u5' : ['maltreat_com_u5', 'conf_com_u5'],
          'pres_com_5_14' : ['maltreat_com_5_14', 'conf_com_5_14'],
          'pres_com_ov15' : ['maltreat_com_ov15', 'conf_com_ov15'],
          'pres_hf_u5' : ['maltreat_hf_u5', 'conf_hf_u5'],
          'pres_hf_5_14' : ['maltreat_hf_5_14', 'conf_hf_5_14'],
          'pres_hf_ov15' : ['maltreat_hf_ov15', 'conf_hf_ov15'],
          }
for col_pres, cols in d_pres.items():
    # create a new column with the difference between the two columns in the list
    dhis2_df[col_pres] = dhis2_df[cols[0]].sub(dhis2_df[cols[1]])
    # ensure that we do not have negative values
    dhis2_df[col_pres] = dhis2_df[col_pres].clip(lower=0)

dhis2_df['pres'] = dhis2_df[['pres_com', 'pres_hf']].sum(axis = 1, skipna = True, min_count = 1)
dhis2_df['pres_u5'] = dhis2_df[['pres_com_u5', 'pres_hf_u5']].sum(axis = 1, skipna = True, min_count = 1)
dhis2_df['pres_5_14'] = dhis2_df[['pres_com_5_14', 'pres_hf_5_14']].sum(axis = 1, skipna = True, min_count = 1)
dhis2_df['pres_ov15'] =  dhis2_df[['pres_com_ov15', 'pres_hf_ov15']].sum(axis = 1, skipna = True, min_count = 1)

# Inspect results
dhis2_df.head(10).style
Output
  adm0 adm1 adm2 adm3 hf YM allout_u5 allout_ov5 maladm_u5 maladm_5_14 maladm_ov15 maldth_u5 maldth_5_14 maldth_ov15 susp_hf_u5 susp_hf_5_14 susp_hf_ov15 susp_com_u5 susp_com_5_14 susp_com_ov15 maldth_fem_ov15 maldth_mal_ov15 maldth_1_59m maldth_10_14 maldth_5_9 test_neg_rdt_u5_com test_pos_rdt_u5_com test_neg_rdt_5_14_com test_pos_rdt_5_14_com test_neg_rdt_ov15_com test_pos_rdt_ov15_com test_neg_mic_u5_hf test_pos_mic_u5_hf test_neg_mic_5_14_hf test_pos_mic_5_14_hf test_neg_mic_ov15_hf test_pos_mic_ov15_hf test_neg_rdt_u5_hf test_pos_rdt_u5_hf test_neg_rdt_5_14_hf test_pos_rdt_5_14_hf test_neg_rdt_ov15_hf test_pos_rdt_ov15_hf maltreat_u24_u5_com maltreat_ov24_u5_com maltreat_u24_5_14_com maltreat_ov24_5_14_com maltreat_u24_ov15_com maltreat_ov24_ov15_com maltreat_u24_u5_hf maltreat_ov24_u5_hf maltreat_u24_5_14_hf maltreat_ov24_5_14_hf maltreat_u24_ov15_hf maltreat_ov24_ov15_hf allout susp test_hf test_com test conf_hf conf_com conf maltreat_com maltreat_hf maltreat maladm maldth test_hf_u5 test_hf_5_14 test_hf_ov15 test_com_u5 test_com_5_14 test_com_ov15 test_u5 test_5_14 test_ov15 susp_u5 susp_5_14 susp_ov15 conf_hf_u5 conf_hf_5_14 conf_hf_ov15 conf_com_u5 conf_com_5_14 conf_com_ov15 conf_u5 conf_5_14 conf_ov15 maltreat_hf_u5 maltreat_hf_5_14 maltreat_hf_ov15 maltreat_com_u5 maltreat_com_5_14 maltreat_com_ov15 maltreat_u5 maltreat_5_14 maltreat_ov15 maltreat_u24_hf maltreat_ov24_hf maltreat_u24_com maltreat_ov24_com maltreat_u24_total maltreat_ov24_total pres_com pres_hf pres_com_u5 pres_com_5_14 pres_com_ov15 pres_hf_u5 pres_hf_5_14 pres_hf_ov15 pres pres_u5 pres_5_14 pres_ov15
0 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP January 2015 44.000000 43.000000 nan nan nan nan nan nan 21.000000 4.000000 4.000000 10.000000 7.000000 4.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 21.000000 nan 4.000000 nan 4.000000 15.000000 6.000000 5.000000 2.000000 3.000000 1.000000 18.000000 3.000000 4.000000 nan 2.000000 nan 87.000000 50.000000 29.000000 nan 29.000000 29.000000 nan 29.000000 32.000000 27.000000 59.000000 nan nan 21.000000 4.000000 4.000000 nan nan nan 21.000000 4.000000 4.000000 31.000000 11.000000 8.000000 21.000000 4.000000 4.000000 nan nan nan 21.000000 4.000000 4.000000 21.000000 4.000000 2.000000 21.000000 7.000000 4.000000 42.000000 11.000000 6.000000 24.000000 3.000000 23.000000 9.000000 47.000000 12.000000 nan 0.000000 nan nan nan 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP February 2015 49.000000 34.000000 nan nan nan nan nan nan 37.000000 3.000000 4.000000 10.000000 7.000000 4.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 10.000000 27.000000 nan 3.000000 nan 4.000000 15.000000 6.000000 5.000000 2.000000 3.000000 1.000000 17.000000 4.000000 2.000000 nan 2.000000 1.000000 83.000000 65.000000 44.000000 nan 44.000000 34.000000 nan 34.000000 32.000000 26.000000 58.000000 nan nan 37.000000 3.000000 4.000000 nan nan nan 37.000000 3.000000 4.000000 47.000000 10.000000 8.000000 27.000000 3.000000 4.000000 nan nan nan 27.000000 3.000000 4.000000 21.000000 2.000000 3.000000 21.000000 7.000000 4.000000 42.000000 9.000000 7.000000 21.000000 5.000000 23.000000 9.000000 44.000000 14.000000 nan 0.000000 nan nan nan 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP March 2015 102.000000 83.000000 nan nan nan nan nan nan 46.000000 5.000000 14.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 36.000000 nan 3.000000 nan 8.000000 nan nan nan nan nan nan 36.000000 nan 3.000000 nan 5.000000 nan 185.000000 65.000000 47.000000 nan 47.000000 47.000000 nan 47.000000 nan 44.000000 44.000000 nan nan 36.000000 3.000000 8.000000 nan nan nan 36.000000 3.000000 8.000000 46.000000 5.000000 14.000000 36.000000 3.000000 8.000000 nan nan nan 36.000000 3.000000 8.000000 36.000000 3.000000 5.000000 nan nan nan 36.000000 3.000000 5.000000 44.000000 nan nan nan 44.000000 nan nan 0.000000 nan nan nan 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP April 2015 68.000000 68.000000 nan nan nan nan nan nan 44.000000 10.000000 10.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 9.000000 35.000000 4.000000 6.000000 5.000000 5.000000 nan nan nan nan nan nan 27.000000 8.000000 4.000000 2.000000 3.000000 2.000000 136.000000 64.000000 64.000000 nan 64.000000 46.000000 nan 46.000000 nan 46.000000 46.000000 nan nan 44.000000 10.000000 10.000000 nan nan nan 44.000000 10.000000 10.000000 44.000000 10.000000 10.000000 35.000000 6.000000 5.000000 nan nan nan 35.000000 6.000000 5.000000 35.000000 6.000000 5.000000 nan nan nan 35.000000 6.000000 5.000000 34.000000 12.000000 nan nan 34.000000 12.000000 nan 0.000000 nan nan nan 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP May 2015 118.000000 76.000000 nan nan nan nan nan nan 105.000000 15.000000 22.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 20.000000 85.000000 4.000000 11.000000 5.000000 17.000000 nan nan nan nan nan nan 67.000000 18.000000 7.000000 4.000000 10.000000 7.000000 194.000000 142.000000 142.000000 nan 142.000000 113.000000 nan 113.000000 nan 113.000000 113.000000 nan nan 105.000000 15.000000 22.000000 nan nan nan 105.000000 15.000000 22.000000 105.000000 15.000000 22.000000 85.000000 11.000000 17.000000 nan nan nan 85.000000 11.000000 17.000000 85.000000 11.000000 17.000000 nan nan nan 85.000000 11.000000 17.000000 84.000000 29.000000 nan nan 84.000000 29.000000 nan 0.000000 nan nan nan 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP June 2015 145.000000 89.000000 nan nan nan nan nan nan 110.000000 10.000000 21.000000 10.000000 11.000000 9.000000 nan nan nan nan nan 2.000000 8.000000 1.000000 10.000000 1.000000 8.000000 nan nan nan nan nan nan 24.000000 86.000000 6.000000 4.000000 10.000000 11.000000 13.000000 4.000000 9.000000 1.000000 6.000000 1.000000 70.000000 16.000000 3.000000 1.000000 6.000000 5.000000 234.000000 171.000000 141.000000 30.000000 171.000000 101.000000 26.000000 127.000000 34.000000 101.000000 135.000000 nan nan 110.000000 10.000000 21.000000 10.000000 11.000000 9.000000 120.000000 21.000000 30.000000 120.000000 21.000000 30.000000 86.000000 4.000000 11.000000 8.000000 10.000000 8.000000 94.000000 14.000000 19.000000 86.000000 4.000000 11.000000 17.000000 10.000000 7.000000 103.000000 14.000000 18.000000 79.000000 22.000000 28.000000 6.000000 107.000000 28.000000 8.000000 0.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 9.000000 0.000000 0.000000
6 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP July 2015 95.000000 94.000000 nan nan nan nan nan nan 67.000000 31.000000 26.000000 10.000000 13.000000 9.000000 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 17.000000 50.000000 9.000000 22.000000 8.000000 18.000000 20.000000 nan 10.000000 nan 8.000000 nan 43.000000 6.000000 13.000000 5.000000 10.000000 5.000000 189.000000 156.000000 124.000000 nan 124.000000 90.000000 nan 90.000000 38.000000 82.000000 120.000000 nan nan 67.000000 31.000000 26.000000 nan nan nan 67.000000 31.000000 26.000000 77.000000 44.000000 35.000000 50.000000 22.000000 18.000000 nan nan nan 50.000000 22.000000 18.000000 49.000000 18.000000 15.000000 20.000000 10.000000 8.000000 69.000000 28.000000 23.000000 66.000000 16.000000 38.000000 nan 104.000000 16.000000 nan 0.000000 nan nan nan 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP August 2015 86.000000 63.000000 nan nan nan nan nan nan 72.000000 6.000000 15.000000 15.000000 17.000000 14.000000 nan nan nan nan nan 2.000000 13.000000 3.000000 14.000000 2.000000 12.000000 nan nan nan nan nan nan 8.000000 64.000000 nan 4.000000 nan 15.000000 20.000000 4.000000 11.000000 3.000000 6.000000 6.000000 52.000000 12.000000 3.000000 1.000000 11.000000 4.000000 149.000000 139.000000 91.000000 46.000000 137.000000 83.000000 39.000000 122.000000 50.000000 83.000000 133.000000 nan nan 72.000000 4.000000 15.000000 15.000000 17.000000 14.000000 87.000000 21.000000 29.000000 87.000000 23.000000 29.000000 64.000000 4.000000 15.000000 13.000000 14.000000 12.000000 77.000000 18.000000 27.000000 64.000000 4.000000 15.000000 24.000000 14.000000 12.000000 88.000000 18.000000 27.000000 66.000000 17.000000 37.000000 13.000000 103.000000 30.000000 11.000000 0.000000 11.000000 0.000000 0.000000 0.000000 0.000000 0.000000 11.000000 11.000000 0.000000 0.000000
8 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP September 2015 117.000000 94.000000 nan nan nan nan nan nan 66.000000 15.000000 24.000000 16.000000 19.000000 16.000000 nan nan nan nan nan 2.000000 14.000000 3.000000 16.000000 2.000000 14.000000 nan nan nan nan nan nan 16.000000 50.000000 9.000000 6.000000 2.000000 22.000000 22.000000 4.000000 13.000000 3.000000 6.000000 3.000000 42.000000 8.000000 3.000000 3.000000 5.000000 7.000000 211.000000 156.000000 105.000000 51.000000 156.000000 78.000000 44.000000 122.000000 51.000000 68.000000 119.000000 nan nan 66.000000 15.000000 24.000000 16.000000 19.000000 16.000000 82.000000 34.000000 40.000000 82.000000 34.000000 40.000000 50.000000 6.000000 22.000000 14.000000 16.000000 14.000000 64.000000 22.000000 36.000000 50.000000 6.000000 12.000000 26.000000 16.000000 9.000000 76.000000 22.000000 21.000000 50.000000 18.000000 41.000000 10.000000 91.000000 28.000000 7.000000 0.000000 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 7.000000 12.000000 0.000000 0.000000
9 Sierra Leone Moyamba District Moyamba District Council Kori Chiefdom Bai Largo MCHP October 2015 97.000000 68.000000 nan nan nan nan nan nan 84.000000 5.000000 11.000000 16.000000 5.000000 10.000000 nan nan nan nan nan 5.000000 11.000000 4.000000 1.000000 4.000000 6.000000 nan nan nan nan nan nan 11.000000 73.000000 2.000000 3.000000 3.000000 8.000000 23.000000 3.000000 1.000000 nan 4.000000 nan 66.000000 7.000000 2.000000 1.000000 5.000000 3.000000 165.000000 131.000000 100.000000 31.000000 131.000000 84.000000 18.000000 102.000000 31.000000 84.000000 115.000000 nan nan 84.000000 5.000000 11.000000 16.000000 5.000000 10.000000 100.000000 10.000000 21.000000 100.000000 10.000000 21.000000 73.000000 3.000000 8.000000 11.000000 1.000000 6.000000 84.000000 4.000000 14.000000 73.000000 3.000000 8.000000 26.000000 1.000000 4.000000 99.000000 4.000000 12.000000 73.000000 11.000000 28.000000 3.000000 101.000000 14.000000 13.000000 0.000000 15.000000 0.000000 0.000000 0.000000 0.000000 0.000000 13.000000 15.000000 0.000000 0.000000

To adapt the code: - Lines X–X: Adjust the variable names to reflect those relevant to your dataset when calculating new variables - Once updated, run the code

Step 4.2: (If needed) Quality control of indicator totals

Now let’s check that indicator totals are equal to the sum of their disaggregated components. The code chunk below filters and displays rows of the data frame where the indicator total does not equal the sum of its components. For example, rows where allout is not equal to the sum of allout_u5 and allout_ov5. Age group aggregates are also checked in this code.

Why check indicator totals?

If all totals were manually summed using the previous step, you may choose to skip this step. However, even if totals were manually calculated in the previous step, checking totals here can help you identify errors.

In the case that total columns are extracted from DHIS2, it is essential to perform this quality control check for coherency. This check confirms which components go into the aggregated totals extracted from DHIS2.

  • R
  • Python
Show the code
# Check indicator totals
dhis2_df |>
  filter(
    # Outpatient checks
    allout != allout_u5 + allout_ov5 |

    # Malaria admissions check
    maladm != maladm_u5 + maladm_5_14 + maladm_ov15 |

    # Total tests check
    test != test_hf + test_com |

    # Total confirmed check
    conf !=  conf_hf + conf_com |

    # Total treated check
    maltreat != maltreat_hf + maltreat_com |

    # Total presumed check
    pres != pres_hf + pres_com |

    # Malaria deaths
    maldth != maldth_1_59m + maldth_u5 + maldth_5_9 + maldth_10_14 +
      maldth_5_14 +maldth_fem_ov15 + maldth_mal_ov15 + maldth_ov15 |

    # Tested cases checks
    test_u5 != test_hf_u5 + test_com_u5 |
    test_5_14 != test_hf_5_14 + test_com_5_14 |
    test_ov15 != test_hf_ov15 + test_com_ov15 |

    # Confirmed cases checks
    conf_u5 != conf_hf_u5 + conf_com_u5 |
    conf_5_14 != conf_hf_5_14 + conf_com_5_14 |
    conf_ov15 != conf_hf_ov15 + conf_com_ov15 |

    # Presumed cases checks
    pres_u5 != pres_hf_u5 + pres_com_u5 |
    pres_5_14 != pres_hf_5_14 + pres_com_5_14 |
    pres_ov15 != pres_hf_ov15 + pres_com_ov15|

    # Suspected cases checks
    susp_u5 != susp_u5_hf + susp_u5_com |
    susp_5_14 != susp_5_14_hf + susp_5_14_com |
    susp_ov15 != susp_ov15_hf + susp_ov15_com|

   # Treated cases checks
    maltreat_u5 != maltreat_hf_u5 + maltreat_com_u5 |
    maltreat_5_14 != maltreat_hf_5_14 + maltreat_com_5_14 |
    maltreat_ov15 != maltreat_hf_ov15 + maltreat_com_ov15 |

    # Treatment timing checks
    maltreat_u24_total != maltreat_u24_hf + maltreat_u24_com |
    maltreat_ov24_total != maltreat_ov24_hf + maltreat_ov24_com
  ) |>
  select(
    hf, periodname, allout,
    maldth, maladm, test,
    conf, maltreat, pres,
    test_u5, test_5_14, test_ov15,
    conf_u5, conf_5_14, conf_ov15,
    pres_u5, pres_5_14, pres_ov15,
    maltreat_u5, maltreat_5_14, maltreat_ov15,
    maltreat_u24_total, maltreat_ov24_total
  )
Output

Note: because this code displays rows where indicator totals are not equal to the sum of their components, we should expect to see output with <0 rows> if all calculations are correct. If rows do appear in the output, use the hf and periodname to further investigate the discrepancy.

# A tibble: 0 × 23
# ℹ 23 variables: hf <chr>, periodname <chr>, allout <dbl>, maldth <dbl>,
#   maladm <dbl>, test <dbl>, conf <dbl>, maltreat <dbl>, pres <dbl>,
#   test_u5 <dbl>, test_5_14 <dbl>, test_ov15 <dbl>, conf_u5 <dbl>,
#   conf_5_14 <dbl>, conf_ov15 <dbl>, pres_u5 <dbl>, pres_5_14 <dbl>,
#   pres_ov15 <dbl>, maltreat_u5 <dbl>, maltreat_5_14 <dbl>,
#   maltreat_ov15 <dbl>, maltreat_u24_total <dbl>, maltreat_ov24_total <dbl>

Step 4.3: indicator totals coherency visualization

Incoherent totals can be visualized to facilitate further investigation. The Sierra Leone example DHIS2 data does not contain any rows with incoherent totals. A deteriorated version of Sierra Leone’s DHIS2 data is used below to exemplify the case of incoherent totals. Any coherency issues demonstrated below are not representative of Sierra Leone’s DHIS2 data quality.

  • R
  • Python
Show the code
# Define component checks for visualization
component_checks <- list(
  list(total = "allout", parts = c("allout_u5", "allout_ov5")),
  list(total = "maladm", parts = c("maladm_u5", "maladm_5_14", "maladm_ov15")),
  list(total = "test", parts = c("test_hf", "test_com")),
  list(total = "conf", parts = c("conf_hf", "conf_com")),
  list(total = "maltreat", parts = c("maltreat_hf", "maltreat_com")),
  list(total = "pres", parts = c("pres_hf", "pres_com"))
)

# Create visualization function
plot_coherence_check <- function(data, facility = NULL, period = NULL) {
  # Filter data if specified
  if (!is.null(facility) && !is.null(period)) {
    data <- data[data$hf == facility & data$periodname == period, ]
  }

  # Calculate all comparisons
  results <- base::lapply(component_checks, function(x) {
    if (!x$total %in% base::names(data)) return(NULL)

    parts_exist <- x$parts[x$parts %in% base::names(data)]
    if (base::length(parts_exist) == 0) return(NULL)

    base::data.frame(
      indicator = x$total,
      reported = data[[x$total]],
      calculated = base::rowSums(data[parts_exist], na.rm = TRUE)
    )
  }) |>
    (\(x) base::do.call(base::rbind, base::Filter(base::NROW, x)))()

  # Create scatterplot
  graphics::plot(
    x = results$calculated,
    y = results$reported,
    xlab = "Calculated Sum of Components",
    ylab = "Reported Total",
    main = "Data Coherence Check: Reported vs Calculated Totals",
    pch = 19,
    col = base::ifelse(results$reported == results$calculated, "#4E79A7", "#E15759")
  )

  # Add identity line
  graphics::abline(a = 0, b = 1, col = "darkgray", lty = 2)

  # Add legend
  graphics::legend("topleft",
         legend = c("Coherent", "Incoherent"),
         pch = 19,
         col = c("#4E79A7", "#E15759"),
         bty = "n")

  # Label outliers
  incoherent <- base::which(results$reported != results$calculated)
  if (base::length(incoherent) > 0) {
    graphics::text(
      x = results$calculated[incoherent],
      y = results$reported[incoherent],
      labels = results$indicator[incoherent],
      pos = 4,
      cex = 0.8,
      col = "#E15759"
    )
  }
}
Output

The identity represents where reported or extracted totals equal calculated totals for aggregated columns. Incoherent totals, where this equality does not hold true, are coloured in red. The example deteriorated Sierra Leone DHIS2 data shows incoherencies in the allout column.

Show the code
import matplotlib.pyplot as plt
import seaborn as sns

def plot_totals_check(data, total_col, component_cols, title="Reported vs Calculated Totals"):
    """Plot reported totals against calculated sums of components."""
    # Calculate sum of components
    calculated = data[component_cols].sum(axis=1)

    # Create plot
    plt.figure(figsize=(8, 6))
    sns.scatterplot(
        x=calculated,
        y=data[total_col],
        hue=(calculated == data[total_col]),
        palette={True: "#4E79A7", False: "#E15759"},
        s=100,
        alpha=0.7
    )

    # Add identity line
    max_val = max(calculated.max(), data[total_col].max())
    plt.plot([0, max_val], [0, max_val], '--', color='gray')

    # Add labels and title
    plt.xlabel("Calculated sum of components")
    plt.ylabel("Reported total")
    plt.title(title)

    # Label mismatches
    mismatches = data[calculated != data[total_col]]
    for i, row in mismatches.iterrows():
        plt.text(
            x=calculated[i],
            y=data[total_col][i],
            s=f"{row['hf']}\n{row['periodname']}",
            color="#E15759",
            fontsize=8
        )

    plt.legend().remove()
    plt.show()

# List of indicators to check
indicator_checks = [
    {"total": "allout", "components": ["allout_u5", "allout_ov5"]},
    {"total": "maladm", "components": ["maladm_u5", "maladm_5_14", "maladm_ov15"]},
    {"total": "test", "components": ["test_hf", "test_com"]},
    {"total": "conf", "components": ["conf_hf", "conf_com"]},
    {"total": "maltreat", "components": ["maltreat_hf", "maltreat_com"]},
    {"total": "pres", "components": ["pres_hf", "pres_com"]}
]

# Generate all plots
for ind in indicator_checks:
    if all(col in dhis2_df.columns for col in [ind["total"]] + ind["components"]):
        plot_totals_check(dhis2_df, ind["total"], ind["components"])
Consult SNT team

Incoherent totals must be shared for review with the SNT team as they may signal critical data quality issues. These discrepancies can reflect reporting errors like overcounting or missing component data, but may also reveal systemic problems such as mismatched indicator definitions or inclusions. Regardless of origin, each case requires source-level verification to distinguish between true data errors and legitimate program exceptions. The SNT team can advise you on how to manage incoherent totals.

Step 4.4: Export rows with incoherent totals

Incoherent totals visualized in the previous step can be exported using this code. Exported outputs must be shared with the SNT team for review and guidance.

  • R
  • Python
Show the code
# Identify rows with incoherent totals
incoherent_rows <- dhis2_df_example |>
  dplyr::filter(
    # Outpatient checks
    allout != allout_u5 + allout_ov5 |
    # Malaria admissions check
    maladm != maladm_u5 + maladm_5_14 + maladm_ov15 |
    # Total tests check
    test != test_hf + test_com |
    # Total confirmed check
    conf != conf_hf + conf_com |
    # Total treated check
    maltreat != maltreat_hf + maltreat_com |
    # Total presumed check
    pres != pres_hf + pres_com
  ) |>
  dplyr::select(
    hf, periodname,
    dplyr::matches("allout|maladm|test|conf|maltreat|pres")
  )

# Export to CSV with timestamp
output_file <- paste0("incoherent_totals_", format(Sys.time(), "%Y%m%d_%H%M"), ".csv")
readr::write_csv(incoherent_rows, file = output_file)

# Print summary
if (nrow(incoherent_rows) > 0) {
  message(
    "Found ", nrow(incoherent_rows),
    " rows with incoherent totals. Exported to: ", output_file
  )
  print(incoherent_rows)
} else {
  message("No incoherent totals found in the data.")
}
Output
# A tibble: 2 × 81
  hf           periodname allout_u5 allout_ov5 maladm_u5 maladm_5_14 maladm_ov15
  <chr>        <chr>          <dbl>      <dbl>     <dbl>       <dbl>       <dbl>
1 Jenner Wrig… November …       725        137        NA          NA          NA
2 Masimera CHC February …        63          9        NA          NA          NA
# ℹ 74 more variables: test_neg_mic_u5_hf <dbl>, test_pos_mic_u5_hf <dbl>,
#   test_neg_mic_5_14_hf <dbl>, test_pos_mic_5_14_hf <dbl>,
#   test_neg_mic_ov15_hf <dbl>, test_pos_mic_ov15_hf <dbl>,
#   maltreat_u24_u5_com <dbl>, maltreat_ov24_u5_com <dbl>,
#   maltreat_u24_5_14_com <dbl>, maltreat_ov24_5_14_com <dbl>,
#   maltreat_u24_ov15_com <dbl>, maltreat_ov24_ov15_com <dbl>,
#   maltreat_u24_u5_hf <dbl>, maltreat_ov24_u5_hf <dbl>, …

Step 5: Create a unique year-month column

Step 5.1: Example data date format

Here we want to end up with a column featuring the report date in format YYYY-MM, which will come in handy for future data merges and other operations. Note, the chunk of code below only applied to the specific SLE example dataset - you need to adapt it to the format of your Year-Month column. Reformatting the date also helps avoid overcrowding on the x-axis when data is plotted and keeps the labels clear and readable.

  • R
  • Python
Show the code
dhis2_df <- dhis2_df |>
  # seperate periodname to month and year
  tidyr::separate(
    col = tidyselect::all_of("periodname"),
    into = c("month", "year"),
    sep = " ",
    remove = TRUE
  )

dhis2_df <- dhis2_df |>
  # recode month
  dplyr::mutate(
    month = recode(
      month,
      "January" = "01",
      "February" = "02",
      "March" = "03",
      "April" = "04",
      "May" = "05",
      "June" = "06",
      "July" = "07",
      "August" = "08",
      "September" = "09",
      "October" = "10",
      "November" = "11",
      "December" = "12"
    )
  ) |>
  # create a new date column
   unite("date", year, month, sep = "-", remove = FALSE)
Output

To adapt the code:

  • Line 4: Change the periodname in the code above to match the one in your dataset
  • Modify the code if the names of months in your data are in another language
  • Once updated, run the code
Show the code
# format YM column
# this would only work for the format 'January 2020'
# if the format is different, we need to adjust the format string
# we need to make sure the result is in the format 'YYYY-MM'
dhis2_df['YM'] = pd.to_datetime(dhis2_df['periodname'], format='%B %Y').dt.strftime('%Y-%m')

# Inspect results
dhis2_df.head(10).style
Output

To adapt the code:

Step 5.2: Handling alternative DHIS2 date formats

The example data illustrates one possible formatting of dates in DHIS2 extractions. However, not all data have this format. The table below outlines different date formats and the respective codes that can be used for reformatting. Example uses of these codes are also provided below the table.

Format Example Description R Approach Python Approach
Janvier 2023 Month name + Year (French) mutate(month = recode(month, "Janvier"="01", "Février"="02", ...)) Set locale first, then use %B %Y format
2023-Jan Year-Month abbreviation ymd(paste0(date_column,"-01")) %>% format("%Y-%m") pd.to_datetime(df['YM'], format='%Y-%b')
01/2023 Month/Year (numeric) my(date_column) %>% format("%Y-%m") pd.to_datetime(df['YM'], format='%m/%Y')
2023-01-15 Full date (YYYY-MM-DD) ymd(date_column) %>% format("%Y-%m") pd.to_datetime(df['YM']).dt.strftime('%Y-%m')
202301 Compact YYYYMM format ymd(paste0(date_column,"01")) %>% format("%Y-%m") pd.to_datetime(df['YM'], format='%Y%m')

Example Usage:

  • R
  • Python

Example 1: French month names

df <- df|>
  mutate(month = recode(month,
    "Janvier" = "01",
    "Février" = "02", # Note accent aigu
    "Mars" = "03",
    "Avril" = "04",
    "Mai" = "05",
    "Juin" = "06",
    "Juillet" = "07",
    "Août" = "08",   # Note accent circonflexe
    "Septembre" = "09",
    "Octobre" = "10",
    "Novembre" = "11",
    "Décembre" = "12"
  )) |>
  unite("YM", year, month, sep = "-")

Example 2: Portuguese month names

df <- df |>
  mutate(month = recode(month,
    "Janeiro"   = "01",
    "Fevereiro" = "02",
    "Março"     = "03",  # Note cedilha (ç) character
    "Abril"     = "04",
    "Maio"      = "05",
    "Junho"     = "06",
    "Julho"     = "07",
    "Agosto"    = "08",
    "Setembro"  = "09",
    "Outubro"   = "10",
    "Novembro"  = "11",
    "Dezembro"  = "12"
  )) |>
  unite("YM", year, month, sep = "-")

Example 3: Compact YYYYMM format

df <- df |>
  mutate(YM = format(ymd(paste0(periodname, "01")), "%Y-%m"))

Example 1: French month names:

locale.setlocale(locale.LC_TIME, 'fr_FR.UTF-8')  # Set French locale
df['YM'] = pd.to_datetime(df['YM'], format='%B %Y').dt.strftime('%Y-%m')

Example 2: Portuguese month names

locale.setlocale(locale.LC_TIME, 'pt_PT.UTF-8')  # Set Portuguese locale
df['YM'] = pd.to_datetime(df['YM'], format='%B %Y').dt.strftime('%Y-%m')

Example 3: Compact YYYYMM format:

df['YM'] = pd.to_datetime(df['periodname'], format='%Y%m').dt.strftime('%Y-%m')

Step 5.3: Unclear data format

If the date format in your data is not clear, the code below will analyze and extract the format.

  • R
  • Python
Show the code
# Extract the values in the 'period_name' column to process
period_values <- df$period_name

# Initialize empty vectors to store month, year, and year_month ("YYYY-MM") values
month_vec <- integer(length(period_values))
year_vec <- integer(length(period_values))
year_month_vec <- character(length(period_values))

# Loop through each value in the 'period_name' column
for (i in seq_along(period_values)) {

  # Clean up the string by trimming spaces
  val <- stringr::str_trim(period_values[i])

  # Initialize temporary variables for each value
  m <- NA_integer_
  y <- NA_integer_
  ym_str <- NA_character_

  # Check for missing or empty values
  if (is.na(val) || val == "") {
    dt <- NA

  # Format: "Jan-2015"
  } else if (nchar(val) == 8 && substr(val, 4,4) == "-") {
    dt <- tryCatch(lubridate::parse_date_time(val, "b-Y"), error = function(e) NA)

  # Format: "Jan 2015"
  } else if (nchar(val) == 8 && substr(val, 4,4) == " ") {
    dt <- tryCatch(lubridate::parse_date_time(val, "b Y"), error = function(e) NA)

  # Format: "January-2015"
  } else if (stringr::str_detect(val, "-") && nchar(stringr::word(val, 1, sep = "-")) > 3) {
    dt <- tryCatch(lubridate::parse_date_time(val, "B-Y"), error = function(e) NA)

  # Format: "January 2015"
  } else if (stringr::str_detect(val, " ") && nchar(stringr::word(val, 1, sep = " ")) > 3) {
    dt <- tryCatch(lubridate::parse_date_time(val, "B Y"), error = function(e) NA)

  # Format: "01-2015"
  } else if (stringr::str_detect(val, "-") && grepl("^[0-9]{2}-[0-9]{4}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "m-Y"), error = function(e) NA)

  # Format: "01/2015"
  } else if (stringr::str_detect(val, "/") && grepl("^[0-9]{2}/[0-9]{4}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "m/Y"), error = function(e) NA)

  # Format: "01.2015"
  } else if (stringr::str_detect(val, "\\.") && grepl("^[0-9]{2}\\.[0-9]{4}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "m.Y"), error = function(e) NA)

  # Format: "2015-01"
  } else if (stringr::str_detect(val, "-") && grepl("^[0-9]{4}-[0-9]{2}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "Y-m"), error = function(e) NA)

  # Format: "2015/01"
  } else if (stringr::str_detect(val, "/") && grepl("^[0-9]{4}/[0-9]{2}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "Y/m"), error = function(e) NA)

  # Format: "2015.01"
  } else if (stringr::str_detect(val, "\\.") && grepl("^[0-9]{4}\\.[0-9]{2}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "Y.m"), error = function(e) NA)

  # Format: "2015 Jan"
  } else if (length(stringr::str_split(val, " ", simplify = TRUE)) == 2 && grepl("^[0-9]{4}", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "Y b"), error = function(e) NA)

  # Format: "2015 January"
  } else if (length(stringr::str_split(val, " ", simplify = TRUE)) == 2 && grepl("^[0-9]{4}", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "Y B"), error = function(e) NA)

  # Format: "January2015"
  } else if (stringr::str_detect(val, "^[A-Za-z]{3,}[0-9]{4}$")) {
    dt <- tryCatch(lubridate::parse_date_time(val, "BY"), error = function(e) NA)

  # Format: "Jan2015"
  } else if (stringr::str_detect(val, "^[A-Za-z]{3}[0-9]{4}$")) {
    dt <- tryCatch(lubridate::parse_date_time(val, "bY"), error = function(e) NA)

  # Format: "201501"
  } else if (nchar(val) == 6 && grepl("^[0-9]{6}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "Ym"), error = function(e) NA)

  # Format: "012015"
  } else if (nchar(val) == 6 && grepl("^[0-9]{6}$", val)) {
    dt <- tryCatch(lubridate::parse_date_time(val, "mY"), error = function(e) NA)

  # Format not recognized
  } else {
    dt <- NA
  }

  # If parsing succeeded, extract month and year, and build "YYYY-MM" string
  if (!is.na(dt)) {
    m <- lubridate::month(dt)
    y <- lubridate::year(dt)
    ym_str <- sprintf("%04d-%02d", y, m)
  }

  # Save values into pre-created vectors
  month_vec[i] <- m
  year_vec[i] <- y
  year_month_vec[i] <- ym_str
}

# Add extracted values as new columns in the dataframe
df <- df |>
  dplyr::mutate(
    month = month_vec,
    year = year_vec,
    year_month = year_month_vec
  )

Step 6: (If needed) Create a unique ID

Step 6.1: Create unique IDs for admin units

VT: we could simplify this section by calling the hf column adm4 (or whatever level it is) and use the same UID column creation function for all levels in one go, instead of doing the same thing in two sections (one for admin uits, one for hf). Happy to discuss

Warning

Here we define important admin unit levels variables to simplify subsequent chunks of code. You must edit those based on your country-specific data.

  • R
  • Python
Show the code
# Set admin levels (EDIT THESE)
min_level <- 1  # Set highest admin unit level available in dataset (usually the Country, or Region/Province)
hf_level <- 5   # Set health facility admin unit level in the dataset (usually the lowest one)

# Core UID creation function
create_uid <- function(df, level) {
  cols <- paste0("adm", 1:level)

  df |>
    dplyr::arrange(across(dplyr::all_of(cols))) |>
    dplyr::group_by(across(dplyr::all_of(cols))) |>
    dplyr::mutate(
      "{paste0('adm', level, '_uid')}" := paste0(
        "adm", level, "_",
        sprintf("%05d", dplyr::cur_group_id())
      )
    ) |>
    dplyr::ungroup()
}

# Create all UIDs
for (level in min_level:hf_level) {
  df <- create_uid(df, level)
}

# Column reordering (matches Python logic)
base_cols <- lapply(
  min_level:hf_level,
  \(i) c(paste0("adm", i), paste0("adm", i, "_uid"))
) |>
  unlist()

cols_data <- df |>
  names() |>
  setdiff(c("YM", base_cols)) |>
  sort()

# Final output (identical to Python)
df <- df |>
  dplyr::select(dplyr::all_of(c("YM", base_cols, cols_data))) |>
  dplyr::arrange(across(dplyr::all_of(base_cols)))

# Verify first 10 rows
utils::head(df, 10)
Show the code
# If your data does NOT contain admin unit UIDs:
def create_uid(df, level):
    cols = [f'adm{i}' for i in range(0, level+1)]
    # sort admin units alphabetically
    df = df.sort_values(by = cols).reset_index(drop = True)
    # assign a unique number to each admin unit at this level, starting at 1
    df[f'adm{level}_uid'] = df.groupby(cols).ngroup() + 1
    # format the uid to a string with leading zeros, to ensure the number of characters in each uid is identical
    df[f'adm{level}_uid'] = df[f'adm{level}_uid'].apply(lambda x: f'adm{level}_{str(x).zfill(5)}')

    return df

# create UID columns
admin_unit_levels = [0, 1, 2, 3]
for level in admin_unit_levels:
    dhis2_df = create_uid(dhis2_df, level)

# reorder columns for display
base_cols = [[f'adm{i}', f'adm{i}_uid'] for i in admin_unit_levels]
base_cols = [i for j in base_cols for i in j]
base_cols = base_cols + ['hf']
base_cols = ['YM'] + base_cols
cols_data = sorted([c for c in dhis2_df.columns if c not in base_cols])
cols = base_cols + cols_data

dhis2_df = (dhis2_df[cols]
    .sort_values(by = base_cols)
    .reset_index(drop = True))

dhis2_df.head(10).style
Output

To adapt the code:

Step 6.2: Create unique IDs for health facilities

Most DHIS2 identify health facilities by unique IDs, but sometimes this information is not extracted with other columns. Even if your data was extracted with the unique health facility ID column, the column should be checked against the master facilities list (MFL) to identify errors such as blank IDs or multiple IDs for the same facility.

If the unique ID column was not downloaded, creating an ID column is essential.In many cases, different facilities share the same name but are located in different areas. Without a unique identifier these can be mistakenly treated as one, leading to errors in analysis and miscounts in SNT outputs. Assigning a distinct ID to each facility ensures accurate tracking and prevents naming overlaps from affecting results. Note that this code does not handle the case where a health facility is listed under two different names, such as typos or transliterations. In this case, the health facility will be counted as two separate facilities. To avoid this, it is strongly recommended to obtain the MFL prior to preprocessing and to download the health facility ID when extracting from DHIS2.

  • R
  • Python
Show the code
dhis2_df <- dhis2_df |>
  dplyr::group_by(adm1, adm2, adm3, hf) |>
  dplyr::mutate(hf_uid = sprintf("hf_%04d", dplyr::cur_group_id())) |>
  dplyr::ungroup()

# check it worked
dhis2_df |>
  dplyr::distinct(adm2, adm3, hf, hf_uid) |>
  tail()
Output
# A tibble: 6 × 4
  adm2                  adm3        hf                                hf_uid 
  <chr>                 <chr>       <chr>                             <chr>  
1 Freetown City Council West 1 Zone Grey Bush CHC                     hf_1698
2 Freetown City Council West 2 Zone Kingharman Road Under Fives CHP   hf_1716
3 Freetown City Council West 1 Zone Kingtom Police Under Fives CHP    hf_1700
4 Freetown City Council East 2 Zone Mabella CHC                       hf_1634
5 Freetown City Council West 1 Zone Methodist Community Health Clinic hf_1702
6 Freetown City Council East 3 Zone Rokupa Under Fives CHP            hf_1677

To adapt the code:

Show the code
# Set the following boolean to indicate whether your dataset containes UID columns for parent admin units (above HF-level)
admin_units_uids = True

def create_hf_uid(df, admin_unit_levels):
    if admin_units_uids:
        cols = [f'adm{i}_uid' for i in admin_unit_levels] + ['hf']
    else:
        cols = [f'adm{i}' for i in admin_unit_levels] + ['hf']
    # sort admin units alphabetically
    df = df.sort_values(by = cols).reset_index(drop = True)
    # assign a unique number to each admin unit at this level, starting at 1
    df['hf_uid'] = df.groupby(cols).ngroup() + 1
    # format the uid to a string with leading zeros, to ensure the number of characters in each uid is identical
    df['hf_uid'] = df['hf_uid'].apply(lambda x: f'HF_{str(x).zfill(5)}')

    return df

# set list of all levels above hf
admin_unit_levels = [0, 1, 2, 3]
dhis2_df = create_hf_uid(dhis2_df, admin_unit_levels)

# reorder columns for display
if admin_units_uids:
    base_cols = [[f'adm{i}', f'adm{i}_uid'] for i in admin_unit_levels]
    base_cols = [i for j in base_cols for i in j]

else:
    base_cols = [[f'adm{i}', f'adm{i}_uid'] for i in admin_unit_levels]

base_cols = base_cols + ['hf', 'hf_uid']
base_cols = ['YM'] + base_cols
cols_data = sorted([c for c in dhis2_df.columns if c not in base_cols])
cols = base_cols + cols_data

dhis2_df = (dhis2_df[cols]
    .sort_values(by = base_cols)
    .reset_index(drop = True))

# Inspect results
dhis2_df.head(10).style
Output

To adapt the code:

Step 7: Second pass troubleshooting

Step 7.1: Highlight duplicate HF-month records with different data

Here we search for rows in the dataset that correspond to the same HF-month report but happen to have different data. If any are found, these need to be cleaned at this stage to make sure the dataset contains only one report for each combination of HF and month.

ADD APPROACH FOR CLEANING: IN STEP 7 TROUBLESHOOTING, MORE LIKELY TO NEED TO CONSULT SNT TEAM

  • R
  • Python
Output

To adapt the code:

Show the code
# identify rows with same HF-YM but different data
duplicates = dhis2_df[dhis2_df.duplicated(subset = ['YM', 'hf_uid'], keep = False)]

# Inspect results
if len(duplicates) == 0:
  print('No duplicated HF-month reports with different data')
else:
  duplicates.style
Output

To adapt the code:

Step 7.2: ETC OTHER PROBLEMS

Step 8: (If needed) Prepare optional ancillary datasets and versions

Step 8.1: Create and save your admin unit hierarchy

I am creating the orgunit tree below from the dataset - we could add a step before that to compare with the some reference (shapefil or HFML). I use this a lot throughout the process to simplify merge and other operations, would suggest to make it feature thourghout the SNT process but happy to work without it if the team prefers not to.

  • R
  • Python
Output

To adapt the code:

Show the code
# Set the list of admin unit levels in your dataset, above HF-level
admin_units_levels = [0, 1, 2, 3]
cols = [[f'adm{level}', f'adm{level}_uid'] for level in admin_units_levels]
cols = [i for j in cols for i in j]
cols = cols + ['hf', 'hf_uid']

dftree = (dhis2_df[cols]
    .drop_duplicates()
    .sort_values(by = cols)
    .reset_index(drop = True))

# Save to file
# discuss with team if useful or not depending on use in other sections

# Inspect results
dftree.head(10).style
Output

To adapt the code:

Step 8.2: Fill in missing HF-month reports

VT:I would move this step to Step 2 I am assuming we are at the stage where we have checked for duplicates and have the following columns only: + YM (which is called periodname or pe or period in DHIS2, basically a string of character YYYYMM) + hf_uid + all the data columns + No other columns

I am also assuming the existence of dftree, which is the orgunit tree, which I simply construct from the original dataset. I use this a lot throughout the process to simplify operations, would suggest to make it feature but happy to work without it if the team prefers not to.

When examining monthly reports per health facility, you may choose to fill in gaps for health facility month records that are entirely missing. This enables identification of when a health facility became active and started reporting. The reporting rate page covers calculations to identify active and inactive facilities in more detail. Remember, data needs to be fully preprocessed prior to any reporting rate calculations.

  • R
  • Python
Show the code
# # ensure date column is parsed as year-month
# ym_df <- dhis2_df |>
#   dplyr::transmute(ym = lubridate::ym(date))

# # get full range of months
# ym_seq <- seq(
#   min(ym_df$ym, na.rm = TRUE), max(ym_df$ym, na.rm = TRUE),
#   by = "1 month") |> format("%Y-%m")

# # build base grid of all HF-months
# base <- tidyr::expand_grid(
#   date = ym_seq, hf_uid = unique(dhis2_df$hf_uid)
# )

# # left join to full grid and merge admin structure
# dhis2_df_complete <- base |>
#   dplyr::left_join(dhis2_df, by = c("date", "hf_uid"))
# dplyr::left_join(dftree, by = "hf_uid")
Output

To adapt the code:

Show the code
# select key columns - need to discuss how to list data columns in the simplest possible way
cols = [c for c in dhis2_df.columns if c not in ['Year', 'Month', 'hf', 'adm0', 'adm0_uid', 'adm1', 'adm1_uid', 'adm2', 'adm2_uid', 'adm3', 'adm3_uid']]

dhis2_df = (dhis2_df[cols]
      .sort_values(by = ['YM', 'hf_uid'])
      .reset_index(drop = True))

# keep track of current number of HF-month records
raw_length = len(dhis2_df)

# define dataset range of years
min_year = min(dhis2_df['YM'].apply(lambda x: int(x[0:4])))
max_year = max(dhis2_df['YM'].apply(lambda x: int(x[0:4])))
years = range(min_year, max_year+1)

# define base dataframe with all periods and health facilities
# periods = [str(year*100 + month) for year in years for month in range(1, 13)]
periods = [f'{year}-{month:02d}' for year in range(min_year, max_year + 1) for month in range(1, 13)]
hfs = sorted(dhis2_df['hf_uid'].unique())
cols = ['YM', 'hf_uid']
base = pd.DataFrame([(p, hf) for p in periods for hf in hfs], columns = cols)
# merge raw data to this base, to ensure all HF-month are represented, even if just with NA in data columns
dhis2_df = base.merge(dhis2_df, how = 'left', on = cols, validate = '1:1')

# reconstitute parent admin units
dhis2_df = dhis2_df.merge(dftree, on = 'hf_uid', how = 'left', validate = "m:1")

# Make Year and Month column
dhis2_df['Year'] = dhis2_df['YM'].apply(lambda x: int(x[0:4]))
dhis2_df['Month'] = dhis2_df['YM'].apply(lambda x: int(x[5:7]))

# Inform user about how many missing HF-month records were added
print(f"There were {len(dhis2_df)-raw_length} HF-month records missing")
Output

To adapt the code:

Step 8.3: IPD/OPD Specification

PLACEHOLDER STEP BASED ON JALINE AND VAL DISCUSSION 08/07

Step 9: Aggregate and save data

Step 9.1: Save data at the health facility level

Since some SNT analyses rely on facility-specific information, we will now save the data at the health facility level. Keeping the data at health facility level ensures that facility-level analyses can be done accurately and without loss of detail.

  • R
  • Python
Show the code
# set up output path
save_path <- here::here("01_data",
                        "02_epidemiology",
                        "2a_routine_surveillance",
                        "processed")

# save to xlsx
rio::export(dhis2_df, here::here(save_path, "health_facility_data.xlsx"))

# save to csv
rio::export(dhis2_df, here::here(save_path, "health_facility_data.csv"))

# save to dta
rio::export(dhis2_df, here::here(save_path, "health_facility_data.dta"))

# save to local area
rio::export(
  dhis2_df,
  here::here("english/data_r/routine_cases/clean_malaria_routine_data_final.rds"))
Output

To adapt the code:

Show the code
# Set your filepath here
output_filepath = Path(here('english/data_r/routine_cases/', 'dhis2_processed_data_python'))

base_cols = ['Year', 'Month', 'YM'] + list(dftree.columns)
cols_data = sorted([c for c in dhis2_df.columns if c not in base_cols])
cols = base_cols + cols_data

dhis2_df = (dhis2_df[cols]
      .sort_values(by = base_cols)
      .reset_index(drop = True))

# Save to csv
dhis2_df.to_csv(f'{output_filepath}.csv', index = None)

# Save to pickle
dhis2_df.to_pickle(f'{output_filepath}.pkl')

# Inspect results
dhis2_df.head(10).style
Output

To adapt the code:

Step 9.2: Aggregate and save data at each admin unit level level

Now we’ll save the data at different administrative levels to support various types of analysis and ensure alignment with how interventions are typically planned and monitored. This allows flexibility when calculating indicators, comparing trends across regions, or linking with other datasets structured at adm0, adm1, adm2, or facility level. The first code chunk aggregates data at the adm2 level.

Avoid Summing Pre-Calculated Rates

When aggregating data at to administrative levels, not all indicators can simply be summed. For example, if you have already calculated a test positivity rate or treatment rate in your dataset, these indicators must be re-calculated at the new administrative level.

  • R
  • Python
Show the code
df <- dhis2_df

# --------------------------
# ADM1 level
# --------------------------
if (all(c("adm0", "adm1", "year", "month") %in% names(df))) {
  # If all columns needed for ADM1 level grouping exist,
  # group by adm0, adm1, year, and month, and sum numeric columns
  adm1_monthly <- df |>
    dplyr::group_by(adm0, adm1, year, month) |>
    dplyr::summarise(
      dplyr::across(dplyr::where(is.numeric), \(x) sum(x, na.rm = TRUE)),
      .groups = "drop"
    )

  rio::export(adm1_monthly, here::here(save_path, "aggregated_data_monthly_adm1_data.xlsx"))
  rio::export(adm1_monthly, here::here(save_path, "aggregated_data_monthly_adm1_data.csv"))

} else {
  message("Skipping adm1: one or more required columns are missing.")
}

# --------------------------
# ADM2 level
# --------------------------
if (all(c("adm0", "adm1", "adm2", "year", "month") %in% names(df))) {
  # If all columns needed for ADM2 level grouping exist,
  # group by adm0, adm1, adm2, year, and month, and sum numeric columns
  adm2_monthly <- df |>
    dplyr::group_by(adm0, adm1, adm2, year, month) |>
    dplyr::summarise(
      dplyr::across(dplyr::where(is.numeric), \(x) sum(x, na.rm = TRUE)),
      .groups = "drop"
    )

  rio::export(adm2_monthly, here::here(save_path, "aggregated_data_monthly_adm2_data.xlsx"))
  rio::export(adm2_monthly, here::here(save_path, "aggregated_data_monthly_adm2_data.csv"))

} else {
  message("Skipping adm2: one or more required columns are missing.")
}

# --------------------------
# ADM3 level
# --------------------------
if (all(c("adm0", "adm1", "adm2", "adm3", "year", "month") %in% names(df))) {
  # If all columns needed for ADM3 level grouping exist,
  # group by adm0, adm1, adm2, adm3, year, and month, and sum numeric columns
  adm3_monthly <- df |>
    dplyr::group_by(adm0, adm1, adm2, adm3, year, month) |>
    dplyr::summarise(
      dplyr::across(dplyr::where(is.numeric), \(x) sum(x, na.rm = TRUE)),
      .groups = "drop"
    )

  rio::export(adm3_monthly, here::here(save_path, "aggregated_data_monthly_adm3_data.xlsx"))
  rio::export(adm3_monthly, here::here(save_path, "aggregated_data_monthly_adm3_data.csv"))

} else {
  message("Skipping adm3: one or more required columns are missing.")
}

# --------------------------
# ADM4 level
# --------------------------
if (all(c("adm0", "adm1", "adm2", "adm3", "adm4", "year", "month") %in% names(df))) {
  # If all columns needed for ADM4 level grouping exist,
  # group by adm0, adm1, adm2, adm3, adm4, year, and month, and sum numeric columns
  adm4_monthly <- df |>
    dplyr::group_by(adm0, adm1, adm2, adm3, adm4, year, month) |>
    dplyr::summarise(
      dplyr::across(dplyr::where(is.numeric), \(x) sum(x, na.rm = TRUE)),
      .groups = "drop"
    )

  rio::export(adm4_monthly, here::here(save_path, "aggregated_data_monthly_adm4_data.xlsx"))
  rio::export(adm4_monthly, here::here(save_path, "aggregated_data_monthly_adm4_data.csv"))

} else {
  message("Skipping adm4: one or more required columns are missing.")
}
Output

To adapt the code:

Show the code
# first we define a function
def aggregate_data(df, aggregation_period, aggregation_level, output_filepath):

   # Step 1: Validate input
    try:
        if aggregation_period not in ['Year', 'Month']:
            raise ValueError("Parameter 'aggregation_period' must be either 'Year' or 'Month'.")

    except ValueError as e:
        print("Input Error:", e)
        return None  # or: return pd.DataFrame()

    # Define columns used for aggregation
    cols = [[f'adm{i}', f'adm{i}_uid'] for i in range(1, aggregation_level +1)]
    cols = [i for j in cols for i in j]
    if aggregation_period == 'Year':
        cols = ['Year'] + cols
    elif aggregation_period == 'Month':
        cols = ['YM'] + cols

    # aggregate data
    aggregated_df = df.groupby(cols)[cols_data].sum(min_count = 1).reset_index()

    return aggregated_df
Output

Similarly, you can repeat the above aggregation at the adm3-year level by:

  • (in R): grouping with dplyr::group_by(adm1, adm2, adm3, year).
  • (in python): Edit the aggregation_period, aggregation_level and output_filepath variables

This allows you to generate annual summaries for each administrative unit.

Summary

We’ve now walked through the key steps for cleaning, reshaping, and aggregating DHIS2 malaria data to make it SNT-ready. The code covered everything from importing raw files, standardising column names, computing key indicators, and saving outputs at multiple administrative levels. For convenience, a full end-to-end script is included below in a folded code block. You can reuse or adapt this for your own country context by adjusting column names, file paths, and administrative levels as needed.

Full code

Find the full code script for DHIS2 data preprocessing below.

  • R
  • Python
Show full code
# install or load relevant packages
pacman::p_load(
  tidyverse,  # core tidy tools (dplyr, tidyr, etc.)
  rio,        # import and export and file format
  DT,         # view interactive tables
  here        # for easy file referencing
)

#===============================================================================
# Step 1: Define Functions and Import Data
#===============================================================================


#' Reads and combines multiple Excel or CSV files from a directory into a single
#' dataframe.
#'
#' This function searches for files with specified extension in the given
#' directory and combines them into a single dataframe. It handles both Excel
#' (.xls, .xlsx) and CSV files.
#'
#' @param directory Path to directory containing files
#'    (default: "english/data_r")
#' @param pattern File extension (default: "xls"). Options: "xls", "xlsx",
#'   "csv" and and other formats available in rio package.
#' @return A dataframe combining all files with trimmed column names. Returns
#'    NULL if no files found or if an error occurs during reading/combining.
#' @examples
#' # Read Excel files
#' df1 <- read_and_combine_files("path/to/excel", "xlsx")
#' # Read CSV files
#' df2 <- read_and_combine_files("path/to/csv", "csv")
#'
read_and_combine_files <- function(directory = "english/data_r",
                                   pattern = "xls") {
  # directory <- here::here(directory)
  pattern <- paste0("\\.", pattern, "$")
  # Find files matching the pattern
  file_paths <- list.files(path = directory, pattern = pattern,
                           full.names = TRUE)

  # Stop if no files found
  if (length(file_paths) == 0) {
    stop("No files found with specified extension in the directory.")
  }

  # Read all files based on extension
  data_files <- lapply(file_paths, function(file) {
    tryCatch(
      {
        rio::import(file)
      },
      error = function(e) {
        NULL
      }
    )
  })

  # Combine all data frames
  combined_df <- dplyr::bind_rows(data_files)

  # Trim whitespace from column names
  names(combined_df) <- trimws(names(combined_df))

  return(combined_df)
}

#' Rename columns in a dataframe according to a predefined mapping
#'#' @param df A dataframe containing columns to be renamed
#'
#' @return
#' Returns the dataframe with renamed columns. Returns NULL if an error occurs.

#' @examples
#' df <- data.frame(adm0 = 1:3, allout_u5 = 4:6)
#' renamed_df <- rename_columns(df)
rename_columns <- function(df) {
  # rename columns
  rename_cols <- c(

    # Organisational units
    "adm0" = "orgunitlevel1",
    "adm1" = "orgunitlevel2",
    "adm2" = "orgunitlevel3",
    "adm3" = "orgunitlevel4",
    "hf" = "organisationunitname",

    # All outpatient visits
    "allout_u5" = "OPD (New and follow-up curative) 0-59m_X",
    "allout_ov5" = "OPD (New and follow-up curative) 5+y_X",

    # Malaria admissions
    "maladm_u5" = "Admission - Child with malaria 0-59 months_X",
    "maladm_5_14" = "Admission - Child with malaria 5-14 years_X",
    "maladm_ov15" = "Admission - Malaria 15+ years_X",

    # Malaria deaths
    "maldth_1_59m" = "Child death - Malaria 1-59m_X",
    "maldth_10_14" = "Child death - Malaria 10-14y_X",
    "maldth_5_9" = "Child death - Malaria 5-9y_X",
    "maldth_fem_ov15" = "Death malaria 15+ years Female",
    "maldth_mal_ov15" = "Death malaria 15+ years Male",
    "maldth_u5" = "Separation - Child with malaria 0-59 months_X Death",
    "maldth_5_14" = "Separation - Child with malaria 5-14 years_X Death",
    "maldth_ov15" = "Separation - Malaria 15+ years_X Death",

    # Suspected cases
    "susp_u5_hf" = "Fever case - suspected Malaria 0-59m_X",
    "susp_5_14_hf" = "Fever case - suspected Malaria 5-14y_X",
    "susp_ov15_hf" = "Fever case - suspected Malaria 15+y_X",
    "susp_u5_com" = "Fever case in community (Suspected Malaria) 0-59m_X",
    "susp_5_14_com" = "Fever case in community (Suspected Malaria) 5-14y_X",
    "susp_ov15_com" = "Fever case in community (Suspected Malaria) 15+y_X",

    # Malaria tests conducted at community level
    "tes_neg_rdt_u5_com" =
      "Fever case in community tested for Malaria (RDT) - Negative 0-59m_X",
    "tes_pos_rdt_u5_com" =
      "Fever case in community tested for Malaria (RDT) - Positive 0-59m_X",
    "tes_neg_rdt_5_14_com" =
      "Fever case in community tested for Malaria (RDT) - Negative 5-14y_X",
    "tes_pos_rdt_5_14_com" =
      "Fever case in community tested for Malaria (RDT) - Positive 5-14y_X",
    "tes_neg_rdt_ov15_com" =
      "Fever case in community tested for Malaria (RDT) - Negative 15+y_X",
    "tes_pos_rdt_ov15_com" =
      "Fever case in community tested for Malaria (RDT) - Positive 15+y_X",

    # Malaria tests conducted at health facility level
    "test_neg_mic_u5_hf" =
      "Fever case tested for Malaria (Microscopy) - Negative 0-59m_X",
    "test_pos_mic_u5_hf" =
      "Fever case tested for Malaria (Microscopy) - Positive 0-59m_X",
    "test_neg_mic_5_14_hf" =
      "Fever case tested for Malaria (Microscopy) - Negative 5-14y_X",
    "test_pos_mic_5_14_hf" =
      "Fever case tested for Malaria (Microscopy) - Positive 5-14y_X",
    "test_neg_mic_ov15_hf" =
      "Fever case tested for Malaria (Microscopy) - Negative 15+y_X",
    "test_pos_mic_ov15_hf" =
      "Fever case tested for Malaria (Microscopy) - Positive 15+y_X",
    "tes_neg_rdt_u5_hf" =
      "Fever case tested for Malaria (RDT) - Negative 0-59m_X",
    "tes_pos_rdt_u5_hf" =
      "Fever case tested for Malaria (RDT) - Positive 0-59m_X",
    "tes_neg_rdt_5_14_hf" =
      "Fever case tested for Malaria (RDT) - Negative 5-14y_X",
    "tes_pos_rdt_5_14_hf" =
      "Fever case tested for Malaria (RDT) - Positive 5-14y_X",
    "tes_neg_rdt_ov15_hf" =
      "Fever case tested for Malaria (RDT) - Negative 15+y_X",
    "tes_pos_rdt_ov15_hf" =
      "Fever case tested for Malaria (RDT) - Positive 15+y_X",

    # Malaria tests conducted at community level
    "maltreat_u24_u5_com" =
      "Malaria treated in community with ACT <24 hours 0-59m_X",
    "maltreat_ov24_u5_com" =
      "Malaria treated in community with ACT >24 hours 0-59m_X",
    "maltreat_u24_5_14_com" =
      "Malaria treated in community with ACT <24 hours 5-14y_X",
    "maltreat_ov24_5_14_com" =
      "Malaria treated in community with ACT >24 hours 5-14y_X",
    "maltreat_u24_ov15_com" =
      "Malaria treated in community with ACT <24 hours 15+y_X",
    "maltreat_ov24_ov15_com" =
      "Malaria treated in community with ACT >24 hours 15+y_X",

    # Malaria tests conducted at community level
    "maltreat_u24_u5_hf" =
      "Malaria treated with ACT <24 hours 0-59m_X",
    "maltreat_ov24_u5_hf" =
      "Malaria treated with ACT >24 hours 0-59m_X",
    "maltreat_u24_5_14_hf" =
      "Malaria treated with ACT <24 hours 5-14y_X",
    "maltreat_ov24_5_14_hf" =
      "Malaria treated with ACT >24 hours 5-14y_X",
    "maltreat_u24_ov15_hf" =
      "Malaria treated with ACT <24 hours 15+y_X",
    "maltreat_ov24_ov15_hf" =
      "Malaria treated with ACT >24 hours 15+y_X"
  )

  df <- df |>
    dplyr::rename(!!!rename_cols)

  return(df)
}

# import dhis2 data and clean columns using helper function
dhis2_df <- read_and_combine_files(
  here::here("english/data_r")
  ) |>
  rename_columns()  # clean names

# check that the renaming of columns worked
colnames(dhis2_df) |> print()

#===============================================================================
# Step 2: Compute New Variables
#===============================================================================

# Calculate totals
dhis2_df <- dhis2_df |>
  dplyr::mutate(
    # outpatient visits
    allout = rowSums(dplyr::across(c(allout_u5, allout_ov5)), na.rm = TRUE),

    # suspected cases
    susp = rowSums(dplyr::across(c(
      susp_u5_hf, susp_5_14_hf, susp_ov15_hf,
      susp_u5_com, susp_5_14_com, susp_ov15_com
    )), na.rm = TRUE),

    # tested cases
    test_hf = rowSums(dplyr::across(c(
      test_neg_mic_u5_hf, test_pos_mic_u5_hf, test_neg_mic_5_14_hf,
      test_pos_mic_5_14_hf, test_neg_mic_ov15_hf, test_pos_mic_ov15_hf,
      tes_neg_rdt_u5_hf, tes_pos_rdt_u5_hf, tes_neg_rdt_5_14_hf,
      tes_pos_rdt_5_14_hf, tes_neg_rdt_ov15_hf, tes_pos_rdt_ov15_hf
    )), na.rm = TRUE),

    test_com = rowSums(dplyr::across(c(
      tes_neg_rdt_u5_com, tes_pos_rdt_u5_com, tes_neg_rdt_5_14_com,
      tes_pos_rdt_5_14_com, tes_neg_rdt_ov15_com, tes_pos_rdt_ov15_com
    )), na.rm = TRUE),

    test = test_hf + test_com,

    # confirmed cases (HF and COM)
    conf_hf = rowSums(dplyr::across(c(
      test_pos_mic_u5_hf, test_pos_mic_5_14_hf, test_pos_mic_ov15_hf,
      tes_pos_rdt_u5_hf, tes_pos_rdt_5_14_hf, tes_pos_rdt_ov15_hf
    )), na.rm = TRUE),

    conf_com = rowSums(dplyr::across(c(
      tes_pos_rdt_u5_com, tes_pos_rdt_5_14_com, tes_pos_rdt_ov15_com
    )), na.rm = TRUE),

    conf = conf_hf + conf_com,

    # treated cases
    maltreat_com = rowSums(dplyr::across(c(
      maltreat_u24_u5_com, maltreat_ov24_u5_com, maltreat_u24_5_14_com,
      maltreat_ov24_5_14_com, maltreat_u24_ov15_com, maltreat_ov24_ov15_com
    )), na.rm = TRUE),

    maltreat_hf = rowSums(dplyr::across(c(
      maltreat_u24_u5_hf, maltreat_ov24_u5_hf, maltreat_u24_5_14_hf,
      maltreat_ov24_5_14_hf, maltreat_u24_ov15_hf, maltreat_ov24_ov15_hf
    )), na.rm = TRUE),

    maltreat = maltreat_hf + maltreat_com,

    # presumed cases
    pres_com = pmax(maltreat_com - conf_com, 0, na.rm = TRUE),
    pres_hf = pmax(maltreat_hf - conf_hf, 0, na.rm = TRUE),
    pres = pres_com + pres_hf,

    # malaria admissions
    maladm = rowSums(dplyr::across(
      c(maladm_u5, maladm_5_14, maladm_ov15)), na.rm = TRUE),

    # malaria deaths
    maldth = rowSums(dplyr::across(c(
      maldth_u5, maldth_1_59m, maldth_10_14, maldth_5_9,
      maldth_5_14, maldth_ov15, maldth_fem_ov15, maldth_mal_ov15
    )), na.rm = TRUE),

    # AGE-GROUP SPECIFIC AGGREGATIONS
    # Tested cases by age group (HF only)
    test_hf_u5 = rowSums(dplyr::across(c(
      test_neg_mic_u5_hf, test_pos_mic_u5_hf,
      tes_neg_rdt_u5_hf, tes_pos_rdt_u5_hf
    )), na.rm = TRUE),

    test_hf_5_14 = rowSums(dplyr::across(c(
      test_neg_mic_5_14_hf, test_pos_mic_5_14_hf,
      tes_neg_rdt_5_14_hf, tes_pos_rdt_5_14_hf
    )), na.rm = TRUE),

    test_hf_ov15 = rowSums(dplyr::across(c(
      test_neg_mic_ov15_hf, test_pos_mic_ov15_hf,
      tes_neg_rdt_ov15_hf, tes_pos_rdt_ov15_hf
    )), na.rm = TRUE),

    # Tested cases by age group (Community only)
    test_com_u5 = rowSums(dplyr::across(c(
      tes_neg_rdt_u5_com, tes_pos_rdt_u5_com
    )), na.rm = TRUE),

    test_com_5_14 = rowSums(dplyr::across(c(
      tes_neg_rdt_5_14_com, tes_pos_rdt_5_14_com
    )), na.rm = TRUE),

    test_com_ov15 = rowSums(dplyr::across(c(
      tes_neg_rdt_ov15_com, tes_pos_rdt_ov15_com
    )), na.rm = TRUE),

    # Total tested by age group (HF + Community)
    test_u5 = test_hf_u5 + test_com_u5,
    test_5_14 = test_hf_5_14 + test_com_5_14,
    test_ov15 = test_hf_ov15 + test_com_ov15,

   # Suspected cases by age group (HF only)
    susp_hf_u5 = rowSums(dplyr::across(c(
      susp_u5_hf
    )), na.rm = TRUE),

    susp_hf_5_14 = rowSums(dplyr::across(c(
      susp_5_14_hf
    )), na.rm = TRUE),

    susp_hf_ov15 = rowSums(dplyr::across(c(
      susp_ov15_hf
    )), na.rm = TRUE),

    # Suspected cases by age group (Community only)
    susp_com_u5 = rowSums(dplyr::across(c(
      susp_u5_com
    )), na.rm = TRUE),

    susp_com_5_14 = rowSums(dplyr::across(c(
      susp_5_14_com
    )), na.rm = TRUE),

    susp_com_ov15 = rowSums(dplyr::across(c(
      susp_ov15_com
    )), na.rm = TRUE),

    # Total suspected by age group (HF + Community)
    susp_u5 = susp_hf_u5 + susp_com_u5,
    susp_5_14 = susp_hf_5_14 + susp_com_5_14,
    susp_ov15 = susp_hf_ov15 + susp_com_ov15,

    # Confirmed cases by age group (HF only)
    conf_hf_u5 = rowSums(dplyr::across(c(
      test_pos_mic_u5_hf, tes_pos_rdt_u5_hf
    )), na.rm = TRUE),

    conf_hf_5_14 = rowSums(dplyr::across(c(
      test_pos_mic_5_14_hf, tes_pos_rdt_5_14_hf
    )), na.rm = TRUE),

    conf_hf_ov15 = rowSums(dplyr::across(c(
      test_pos_mic_ov15_hf, tes_pos_rdt_ov15_hf
    )), na.rm = TRUE),

    # Confirmed cases by age group (Community only)
    conf_com_u5 = tes_pos_rdt_u5_com,
    conf_com_5_14 = tes_pos_rdt_5_14_com,
    conf_com_ov15 = tes_pos_rdt_ov15_com,

    # Total confirmed by age group (HF + Community)
    conf_u5 = conf_hf_u5 + conf_com_u5,
    conf_5_14 = conf_hf_5_14 + conf_com_5_14,
    conf_ov15 = conf_hf_ov15 + conf_com_ov15,

    # Treated cases by age group (HF only)
    maltreat_hf_u5 = rowSums(dplyr::across(c(
      maltreat_u24_u5_hf, maltreat_ov24_u5_hf
    )), na.rm = TRUE),

    maltreat_hf_5_14 = rowSums(dplyr::across(c(
      maltreat_u24_5_14_hf, maltreat_ov24_5_14_hf
    )), na.rm = TRUE),

    maltreat_hf_ov15 = rowSums(dplyr::across(c(
      maltreat_u24_ov15_hf, maltreat_ov24_ov15_hf
    )), na.rm = TRUE),

    # Treated cases by age group (Community only)
    maltreat_com_u5 = rowSums(dplyr::across(c(
      maltreat_u24_u5_com, maltreat_ov24_u5_com
    )), na.rm = TRUE),

    maltreat_com_5_14 = rowSums(dplyr::across(c(
      maltreat_u24_5_14_com, maltreat_ov24_5_14_com
    )), na.rm = TRUE),

    maltreat_com_ov15 = rowSums(dplyr::across(c(
      maltreat_u24_ov15_com, maltreat_ov24_ov15_com
    )), na.rm = TRUE),

    # Total treated by age group (HF + Community)
    maltreat_u5 = maltreat_hf_u5 + maltreat_com_u5,
    maltreat_5_14 = maltreat_hf_5_14 + maltreat_com_5_14,
    maltreat_ov15 = maltreat_hf_ov15 + maltreat_com_ov15,

    # Total treated within 24 hours (HF only)
    maltreat_u24_hf = rowSums(dplyr::across(c(
      maltreat_u24_u5_hf, maltreat_u24_5_14_hf, maltreat_u24_ov15_hf
    )), na.rm = TRUE),

    # Total treated after 24 hours (HF only)
    maltreat_ov24_hf = rowSums(dplyr::across(c(
      maltreat_ov24_u5_hf, maltreat_ov24_5_14_hf, maltreat_ov24_ov15_hf
    )), na.rm = TRUE),

    # Total treated within 24 hours (Community only)
    maltreat_u24_com = rowSums(dplyr::across(c(
      maltreat_u24_u5_com, maltreat_u24_5_14_com, maltreat_u24_ov15_com
    )), na.rm = TRUE),

    # Total treated after 24 hours (Community only)
    maltreat_ov24_com = rowSums(dplyr::across(c(
      maltreat_ov24_u5_com, maltreat_ov24_5_14_com, maltreat_ov24_ov15_com
    )), na.rm = TRUE),

    # Overall totals (HF + Community)
    maltreat_u24_total = maltreat_u24_hf + maltreat_u24_com,
    maltreat_ov24_total = maltreat_ov24_hf + maltreat_ov24_com,

    # Presumed cases by age group
    pres_com_u5 = pmax(maltreat_com_u5 - conf_com_u5, 0, na.rm = TRUE),
    pres_com_5_14 = pmax(maltreat_com_5_14 - conf_com_5_14, 0, na.rm = TRUE),
    pres_com_ov15 = pmax(maltreat_com_ov15 - conf_com_ov15, 0, na.rm = TRUE),

    pres_hf_u5 = pmax(maltreat_hf_u5 - conf_hf_u5, 0, na.rm = TRUE),
    pres_hf_5_14 = pmax(maltreat_hf_5_14 - conf_hf_5_14, 0, na.rm = TRUE),
    pres_hf_ov15 = pmax(maltreat_hf_ov15 - conf_hf_ov15, 0, na.rm = TRUE),

    pres_u5 = pres_com_u5 + pres_hf_u5,
    pres_5_14 = pres_com_5_14 + pres_hf_5_14,
    pres_ov15 = pres_com_ov15 + pres_hf_ov15
  )

  # Check totals
dhis2_df |>
  filter(
    # Outpatient checks
    allout != allout_u5 + allout_ov5 |

    # Malaria admissions check
    maladm != maladm_u5 + maladm_5_14 + maladm_ov15 |

    # Total tests check
    test != test_hf + test_com |

    # Total confirmed check
    conf !=  conf_hf + conf_com |

    # Total treated check
    maltreat != maltreat_hf + maltreat_com |

    # Total presumed check
    pres != pres_hf + pres_com |

    # Tested cases checks
    test_u5 != test_hf_u5 + test_com_u5 |
    test_5_14 != test_hf_5_14 + test_com_5_14 |
    test_ov15 != test_hf_ov15 + test_com_ov15 |

    # Confirmed cases checks
    conf_u5 != conf_hf_u5 + conf_com_u5 |
    conf_5_14 != conf_hf_5_14 + conf_com_5_14 |
    conf_ov15 != conf_hf_ov15 + conf_com_ov15 |

    # Presumed cases checks
    pres_u5 != pres_hf_u5 + pres_com_u5 |
    pres_5_14 != pres_hf_5_14 + pres_com_5_14 |
    pres_ov15 != pres_hf_ov15 + pres_com_ov15|

    # Suspected cases checks
    susp_u5 != susp_u5_hf + susp_u5_com |
    susp_5_14 != susp_5_14_hf + susp_5_14_com |
    susp_ov15 != susp_ov15_hf + susp_ov15_com|

   # Treated cases checks
    maltreat_u5 != maltreat_hf_u5 + maltreat_com_u5 |
    maltreat_5_14 != maltreat_hf_5_14 + maltreat_com_5_14 |
    maltreat_ov15 != maltreat_hf_ov15 + maltreat_com_ov15 |

    # Treatment timing checks
    maltreat_u24_total != maltreat_u24_hf + maltreat_u24_com |
    maltreat_ov24_total != maltreat_ov24_hf + maltreat_ov24_com
  ) |>
  select(
    hf, periodname, allout,
    maldth, maladm, test,
    conf, maltreat, pres,
    test_u5, test_5_14, test_ov15,
    conf_u5, conf_5_14, conf_ov15,
    pres_u5, pres_5_14, pres_ov15,
    maltreat_u5, maltreat_5_14, maltreat_ov15,
    maltreat_u24_total, maltreat_ov24_total
  )
#===============================================================================
# Step 3: Fix Year and Month Columns
#===============================================================================

dhis2_df <- dhis2_df |>
  # seperate periodname to month and year
  tidyr::separate(
    col = tidyselect::all_of("periodname"),
    into = c("month", "year"),
    sep = " ",
    remove = TRUE
  ) |>
  # recode month
  dplyr::mutate(
    month = recode(
      month,
      "January" = "01",
      "February" = "02",
      "March" = "03",
      "April" = "04",
      "May" = "05",
      "June" = "06",
      "July" = "07",
      "August" = "08",
      "September" = "09",
      "October" = "10",
      "November" = "11",
      "December" = "12"
    )
  ) |>
  # create a new date column
  unite("date", year, month, sep = "-", remove = FALSE)


#===============================================================================
# Step 4: Create a Unique `hf` Id
#===============================================================================

dhis2_df <- dhis2_df |>
  dplyr::group_by(adm1, adm2, adm3, hf) |>
  dplyr::mutate(hf_uid = sprintf("hf_%04d", dplyr::cur_group_id())) |>
  dplyr::ungroup()

# check it worked
dhis2_df |>
  dplyr::distinct(adm2, adm3, hf, hf_uid) |>
  tail()

#===============================================================================
# Step 5: Save Health Facility Data
#===============================================================================

# set up output path
save_path <- here::here("01_data",
                        "02_epidemiology",
                        "2a_routine_surveillance",
                        "processed")

# save to xlsx
rio::export(dhis2_df, here::here(save_path, "health_facility_data.xlsx"))

# save to csv
rio::export(dhis2_df, here::here(save_path, "health_facility_data.csv"))

# save to dta
rio::export(dhis2_df, here::here(save_path, "health_facility_data.dta"))

##==============================================================================
# Step 6: Group and aggregate monthly data at the adm1 level
#===============================================================================

adm2_monthly_data <- dhis2_df |>
  dplyr::group_by(adm2, adm1, year, month, date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::where(
        is.numeric), \(x) sum(x, na.rm = TRUE)),
    .groups = "drop")

# save to xlsx
rio::export(adm2_monthly_data,
             here::here(save_path, "aggregated_data_monthly_adm2_data.xlsx"))

# save to csv
rio::export(adm2_monthly_data,
             here::here(save_path, "aggregated_data_monthly_adm2_data.csv"))

# save to dta
rio::export(adm2_monthly_data,
            here::here(save_path, "aggregated_data_monthly_adm2_data.dta"))


adm1_monthly_data <- dhis2_df |>
  dplyr::group_by(adm1, year, month, date) |>
  dplyr::summarise(
    dplyr::across(
      dplyr::where(
        is.numeric), \(x) sum(x, na.rm = TRUE)),
    .groups = "drop")

# save to xlsx
rio::export(adm1_monthly_data,
             here::here(save_path, "haggregated_data_monthly_adm1_data.xlsx"))

# save to csv
rio::export(adm1_monthly_data,
             here::here(save_path, "aggregated_data_monthly_adm1_data.csv"))

# save to dta
rio::export(adm1_monthly_data,
             here::here(save_path, "aggregated_data_monthly_adm1_data.dta"))

#===============================================================================
# End of Script
#===============================================================================
 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved