| adm0 | adm1 | adm2 | adm3 | hf | date | month | year | allout_u5 | allout_ov5 | conf_u5 | conf_5_14 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-01 | 01 | 2021 | 44 | 48 | 36 | 22 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-02 | 02 | 2021 | 37 | 27 | 20 | 30 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-03 | 03 | 2021 | 44 | 42 | 18 | 22 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-04 | 04 | 2021 | 28 | 23 | 30 | 30 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-05 | 05 | 2021 | 138 | 141 | 72 | 40 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-06 | 06 | 2021 | 127 | 82 | 59 | 16 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-07 | 07 | 2021 | 130 | 175 | 86 | 24 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-08 | 08 | 2021 | 144 | 203 | 93 | 17 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-09 | 09 | 2021 | 159 | 159 | 78 | 18 |
| SIERRA LEONE | BO DISTRICT | BO CITY COUNCIL | BO CITY | AETHEL CHP | 2021-10 | 10 | 2021 | 88 | 128 | 6 | 80 |
DHIS2 data preprocessing
Beginner
Overview
DHIS2 is a core data source for routine malaria surveillance, capturing indicators such as outpatient case counts, diagnostic testing, and treatment provision. However, raw DHIS2 exports may not be ready for SNT analysis if they have structural or quality issues, are messy, include inconsistent or unreadable column names, missing values, mismatched facility names, duplicate records, and formats that vary across months or locations. To make this data usable for SNT, it needs to be cleaned, reshaped, and aligned with a spatial and temporal structure that supports later steps of analysis: identifying active and inactive health facilities, calculating reporting rates, managing outliers, calculating malaria incidence, etc.
In this section, we walk through how to preprocess monthly DHIS2 malaria records to make them SNT-ready, using example data from Sierra Leone. This includes cleaning and standardizing facility-level data, parsing and standardizing date formats, harmonizing administrative names with reference shapefiles for spatial joins, computing derived indicators, and aggregating to administrative levels where needed to produce structured datasets.
The DHIS2 data you are working with may require some or all of the specific steps detailed below, or additional steps may be required to prepare the data for analysis.
- Import and combine DHIS2 data files from multiple sources
- Verify data completeness after import (check all groups present, identify missing variables)
- Rename columns using a data dictionary for standardized naming
- Standardize date columns (parse dates, create year/month/yearmon fields)
- Harmonize administrative names with reference shapefiles for spatial joins
- Create unique identifiers (UIDs) for facilities and records
- Compute indicator totals and derived variables (e.g. presumed cases)
- Perform quality control checks on computed indicators
- Identify and resolve duplicate HF-month records
- Aggregate data at the selected administrative level
- Export cleaned facility-level and aggregated datasets
Key Considerations
Troubleshooting
Routine surveillance data can be trickier to work with than other data sources that are more standardized. Expect to encounter and resolve data issues throughout the DHIS2 data preprocessing process. You will likely need to:
- Investigate unexpected patterns in your data
- Make judgment calls about how to handle inconsistencies
- Collaborate with your SNT team on country-specific issues
- Develop custom solutions for problems not covered here
The troubleshooting approaches shown in this guide provide a starting framework, but real-world data cleaning often requires iterative problem-solving and local expertise. While this guide addresses challenges others have encountered when preparing DHIS2 data for analysis, every country is different. Your specific context may present unique problems that are not yet covered in the code library. We encourage you to be persistent and creative if needed and to consult others for help if you are unable to resolve on your own.
If you encounter a new challenge when preprocessing DHIS2 data and would like new content to be added to the code library, please fill out our feedback form.
Best Practices When Working with Routine Data
1. Understand the System’s Context
- Reporting workflows: Know how data moves from clinics to DHIS2 (paper vs. direct entry, aggregation rules).
- Indicator definitions: Confirm case defintions and inclusions
2. Triangulate Strategically
Cross-check with:
- External benchmarks (e.g., HMIS reports, survey data)
- Parallel systems (e.g., stock management data for testing kits)
- Temporal patterns (compare dry vs. rainy season trends)
3. Document Your Process
Maintain a troubleshooting log tracking:
- Issues encountered
- Data sources consulted
- Decisions made (with rationale)
Timing lags: Facilities may report late, especially after holidays or system outages.
What a Clean and Well-Structured Dataset Should Look Like
Your data might come in different format to Sierra Leone example below. Whether you are able to use the chunk of code provided, or parts of it, to import your data, please make sure your data, after import, contains the following information:
- A column with the date (year and month) of the report
- A column with the health facility name
- Various columns with parent admin units - at mimimum you will need the name of the admin unit for SNT analysis (for example, chiefdom for Sierra Leone)
- Your data indicators columns
Below is an example of what the data should look like after import:
Step-by-Step
In this section, we walk through the key preprocessing steps needed to get DHIS2 malaria data ready for SNT analysis, using example DHIS2 data from Sierra Leone. The focus here is on cleaning, reshaping, and aggregating facility-level data to produce structured outputs at both the facility and administrative levels.
Each step is designed to guide you through the process cleanly. Make sure to follow the notes in the code, especially where edits are required (e.g. column names or admin levels). The goal is to make the data analysis-ready without breaking linkages to spatial units or reporting structures.
Remember that you will need to adapt this process to the specific country context in which you are working.
To skip the step-by-step explanation, jump to the full code at the end of this page.
Step 1: Import Packages and Data
We’ll use Sierra Leone DHIS2 malaria data as a working example. Variable names or database structures may differ across countries. Be sure to review and update the variable names to match your country’s DHIS2 configuration.
We begin by installing and loading the required packages for our preprocessing, then import the data.
If you are using python and need help installing packages, please refer to the Getting Started section.
Step 1.1: Import Packages
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# install or load relevant packages
pacman::p_load(
tidyverse, # data manipulation, reshaping and plotting
rio, # import/export multiple file types
DT, # interactive table preview
here, # project-relative file paths
readxl, # read excel files
writexl, # write excel files
knitr # render code in quarto/rmarkdown
)#|message: false
#|echo: true
#|eval: true
#| label: import packages - py (fullcode)
import pandas as pd # load core data tools
from pathlib import Path # handle file system paths
from pyprojroot import here # build project-relative paths
import os # optional path utilities
import locale # for setting language locale
import geopandas as gpd # for importing shapefiles
import xxhash # for hashing
import pyarrow.parquet as pq # for exportin parquet files
import pyarrow as pa # for exportin parquet filesStep 1.2: Import Data
The code chunk below defines a helper function to import different types of data, for example Excel or csv data. If your DHIS2 extraction resulted in multiple files, this function will merge them together into a single dataset. Then we use it to read in our data file(s).
When calling the DHIS2 data import function, we read files from a folder using the here package. The here package builds paths from the project root, keeping code portable and reliable.
Before running any code, take a moment to visually inspect your data. This is especially important when working with DHIS2 data or other routine health sources. As explained in the Data Structures of the Getting Started: For Analysts page, your data should follow tidy data principles: each column represents a variable, each row an observation, and each cell contains a single value.
To check this, open the Excel or CSV file directly and review:
- Headers: Are the column names in the first row? Are they clear and consistent?
- Columns and rows:
- Each column should represent a variable (e.g.
health_facility,month,confirmed_cases). - Each row should represent a single observation (e.g. one facility-month).
- Each column should represent a variable (e.g.
- Cells: Each cell should contain a single value (not merged or grouped). Avoid cells that contain multiple values (e.g.
10/5for tested/positive). - Extra rows: Remove any top rows used for titles, notes, or merged headers.
If your dataset is already tidy, you’re ready to move on. If not, try to clean it up before working with it. You can do this:
In Excel, by editing a copy of the file and removing any extra formatting, then saving your new copy to import and work with in later steps.
Or in R, using packages like:
rio::import(file, skip = X)to skip X number of extra header rowstidyr::pivot_longer()to reshape wide data into long formatdplyr::mutate()andseparate()to split combined values into separate columns
Or in Python, using:
pandas.read_excel(file, skiprows=X)to skip X number of unwanted header rowspandas.melt()to reshape wide data into long formatpandas.Series.str.split()andpandas.assign()to split and create new columns
For more complex formatting issues, like multi-row or merged headers, see this excellent step-by-step guide: Tidying Multi-Header Excel Data with R by Paul Campbell (2019).
Getting this right early will save time and reduce confusion later in the workflow, as all the subsequent steps assume that the data is already in tidy format and ready for analysis.
Set the paths to the DHIS2 data initially.
If the dataset is a single Excel spreadsheet, it can be imported directly using rio::import, which reads multiple file formats. This approach assumes all indicators, locations, and time periods are stored in one file
In cases where the data is spread across multiple tabs, for example split into yearly tabs, use the following approach.
DHIS2 exports can be difficult when data spans many facilities or years. File size and memory use increase when stacking many tabs or files. For this reason, data may be exported across multiple files or sheets. If the data is in a zipped archive, or split into multiple files in the same format, import them together with rio::import_list() to import them together. This assumes all files share the same structure, column names, and data types. It also assumes sheets contain rectangular data only, and the file extension is consistent so rio can detect the format. Examples of this include district or admin named files exported by year ranges, such as Bombali_District_2015-2023.xls or Kambia_District_2015-2023.xls.
dhis2_df <- rio::import_list(
file = list.files(
path = core_routine_path,
pattern = "\\.(xls)$", # the file extensions
full.names = TRUE
),
rbind = TRUE, # stack all tabs/files into one table
rbind_label = "sheet_admin", # source name stored in column 'sheet_admin'
rbind_fill = TRUE # fill missing columns with NA when stacking
)
# check data
dhis2_df |>
dplyr::select(1:20) |>
dplyr::glimpse()To adapt the code:
- Line 3: Set
core_routine_pathto your target folder. - Line 4: Modify the pattern to match your extension (xls, xlsx, csv).
- Line 8: Set
rbind_label = 'sheet_admin'to store the source tab or file, such as admin units, use year for yearly tabs/files.
Set the paths to the DHIS2 data initially.
If the dataset is a single Excel file, import it directly using pd.read_excel. This assumes all indicators, locations, and time periods are in one spreadsheet.
In cases where data is stored in multiple tabs, such as yearly splits, load all tabs and stack them into one table. Add a column to keep the tab source, like year.
Cumulative exports across many facilities or years are often split into multiple files. File size and memory use increase when stacking. All files and sheets are assumed to share the same columns and types. Use pandas to detect the sheets, read each file, stack them, and store the file or tab source in a label column, such as sheet_admin. Examples of this include district or admin named files exported by year ranges, such as Bombali_District_2015-2023.xls or Kambia_District_2015-2023.xls.
# set file extension
extension = "xls"
# find all files with specified extension in directory
files = list(core_routine_path.glob(f"*.{extension}"))
# initialise dataframe
dhis2_df = pd.DataFrame()
# iterate over files, concatenate stack as one dataframe
for file in files:
temp = pd.read_excel(file, sheet_name="Sheet1")
dhis2_df = pd.concat([dhis2_df, temp])
# Inspect data
dhis2_df.head(10) orgunitlevel1 ... Malaria treated with ACT >24 hours 15+y_X
0 Sierra Leone ... NaN
1 Sierra Leone ... 1.0
2 Sierra Leone ... NaN
3 Sierra Leone ... 2.0
4 Sierra Leone ... 7.0
5 Sierra Leone ... 5.0
6 Sierra Leone ... 5.0
7 Sierra Leone ... 4.0
8 Sierra Leone ... 7.0
9 Sierra Leone ... 3.0
[10 rows x 56 columns]
To adapt the code:
- Line 1: Update extension to
xls,xlsx, orcsv. - Line 4: Set
core_routine_pathto your data folder. - Line 11: Remove
sheet_name="Sheet1"if your files do not use a fixed tab name.
Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts
Before any cleaning or indicator construction, confirm that the merged dataset contains all groups and components from the original DHIS2 exports.
Step 2.1: Verify All Expected Groups Are Present After Import
DHIS2 exports are often split across several files or sheets. The structure varies by country. Files may be separated by admin units, by years, or by other DHIS2-defined groupings such as monthly batches, subnational regions, wards, partner-assigned districts, or custom DHIS2 groups such as outpatient tallies etc., After importing and appending these extracts into a single dataset, you should run a set of checks to confirm that all components were included correctly and that nothing was missed or duplicated.
The example below uses adm2 (currently orgunitlevel3 in our data) as the grouping variable, but you should replace adm2 with year or any other grouping used in your export.
orgunitlevel3
1 Bo City Council
2 Bo District Council
3 Bombali District Council
4 Makeni City Council
5 Bonthe District Council
6 Bonthe Municipal Council
7 Falaba District Council
8 Kailahun District Council
9 Kambia District Council
10 Karene District Council
11 Kenema City Council
12 Kenema District Council
13 Koinadugu District Council
14 Kono District Council
15 Koidu New Sembehun City Council
16 Moyamba District Council
17 Port Loko District Council
18 Port Loko City Council
19 Pujehun District Council
20 Tonkolili District Council
21 Western Area Rural District Council
22 Freetown City Council
To adapt the code:
- Line 3: Replace orgunitlevel3 with the grouping used in your export (for example, adm2, adm1, year, ward, district, or any custom DHIS2 grouping).
orgunitlevel3
0 Moyamba District Council
1 Bombali District Council
2 Makeni City Council
3 Kambia District Council
4 Tonkolili District Council
5 Port Loko District Council
6 Port Loko City Council
7 Karene District Council
8 Koinadugu District Council
9 Kono District Council
10 Koidu New Sembehun City Council
11 Western Area Rural District Council
12 Kailahun District Council
13 Pujehun District Council
14 Kenema City Council
15 Kenema District Council
16 Bonthe District Council
17 Bonthe Municipal Council
18 Bo City Council
19 Bo District Council
20 Falaba District Council
21 Freetown City Council
To adapt the code:
- Line 2: Replace orgunitlevel3 with the grouping used in your export (for example, adm2, adm1, year, ward, district, or any custom DHIS2 grouping).
This diagnostic checks that all expected groups from your extracts appear in the merged dataset, whether files were split by admin units, years, wards, or any other DHIS2 grouping.
If you imported 15 files or sheets, you should see 15 distinct groups. Fewer groups usually indicate a missing file, inconsistent tab names, or spelling differences. Row counts will vary across groups, so focus on group presence, not size.
If groups are missing, confirm that all source files are present, sheet names follow a consistent pattern, and group names match across extracts. After fixing any issues and re-importing, continue with the preprocessing steps.
Step 2.2: Check for Completely Missing Variables
Some variables, such as health facility name, reporting date, and administrative unit, must not be missing. Rows with missing values in these fields cannot be used directly and should be investigated with the DHIS2 focal person.
Completely missing columns should be reviewed before continuing. They may reflect changes in DHIS2 indicators, missing files, or indicators that were never collected. Check with the SNT team to confirm whether the variable should be kept, renamed, or removed. Resolving this early prevents issues in all downstream SNT steps.
The code below prints the number of missing values for each variable and highlights any missingness in the critical identifiers needed for SNT analysis.
# identify complete missing values per column
dhis2_df |>
dplyr::summarise(
dplyr::across(
dplyr::everything(),
~ mean(is.na(.x)) * 100
)
) |>
tidyr::pivot_longer(
everything(),
names_to = "variable",
values_to = "pct_missing"
) |>
dplyr::filter(pct_missing == 100)# A tibble: 0 × 2
# ℹ 2 variables: variable <chr>, pct_missing <dbl>
# compute percent missing for each column
pct_missing = dhis2_df.isna().mean() * 100
# convert to tidy/long format
missing_table = (
pct_missing.reset_index()
.rename(columns={"index": "variable", 0: "pct_missing"})
)
# filter for completely missing columns
missing_table[missing_table["pct_missing"] == 100]Empty DataFrame
Columns: [variable, pct_missing]
Index: []
Do not change or adapt the code above.
Step 3: Rename Columns
A practical way to standardise and renames DHIS2 column names is to use a data dictionary. The dictionary contains at least two columns: 1. the original DHIS2 column names as they appear in the raw export 2. the corresponding SNT column names you prefer to use
The SNT names are typically shorter and more consistent, which helps maintain a clear structure across the analytical pipeline.
| indicator_label | snt_var |
|---|---|
| orgunitlevel1 | adm0 |
| orgunitlevel2 | adm1 |
| orgunitlevel3 | adm2 |
| orgunitlevel4 | adm3 |
| organisationunitname | hf |
| OPD (New and follow-up curative) 0-59m_X | allout_u5 |
| OPD (New and follow-up curative) 5+y_X | allout_ov5 |
| Admission - Child with malaria 0-59 months_X | maladm_u5 |
| Admission - Child with malaria 5-14 years_X | maladm_5_14 |
| Admission - Malaria 15+ years_X | maladm_ov15 |
| Child death - Malaria 1-59m_X | maldth_1_59m |
| Child death - Malaria 10-14y_X | maldth_10_14 |
| Child death - Malaria 5-9y_X | maldth_5_9 |
| Death malaria 15+ years Female | maldth_fem_ov15 |
| Death malaria 15+ years Male | maldth_mal_ov15 |
| Separation - Child with malaria 0-59 months_X Death | maldth_u5 |
| Separation - Child with malaria 5-14 years_X Death | maldth_5_14 |
| Separation - Malaria 15+ years_X Death | maldth_ov15 |
| Fever case - suspected Malaria 0-59m_X | susp_u5_hf |
| Fever case - suspected Malaria 5-14y_X | susp_5_14_hf |
| Fever case - suspected Malaria 15+y_X | susp_ov15_hf |
| Fever case in community (Suspected Malaria) 0-59m_X | susp_u5_com |
| Fever case in community (Suspected Malaria) 5-14y_X | susp_5_14_com |
| Fever case in community (Suspected Malaria) 15+y_X | susp_ov15_com |
| Fever case in community tested for Malaria (RDT) - Negative 0-59m_X | tes_neg_rdt_u5_com |
| Fever case in community tested for Malaria (RDT) - Positive 0-59m_X | tes_pos_rdt_u5_com |
| Fever case in community tested for Malaria (RDT) - Negative 5-14y_X | tes_neg_rdt_5_14_com |
| Fever case in community tested for Malaria (RDT) - Positive 5-14y_X | tes_pos_rdt_5_14_com |
| Fever case in community tested for Malaria (RDT) - Negative 15+y_X | tes_neg_rdt_ov15_com |
| Fever case in community tested for Malaria (RDT) - Positive 15+y_X | tes_pos_rdt_ov15_com |
| Fever case tested for Malaria (Microscopy) - Negative 0-59m_X | test_neg_mic_u5_hf |
| Fever case tested for Malaria (Microscopy) - Positive 0-59m_X | test_pos_mic_u5_hf |
| Fever case tested for Malaria (Microscopy) - Negative 5-14y_X | test_neg_mic_5_14_hf |
| Fever case tested for Malaria (Microscopy) - Positive 5-14y_X | test_pos_mic_5_14_hf |
| Fever case tested for Malaria (Microscopy) - Negative 15+y_X | test_neg_mic_ov15_hf |
| Fever case tested for Malaria (Microscopy) - Positive 15+y_X | test_pos_mic_ov15_hf |
| Fever case tested for Malaria (RDT) - Negative 0-59m_X | tes_neg_rdt_u5_hf |
| Fever case tested for Malaria (RDT) - Positive 0-59m_X | tes_pos_rdt_u5_hf |
| Fever case tested for Malaria (RDT) - Negative 5-14y_X | tes_neg_rdt_5_14_hf |
| Fever case tested for Malaria (RDT) - Positive 5-14y_X | tes_pos_rdt_5_14_hf |
| Fever case tested for Malaria (RDT) - Negative 15+y_X | tes_neg_rdt_ov15_hf |
| Fever case tested for Malaria (RDT) - Positive 15+y_X | tes_pos_rdt_ov15_hf |
| Malaria treated in community with ACT <24 hours 0-59m_X | maltreat_u24_u5_com |
| Malaria treated in community with ACT >24 hours 0-59m_X | maltreat_ov24_u5_com |
| Malaria treated in community with ACT <24 hours 5-14y_X | maltreat_u24_5_14_com |
| Malaria treated in community with ACT >24 hours 5-14y_X | maltreat_ov24_5_14_com |
| Malaria treated in community with ACT <24 hours 15+y_X | maltreat_u24_ov15_com |
| Malaria treated in community with ACT >24 hours 15+y_X | maltreat_ov24_ov15_com |
| Malaria treated with ACT <24 hours 0-59m_X | maltreat_u24_u5_hf |
| Malaria treated with ACT >24 hours 5-14y_X | maltreat_ov24_5_14_hf |
| Malaria treated with ACT <24 hours 5-14y_X | maltreat_u24_5_14_hf |
| Malaria treated with ACT >24 hours 15+y_X | maltreat_ov24_ov15_hf |
| Malaria treated with ACT >24 hours 0-59m_X | maltreat_ov24_u5_hf |
| Malaria treated with ACT <24 hours 15+y_X | maltreat_u24_ov15_hf |
Keeping this dictionary in an external file (for example, Excel or CSV) allows others to review and update it, and national teams can confirm indicator naming directly. This also reduces the risk of errors from hard-coded renaming rules, which may break when column formats change or when scripts are modified.
Using a dictionary creates a shared reference for DHIS2 renaming and helps maintain a consistent naming approach throughout the SNT workflow.
Review variable definitions for your country to better understand and appropriately analyze the data. Consider investigating questions like:
- Are admissions included in outpatients?
- Are data on pregnant women included in the data for adults?
- Is there double counting between RDT and microscopy results? If yes, is it appropriate to only use RDT results?
- Are data from the private sector included here? If so, what percentage of the private sector reports into DHIS2?
- Are data from commmunity health workers included in their assigned health facility data or are data separate?
- Have any variables been included or adapted throughout the years? If so, how should they be treated throughout the time series?
Always confirm variable definitions with the SNT team.
# import data dictionary
data_dict <- rio::import(
here::here(core_routine_path, "sle_dhis2_data_dict.xlsx")
)
# create rename vector: object = old names, names = new names
rename_vector <- stats::setNames(
object = data_dict$indicator_label,
nm = data_dict$snt_var
)
# rename dhis2 data
dhis2_df <- dhis2_df |>
dplyr::rename(
!!!rename_vector
) |>
# we drop orgunitlevel5 because it s same as hf
dplyr::select(-orgunitlevel5)
# check the names
colnames(dhis2_df)To adapt the code:
- Line 3: Point to your own dictionary file and folder.
- Lines 7–8: Update the dictionary column names for original and SNT variables.
- Line 12: Replace
dhis2_dfwith your DHIS2 dataset. - Line 17: Remove or change
orgunitlevel5if this column is not present.
# import data dictionary
dict_path = os.path.join(core_routine_path, "sle_dhis2_data_dict.xlsx")
data_dict = pd.read_excel(dict_path)
# build rename dictionary: {old_name: new_name}
rename_dict = dict(zip(data_dict["indicator_label"], data_dict["snt_var"]))
# rename dhis2 data
dhis2_df = dhis2_df.rename(columns=rename_dict).drop(
columns=["orgunitlevel5"], errors="ignore"
)
# view updated column names
dhis2_df.columns.tolist()To adapt the code:
- Line 2: Update the file path if your dictionary is stored elsewhere.
- Line 6: Change
indicator_labeland snt_var to match your dictionary column names. - Line 9: Replace
dhis2_dfwith your DHIS2 dataset. - Line 10: Adjust or remove
"orgunitlevel5"if this column is not present.
The Excel data dictionary used here is based on the Sierra Leone DHIS2 export. If you are working with another country, update the rows in your Excel file so that the “original DHIS2 name” and “SNT name” columns match your indicators.
For example, if your dataset uses “Admission of child with malaria - 5–14yrs”, enter that exact name in the dictionary and set your preferred SNT name beside it. Once the Excel file is updated, the renaming code applies your mapping automatically.
Step 4: Standardise the Date Columns
Most SNT workflows require a consistent set of date variables: a YYYY-MM-DD date column, separate year and month fields, and a yearmon field in YYYY-MM for ordered time-series work. The goal of this step is to take whatever date format appears in the DHIS2 export and convert it into this standard structure.
The steps you apply depend on your starting format. Some formats parse automatically (for example, YYYY-MM-DD). Others, such as "Jan 2020" or French "Janvier 2020", need small additional steps.
The code block below shows how to parse the main date formats seen in DHIS2 exports and convert them into a consistent structure. Once the date field is standardised, creating the year, month, and yearmon fields becomes straightforward.
Format already in YYYY-MM-DD
Format “Jan 2020” or “January 2020”
French month names (for example, “Janvier 2020”)
Portuguese month names (for example, “Janeiro 2020”)
Format “2023-01” or “2023/01”
Format “01/2020” or “01-2020” (Month–Year)
Compact format “202301” (YYYYMM)
Mixed or unclear formats
# create dummy data
dates_ex8 <- tibble::tibble(
raw_date = c(
"2020-01-15", # ISO full date
"Jan 2021", # English abbreviated month
"202203", # compact YYYYMM
"03/2022", # month/year with slash
"2021-07", # year-month
"July 2020", # English full month
"15-02-2021", # DD-MM-YYYY
"2020/11/05", # YYYY/MM/DD
"2020.12.25", # dotted date
"2022.07", # YYYY.MM
"10-2020", # MM-YYYY
"20210105", # YYYYMMDD
"2020 Jan", # Year then month (English)
"2020.01.01" # dotted YYYY.MM.DD
)
)
# parse dates
dates_ex8 |>
dplyr::mutate(
date = lubridate::parse_date_time(
raw_date,
orders = c(
"Y-m-d", # 2020-01-15
"Y/m/d", # 2020/11/05
"Y.m.d", # 2020.12.25
"b Y", # Jan 2021
"B Y", # January 2021
"Y b", # 2020 Jan
"Y B", # 2020 January
"Ym", # 202203
"m/Y", # 03/2022
"m-Y", # 10-2020
"Y-m", # 2021-07
"Y.m", # 2022.07
"d-m-Y", # 15-02-2021
"d/m/Y", # 15/02/2021 (if present)
"Ymd" # 20210105
)
) |>
as.Date()
)Format already in YYYY-MM-DD
Format “Jan 2020” or “January 2020”
French month names (for example, “Janvier 2020”)
# set French locale (adjust if needed depending on system)
# common options: 'fr_FR.UTF-8', 'fr_FR'
locale.setlocale(locale.LC_TIME, 'fr_FR.UTF-8')
# create dummy data
dates_ex3 = pd.DataFrame({
"raw_date": [
"Janvier 2020",
"Février 2021",
"décembre 2022"
]
})
# ensure day is present for parsing
dates_ex3["date"] = pd.to_datetime(
"01 " + dates_ex3["raw_date"],
format="%d %B %Y",
errors="coerce"
)
# fallback for abbreviated months
mask_missing = dates_ex3["date"].isna()
dates_ex3.loc[mask_missing, "date"] = pd.to_datetime(
"01 " + dates_ex3.loc[mask_missing, "raw_date"],
format="%d %b %Y",
errors="coerce"
)
dates_ex3Portuguese month names (for example, “Janeiro 2020”)
# set Portuguese locale
# common options: 'pt_PT.UTF-8', 'pt_PT'
locale.setlocale(locale.LC_TIME, 'pt_PT.UTF-8')
# create dummy data
dates_ex4 = pd.DataFrame({
"raw_date": [
"Janeiro 2020",
"fevereiro 2021",
"Dezembro 2022",
"Jan 2021",
"Fev 2022",
"Dez 2023"
]
})
# first attempt: full Portuguese month names
dates_ex4["date"] = pd.to_datetime(
"01 " + dates_ex4["raw_date"],
format="%d %B %Y",
errors="coerce"
)
# fallback: abbreviated Portuguese month names (e.g. Jan, Fev, Dez)
mask_missing = dates_ex4["date"].isna()
dates_ex4.loc[mask_missing, "date"] = pd.to_datetime(
"01 " + dates_ex4.loc[mask_missing, "raw_date"],
format="%d %b %Y",
errors="coerce"
)
dates_ex4Mixed or unclear formats
# create dummy data
dates_ex5 = pd.DataFrame(
{
"raw_date": [
"2020-01-15", # ISO full date
"Jan 2021", # English abbreviated month
"202203", # compact YYYYMM
"03/2022", # month/year
"2021-07", # year-month
"July 2020", # English full month
"15-02-2021", # DD-MM-YYYY
"2020/11/05", # YYYY/MM/DD
"2020.12.25", # YYYY.MM.DD
"2022.07", # YYYY.MM
"10-2020", # MM-YYYY
"20210105", # YYYYMMDD
"2020 Jan", # year then month
"2020.01.01", # dotted full date
]
}
)
# list of possible formats
formats = [
"%Y-%m-%d",
"%Y/%m/%d",
"%Y.%m.%d",
"%d-%m-%Y",
"%d/%m/%Y",
"%b %Y",
"%B %Y",
"%Y %b",
"%Y %B",
"%Y-%m",
"%Y/%m",
"%Y.%m",
"%m-%Y",
"%m/%Y",
"%Y%m",
"%Y%m%d",
]
def try_parse(value):
# try strict formats first
for fmt in formats:
try:
return pd.to_datetime(value, format=fmt)
except:
pass
# fallback: let pandas guess freely
return pd.to_datetime(value, errors="coerce")
dates_ex5["date"] = dates_ex5["raw_date"].apply(try_parse)
dates_ex5["date"] = dates_ex5["date"].dt.to_period("M").dt.to_timestamp()
dates_ex5For the Sierra Leone dataset, the date field is stored in periodname and appears in formats such as “January 2019” or “February 2019”. We first standardise this into a proper YYYY-MM-DD date. Once the date column is cleaned, we can generate the year, month, and yearmon fields used throughout the SNT workflow.
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
Sys.setlocale("LC_TIME", "en_US.UTF-8")
dhis2_df <- dhis2_df |>
# parse the raw date into a proper Date object
dplyr::mutate(
date = lubridate::parse_date_time(
periodname,
orders = c("B Y", "b Y")
) |>
as.Date()
) |>
# create year, month, and ordered yearmon label
dplyr::mutate(
year = lubridate::year(date),
month = lubridate::month(date),
# human-readable label, e.g. "Jan 2020"
yearmon = format(date, "%b %Y"),
# order factor by actual chronological date
yearmon = factor(yearmon, levels = unique(yearmon[order(date)]))
)
# check the head
dhis2_df |>
dplyr::select(date, year, month, yearmon) |>
head()To adapt the code:
- Line 3: Set the locale to match the language of month names in your export, for example
"en_US.UTF-8","fr_FR.UTF-8","pt_PT.UTF-8". - Line 9: Update the column name if your date field is not called
periodname. - Line 10: Adjust the
orders = c("B Y", "b Y")list only if your dates use a different structure. - Line 19 Change the
yearmondisplay (for example,"%b %Y"to"%B %Y") based on how you want month names shown.
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
locale.setlocale(locale.LC_TIME, "en_US.UTF-8")
# parse the raw date into a proper datetime object
dhis2_df["date"] = pd.to_datetime(
dhis2_df["periodname"],
format="%B %Y",
errors="coerce"
)
# create year, month, and ordered yearmon columns
dhis2_df["year"] = dhis2_df["date"].dt.year
dhis2_df["month"] = dhis2_df["date"].dt.month
dhis2_df["yearmon"] = dhis2_df["date"].dt.strftime("%b %Y")
# check the output
dhis2_df[["date", "year", "month", "yearmon"]].head()To adapt the code:
- Line 3: Set the locale to match the language of month names in your export, for example
"en_US.UTF-8","fr_FR.UTF-8","pt_PT.UTF-8". - Line 6: Update the column name if your date field is not called
periodname. - Line 7: Adjust the
format="%B %Y"if your dates use a different structure (for example,"%b %Y"for abbreviated months). - Line 14: Change the
strftimeformat (for example,"%b %Y"to"%B %Y") based on how you want month names shown.
Step 5: Managing Location Columns
Steps 5.1: Harmonise Administrative Names
Differences in spelling, capitalisation, spacing, or naming conventions between DHIS2 exports and shapefiles are common. These differences will cause joins to fail and lead to missing or duplicated geographic units. Before any spatial join, you should harmonise administrative names used in your DHIS2 dataset with those defined in the shapefile.
A practical way to do this is to use the prep_geonames() function from the sntutils function from the sntutils package. The function detects mismatches, suggests likely matches using string distance methods, and lets you confirm or edit them interactively. It also saves your decisions to a cache file so that harmonisation is reproducible and consistent across analysts and across reruns.
We only show the basic structure here. The full workflow, including interactive matching and scripted reuse, is explained in the Merging shapefiles with tabular data section of the SNT Code Library. For an in-depth blog post based on this method, see Cleaning and standardising geographic names in R with prep_geonames() from the AMMnet Hackathon.
Before any interactive matching, we ensure that the columns are correctly assigned. From observation of the admin names, it appeared that adm1 was actually adm2, so we create a simple lookup to assign districts (adm2) to their respective provinces (adm1).
# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")
# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"
# get adm3 shapfile to use as reference lookup
shp_adm3 <- sntutils::read(
here::here("english/data_r/shapefiles/sle_spatial_adm3_2021.rds")
) |>
sf::st_drop_geometry() # drop geometry, we just need the admin names
# standardise admin names
dhis2_df <- dhis2_df |>
dplyr::mutate(
adm0 = toupper(adm0),
adm2 = toupper(adm1),
adm3 = toupper(adm3),
hf = toupper(hf),
# the adm2 in the lookup shapefile don't have
# "DISTRICT" in the name, so we remove it to enable matching
adm2 = stringr::str_remove_all(adm2, " DISTRICT"),
adm3 = stringr::str_remove_all(adm3, " CHIEFDOM")
) |>
dplyr::mutate(
# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
adm1 = dplyr::case_when(
adm2 %in% c("KAILAHUN", "KENEMA", "KONO") ~ "EASTERN",
adm2 %in% c("BOMBALI", "FALABA", "KOINADUGU", "TONKOLILI") ~ "NORTH EAST",
adm2 %in% c("KAMBIA", "KARENE", "PORT LOKO") ~ "NORTH WEST",
adm2 %in% c("BO", "BONTHE", "MOYAMBA", "PUJEHUN") ~ "SOUTHERN",
adm2 %in% c("WESTERN AREA RURAL", "WESTERN AREA URBAN") ~ "WESTERN"
)
)
# harmonise admin names between dhis2 data and shapefile
dhis2_df <-
sntutils::prep_geonames(
target_df = dhis2_df,
lookup_df = lookup_keys,
level0 = "adm0",
level1 = "adm1",
level2 = "adm2",
level3 = "adm3",
interactive = TRUE,
cache_path = here::here(cache_path, "geoname_decisions_cache.xlsx")
)To adapt the code:
- Line 6: Update
cache_pathto your preferred location for saving cached files. - Lines 9–12: Replace the shapefile path with the path to your reference shapefile.
- Line 15: Replace
dhis2_dfwith your DHIS2 or target dataset. - Lines 15–36: This standardisation block is specific to Sierra Leone. Adapt the
toupper(),str_remove_all(), andcase_when()logic to match your country’s administrative structure and naming conventions. - Line 39: Replace
dhis2_dfwith your target dataframe name. - Line 42: Replace
lookup_keyswith your lookup or reference dataset. - Lines 43–46: Adjust
level0,level1,level2, andlevel3to match your actual admin column names if they differ (e.g., “country”, “region”, “district”, “ward”). Ensure the columns exist in both datasets. - Line 48: Update the cache filename if desired.
Once updated, run the code to harmonise your data’s admin names with the shapefile.
# If not already, install the sntutils python package from GitHub
# has a number of useful helper functions
pip install git+https://github.com/ahadi-analytics/sntutils-py.git
from sntutils.geo import prep_geonames # for name matching
# set up location to save cache
cache_path = Path("1.1_foundational/1d_cache_files")
# get adm3 shapefile to use as reference lookup
shp_adm3 = gpd.read_file(
Path(here("english/data_r/shapefiles/sle_spatial_adm3_2021.geojson"))
).drop(columns="geometry")
# standardise admin names
dhis2_df = dhis2_df.assign(
adm0=lambda x: x["adm0"].str.upper(),
# the adm2 in the lookup shapefile don't have
# "DISTRICT" in the name, so we remove it to enable matching
adm2=lambda x: x["adm1"]
.str.upper()
.str.replace(" DISTRICT", "", regex=False)
.str.strip(),
adm3=lambda x: x["adm3"]
.str.upper()
.str.replace(" CHIEFDOM", "", regex=False)
.str.strip(),
hf=lambda x: x["hf"]
.str.upper()
)
# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
district_to_province = {
"KAILAHUN": "EASTERN", "KENEMA": "EASTERN", "KONO": "EASTERN",
"BOMBALI": "NORTH EAST", "FALABA": "NORTH EAST",
"KOINADUGU": "NORTH EAST", "TONKOLILI": "NORTH EAST",
"KAMBIA": "NORTH WEST", "KARENE": "NORTH WEST", "PORT LOKO": "NORTH WEST",
"BO": "SOUTHERN", "BONTHE": "SOUTHERN",
"MOYAMBA": "SOUTHERN", "PUJEHUN": "SOUTHERN",
"WESTERN AREA RURAL": "WESTERN", "WESTERN AREA URBAN": "WESTERN"
}
dhis2_df["adm1"] = dhis2_df["adm2"].map(district_to_province)
# harmonise admin names between dhis2 data and shapefile
# note: sntutils::prep_geonames() has no direct python equivalent
# manual fuzzy matching or exact merge required
dhis2_df = dhis2_df.merge(
lookup_keys,
on=["adm0", "adm1", "adm2", "adm3"],
how="left"
)
# Harmonise admin names between population data and shapefile
dhis2_df = prep_geonames(
target_df=dhis2_df,
lookup_df=shp_adm3,
level0="adm0",
level1="adm1",
level2="adm2",
level3="adm3",
cache_path=here(cache_path, "geoname_decisions_cache.xlsx")
)To adapt the code:
- Line 10: Run once in terminal to install the
sntutils-pypackage from GitHub. - Line 11: Import statement for the
prep_geonamesfunction. - Line 14: Update
cache_pathto your preferred location for saving cached files. - Lines 17–19: Replace the shapefile path with the path to your reference shapefile.
- Line 22: Replace
dhis2_dfwith your DHIS2 or target dataset. - Lines 22–46: This standardisation block is specific to Sierra Leone. Adapt the
.str.upper(),.str.replace(), anddistrict_to_provincedictionary to match your country’s administrative structure and naming conventions. - Lines 51–55: Remove this merge block if using
prep_geonames()below, as it duplicates the matching step. - Line 58: Replace
dhis2_dfwith your target dataframe name. - Line 59: Replace
shp_adm3with your lookup or reference dataset. - Lines 60–63: Adjust
level0,level1,level2, andlevel3to match your actual admin column names if they differ (e.g., “country”, “region”, “district”, “ward”).- Ensure the columns exist in both datasets
- Line 64: Update the cache filename if desired.
Once updated, run the code to harmonise your data’s admin names with the shapefile.
Once the interactive matching is complete and the results are saved, the output message confirms that all administrative levels have been successfully aligned between your dataset and the reference shapefile. If mismatches remain, the prep_geoname function will prompt you to resolve them interactively.
To double-check your decisions, you can share the cache file geoname_decisions_cache.xlsx with the country SNT team to ensure all matching decisions are correct.
Steps 5.1: Create unique identifiers and location labels
Using our corrected admin names, we can now create key columns for downstream analysis and mapping:
hf_uid: A unique health facility identifier generated by hashing the full administrative hierarchy and facility name. This ensures each facility has a consistent, reproducible ID.location_short: A concatenated label of province and district (adm1 ~ adm2), useful for summary tables and aggregated views.location_full: The complete administrative hierarchy including chiefdom and facility name, useful for detailed map tooltips and drill-down labelling.record_id: A unique record identifier combining the facility ID and time period, ensuring each facility-month combination is uniquely identifiable.
# create identifiers
dhis2_df <- dhis2_df |>
dplyr::mutate(
# create unique health facility identifier from admin hierarchy
hf_uid = sntutils::vdigest(
paste0(adm0, adm1, adm2, adm3, hf),
algo = "xxhash32"
),
# create location labels for mapping
location_short = paste(adm1, adm2, sep = " ~ "),
location_full = paste(adm1, adm2, adm3, hf, sep = " ~ "),
# generate consistent record id (facility + time period)
record_id = sntutils::vdigest(
paste(hf_uid, yearmon),
algo = "xxhash32"
)
)
# check
dhis2_df |>
dplyr::arrange(location_full) |>
dplyr::distinct(location_full, hf_uid, record_id) |>
head()To adapt the code:
- Line 2: Replace
dhis2_dfwith your target dataset. - Line 6: Adjust the
paste0()arguments to match your administrative column names (e.g.,adm0,adm1,adm2,adm3) and facility column (e.g.,hf). - Line 10: Modify
location_shortto include the admin levels you want for summary views. - Line 11: Modify
location_fullto include all admin levels and facility name for detailed labelling. - Line 14: Ensure
yearmonmatches your time period column name. If using a different temporal unit (e.g.,year,epiweek), update accordingly.
Once updated, run the code to generate unique identifiers for your health facilities and records.
# helper function to create xxhash32 digest (equivalent to sntutils::vdigest)
def vdigest(x, algo="xxhash32"):
"""Create vectorised hash digest."""
return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())
# create identifiers
dhis2_df = dhis2_df.assign(
# create unique health facility identifier from admin hierarchy
# use spaces to match R's paste() behavior for consistent hashes
hf_uid=lambda x: vdigest(
x["adm0"] + " " + x["adm1"] + " " + x["adm2"] + " " + x["adm3"] + " " + x["hf"]
),
# create location labels for mapping
location_short=lambda x: x["adm1"] + " ~ " + x["adm2"],
location_full=lambda x: (
x["adm1"] + " ~ " + x["adm2"] + " ~ " + x["adm3"] + " ~ " + x["hf"]
),
)
# generate consistent record id (facility + time period)
# use space to match R's paste() behavior
dhis2_df["record_id"] = vdigest(
dhis2_df["hf_uid"] + " " + dhis2_df["yearmon"].astype(str)
)
# check
dhis2_df[["location_full", "hf_uid", "record_id"]].drop_duplicates().sort_values(
"location_full"
).head()To adapt the code:
- Line 16: Replace
dhis2_dfwith your target dataset. - Lines 18–20: Adjust the concatenated columns to match your administrative column names (e.g.,
adm0,adm1,adm2,adm3) and facility column (e.g.,hf). - Line 22: Modify
location_shortto include the admin levels you want for summary views. - Lines 23–25: Modify
location_fullto include all admin levels and facility name for detailed labelling. - Lines 29–31: Ensure
yearmonmatches your time period column name. If using a different temporal unit (e.g.,year,date), update accordingly.
Once updated, run the code to generate unique identifiers for your health facilities and records.
Step 6: Compute Variables
Step 6.1: Compute Indicator Totals and New Variables
Now that we have imported our DHIS2 data and cleaned up the columns, we will compute derived variables to create totals for specific indicators. DHIS2 can provide totals, but these columns must specifically be extracted. In the raw DHIS2 dataset, indicators are typically disaggregated by age group, community, and health facility level. Analysts should strive to obtain the most disaggregated databases possible.
For example, if we want to calculate the total number of outpatient visits, DHIS2 may not provide a direct total, but includes components that can be summed, such as allout_u5 and allout_ov5. Remember to review variable definitions to ensure your aggregations do not double count cases.
The same logic applies to other indicators, including total suspected cases (susp), tested cases (test), confirmed cases (conf), treated cases (maltreat), and presumed cases (pres). The code snippet below shows how to compute these totals by combining the relevant components. The code also includes indicator aggregation by age groups, such as test_hf_u5 which captures all children under five tested by either RDT or microscopy at a health facility.
It is best to verify with the SNT team which specific data elements from DHIS2 should be summed to obtain the correct total. While some calculations may seem obvious, others are not. The box in Step 3 lists example questions that the analyst should get verification for before summing across disaggregated data elements. Depending on your country’s DHIS2 and data reporting practice, other questions may also be relevant.
While some countries have presumed cases in their DHIS2 data, not all do! Here are three options for presumed cases that countries have used:
- Option 1: Your data already has a
prescolumn, no further calculation is needed. - Option 2: Calculate presumed cases using the difference between treated cases and confirmed cases:
maltreat-conf - Option 3: Calculate presumed cases using the difference between suspected cases and tested cases:
susp-test
The code below uses Option 2 (lines 52 to 55). Testing is subject to resource availability meaning facilities handle presumed cases differently.
Consult the SNT team to determine how best to approach presumed cases calculations, whether it’s one of the options shown above or a different approach.
Below we create to new functions: fallback_row_sum() and fallback_diff(). The fallback_row_sum() function replaces rowSums(..., na.rm = TRUE), which returns 0 when all values in a row are NA. This default behavior can obscure missingness.
In routine health data, it’s critical to distinguish between:
- True zeros: e.g. a facility reported 0 cases
- Missing values: e.g. the facility did not report at all
fallback_row_sum() returns NA when too few values are present, preserving this distinction and preventing overestimation of completeness.
Similarly, fallback_diff() returns the absolute difference if both values are present, the non-missing value if only one is present, and NA if both are missing. It applies pmax() to ensure results meet a minimum threshold.
Both functions guard against misleading outputs when data are incomplete.
# smart row-wise sum with missing data handling
fallback_row_sum <- function(..., min_present = 1, .keep_zero_as_zero = TRUE) {
vars_matrix <- cbind(...)
valid_count <- rowSums(!is.na(vars_matrix))
raw_sum <- rowSums(vars_matrix, na.rm = TRUE)
ifelse(valid_count >= min_present, raw_sum, NA_real_)
}
# fallback absolute difference between two vectors
fallback_diff <- function(col1, col2, minimum = 0) {
dplyr::case_when(
is.na(col1) & is.na(col2) ~ NA_real_,
is.na(col1) ~ pmax(col2, minimum),
is.na(col2) ~ pmax(col1, minimum),
TRUE ~ pmax(col1 - col2, minimum)
)
}Now we use these functions to aggregate our columns.
Show the code
# compute indicator totals in DHIS2 data
dhis2_df <- dhis2_df |>
dplyr::mutate(
# outpatient visits
allout = fallback_row_sum(allout_u5, allout_ov5),
# suspected cases
susp = fallback_row_sum(
susp_u5_hf,
susp_5_14_hf,
susp_ov15_hf,
susp_u5_com,
susp_5_14_com,
susp_ov15_com
),
# tested cases
test_hf = fallback_row_sum(
test_neg_mic_u5_hf,
test_pos_mic_u5_hf,
test_neg_mic_5_14_hf,
test_pos_mic_5_14_hf,
test_neg_mic_ov15_hf,
test_pos_mic_ov15_hf,
tes_neg_rdt_u5_hf,
tes_pos_rdt_u5_hf,
tes_neg_rdt_5_14_hf,
tes_pos_rdt_5_14_hf,
tes_neg_rdt_ov15_hf,
tes_pos_rdt_ov15_hf
),
test_com = fallback_row_sum(
tes_neg_rdt_u5_com,
tes_pos_rdt_u5_com,
tes_neg_rdt_5_14_com,
tes_pos_rdt_5_14_com,
tes_neg_rdt_ov15_com,
tes_pos_rdt_ov15_com
),
test = fallback_row_sum(test_hf, test_com),
# confirmed cases (HF and COM)
conf_hf = fallback_row_sum(
test_pos_mic_u5_hf,
test_pos_mic_5_14_hf,
test_pos_mic_ov15_hf,
tes_pos_rdt_u5_hf,
tes_pos_rdt_5_14_hf,
tes_pos_rdt_ov15_hf
),
conf_com = fallback_row_sum(
tes_pos_rdt_u5_com,
tes_pos_rdt_5_14_com,
tes_pos_rdt_ov15_com
),
conf = fallback_row_sum(conf_hf, conf_com),
# treated cases
maltreat_com = fallback_row_sum(
maltreat_u24_u5_com,
maltreat_ov24_u5_com,
maltreat_u24_5_14_com,
maltreat_ov24_5_14_com,
maltreat_u24_ov15_com,
maltreat_ov24_ov15_com
),
maltreat_hf = fallback_row_sum(
maltreat_u24_u5_hf,
maltreat_ov24_u5_hf,
maltreat_u24_5_14_hf,
maltreat_ov24_5_14_hf,
maltreat_u24_ov15_hf,
maltreat_ov24_ov15_hf
),
maltreat = fallback_row_sum(maltreat_hf, maltreat_com),
# presumed cases
pres_com = fallback_diff(maltreat_com, conf_com),
pres_hf = fallback_diff(maltreat_hf, conf_hf),
pres = fallback_row_sum(pres_com, pres_hf),
# malaria admissions
maladm = fallback_row_sum(
maladm_u5,
maladm_5_14,
maladm_ov15
),
# malaria deaths
maldth = fallback_row_sum(
maldth_u5,
maldth_1_59m,
maldth_10_14,
maldth_5_9,
maldth_5_14,
maldth_ov15,
maldth_fem_ov15,
maldth_mal_ov15
),
# AGE-GROUP SPECIFIC AGGREGATIONS
# Tested cases by age group (HF only)
test_hf_u5 = fallback_row_sum(
test_neg_mic_u5_hf,
test_pos_mic_u5_hf,
tes_neg_rdt_u5_hf,
tes_pos_rdt_u5_hf
),
test_hf_5_14 = fallback_row_sum(
test_neg_mic_5_14_hf,
test_pos_mic_5_14_hf,
tes_neg_rdt_5_14_hf,
tes_pos_rdt_5_14_hf
),
test_hf_ov15 = fallback_row_sum(
test_neg_mic_ov15_hf,
test_pos_mic_ov15_hf,
tes_neg_rdt_ov15_hf,
tes_pos_rdt_ov15_hf
),
# Tested cases by age group (Community only)
test_com_u5 = fallback_row_sum(
tes_neg_rdt_u5_com,
tes_pos_rdt_u5_com
),
test_com_5_14 = fallback_row_sum(
tes_neg_rdt_5_14_com,
tes_pos_rdt_5_14_com
),
test_com_ov15 = fallback_row_sum(
tes_neg_rdt_ov15_com,
tes_pos_rdt_ov15_com
),
# Total tested by age group (HF + Community)
test_u5 = fallback_row_sum(test_hf_u5, test_com_u5),
test_5_14 = fallback_row_sum(test_hf_5_14, test_com_5_14),
test_ov15 = fallback_row_sum(test_hf_ov15, test_com_ov15),
# Suspected cases by age group (HF only)
susp_hf_u5 = susp_u5_hf,
susp_hf_5_14 = susp_5_14_hf,
susp_hf_ov15 = susp_ov15_hf,
# Suspected cases by age group (Community only)
susp_com_u5 = susp_u5_com,
susp_com_5_14 = susp_5_14_com,
susp_com_ov15 = susp_ov15_com,
# Total suspected by age group (HF + Community)
susp_u5 = fallback_row_sum(susp_hf_u5, susp_com_u5),
susp_5_14 = fallback_row_sum(susp_hf_5_14, susp_com_5_14),
susp_ov15 = fallback_row_sum(susp_hf_ov15, susp_com_ov15),
# Confirmed cases by age group (HF only)
conf_hf_u5 = fallback_row_sum(
test_pos_mic_u5_hf,
tes_pos_rdt_u5_hf
),
conf_hf_5_14 = fallback_row_sum(
test_pos_mic_5_14_hf,
tes_pos_rdt_5_14_hf
),
conf_hf_ov15 = fallback_row_sum(
test_pos_mic_ov15_hf,
tes_pos_rdt_ov15_hf
),
# Confirmed cases by age group (Community only)
conf_com_u5 = tes_pos_rdt_u5_com,
conf_com_5_14 = tes_pos_rdt_5_14_com,
conf_com_ov15 = tes_pos_rdt_ov15_com,
# Total confirmed cases by age group (HF + Community)
conf_u5 = fallback_row_sum(conf_hf_u5, conf_com_u5),
conf_5_14 = fallback_row_sum(conf_hf_5_14, conf_com_5_14),
conf_ov15 = fallback_row_sum(conf_hf_ov15, conf_com_ov15),
# Treated cases by age group (HF only)
maltreat_hf_u5 = fallback_row_sum(
maltreat_u24_u5_hf,
maltreat_ov24_u5_hf
),
maltreat_hf_5_14 = fallback_row_sum(
maltreat_u24_5_14_hf,
maltreat_ov24_5_14_hf
),
maltreat_hf_ov15 = fallback_row_sum(
maltreat_u24_ov15_hf,
maltreat_ov24_ov15_hf
),
# Treated cases by age group (Community only)
maltreat_com_u5 = fallback_row_sum(
maltreat_u24_u5_com,
maltreat_ov24_u5_com
),
maltreat_com_5_14 = fallback_row_sum(
maltreat_u24_5_14_com,
maltreat_ov24_5_14_com
),
maltreat_com_ov15 = fallback_row_sum(
maltreat_u24_ov15_com,
maltreat_ov24_ov15_com
),
# Total treated cases by age group (HF + Community)
maltreat_u5 = fallback_row_sum(maltreat_hf_u5, maltreat_com_u5),
maltreat_5_14 = fallback_row_sum(maltreat_hf_5_14, maltreat_com_5_14),
maltreat_ov15 = fallback_row_sum(maltreat_hf_ov15, maltreat_com_ov15),
# Total treated cases within 24 hours (HF only)
maltreat_u24_hf = fallback_row_sum(
maltreat_u24_u5_hf,
maltreat_u24_5_14_hf,
maltreat_u24_ov15_hf
),
# Total treated cases after 24 hours (HF only)
maltreat_ov24_hf = fallback_row_sum(
maltreat_ov24_u5_hf,
maltreat_ov24_5_14_hf,
maltreat_ov24_ov15_hf
),
# Total treated cases within 24 hours (Community only)
maltreat_u24_com = fallback_row_sum(
maltreat_u24_u5_com,
maltreat_u24_5_14_com,
maltreat_u24_ov15_com
),
# Total treated cases after 24 hours (Community only)
maltreat_ov24_com = fallback_row_sum(
maltreat_ov24_u5_com,
maltreat_ov24_5_14_com,
maltreat_ov24_ov15_com
),
# Overall totals (HF + Community)
maltreat_u24_total = fallback_row_sum(maltreat_u24_hf, maltreat_u24_com),
maltreat_ov24_total = fallback_row_sum(maltreat_ov24_hf, maltreat_ov24_com),
# Presumed cases by age group
pres_com_u5 = fallback_diff(maltreat_com_u5, conf_com_u5),
pres_com_5_14 = fallback_diff(maltreat_com_5_14, conf_com_5_14),
pres_com_ov15 = fallback_diff(maltreat_com_ov15, conf_com_ov15),
pres_hf_u5 = fallback_diff(maltreat_hf_u5, conf_hf_u5),
pres_hf_5_14 = fallback_diff(maltreat_hf_5_14, conf_hf_5_14),
pres_hf_ov15 = fallback_diff(maltreat_hf_ov15, conf_hf_ov15),
pres_u5 = fallback_row_sum(pres_com_u5, pres_hf_u5),
pres_5_14 = fallback_row_sum(pres_com_5_14, pres_hf_5_14),
pres_ov15 = fallback_row_sum(pres_com_ov15, pres_hf_ov15)
)
# check to see the aggregation worked
dhis2_df |>
dplyr::filter(
record_id %in%
c("6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842")
) |>
dplyr::arrange(allout) |>
dplyr::select(
allout,
allout_u5,
allout_ov5,
pres_hf_u5,
maltreat_hf_u5,
conf_hf_u5
) |>
head()To adapt the code: - Lines 4–64: Adjust the variable names to reflect those relevant to your dataset when calculating new variables - Once updated, run the code
Show the code
# helper function: row-wise sum that returns value if only one non-NA
def fallback_row_sum(df, cols):
return df[cols].sum(axis=1, skipna=True, min_count=1)
# helper function: difference with floor at 0
def fallback_diff(df, col1, col2):
return (df[col1] - df[col2]).clip(lower=0)
dhis2_df = (
dhis2_df.assign(
# outpatient visits
allout=lambda x: fallback_row_sum(x, ["allout_u5", "allout_ov5"]),
# suspected cases
susp=lambda x: fallback_row_sum(
x,
[
"susp_u5_hf",
"susp_5_14_hf",
"susp_ov15_hf",
"susp_u5_com",
"susp_5_14_com",
"susp_ov15_com",
],
),
# tested cases
test_hf=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_u5_hf",
"test_pos_mic_u5_hf",
"test_neg_mic_5_14_hf",
"test_pos_mic_5_14_hf",
"test_neg_mic_ov15_hf",
"test_pos_mic_ov15_hf",
"tes_neg_rdt_u5_hf",
"tes_pos_rdt_u5_hf",
"tes_neg_rdt_5_14_hf",
"tes_pos_rdt_5_14_hf",
"tes_neg_rdt_ov15_hf",
"tes_pos_rdt_ov15_hf",
],
),
test_com=lambda x: fallback_row_sum(
x,
[
"tes_neg_rdt_u5_com",
"tes_pos_rdt_u5_com",
"tes_neg_rdt_5_14_com",
"tes_pos_rdt_5_14_com",
"tes_neg_rdt_ov15_com",
"tes_pos_rdt_ov15_com",
],
),
# confirmed cases (HF and COM)
conf_hf=lambda x: fallback_row_sum(
x,
[
"test_pos_mic_u5_hf",
"test_pos_mic_5_14_hf",
"test_pos_mic_ov15_hf",
"tes_pos_rdt_u5_hf",
"tes_pos_rdt_5_14_hf",
"tes_pos_rdt_ov15_hf",
],
),
conf_com=lambda x: fallback_row_sum(
x,
[
"tes_pos_rdt_u5_com",
"tes_pos_rdt_5_14_com",
"tes_pos_rdt_ov15_com",
],
),
# treated cases
maltreat_com=lambda x: fallback_row_sum(
x,
[
"maltreat_u24_u5_com",
"maltreat_ov24_u5_com",
"maltreat_u24_5_14_com",
"maltreat_ov24_5_14_com",
"maltreat_u24_ov15_com",
"maltreat_ov24_ov15_com",
],
),
maltreat_hf=lambda x: fallback_row_sum(
x,
[
"maltreat_u24_u5_hf",
"maltreat_ov24_u5_hf",
"maltreat_u24_5_14_hf",
"maltreat_ov24_5_14_hf",
"maltreat_u24_ov15_hf",
"maltreat_ov24_ov15_hf",
],
),
# malaria admissions
maladm=lambda x: fallback_row_sum(
x, ["maladm_u5", "maladm_5_14", "maladm_ov15"]
),
# malaria deaths
maldth=lambda x: fallback_row_sum(
x,
[
"maldth_u5",
"maldth_1_59m",
"maldth_10_14",
"maldth_5_9",
"maldth_5_14",
"maldth_ov15",
"maldth_fem_ov15",
"maldth_mal_ov15",
],
),
# AGE-GROUP SPECIFIC AGGREGATIONS
# Tested cases by age group (HF only)
test_hf_u5=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_u5_hf",
"test_pos_mic_u5_hf",
"tes_neg_rdt_u5_hf",
"tes_pos_rdt_u5_hf",
],
),
test_hf_5_14=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_5_14_hf",
"test_pos_mic_5_14_hf",
"tes_neg_rdt_5_14_hf",
"tes_pos_rdt_5_14_hf",
],
),
test_hf_ov15=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_ov15_hf",
"test_pos_mic_ov15_hf",
"tes_neg_rdt_ov15_hf",
"tes_pos_rdt_ov15_hf",
],
),
# Tested cases by age group (Community only)
test_com_u5=lambda x: fallback_row_sum(
x, ["tes_neg_rdt_u5_com", "tes_pos_rdt_u5_com"]
),
test_com_5_14=lambda x: fallback_row_sum(
x, ["tes_neg_rdt_5_14_com", "tes_pos_rdt_5_14_com"]
),
test_com_ov15=lambda x: fallback_row_sum(
x, ["tes_neg_rdt_ov15_com", "tes_pos_rdt_ov15_com"]
),
# Suspected cases by age group (rename for consistency)
susp_hf_u5=lambda x: x["susp_u5_hf"],
susp_hf_5_14=lambda x: x["susp_5_14_hf"],
susp_hf_ov15=lambda x: x["susp_ov15_hf"],
susp_com_u5=lambda x: x["susp_u5_com"],
susp_com_5_14=lambda x: x["susp_5_14_com"],
susp_com_ov15=lambda x: x["susp_ov15_com"],
# Confirmed cases by age group (HF only)
conf_hf_u5=lambda x: fallback_row_sum(
x, ["test_pos_mic_u5_hf", "tes_pos_rdt_u5_hf"]
),
conf_hf_5_14=lambda x: fallback_row_sum(
x, ["test_pos_mic_5_14_hf", "tes_pos_rdt_5_14_hf"]
),
conf_hf_ov15=lambda x: fallback_row_sum(
x, ["test_pos_mic_ov15_hf", "tes_pos_rdt_ov15_hf"]
),
# Confirmed cases by age group (Community only)
conf_com_u5=lambda x: x["tes_pos_rdt_u5_com"],
conf_com_5_14=lambda x: x["tes_pos_rdt_5_14_com"],
conf_com_ov15=lambda x: x["tes_pos_rdt_ov15_com"],
# Treated cases by age group (HF only)
maltreat_hf_u5=lambda x: fallback_row_sum(
x, ["maltreat_u24_u5_hf", "maltreat_ov24_u5_hf"]
),
maltreat_hf_5_14=lambda x: fallback_row_sum(
x, ["maltreat_u24_5_14_hf", "maltreat_ov24_5_14_hf"]
),
maltreat_hf_ov15=lambda x: fallback_row_sum(
x, ["maltreat_u24_ov15_hf", "maltreat_ov24_ov15_hf"]
),
# Treated cases by age group (Community only)
maltreat_com_u5=lambda x: fallback_row_sum(
x, ["maltreat_u24_u5_com", "maltreat_ov24_u5_com"]
),
maltreat_com_5_14=lambda x: fallback_row_sum(
x, ["maltreat_u24_5_14_com", "maltreat_ov24_5_14_com"]
),
maltreat_com_ov15=lambda x: fallback_row_sum(
x, ["maltreat_u24_ov15_com", "maltreat_ov24_ov15_com"]
),
# Total treated cases within/after 24 hours
maltreat_u24_hf=lambda x: fallback_row_sum(
x,
["maltreat_u24_u5_hf", "maltreat_u24_5_14_hf", "maltreat_u24_ov15_hf"],
),
maltreat_ov24_hf=lambda x: fallback_row_sum(
x,
["maltreat_ov24_u5_hf", "maltreat_ov24_5_14_hf", "maltreat_ov24_ov15_hf"],
),
maltreat_u24_com=lambda x: fallback_row_sum(
x,
[
"maltreat_u24_u5_com",
"maltreat_u24_5_14_com",
"maltreat_u24_ov15_com",
],
),
maltreat_ov24_com=lambda x: fallback_row_sum(
x,
[
"maltreat_ov24_u5_com",
"maltreat_ov24_5_14_com",
"maltreat_ov24_ov15_com",
],
),
)
.assign(
# second pass: computed from first pass columns
test=lambda x: fallback_row_sum(x, ["test_hf", "test_com"]),
conf=lambda x: fallback_row_sum(x, ["conf_hf", "conf_com"]),
maltreat=lambda x: fallback_row_sum(x, ["maltreat_hf", "maltreat_com"]),
# totals by age group
test_u5=lambda x: fallback_row_sum(x, ["test_hf_u5", "test_com_u5"]),
test_5_14=lambda x: fallback_row_sum(x, ["test_hf_5_14", "test_com_5_14"]),
test_ov15=lambda x: fallback_row_sum(x, ["test_hf_ov15", "test_com_ov15"]),
susp_u5=lambda x: fallback_row_sum(x, ["susp_hf_u5", "susp_com_u5"]),
susp_5_14=lambda x: fallback_row_sum(x, ["susp_hf_5_14", "susp_com_5_14"]),
susp_ov15=lambda x: fallback_row_sum(x, ["susp_hf_ov15", "susp_com_ov15"]),
conf_u5=lambda x: fallback_row_sum(x, ["conf_hf_u5", "conf_com_u5"]),
conf_5_14=lambda x: fallback_row_sum(x, ["conf_hf_5_14", "conf_com_5_14"]),
conf_ov15=lambda x: fallback_row_sum(x, ["conf_hf_ov15", "conf_com_ov15"]),
maltreat_u5=lambda x: fallback_row_sum(
x, ["maltreat_hf_u5", "maltreat_com_u5"]
),
maltreat_5_14=lambda x: fallback_row_sum(
x, ["maltreat_hf_5_14", "maltreat_com_5_14"]
),
maltreat_ov15=lambda x: fallback_row_sum(
x, ["maltreat_hf_ov15", "maltreat_com_ov15"]
),
maltreat_u24_total=lambda x: fallback_row_sum(
x, ["maltreat_u24_hf", "maltreat_u24_com"]
),
maltreat_ov24_total=lambda x: fallback_row_sum(
x, ["maltreat_ov24_hf", "maltreat_ov24_com"]
),
# presumed cases
pres_com=lambda x: fallback_diff(x, "maltreat_com", "conf_com"),
pres_hf=lambda x: fallback_diff(x, "maltreat_hf", "conf_hf"),
pres_com_u5=lambda x: fallback_diff(x, "maltreat_com_u5", "conf_com_u5"),
pres_com_5_14=lambda x: fallback_diff(x, "maltreat_com_5_14", "conf_com_5_14"),
pres_com_ov15=lambda x: fallback_diff(x, "maltreat_com_ov15", "conf_com_ov15"),
pres_hf_u5=lambda x: fallback_diff(x, "maltreat_hf_u5", "conf_hf_u5"),
pres_hf_5_14=lambda x: fallback_diff(x, "maltreat_hf_5_14", "conf_hf_5_14"),
pres_hf_ov15=lambda x: fallback_diff(x, "maltreat_hf_ov15", "conf_hf_ov15"),
)
.assign(
# third pass: totals from presumed
pres=lambda x: fallback_row_sum(x, ["pres_com", "pres_hf"]),
pres_u5=lambda x: fallback_row_sum(x, ["pres_com_u5", "pres_hf_u5"]),
pres_5_14=lambda x: fallback_row_sum(x, ["pres_com_5_14", "pres_hf_5_14"]),
pres_ov15=lambda x: fallback_row_sum(x, ["pres_com_ov15", "pres_hf_ov15"]),
)
)
# inspect results
(
dhis2_df[
dhis2_df["record_id"].isin(
["a80a6352", "aa2eb756", "c8c29999", "eb63621c", "27ee9eeb", "93b16cfe"]
)
][
[
"allout",
"allout_u5",
"allout_ov5",
"pres_hf_u5",
"maltreat_hf_u5",
"conf_hf_u5",
]
]
.sort_values("allout")
.head(6)
)To adapt the code:
- Lines 11–270: Adjust the variable names to reflect those relevant to your dataset when calculating new variables
Once updated, run the code
We can see that the aggregation worked as expected using our fallback functions. The allout column is the sum of allout_u5 and allout_ov5—when either value is NA, the non-NA value is preserved rather than returning NA. Similarly, pres_hf_u5 is derived from the difference between maltreat_hf_u5 and conf_hf_u5. When either value is NA, the result reflects the availabel data rather than defaulting to NA.
Step 6.2: Quality Control of Indicator Totals
Now let’s check that indicator totals are equal to the sum of their disaggregated components. The code chunk below counts the number of rows where the indicator total does not equal the sum of its components. For example, rows where allout is not equal to the sum of allout_u5 and allout_ov5.
If all totals were manually summed using the previous step, you may choose to skip this step. However, even if totals were manually calculated in the previous step, checking totals here can help you identify errors.
In the case that total columns are extracted from DHIS2, it is essential to perform this quality control check for coherency. This check confirms which components go into the aggregated totals extracted from DHIS2.
Show the code
# create mismatch flags for each indicator group
mismatch_summary <- dhis2_df |>
dplyr::summarise(
# outpatient checks
allout_mismatch = sum(
allout != (allout_u5 + allout_ov5),
na.rm = TRUE
),
# malaria admissions check
maladm_mismatch = sum(
maladm != (maladm_u5 + maladm_5_14 + maladm_ov15),
na.rm = TRUE
),
# total tests check
test_mismatch = sum(
test != (test_hf + test_com),
na.rm = TRUE
),
# total confirmed check
conf_mismatch = sum(
conf != (conf_hf + conf_com),
na.rm = TRUE
),
# total treated check
maltreat_mismatch = sum(
maltreat != (maltreat_hf + maltreat_com),
na.rm = TRUE
),
# total presumed check
pres_mismatch = sum(
pres != (pres_hf + pres_com),
na.rm = TRUE
),
# malaria deaths check
maldth_mismatch = sum(
maldth !=
(maldth_1_59m +
maldth_u5 +
maldth_5_9 +
maldth_10_14 +
maldth_5_14 +
maldth_fem_ov15 +
maldth_mal_ov15 +
maldth_ov15),
na.rm = TRUE
),
# tested cases by age group
test_u5_mismatch = sum(
test_u5 != (test_hf_u5 + test_com_u5),
na.rm = TRUE
),
test_5_14_mismatch = sum(
test_5_14 != (test_hf_5_14 + test_com_5_14),
na.rm = TRUE
),
test_ov15_mismatch = sum(
test_ov15 != (test_hf_ov15 + test_com_ov15),
na.rm = TRUE
),
# confirmed cases by age group
conf_u5_mismatch = sum(
conf_u5 != (conf_hf_u5 + conf_com_u5),
na.rm = TRUE
),
conf_5_14_mismatch = sum(
conf_5_14 != (conf_hf_5_14 + conf_com_5_14),
na.rm = TRUE
),
conf_ov15_mismatch = sum(
conf_ov15 != (conf_hf_ov15 + conf_com_ov15),
na.rm = TRUE
),
# presumed cases by age group
pres_u5_mismatch = sum(
pres_u5 != (pres_hf_u5 + pres_com_u5),
na.rm = TRUE
),
pres_5_14_mismatch = sum(
pres_5_14 != (pres_hf_5_14 + pres_com_5_14),
na.rm = TRUE
),
pres_ov15_mismatch = sum(
pres_ov15 != (pres_hf_ov15 + pres_com_ov15),
na.rm = TRUE
),
# suspected cases by age group
susp_u5_mismatch = sum(
susp_u5 != (susp_u5_hf + susp_u5_com),
na.rm = TRUE
),
susp_5_14_mismatch = sum(
susp_5_14 != (susp_5_14_hf + susp_5_14_com),
na.rm = TRUE
),
susp_ov15_mismatch = sum(
susp_ov15 != (susp_ov15_hf + susp_ov15_com),
na.rm = TRUE
),
# treated cases by age group
maltreat_u5_mismatch = sum(
maltreat_u5 != (maltreat_hf_u5 + maltreat_com_u5),
na.rm = TRUE
),
maltreat_5_14_mismatch = sum(
maltreat_5_14 != (maltreat_hf_5_14 + maltreat_com_5_14),
na.rm = TRUE
),
maltreat_ov15_mismatch = sum(
maltreat_ov15 != (maltreat_hf_ov15 + maltreat_com_ov15),
na.rm = TRUE
),
# treatment timing checks
maltreat_u24_mismatch = sum(
maltreat_u24_total != (maltreat_u24_hf + maltreat_u24_com),
na.rm = TRUE
),
maltreat_ov24_mismatch = sum(
maltreat_ov24_total != (maltreat_ov24_hf + maltreat_ov24_com),
na.rm = TRUE
)
) |>
# pivot to long format for easier viewing
tidyr::pivot_longer(
cols = dplyr::everything(),
names_to = "indicator",
values_to = "n_mismatches"
) |>
# filter to show only indicators with mismatches
dplyr::filter(n_mismatches > 0)
mismatch_summaryTo adapt the code:
- Line 3: Replace
dhis2_dfwith your target dataset. - Lines 6–52: Adjust the indicator checks to match your data. Each check compares a total column against the sum of its components. Add or remove checks based on the indicators availabel in your dataset.
- Lines 54–126: These check age-disaggregated indicators. Modify column names to match your age group naming conventions (e.g.,
_u5,_5_14,_ov15). - Line 135: The output filters to show only indicators with mismatches. Remove this filter if you want to see all indicators regardless of mismatch status.
Show the code
# create mismatch flags for each indicator group
# helper function to count mismatches, ignoring NAs
def count_mismatch(total, components):
"""Count mismatches between total and sum of components, ignoring NAs."""
calculated = components.sum(axis=1)
# only compare where both total and calculated are not NA
mask = total.notna() & calculated.notna()
return (total[mask] != calculated[mask]).sum()
mismatch_summary = pd.DataFrame({
# outpatient checks
"allout_mismatch": [
count_mismatch(
dhis2_df["allout"],
dhis2_df[["allout_u5", "allout_ov5"]]
)
],
# malaria admissions check
"maladm_mismatch": [
count_mismatch(
dhis2_df["maladm"],
dhis2_df[["maladm_u5", "maladm_5_14", "maladm_ov15"]]
)
],
# total tests check
"test_mismatch": [
count_mismatch(
dhis2_df["test"],
dhis2_df[["test_hf", "test_com"]]
)
],
# total confirmed check
"conf_mismatch": [
count_mismatch(
dhis2_df["conf"],
dhis2_df[["conf_hf", "conf_com"]]
)
],
# total treated check
"maltreat_mismatch": [
count_mismatch(
dhis2_df["maltreat"],
dhis2_df[["maltreat_hf", "maltreat_com"]]
)
],
# total presumed check
"pres_mismatch": [
count_mismatch(
dhis2_df["pres"],
dhis2_df[["pres_hf", "pres_com"]]
)
],
# malaria deaths check
"maldth_mismatch": [
count_mismatch(
dhis2_df["maldth"],
dhis2_df[[
"maldth_1_59m", "maldth_u5", "maldth_5_9", "maldth_10_14",
"maldth_5_14", "maldth_fem_ov15", "maldth_mal_ov15", "maldth_ov15"
]]
)
],
# tested cases by age group
"test_u5_mismatch": [
count_mismatch(
dhis2_df["test_u5"],
dhis2_df[["test_hf_u5", "test_com_u5"]]
)
],
"test_5_14_mismatch": [
count_mismatch(
dhis2_df["test_5_14"],
dhis2_df[["test_hf_5_14", "test_com_5_14"]]
)
],
"test_ov15_mismatch": [
count_mismatch(
dhis2_df["test_ov15"],
dhis2_df[["test_hf_ov15", "test_com_ov15"]]
)
],
# confirmed cases by age group
"conf_u5_mismatch": [
count_mismatch(
dhis2_df["conf_u5"],
dhis2_df[["conf_hf_u5", "conf_com_u5"]]
)
],
"conf_5_14_mismatch": [
count_mismatch(
dhis2_df["conf_5_14"],
dhis2_df[["conf_hf_5_14", "conf_com_5_14"]]
)
],
"conf_ov15_mismatch": [
count_mismatch(
dhis2_df["conf_ov15"],
dhis2_df[["conf_hf_ov15", "conf_com_ov15"]]
)
],
# presumed cases by age group
"pres_u5_mismatch": [
count_mismatch(
dhis2_df["pres_u5"],
dhis2_df[["pres_hf_u5", "pres_com_u5"]]
)
],
"pres_5_14_mismatch": [
count_mismatch(
dhis2_df["pres_5_14"],
dhis2_df[["pres_hf_5_14", "pres_com_5_14"]]
)
],
"pres_ov15_mismatch": [
count_mismatch(
dhis2_df["pres_ov15"],
dhis2_df[["pres_hf_ov15", "pres_com_ov15"]]
)
],
# suspected cases by age group
"susp_u5_mismatch": [
count_mismatch(
dhis2_df["susp_u5"],
dhis2_df[["susp_hf_u5", "susp_com_u5"]]
)
],
"susp_5_14_mismatch": [
count_mismatch(
dhis2_df["susp_5_14"],
dhis2_df[["susp_hf_5_14", "susp_com_5_14"]]
)
],
"susp_ov15_mismatch": [
count_mismatch(
dhis2_df["susp_ov15"],
dhis2_df[["susp_hf_ov15", "susp_com_ov15"]]
)
],
# treated cases by age group
"maltreat_u5_mismatch": [
count_mismatch(
dhis2_df["maltreat_u5"],
dhis2_df[["maltreat_hf_u5", "maltreat_com_u5"]]
)
],
"maltreat_5_14_mismatch": [
count_mismatch(
dhis2_df["maltreat_5_14"],
dhis2_df[["maltreat_hf_5_14", "maltreat_com_5_14"]]
)
],
"maltreat_ov15_mismatch": [
count_mismatch(
dhis2_df["maltreat_ov15"],
dhis2_df[["maltreat_hf_ov15", "maltreat_com_ov15"]]
)
],
# treatment timing checks
"maltreat_u24_mismatch": [
count_mismatch(
dhis2_df["maltreat_u24_total"],
dhis2_df[["maltreat_u24_hf", "maltreat_u24_com"]]
)
],
"maltreat_ov24_mismatch": [
count_mismatch(
dhis2_df["maltreat_ov24_total"],
dhis2_df[["maltreat_ov24_hf", "maltreat_ov24_com"]]
)
]
})
# pivot to long format for easier viewing
mismatch_summary = (
mismatch_summary
.melt(var_name="indicator", value_name="n_mismatches")
# filter to show only indicators with mismatches
.query("n_mismatches > 0")
)
mismatch_summaryTo adapt the code:
- Lines 10–156: Replace
dhis2_dfwith your target dataset. - Lines 10–58: Adjust the indicator checks to match your data. Each check compares a total column against the sum of its components. Add or remove checks based on the indicators availabel in your dataset.
- Lines 60–156: These check age-disaggregated indicators. Modify column names to match your age group naming conventions (e.g.,
_u5,_5_14,_ov15). - Line 164: The output filters to show only indicators with mismatches. Remove
.query("n_mismatches > 0")if you want to see all indicators regardless of mismatch status.
Step 6.3: Export Rows with Incoherent Totals
If there are incoherent totals detected, they can be exported for further assessment using the below code. Exported outputs must be shared with the SNT team for review and guidance.
Show the code
# Identify rows with incoherent totals
incoherent_rows <- dhis2_df |>
dplyr::filter(
# outpatient checks
allout != (allout_u5 + allout_ov5) |
# malaria admissions check
maladm != (maladm_u5 + maladm_5_14 + maladm_ov15) |
# total tests check
test != (test_hf + test_com) |
# total confirmed check
conf != (conf_hf + conf_com) |
# total treated check
maltreat != (maltreat_hf + maltreat_com) |
# total presumed check
pres != (pres_hf + pres_com)
) |>
dplyr::select(
hf, periodname,
dplyr::matches("allout|maladm|test|conf|maltreat|pres")
)
# set path for saving
output_file <- here::here(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
rio::export(incoherent_rows, file = output_file)To adapt the code:
- Line 2: Replace
dhis2_dfwith your target dataset. - Lines 3–16: Adjust the filter conditions to match your indicator totals and components.
- Lines 23–29: Update the output file path to your preferred location.
Once updated, run the code to export rows with incoherent totals for review with the SNT team.
Show the code
# identify rows with incoherent totals
incoherent_rows = dhis2_df[
# outpatient checks
(dhis2_df["allout"] != (dhis2_df["allout_u5"] + dhis2_df["allout_ov5"])) |
# malaria admissions check
(dhis2_df["maladm"] != (dhis2_df["maladm_u5"] + dhis2_df["maladm_5_14"] + dhis2_df["maladm_ov15"])) |
# total tests check
(dhis2_df["test"] != (dhis2_df["test_hf"] + dhis2_df["test_com"])) |
# total confirmed check
(dhis2_df["conf"] != (dhis2_df["conf_hf"] + dhis2_df["conf_com"])) |
# total treated check
(dhis2_df["maltreat"] != (dhis2_df["maltreat_hf"] + dhis2_df["maltreat_com"])) |
# total presumed check
(dhis2_df["pres"] != (dhis2_df["pres_hf"] + dhis2_df["pres_com"]))
]
# select relevant columns
incoherent_rows = incoherent_rows[
["hf", "periodname"] +
[col for col in incoherent_rows.columns
if any(x in col for x in ["allout", "maladm", "test", "conf", "maltreat", "pres"])]
]
# set path for saving
output_file = Path(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
incoherent_rows.to_excel(output_file, index=False)To adapt the code:
- Line 2: Replace
dhis2_dfwith your target dataset. - Lines 2–15: Adjust the filter conditions to match your indicator totals and components.
- Lines 25–30: Update the output file path to your preferred location.
Once updated, run the code to export rows with incoherent totals for review with the SNT team.
Step 6.3: Add IPD/OPD Specification
It may be useful in later analyses to subset routine data to only inpatient or only outpatient facilities or departments. For example, you may want to analyse malaria admissions only from facilities with inpatient capacity, or focus on outpatient case management indicators from primary care facilities.
Here, we create a column that specifies whether a facility provides inpatient (IPD) or outpatient (OPD) services. This classification can be based on:
- Facility type naming conventions (e.g., “Hospital” = IPD, “CHP” = OPD)
- Presence of inpatient indicators (e.g., facilities reporting admissions or deaths)
- A reference lookup file from the SNT team or health ministry
Facility classification varies by country. Confirm with the SNT team which facility types should be classified as IPD vs OPD. Some facilities may offer both services and require special handling.
Show the code
# option 1: classify based on facility name patterns
dhis2_df <- dhis2_df |>
dplyr::mutate(
facility_type = dplyr::case_when(
# inpatient facilities (hospitals)
stringr::str_detect(
hf,
regex("hospital|hosp", ignore_case = TRUE)
) ~ "IPD",
stringr::str_detect(
hf,
regex("district hospital|regional hospital", ignore_case = TRUE)
) ~ "IPD",
# outpatient facilities (clinics, health posts, community health)
stringr::str_detect(
hf,
regex(
"CHP|CHC|MCHP|clinic|health post|health centre",
ignore_case = TRUE
)
) ~ "OPD",
# default to OPD if no pattern matched
TRUE ~ "OPD"
)
)
# option 2: classify based on presence of inpatient indicators
dhis2_df <- dhis2_df |>
dplyr::group_by(hf_uid) |>
dplyr::mutate(
# facility has IPD if it ever reports admissions or deaths
has_ipd = any(!is.na(maladm) & maladm > 0, na.rm = TRUE) |
any(!is.na(maldth) & maldth > 0, na.rm = TRUE),
facility_type = dplyr::if_else(has_ipd, "IPD", "OPD")
) |>
dplyr::ungroup() |>
dplyr::select(-has_ipd)
# option 3: use a reference lookup file
hf_path <-
rio::import(
here::here(
"01_data",
"1.1_foundational",
"1.1b_health_facilities",
"processed",
"health_facility_master_list.xlsx"
)
)
dhis2_df <- dhis2_df |>
dplyr::left_join(
facility_lookup |> dplyr::select(hf_uid, facility_type),
by = "hf_uid"
)
# check classification distribution
dhis2_df |>
dplyr::distinct(hf_uid, facility_type) |>
dplyr::count(facility_type)To adapt the code:
- Line 12: Replace
dhis2_dfwith your target dataset. - Lines 16–31 (Option 1): Adjust the
str_detect()patterns to match your country’s facility naming conventions (e.g., “dispensary”, “health center”, “referral hospital”). - Lines 40–43 (Option 2): Modify the indicator columns (
maladm,maldth) if your dataset uses different names for inpatient indicators. - Lines 47–55 (Option 3): Update the path to your facility classification lookup file if using a reference list from the SNT team.
Choose the option that best fits your data and context. Option 3 (reference lookup) is recommended when an official facility classification exists.
Show the code
# option 1: classify based on facility name patterns
def classify_facility(hf_name):
"""Classify facility as IPD or OPD based on name patterns."""
hf_lower = hf_name.lower() if pd.notna(hf_name) else ""
# inpatient facilities (hospitals)
if any(x in hf_lower for x in ["hospital", "hosp"]):
return "IPD"
# outpatient facilities (clinics, health posts, community health)
elif any(x in hf_lower for x in ["chp", "chc", "mchp", "clinic", "health post", "health centre"]):
return "OPD"
# default to OPD
else:
return "OPD"
dhis2_df["facility_type"] = dhis2_df["hf"].apply(classify_facility)
# option 2: classify based on presence of inpatient indicators
has_ipd = (
dhis2_df
.groupby("hf_uid")
.apply(
lambda x: ((x["maladm"].notna() & (x["maladm"] > 0)).any() |
(x["maldth"].notna() & (x["maldth"] > 0)).any())
)
.reset_index(name="has_ipd")
)
dhis2_df = dhis2_df.merge(has_ipd, on="hf_uid", how="left")
dhis2_df["facility_type"] = dhis2_df["has_ipd"].map({True: "IPD", False: "OPD"})
dhis2_df = dhis2_df.drop(columns="has_ipd")
# option 3: use a reference lookup file
facility_lookup = pd.read_excel(
Path("path/to/facility_classification.xlsx")
)
dhis2_df = dhis2_df.merge(
facility_lookup[["hf_uid", "facility_type"]],
on="hf_uid",
how="left"
)
# check classification distribution
dhis2_df[["hf_uid", "facility_type"]].drop_duplicates()["facility_type"].value_counts()To adapt the code:
- Lines 6–13 (Option 1): Adjust the
str_detect()/ string patterns to match your country’s facility naming conventions (e.g., “dispensary”, “health center”, “referral hospital”). - Lines 17–25 (Option 2): Modify the indicator columns (
maladm,maldth) if your dataset uses different names for inpatient indicators. - Lines 28–35 (Option 3): Update the path to your facility classification lookup file if using a reference list from the SNT team.
Choose the option that best fits your data and context. Option 3 (reference lookup) is recommended when an official facility classification exists.
Step 7: Finalize Data
Step 7.1: Troubleshoot Duplicate HF-Month Records with Different Data
Here we search for rows in the dataset that correspond to the same HF-month report but happen to have different data. If any are found, these need to be cleaned at this stage to make sure the dataset contains only one report for each combination of HF and month. Expect to work closely with the SNT team or DHIS2 focal person at this stage to ensure correct data handling decisions are made.
Duplicate HF-month records can arise from multiple data entry submissions for the same reporting period, data imports from different sources being combined, system errors during DHIS2 extraction, or facilities reporting through both paper and electronic systems.
Unresolved duplicates will inflate case counts and distort indicators. Identifying and resolving them is essential before any analysis.
Show the code
# identify duplicate hf-month combinations
duplicates <- dhis2_df |>
dplyr::group_by(record_id) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::ungroup()
# count number of duplicate pairs
n_duplicates <- duplicates |>
dplyr::distinct(record_id) |>
nrow()
# if duplicates exist, inspect them
if (n_duplicates > 0) {
# view duplicate records with key indicators
duplicate_details <- duplicates |>
dplyr::select(
record_id,
adm0, adm1, adm2, adm3, hfm yearmon,
allout,
test,
conf,
maltreat
) |>
dplyr::arrange(record_id)
print(duplicate_details)
# export for review with SNT team
rio::export(
duplicate_details,
here::here(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_duplicate_records_dhis2.xlsx"
)
)
}
# option 1: keep first record (if duplicates are exact copies)
dhis2_df <- dhis2_df |>
dplyr::distinct(record_id, .keep_all = TRUE)
# option 2: keep record with most complete data
dhis2_df <- dhis2_df |>
dplyr::group_by(record_id) |>
dplyr::slice_max(
# count non-NA values across indicator columns
order_by = rowSums(!is.na(dplyr::across(dplyr::where(is.numeric)))),
n = 1,
with_ties = FALSE
) |>
dplyr::ungroup()
# option 3: sum duplicates (if records represent partial reports)
dhis2_df <- dhis2_df |>
dplyr::group_by(record_id, hf, adm0, adm1, adm2, adm3) |>
dplyr::summarise(
dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)),
.groups = "drop"
)To adapt the code:
- Line 3: Replace
dhis2_dfwith your target dataset. - Lines 18–21: Adjust the columns selected for duplicate inspection based on your key indicators.
- Lines 26–33: Update the output file path for exporting duplicates.
- Lines 36–56: Choose the appropriate resolution option based on your context
Once updated, run the code to identify and resolve duplicate HF-month records.
Show the code
# identify duplicate hf-month combinations
duplicates = dhis2_df.groupby(["hf_uid", "yearmon"]).filter(lambda x: len(x) > 1)
# count number of duplicate pairs
n_duplicates = duplicates[["hf_uid", "yearmon"]].drop_duplicates().shape[0]
# if duplicates exist, inspect them
if n_duplicates > 0:
# view duplicate records with key indicators
duplicate_details = duplicates[
["record_id", "allout", "test", "conf", "maltreat"]
].sort_values(["hf_uid", "yearmon"])
print(duplicate_details)
# export for review with SNT team
duplicate_details.to_excel(
Path(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_duplicate_records_dhis2.xlsx",
),
index=False,
)
# option 1: keep first record (if duplicates are exact copies)
dhis2_df = dhis2_df.drop_duplicates(subset=["hf_uid", "yearmon"], keep="first")
# option 2: keep record with most complete data
dhis2_df = (
dhis2_df.assign(
n_complete=lambda x: x.select_dtypes(include="number").notna().sum(axis=1)
)
.sort_values("n_complete", ascending=False)
.drop_duplicates(subset=["record_id"], keep="first")
.drop(columns="n_complete")
)
# option 3: sum duplicates (if records represent partial reports)
group_cols = ["hf_uid", "yearmon", "hf", "adm0", "adm1", "adm2", "adm3"]
numeric_cols = dhis2_df.select_dtypes(include="number").columns.tolist()
dhis2_df = dhis2_df.groupby(group_cols, as_index=False)[numeric_cols].sum()
# verify duplicates resolved
n_remaining = dhis2_df.groupby(["record_id"]).filter(lambda x: len(x) > 1).shape[0]To adapt the code:
- Line 3: Replace
dhis2_dfwith your target dataset. - Lines 18–21: Adjust the columns selected for duplicate inspection based on your key indicators.
- Lines 26–33: Update the output file path for exporting duplicates.
- Lines 36–56: Choose the appropriate resolution option based on your context
Once updated, run the code to identify and resolve duplicate HF-month records.
Consult with the SNT team before choosing how to resolve duplicates. The correct approach depends on why duplicates exist. Exact duplicates from system errors can be handled by keeping the first record (Option 1). When records have different completeness from re-submissions, keep the most complete record (Option 2). Partial reports from split submissions may need to be summed together (Option 3). Conflicting data requiring investigation should be exported and resolved manually with the SNT team.
Step 7.2: Generate Final Data Dictionary
A complete data dictionary documents all columns in the preprocessed dataset, including both the original DHIS2 variables and the computed/derived columns created during preprocessing. This dictionary serves as a reference for downstream analyses and facilitates collaboration with the SNT team.
Show the code
# define descriptions for computed/derived columns
computed_vars <- tibble::tribble(
~snt_var, ~indicator_label,
# time columns
"date", "Report date (YYYY-MM-DD)",
"year", "Report year",
"month", "Report month (1-12)",
"yearmon", "Year-month label (e.g., Jan 2020)",
# identifier columns
"hf_uid", "Unique health facility identifier (hash)",
"record_id", "Unique record identifier (facility + month hash)",
"location_short", "Location label: adm1 ~ adm2",
"location_full", "Location label: adm1 ~ adm2 ~ adm3 ~ hf",
"facility_type", "Facility type (IPD/OPD)",
# aggregated totals
"allout", "Total outpatient visits (allout_u5 + allout_ov5)",
"susp", "Total suspected cases (all ages, HF + COM)",
"test", "Total tested (test_hf + test_com)",
"test_hf", "Total tested at health facility",
"test_com", "Total tested in community",
"conf", "Total confirmed cases (conf_hf + conf_com)",
"conf_hf", "Total confirmed cases at health facility",
"conf_com", "Total confirmed cases in community",
"maltreat", "Total treated cases (maltreat_hf + maltreat_com)",
"maltreat_hf", "Total treated cases at health facility",
"maltreat_com", "Total treated cases in community",
"pres", "Total presumed cases (pres_hf + pres_com)",
"pres_hf", "Total presumed cases at health facility",
"pres_com", "Total presumed cases in community",
"maladm", "Total malaria admissions (all ages)",
"maldth", "Total malaria deaths (all ages)",
# age-group totals
"test_u5", "Total tested under 5 (HF + COM)",
"test_5_14", "Total tested 5-14 years (HF + COM)",
"test_ov15", "Total tested over 15 years (HF + COM)",
"test_hf_u5", "Tested under 5 at health facility",
"test_hf_5_14", "Tested 5-14 years at health facility",
"test_hf_ov15", "Tested over 15 years at health facility",
"test_com_u5", "Tested under 5 in community",
"test_com_5_14", "Tested 5-14 years in community",
"test_com_ov15", "Tested over 15 years in community",
"susp_u5", "Suspected cases under 5 (HF + COM)",
"susp_5_14", "Suspected cases 5-14 years (HF + COM)",
"susp_ov15", "Suspected cases over 15 years (HF + COM)",
"susp_hf_u5", "Suspected cases under 5 at health facility",
"susp_hf_5_14", "Suspected cases 5-14 years at health facility",
"susp_hf_ov15", "Suspected cases over 15 years at health facility",
"susp_com_u5", "Suspected cases under 5 in community",
"susp_com_5_14", "Suspected cases 5-14 years in community",
"susp_com_ov15", "Suspected cases over 15 years in community",
"conf_u5", "Confirmed cases under 5 (HF + COM)",
"conf_5_14", "Confirmed cases 5-14 years (HF + COM)",
"conf_ov15", "Confirmed cases over 15 years (HF + COM)",
"conf_hf_u5", "Confirmed cases under 5 at health facility",
"conf_hf_5_14", "Confirmed cases 5-14 years at health facility",
"conf_hf_ov15", "Confirmed cases over 15 years at health facility",
"conf_com_u5", "Confirmed cases under 5 in community",
"conf_com_5_14", "Confirmed cases 5-14 years in community",
"conf_com_ov15", "Confirmed cases over 15 years in community",
"maltreat_u5", "Treated cases under 5 (HF + COM)",
"maltreat_5_14", "Treated cases 5-14 years (HF + COM)",
"maltreat_ov15", "Treated cases over 15 years (HF + COM)",
"maltreat_hf_u5", "Treated cases under 5 at health facility",
"maltreat_hf_5_14", "Treated cases 5-14 years at health facility",
"maltreat_hf_ov15", "Treated cases over 15 years at health facility",
"maltreat_com_u5", "Treated cases under 5 in community",
"maltreat_com_5_14", "Treated cases 5-14 years in community",
"maltreat_com_ov15", "Treated cases over 15 years in community",
"maltreat_u24_hf", "Treated cases within 24hrs at health facility",
"maltreat_ov24_hf", "Treated cases after 24hrs at health facility",
"maltreat_u24_com", "Treated cases within 24hrs in community",
"maltreat_ov24_com", "Treated cases after 24hrs in community",
"maltreat_u24_total", "Total treated cases within 24hrs (HF + COM)",
"maltreat_ov24_total", "Total treated cases after 24hrs (HF + COM)",
"pres_u5", "Presumed cases under 5 (HF + COM)",
"pres_5_14", "Presumed cases 5-14 years (HF + COM)",
"pres_ov15", "Presumed cases over 15 years (HF + COM)",
"pres_hf_u5", "Presumed cases under 5 at health facility",
"pres_hf_5_14", "Presumed cases 5-14 years at health facility",
"pres_hf_ov15", "Presumed cases over 15 years at health facility",
"pres_com_u5", "Presumed cases under 5 in community",
"pres_com_5_14", "Presumed cases 5-14 years in community",
"pres_com_ov15", "Presumed cases over 15 years in community"
)
# combine original and computed dictionaries
full_data_dict <- dplyr::bind_rows(
data_dict,
computed_vars
) |>
# keep only columns present in final dataset
dplyr::filter(snt_var %in% names(dhis2_df)) |>
dplyr::arrange(snt_var) |>
dplyr::select(snt_variable = snt_var, label = indicator_label)
# check
full_data_dict |>
head()To adapt the code:
- Lines 4–77: Review and update
computed_varsto match your computed variables. Add or remove rows as needed. - Lines 85–91: Update the output file path for the final data dictionary.
Once updated, run the code to generate and export the final data dictionary.
Show the code
# define descriptions for computed/derived columns
computed_vars = pd.DataFrame([
# time columns
{"snt_var": "date", "indicator_label": "Report date (YYYY-MM-DD)"},
{"snt_var": "year", "indicator_label": "Report year"},
{"snt_var": "month", "indicator_label": "Report month (1-12)"},
{"snt_var": "yearmon", "indicator_label": "Year-month label (e.g., Jan 2020)"},
# identifier columns
{"snt_var": "hf_uid", "indicator_label": "Unique health facility identifier (hash)"},
{"snt_var": "record_id", "indicator_label": "Unique record identifier (facility + month hash)"},
{"snt_var": "location_short", "indicator_label": "Location label: adm1 ~ adm2"},
{"snt_var": "location_full", "indicator_label": "Location label: adm1 ~ adm2 ~ adm3 ~ hf"},
{"snt_var": "facility_type", "indicator_label": "Facility type (IPD/OPD)"},
# aggregated totals
{"snt_var": "allout", "indicator_label": "Total outpatient visits (allout_u5 + allout_ov5)"},
{"snt_var": "susp", "indicator_label": "Total suspected cases (all ages, HF + COM)"},
{"snt_var": "test", "indicator_label": "Total tested (test_hf + test_com)"},
{"snt_var": "test_hf", "indicator_label": "Total tested at health facility"},
{"snt_var": "test_com", "indicator_label": "Total tested in community"},
{"snt_var": "conf", "indicator_label": "Total confirmed cases (conf_hf + conf_com)"},
{"snt_var": "conf_hf", "indicator_label": "Total confirmed cases at health facility"},
{"snt_var": "conf_com", "indicator_label": "Total confirmed cases in community"},
{"snt_var": "maltreat", "indicator_label": "Total treated cases (maltreat_hf + maltreat_com)"},
{"snt_var": "maltreat_hf", "indicator_label": "Total treated cases at health facility"},
{"snt_var": "maltreat_com", "indicator_label": "Total treated cases in community"},
{"snt_var": "pres", "indicator_label": "Total presumed cases (pres_hf + pres_com)"},
{"snt_var": "pres_hf", "indicator_label": "Total presumed cases at health facility"},
{"snt_var": "pres_com", "indicator_label": "Total presumed cases in community"},
{"snt_var": "maladm", "indicator_label": "Total malaria admissions (all ages)"},
{"snt_var": "maldth", "indicator_label": "Total malaria deaths (all ages)"},
# age-group totals
{"snt_var": "test_u5", "indicator_label": "Total tested under 5 (HF + COM)"},
{"snt_var": "test_5_14", "indicator_label": "Total tested 5-14 years (HF + COM)"},
{"snt_var": "test_ov15", "indicator_label": "Total tested over 15 years (HF + COM)"},
{"snt_var": "conf_u5", "indicator_label": "Confirmed cases under 5 (HF + COM)"},
{"snt_var": "conf_5_14", "indicator_label": "Confirmed cases 5-14 years (HF + COM)"},
{"snt_var": "conf_ov15", "indicator_label": "Confirmed cases over 15 years (HF + COM)"},
{"snt_var": "maltreat_u5", "indicator_label": "Treated cases under 5 (HF + COM)"},
{"snt_var": "maltreat_5_14", "indicator_label": "Treated cases 5-14 years (HF + COM)"},
{"snt_var": "maltreat_ov15", "indicator_label": "Treated cases over 15 years (HF + COM)"},
{"snt_var": "pres_u5", "indicator_label": "Presumed cases under 5 (HF + COM)"},
{"snt_var": "pres_5_14", "indicator_label": "Presumed cases 5-14 years (HF + COM)"},
{"snt_var": "pres_ov15", "indicator_label": "Presumed cases over 15 years (HF + COM)"},
# add remaining age-group variables as needed...
])
# combine original and computed dictionaries
full_data_dict = (
pd.concat([data_dict, computed_vars], ignore_index=True)
.rename(columns={"snt_var": "snt_variable", "indicator_label": "label"})
[["snt_variable", "label"]]
)
# keep only columns present in final dataset
full_data_dict = (
full_data_dict[full_data_dict["snt_variable"].isin(dhis2_df.columns)]
.sort_values("snt_variable")
)
# check
full_data_dict.head(10)To adapt the code:
- Lines 7–49: Review and update
computed_varsto match your computed variables. Add or remove rows as needed. - Lines 55–61: Update the output file path for the final data dictionary.
Once updated, run the code to generate and export the final data dictionary.
Step 7.3: Arrange Final Column Order
Before exporting, we organise columns in a logical order to make the dataset easier to navigate. Identifiers and location columns come first, followed by time variables, then all indicator columns.
Show the code
# arrange columns in logical order
dhis2_df <- dhis2_df |>
dplyr::select(
# identifiers
record_id,
# location hierarchy
adm0,
adm1,
adm2,
adm3,
hf,
hf_uid,
location_short,
location_full,
facility_type,
# time variables
date,
yearmon,
year,
month,
# all remaining indicator columns
dplyr::everything()
)
# check column order
colnames(dhis2_df) |> head(20)To adapt the code:
- Lines 6–7: Adjust identifier columns based on your dataset.
- Lines 9–16: Modify location columns to match your administrative hierarchy.
- Lines 18–21: Update time columns if using different temporal variables (e.g.,
epiweek).
Once updated, run the code to reorder columns before exporting the final dataset.
Show the code
# define column order
id_cols = ["record_id"]
location_cols = ["adm0", "adm1", "adm2", "adm3", "hf", "hf_uid",
"location_short", "location_full", "facility_type"]
time_cols = ["date", "yearmon", "year", "month"]
# get remaining columns in original order
ordered_cols = id_cols + location_cols + time_cols
remaining_cols = [col for col in dhis2_df.columns if col not in ordered_cols]
# reorder dataframe
dhis2_df = dhis2_df[ordered_cols + remaining_cols]
# check column order
print(dhis2_df.columns[:20].tolist())To adapt the code:
- Line 4: Adjust identifier columns based on your dataset.
- Lines 5–6: Modify location columns to match your administrative hierarchy.
- Lines 7: Update time columns if using different temporal variables (e.g.,
epiweek).
Once updated, run the code to reorder columns before exporting the final dataset.
Step 8: Aggregate and Save Data
Step 8.1: Save Data at the Health Facility Level
Since some SNT analyses rely on facility-specific information, we will now save the data at the health facility level. Keeping the data at health facility level ensures that facility-level analyses can be done accurately and without loss of detail.
Show the code
# set up output path
save_path <- here::here(
"01_data",
"02_epidemiology",
"2a_routine_surveillance",
"processed"
)
# create list to save
dhis2_hf_list <- list(
data = dhis2_df,
dictionary = full_data_dict
)
# save to xlsx
rio::export(
dhis2_hf_list,
here::here(save_path, "sle_dhis2_health_facility_data.xlsx")
)
# save to rds
rio::export(
dhis2_hf_list,
here::here(save_path, "sle_dhis2_health_facility_data.rds")
)To adapt the code:
- Lines 2-7: Update
save_pathto your preferred output directory. - Lines 10-13: The list includes both the cleaned data and data dictionary. Add or remove items as needed.
- Lines 16-19: Update the filename prefix (e.g.,
sle_) to match your country code.
Once updated, run the code to export the final preprocessed dataset and data dictionary.
Code
save_path = Path(
here("01_data/02_epidemiology/2a_routine_surveillance/processed")
)
# save data to xlsx
dhis2_df.to_excel(
save_path / "sle_dhis2_health_facility_data.xlsx",
index=False
)
# save dictionary to xlsx
full_data_dict.to_excel(
save_path / "sle_dhis2_health_facility_dict.xlsx",
index=False
)
# save to parquet
dhis2_df.to_parquet(
save_path / "sle_dhis2_health_facility_data.parquet",
compression="zstd",
index=False
)To adapt the code:
- Lines 2-4: Update
save_pathto your preferred output directory. - Lines 7-10: Update the filename prefix (e.g.,
sle_) to match your country code. - Lines 13-16: Save the dictionary separately if needed.
- Lines 19-23: The
.parquetformat is efficient for large datasets. Use.csvinstead if sharing with non-Python users.
Once updated, run the code to export the final preprocessed dataset and data dictionary.
Step 8.2: Aggregate and Save Data at Each Admin Unit Level
Now we aggregate and save the data at different administrative levels to support various types of analysis and ensure alignment with how interventions are typically planned and monitored. This allows flexibility when calculating indicators, comparing trends across regions, or linking with other datasets structured at adm0, adm1, adm2, adm3, or facility level.
When aggregating data to administrative levels, not all indicators can simply be summed. For example, if you have already calculated a test positivity rate or treatment rate in your dataset, these indicators must be re-calculated at the new administrative level.
Show the code
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols <- c(
# totals
"allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
# by location
"test_hf", "test_com", "conf_hf", "conf_com",
"maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
# by age group - totals
"allout_u5", "allout_ov5",
"test_u5", "test_5_14", "test_ov15",
"conf_u5", "conf_5_14", "conf_ov15",
"maltreat_u5", "maltreat_5_14", "maltreat_ov15",
"pres_u5", "pres_5_14", "pres_ov15",
"susp_u5", "susp_5_14", "susp_ov15",
"maladm_u5", "maladm_5_14", "maladm_ov15",
"maldth_u5", "maldth_5_14", "maldth_ov15",
# by age and location
"test_hf_u5", "test_hf_5_14", "test_hf_ov15",
"test_com_u5", "test_com_5_14", "test_com_ov15",
"conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
"conf_com_u5", "conf_com_5_14", "conf_com_ov15",
"maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
"maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
"pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
"pres_com_u5", "pres_com_5_14", "pres_com_ov15",
# treatment timing
"maltreat_u24_hf", "maltreat_ov24_hf",
"maltreat_u24_com", "maltreat_ov24_com",
"maltreat_u24_total", "maltreat_ov24_total"
)
# function to aggregate at a given admin level
aggregate_admin <- function(df, group_cols, sum_cols) {
df |>
dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |>
dplyr::summarise(
dplyr::across(
dplyr::any_of(sum_cols),
~ sum(.x, na.rm = TRUE)
),
n_facilities = dplyr::n(),
.groups = "drop"
)
}
# aggregate at adm3 level
group_cols_adm3 <- c("adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon")
dhis2_adm3 <- aggregate_admin(dhis2_df, group_cols_adm3, sum_cols) |>
dplyr::mutate(
record_id = sntutils::vdigest(
paste(adm0, adm1, adm2, adm3, yearmon),
algo = "xxhash32"
),
location_short = paste(adm1, adm2, sep = " ~ "),
location_full = paste(adm1, adm2, adm3, sep = " ~ ")
)
# aggregate at adm2 level
group_cols_adm2 <- c("adm0", "adm1", "adm2", "year", "month", "yearmon")
dhis2_adm2 <- aggregate_admin(dhis2_df, group_cols_adm2, sum_cols) |>
dplyr::mutate(
record_id = sntutils::vdigest(
paste(adm0, adm1, adm2, yearmon),
algo = "xxhash32"
),
location_short = paste(adm1, adm2, sep = " ~ "),
location_full = paste(adm1, adm2, sep = " ~ ")
)
# aggregate at adm1 level
group_cols_adm1 <- c("adm0", "adm1", "year", "month", "yearmon")
dhis2_adm1 <- aggregate_admin(dhis2_df, group_cols_adm1, sum_cols) |>
dplyr::mutate(
record_id = sntutils::vdigest(
paste(adm0, adm1, yearmon),
algo = "xxhash32"
),
location_short = adm1,
location_full = adm1
)
# create data dictionaries for each level (filter to relevant columns only)
id_cols <- c("n_facilities", "record_id", "location_short", "location_full")
# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_dict <- full_data_dict |>
dplyr::filter(snt_variable %in% c(group_cols_adm3, sum_cols, id_cols))
# adm2 dict: excludes adm3 (not present at this level)
adm2_dict <- full_data_dict |>
dplyr::filter(
snt_variable %in% c(group_cols_adm2, sum_cols, id_cols),
!snt_variable %in% c("adm3")
)
# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_dict <- full_data_dict |>
dplyr::filter(
snt_variable %in% c(group_cols_adm1, sum_cols, id_cols),
!snt_variable %in% c("adm2", "adm3")
)
# save adm3 data with dictionary
rio::export(
list(data = dhis2_adm3, dictionary = adm3_dict),
here::here(save_path, "sle_dhis2_adm3_data.xlsx")
)
rio::export(dhis2_adm3, here::here(save_path, "sle_dhis2_adm3_data.rds"))
# save adm2 data with dictionary
rio::export(
list(data = dhis2_adm2, dictionary = adm2_dict),
here::here(save_path, "sle_dhis2_adm2_data.xlsx")
)
rio::export(dhis2_adm2, here::here(save_path, "sle_dhis2_adm2_data.rds"))
# save adm1 data with dictionary
rio::export(
list(data = dhis2_adm1, dictionary = adm1_dict),
here::here(save_path, "sle_dhis2_adm1_data.xlsx")
)
rio::export(dhis2_adm1, here::here(save_path, "sle_dhis2_adm1_data.rds"))To adapt the code:
- Lines 2-30: Update
sum_colsto include all numeric indicators relevant to your dataset. - Lines 33-42: The
aggregate_admin()function handles grouping by admin level. - Lines 45-79: Adjust grouping columns if your admin hierarchy differs. The
record_idis created usingsntutils::vdigest()with the full admin hierarchy (e.g., adm0 + adm1 + adm2 + adm3 + yearmon for adm3 level) to ensure unique identifiers. The code also createslocation_shortandlocation_fulllabels for each level. - Lines 82-94: Filter the data dictionary to match columns at each level, excluding admin columns not present at that level (e.g., adm3 excluded from adm2 dictionary).
- Lines 97-112: Update filename prefixes (e.g.,
sle_) to match your country code.
Once updated, run the code to aggregate and save data at adm3, adm2, and adm1 levels.
Show the code
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols = [
# totals
"allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
# by location
"test_hf", "test_com", "conf_hf", "conf_com",
"maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
# by age group - totals
"allout_u5", "allout_ov5",
"test_u5", "test_5_14", "test_ov15",
"conf_u5", "conf_5_14", "conf_ov15",
"maltreat_u5", "maltreat_5_14", "maltreat_ov15",
"pres_u5", "pres_5_14", "pres_ov15",
"susp_u5", "susp_5_14", "susp_ov15",
"maladm_u5", "maladm_5_14", "maladm_ov15",
"maldth_u5", "maldth_5_14", "maldth_ov15",
# by age and location
"test_hf_u5", "test_hf_5_14", "test_hf_ov15",
"test_com_u5", "test_com_5_14", "test_com_ov15",
"conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
"conf_com_u5", "conf_com_5_14", "conf_com_ov15",
"maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
"maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
"pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
"pres_com_u5", "pres_com_5_14", "pres_com_ov15",
# treatment timing
"maltreat_u24_hf", "maltreat_ov24_hf",
"maltreat_u24_com", "maltreat_ov24_com",
"maltreat_u24_total", "maltreat_ov24_total"
]
# function to aggregate at a given admin level
def aggregate_admin(df, group_cols, sum_cols):
# filter to columns that exist in dataframe
existing_sum_cols = [c for c in sum_cols if c in df.columns]
# aggregate
agg_df = (
df
.groupby(group_cols, as_index=False)
.agg(
**{col: (col, "sum") for col in existing_sum_cols},
n_facilities=("hf_uid", "count")
)
)
return agg_df
# aggregate at adm3 level
group_cols_adm3 = ["adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon"]
dhis2_adm3 = aggregate_admin(dhis2_df, group_cols_adm3, sum_cols)
dhis2_adm3["record_id"] = vdigest(
dhis2_adm3["adm0"] + " " + dhis2_adm3["adm1"] + " " +
dhis2_adm3["adm2"] + " " + dhis2_adm3["adm3"] + " " +
dhis2_adm3["yearmon"].astype(str)
)
dhis2_adm3["location_short"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"]
dhis2_adm3["location_full"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"] + " ~ " + dhis2_adm3["adm3"]
# aggregate at adm2 level
group_cols_adm2 = ["adm0", "adm1", "adm2", "year", "month", "yearmon"]
dhis2_adm2 = aggregate_admin(dhis2_df, group_cols_adm2, sum_cols)
dhis2_adm2["record_id"] = vdigest(
dhis2_adm2["adm0"] + " " + dhis2_adm2["adm1"] + " " +
dhis2_adm2["adm2"] + " " + dhis2_adm2["yearmon"].astype(str)
)
dhis2_adm2["location_short"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
dhis2_adm2["location_full"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
# aggregate at adm1 level
group_cols_adm1 = ["adm0", "adm1", "year", "month", "yearmon"]
dhis2_adm1 = aggregate_admin(dhis2_df, group_cols_adm1, sum_cols)
dhis2_adm1["record_id"] = vdigest(
dhis2_adm1["adm0"] + " " + dhis2_adm1["adm1"] + " " +
dhis2_adm1["yearmon"].astype(str)
)
dhis2_adm1["location_short"] = dhis2_adm1["adm1"]
dhis2_adm1["location_full"] = dhis2_adm1["adm1"]
# create data dictionaries for each level (filter to relevant columns only)
id_cols = ["n_facilities", "record_id", "location_short", "location_full"]
# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_cols = group_cols_adm3 + [c for c in sum_cols if c in dhis2_adm3.columns] + id_cols
adm3_dict = full_data_dict[full_data_dict["snt_variable"].isin(adm3_cols)]
# adm2 dict: excludes adm3 (not present at this level)
adm2_cols = group_cols_adm2 + [c for c in sum_cols if c in dhis2_adm2.columns] + id_cols
adm2_dict = full_data_dict[
(full_data_dict["snt_variable"].isin(adm2_cols)) &
(~full_data_dict["snt_variable"].isin(["adm3"]))
]
# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_cols = group_cols_adm1 + [c for c in sum_cols if c in dhis2_adm1.columns] + id_cols
adm1_dict = full_data_dict[
(full_data_dict["snt_variable"].isin(adm1_cols)) &
(~full_data_dict["snt_variable"].isin(["adm2", "adm3"]))
]
# save adm3 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm3_data.xlsx") as writer:
dhis2_adm3.to_excel(writer, sheet_name="data", index=False)
adm3_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm3.to_parquet(save_path / "sle_dhis2_adm3_data.parquet", compression="zstd", index=False)
# save adm2 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm2_data.xlsx") as writer:
dhis2_adm2.to_excel(writer, sheet_name="data", index=False)
adm2_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm2.to_parquet(save_path / "sle_dhis2_adm2_data.parquet", compression="zstd", index=False)
# save adm1 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm1_data.xlsx") as writer:
dhis2_adm1.to_excel(writer, sheet_name="data", index=False)
adm1_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm1.to_parquet(save_path / "sle_dhis2_adm1_data.parquet", compression="zstd", index=False)To adapt the code:
- Lines 2-30: Update
sum_colsto include all numeric indicators relevant to your dataset. - Lines 33-46: The
aggregate_admin()function handles grouping by admin level. - Lines 49-75: Adjust grouping columns if your admin hierarchy differs. The
record_idis created usingvdigest()with the full admin hierarchy (e.g., adm0 + adm1 + adm2 + adm3 + yearmon for adm3 level) to ensure unique identifiers. The code also createslocation_shortandlocation_fulllabels for each level. - Lines 78-90: Filter the data dictionary to match columns at each level, excluding admin columns not present at that level (e.g., adm3 excluded from adm2 dictionary).
- Lines 93-108: Update filename prefixes (e.g.,
sle_) to match your country code.
Once updated, run the code to aggregate and save data at adm3, adm2, and adm1 levels.
Summary
We’ve now walked through the key steps for cleaning, reshaping, and aggregating DHIS2 malaria data to make it SNT-ready. The code covered everything from importing raw files, standardising column names, computing key indicators, and saving outputs at multiple administrative levels. For convenience, a full end-to-end script is included below in a folded code block. You can reuse or adapt this for your own country context by adjusting column names, file paths, and administrative levels as needed.
Full code
Find the full code script for accessing and processing routine data below.
Show full code
################################################################################
#################### ~ DHIS2 data preprocessing full code ~ ####################
################################################################################
### Step 1: Import Packages and Data -------------------------------------------
#### Step 1.1: Import Packages -------------------------------------------------
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# install or load relevant packages
pacman::p_load(
tidyverse, # data manipulation, reshaping and plotting
rio, # import/export multiple file types
DT, # interactive table preview
here, # project-relative file paths
readxl, # read excel files
writexl, # write excel files
knitr # render code in quarto/rmarkdown
)
#### Step 1.2: Import Data -----------------------------------------------------
# set the path to the data
core_routine_path <- here::here(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"raw"
)
# set the full dhis2 path
path_to_dhis2 <- here::here(core_routine_path, "sle_dhis2_2015_2022.xlsx")
dhis2_df <- rio::import_list(
file = list.files(
path = core_routine_path,
pattern = "\\.(xls)$", # the file extensions
full.names = TRUE
),
rbind = TRUE, # stack all tabs/files into one table
rbind_label = "sheet_admin", # source name stored in column 'sheet_admin'
rbind_fill = TRUE # fill missing columns with NA when stacking
)
# check data
dhis2_df |>
dplyr::select(1:20) |>
dplyr::glimpse()
### Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts --
#### Step 2.1: Verify All Expected Groups Are Present After Import -------------
# check the adm's' in our data
dhis2_df |>
dplyr::distinct(orgunitlevel3)
#### Step 2.2: Check for Completely Missing Variables --------------------------
# identify complete missing values per column
dhis2_df |>
dplyr::summarise(
dplyr::across(
dplyr::everything(),
~ mean(is.na(.x)) * 100
)
) |>
tidyr::pivot_longer(
everything(),
names_to = "variable",
values_to = "pct_missing"
) |>
dplyr::filter(pct_missing == 100)
### Step 3: Rename Columns -----------------------------------------------------
# import data dictionary
data_dict <- rio::import(
here::here(core_routine_path, "sle_dhis2_data_dict.xlsx")
)
# create rename vector: object = old names, names = new names
rename_vector <- stats::setNames(
object = data_dict$indicator_label,
nm = data_dict$snt_var
)
# rename dhis2 data
dhis2_df <- dhis2_df |>
dplyr::rename(
!!!rename_vector
) |>
# we drop orgunitlevel5 because it s same as hf
dplyr::select(-orgunitlevel5)
# check the names
colnames(dhis2_df)
### Step 4: Standardise the Date Columns ---------------------------------------
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
Sys.setlocale("LC_TIME", "en_US.UTF-8")
dhis2_df <- dhis2_df |>
# parse the raw date into a proper Date object
dplyr::mutate(
date = lubridate::parse_date_time(
periodname,
orders = c("B Y", "b Y")
) |>
as.Date()
) |>
# create year, month, and ordered yearmon label
dplyr::mutate(
year = lubridate::year(date),
month = lubridate::month(date),
# human-readable label, e.g. "Jan 2020"
yearmon = format(date, "%b %Y"),
# order factor by actual chronological date
yearmon = factor(yearmon, levels = unique(yearmon[order(date)]))
)
# check the head
dhis2_df |>
dplyr::select(date, year, month, yearmon) |>
head()
### Step 5: Managing Location Columns ------------------------------------------
#### Steps 5.1: Harmonise Administrative Names ---------------------------------
# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")
# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"
# get adm3 shapfile to use as reference lookup
shp_adm3 <- sntutils::read(
here::here("english/data_r/shapefiles/sle_spatial_adm3_2021.rds")
) |>
sf::st_drop_geometry() # drop geometry, we just need the admin names
# standardise admin names
dhis2_df <- dhis2_df |>
dplyr::mutate(
adm0 = toupper(adm0),
adm2 = toupper(adm1),
adm3 = toupper(adm3),
hf = toupper(hf),
# the adm2 in the lookup shapefile don't have
# "DISTRICT" in the name, so we remove it to enable matching
adm2 = stringr::str_remove_all(adm2, " DISTRICT"),
adm3 = stringr::str_remove_all(adm3, " CHIEFDOM")
) |>
dplyr::mutate(
# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
adm1 = dplyr::case_when(
adm2 %in% c("KAILAHUN", "KENEMA", "KONO") ~ "EASTERN",
adm2 %in% c("BOMBALI", "FALABA", "KOINADUGU", "TONKOLILI") ~ "NORTH EAST",
adm2 %in% c("KAMBIA", "KARENE", "PORT LOKO") ~ "NORTH WEST",
adm2 %in% c("BO", "BONTHE", "MOYAMBA", "PUJEHUN") ~ "SOUTHERN",
adm2 %in% c("WESTERN AREA RURAL", "WESTERN AREA URBAN") ~ "WESTERN"
)
)
# harmonise admin names between dhis2 data and shapefile
dhis2_df <-
sntutils::prep_geonames(
target_df = dhis2_df,
lookup_df = lookup_keys,
level0 = "adm0",
level1 = "adm1",
level2 = "adm2",
level3 = "adm3",
interactive = TRUE,
cache_path = here::here(cache_path, "geoname_decisions_cache.xlsx")
)
#### Steps 5.1: Create unique identifiers and location labels ------------------
# create identifiers
dhis2_df <- dhis2_df |>
dplyr::mutate(
# create unique health facility identifier from admin hierarchy
hf_uid = sntutils::vdigest(
paste0(adm0, adm1, adm2, adm3, hf),
algo = "xxhash32"
),
# create location labels for mapping
location_short = paste(adm1, adm2, sep = " ~ "),
location_full = paste(adm1, adm2, adm3, hf, sep = " ~ "),
# generate consistent record id (facility + time period)
record_id = sntutils::vdigest(
paste(hf_uid, yearmon),
algo = "xxhash32"
)
)
# check
dhis2_df |>
dplyr::arrange(location_full) |>
dplyr::distinct(location_full, hf_uid, record_id) |>
head()
### Step 6: Compute Variables --------------------------------------------------
#### Step 6.1: Compute Indicator Totals and New Variables ----------------------
# smart row-wise sum with missing data handling
fallback_row_sum <- function(..., min_present = 1, .keep_zero_as_zero = TRUE) {
vars_matrix <- cbind(...)
valid_count <- rowSums(!is.na(vars_matrix))
raw_sum <- rowSums(vars_matrix, na.rm = TRUE)
ifelse(valid_count >= min_present, raw_sum, NA_real_)
}
# fallback absolute difference between two vectors
fallback_diff <- function(col1, col2, minimum = 0) {
dplyr::case_when(
is.na(col1) & is.na(col2) ~ NA_real_,
is.na(col1) ~ pmax(col2, minimum),
is.na(col2) ~ pmax(col1, minimum),
TRUE ~ pmax(col1 - col2, minimum)
)
}
# compute indicator totals in DHIS2 data
dhis2_df <- dhis2_df |>
dplyr::mutate(
# outpatient visits
allout = fallback_row_sum(allout_u5, allout_ov5),
# suspected cases
susp = fallback_row_sum(
susp_u5_hf,
susp_5_14_hf,
susp_ov15_hf,
susp_u5_com,
susp_5_14_com,
susp_ov15_com
),
# tested cases
test_hf = fallback_row_sum(
test_neg_mic_u5_hf,
test_pos_mic_u5_hf,
test_neg_mic_5_14_hf,
test_pos_mic_5_14_hf,
test_neg_mic_ov15_hf,
test_pos_mic_ov15_hf,
tes_neg_rdt_u5_hf,
tes_pos_rdt_u5_hf,
tes_neg_rdt_5_14_hf,
tes_pos_rdt_5_14_hf,
tes_neg_rdt_ov15_hf,
tes_pos_rdt_ov15_hf
),
test_com = fallback_row_sum(
tes_neg_rdt_u5_com,
tes_pos_rdt_u5_com,
tes_neg_rdt_5_14_com,
tes_pos_rdt_5_14_com,
tes_neg_rdt_ov15_com,
tes_pos_rdt_ov15_com
),
test = fallback_row_sum(test_hf, test_com),
# confirmed cases (HF and COM)
conf_hf = fallback_row_sum(
test_pos_mic_u5_hf,
test_pos_mic_5_14_hf,
test_pos_mic_ov15_hf,
tes_pos_rdt_u5_hf,
tes_pos_rdt_5_14_hf,
tes_pos_rdt_ov15_hf
),
conf_com = fallback_row_sum(
tes_pos_rdt_u5_com,
tes_pos_rdt_5_14_com,
tes_pos_rdt_ov15_com
),
conf = fallback_row_sum(conf_hf, conf_com),
# treated cases
maltreat_com = fallback_row_sum(
maltreat_u24_u5_com,
maltreat_ov24_u5_com,
maltreat_u24_5_14_com,
maltreat_ov24_5_14_com,
maltreat_u24_ov15_com,
maltreat_ov24_ov15_com
),
maltreat_hf = fallback_row_sum(
maltreat_u24_u5_hf,
maltreat_ov24_u5_hf,
maltreat_u24_5_14_hf,
maltreat_ov24_5_14_hf,
maltreat_u24_ov15_hf,
maltreat_ov24_ov15_hf
),
maltreat = fallback_row_sum(maltreat_hf, maltreat_com),
# presumed cases
pres_com = fallback_diff(maltreat_com, conf_com),
pres_hf = fallback_diff(maltreat_hf, conf_hf),
pres = fallback_row_sum(pres_com, pres_hf),
# malaria admissions
maladm = fallback_row_sum(
maladm_u5,
maladm_5_14,
maladm_ov15
),
# malaria deaths
maldth = fallback_row_sum(
maldth_u5,
maldth_1_59m,
maldth_10_14,
maldth_5_9,
maldth_5_14,
maldth_ov15,
maldth_fem_ov15,
maldth_mal_ov15
),
# AGE-GROUP SPECIFIC AGGREGATIONS
# Tested cases by age group (HF only)
test_hf_u5 = fallback_row_sum(
test_neg_mic_u5_hf,
test_pos_mic_u5_hf,
tes_neg_rdt_u5_hf,
tes_pos_rdt_u5_hf
),
test_hf_5_14 = fallback_row_sum(
test_neg_mic_5_14_hf,
test_pos_mic_5_14_hf,
tes_neg_rdt_5_14_hf,
tes_pos_rdt_5_14_hf
),
test_hf_ov15 = fallback_row_sum(
test_neg_mic_ov15_hf,
test_pos_mic_ov15_hf,
tes_neg_rdt_ov15_hf,
tes_pos_rdt_ov15_hf
),
# Tested cases by age group (Community only)
test_com_u5 = fallback_row_sum(
tes_neg_rdt_u5_com,
tes_pos_rdt_u5_com
),
test_com_5_14 = fallback_row_sum(
tes_neg_rdt_5_14_com,
tes_pos_rdt_5_14_com
),
test_com_ov15 = fallback_row_sum(
tes_neg_rdt_ov15_com,
tes_pos_rdt_ov15_com
),
# Total tested by age group (HF + Community)
test_u5 = fallback_row_sum(test_hf_u5, test_com_u5),
test_5_14 = fallback_row_sum(test_hf_5_14, test_com_5_14),
test_ov15 = fallback_row_sum(test_hf_ov15, test_com_ov15),
# Suspected cases by age group (HF only)
susp_hf_u5 = susp_u5_hf,
susp_hf_5_14 = susp_5_14_hf,
susp_hf_ov15 = susp_ov15_hf,
# Suspected cases by age group (Community only)
susp_com_u5 = susp_u5_com,
susp_com_5_14 = susp_5_14_com,
susp_com_ov15 = susp_ov15_com,
# Total suspected by age group (HF + Community)
susp_u5 = fallback_row_sum(susp_hf_u5, susp_com_u5),
susp_5_14 = fallback_row_sum(susp_hf_5_14, susp_com_5_14),
susp_ov15 = fallback_row_sum(susp_hf_ov15, susp_com_ov15),
# Confirmed cases by age group (HF only)
conf_hf_u5 = fallback_row_sum(
test_pos_mic_u5_hf,
tes_pos_rdt_u5_hf
),
conf_hf_5_14 = fallback_row_sum(
test_pos_mic_5_14_hf,
tes_pos_rdt_5_14_hf
),
conf_hf_ov15 = fallback_row_sum(
test_pos_mic_ov15_hf,
tes_pos_rdt_ov15_hf
),
# Confirmed cases by age group (Community only)
conf_com_u5 = tes_pos_rdt_u5_com,
conf_com_5_14 = tes_pos_rdt_5_14_com,
conf_com_ov15 = tes_pos_rdt_ov15_com,
# Total confirmed cases by age group (HF + Community)
conf_u5 = fallback_row_sum(conf_hf_u5, conf_com_u5),
conf_5_14 = fallback_row_sum(conf_hf_5_14, conf_com_5_14),
conf_ov15 = fallback_row_sum(conf_hf_ov15, conf_com_ov15),
# Treated cases by age group (HF only)
maltreat_hf_u5 = fallback_row_sum(
maltreat_u24_u5_hf,
maltreat_ov24_u5_hf
),
maltreat_hf_5_14 = fallback_row_sum(
maltreat_u24_5_14_hf,
maltreat_ov24_5_14_hf
),
maltreat_hf_ov15 = fallback_row_sum(
maltreat_u24_ov15_hf,
maltreat_ov24_ov15_hf
),
# Treated cases by age group (Community only)
maltreat_com_u5 = fallback_row_sum(
maltreat_u24_u5_com,
maltreat_ov24_u5_com
),
maltreat_com_5_14 = fallback_row_sum(
maltreat_u24_5_14_com,
maltreat_ov24_5_14_com
),
maltreat_com_ov15 = fallback_row_sum(
maltreat_u24_ov15_com,
maltreat_ov24_ov15_com
),
# Total treated cases by age group (HF + Community)
maltreat_u5 = fallback_row_sum(maltreat_hf_u5, maltreat_com_u5),
maltreat_5_14 = fallback_row_sum(maltreat_hf_5_14, maltreat_com_5_14),
maltreat_ov15 = fallback_row_sum(maltreat_hf_ov15, maltreat_com_ov15),
# Total treated cases within 24 hours (HF only)
maltreat_u24_hf = fallback_row_sum(
maltreat_u24_u5_hf,
maltreat_u24_5_14_hf,
maltreat_u24_ov15_hf
),
# Total treated cases after 24 hours (HF only)
maltreat_ov24_hf = fallback_row_sum(
maltreat_ov24_u5_hf,
maltreat_ov24_5_14_hf,
maltreat_ov24_ov15_hf
),
# Total treated cases within 24 hours (Community only)
maltreat_u24_com = fallback_row_sum(
maltreat_u24_u5_com,
maltreat_u24_5_14_com,
maltreat_u24_ov15_com
),
# Total treated cases after 24 hours (Community only)
maltreat_ov24_com = fallback_row_sum(
maltreat_ov24_u5_com,
maltreat_ov24_5_14_com,
maltreat_ov24_ov15_com
),
# Overall totals (HF + Community)
maltreat_u24_total = fallback_row_sum(maltreat_u24_hf, maltreat_u24_com),
maltreat_ov24_total = fallback_row_sum(maltreat_ov24_hf, maltreat_ov24_com),
# Presumed cases by age group
pres_com_u5 = fallback_diff(maltreat_com_u5, conf_com_u5),
pres_com_5_14 = fallback_diff(maltreat_com_5_14, conf_com_5_14),
pres_com_ov15 = fallback_diff(maltreat_com_ov15, conf_com_ov15),
pres_hf_u5 = fallback_diff(maltreat_hf_u5, conf_hf_u5),
pres_hf_5_14 = fallback_diff(maltreat_hf_5_14, conf_hf_5_14),
pres_hf_ov15 = fallback_diff(maltreat_hf_ov15, conf_hf_ov15),
pres_u5 = fallback_row_sum(pres_com_u5, pres_hf_u5),
pres_5_14 = fallback_row_sum(pres_com_5_14, pres_hf_5_14),
pres_ov15 = fallback_row_sum(pres_com_ov15, pres_hf_ov15)
)
# check to see the aggregation worked
dhis2_df |>
dplyr::filter(
record_id %in%
c("6a29143b", "0e7ba814", "943c5f5f", "4fbe05fd", "40cc411c", "51194842")
) |>
dplyr::arrange(allout) |>
dplyr::select(
allout,
allout_u5,
allout_ov5,
pres_hf_u5,
maltreat_hf_u5,
conf_hf_u5
) |>
head()
#### Step 6.2: Quality Control of Indicator Totals -----------------------------
# create mismatch flags for each indicator group
mismatch_summary <- dhis2_df |>
dplyr::summarise(
# outpatient checks
allout_mismatch = sum(
allout != (allout_u5 + allout_ov5),
na.rm = TRUE
),
# malaria admissions check
maladm_mismatch = sum(
maladm != (maladm_u5 + maladm_5_14 + maladm_ov15),
na.rm = TRUE
),
# total tests check
test_mismatch = sum(
test != (test_hf + test_com),
na.rm = TRUE
),
# total confirmed check
conf_mismatch = sum(
conf != (conf_hf + conf_com),
na.rm = TRUE
),
# total treated check
maltreat_mismatch = sum(
maltreat != (maltreat_hf + maltreat_com),
na.rm = TRUE
),
# total presumed check
pres_mismatch = sum(
pres != (pres_hf + pres_com),
na.rm = TRUE
),
# malaria deaths check
maldth_mismatch = sum(
maldth !=
(maldth_1_59m +
maldth_u5 +
maldth_5_9 +
maldth_10_14 +
maldth_5_14 +
maldth_fem_ov15 +
maldth_mal_ov15 +
maldth_ov15),
na.rm = TRUE
),
# tested cases by age group
test_u5_mismatch = sum(
test_u5 != (test_hf_u5 + test_com_u5),
na.rm = TRUE
),
test_5_14_mismatch = sum(
test_5_14 != (test_hf_5_14 + test_com_5_14),
na.rm = TRUE
),
test_ov15_mismatch = sum(
test_ov15 != (test_hf_ov15 + test_com_ov15),
na.rm = TRUE
),
# confirmed cases by age group
conf_u5_mismatch = sum(
conf_u5 != (conf_hf_u5 + conf_com_u5),
na.rm = TRUE
),
conf_5_14_mismatch = sum(
conf_5_14 != (conf_hf_5_14 + conf_com_5_14),
na.rm = TRUE
),
conf_ov15_mismatch = sum(
conf_ov15 != (conf_hf_ov15 + conf_com_ov15),
na.rm = TRUE
),
# presumed cases by age group
pres_u5_mismatch = sum(
pres_u5 != (pres_hf_u5 + pres_com_u5),
na.rm = TRUE
),
pres_5_14_mismatch = sum(
pres_5_14 != (pres_hf_5_14 + pres_com_5_14),
na.rm = TRUE
),
pres_ov15_mismatch = sum(
pres_ov15 != (pres_hf_ov15 + pres_com_ov15),
na.rm = TRUE
),
# suspected cases by age group
susp_u5_mismatch = sum(
susp_u5 != (susp_u5_hf + susp_u5_com),
na.rm = TRUE
),
susp_5_14_mismatch = sum(
susp_5_14 != (susp_5_14_hf + susp_5_14_com),
na.rm = TRUE
),
susp_ov15_mismatch = sum(
susp_ov15 != (susp_ov15_hf + susp_ov15_com),
na.rm = TRUE
),
# treated cases by age group
maltreat_u5_mismatch = sum(
maltreat_u5 != (maltreat_hf_u5 + maltreat_com_u5),
na.rm = TRUE
),
maltreat_5_14_mismatch = sum(
maltreat_5_14 != (maltreat_hf_5_14 + maltreat_com_5_14),
na.rm = TRUE
),
maltreat_ov15_mismatch = sum(
maltreat_ov15 != (maltreat_hf_ov15 + maltreat_com_ov15),
na.rm = TRUE
),
# treatment timing checks
maltreat_u24_mismatch = sum(
maltreat_u24_total != (maltreat_u24_hf + maltreat_u24_com),
na.rm = TRUE
),
maltreat_ov24_mismatch = sum(
maltreat_ov24_total != (maltreat_ov24_hf + maltreat_ov24_com),
na.rm = TRUE
)
) |>
# pivot to long format for easier viewing
tidyr::pivot_longer(
cols = dplyr::everything(),
names_to = "indicator",
values_to = "n_mismatches"
) |>
# filter to show only indicators with mismatches
dplyr::filter(n_mismatches > 0)
mismatch_summary
#### Step 6.3: Export Rows with Incoherent Totals ------------------------------
# Identify rows with incoherent totals
incoherent_rows <- dhis2_df |>
dplyr::filter(
# outpatient checks
allout != (allout_u5 + allout_ov5) |
# malaria admissions check
maladm != (maladm_u5 + maladm_5_14 + maladm_ov15) |
# total tests check
test != (test_hf + test_com) |
# total confirmed check
conf != (conf_hf + conf_com) |
# total treated check
maltreat != (maltreat_hf + maltreat_com) |
# total presumed check
pres != (pres_hf + pres_com)
) |>
dplyr::select(
hf, periodname,
dplyr::matches("allout|maladm|test|conf|maltreat|pres")
)
# set path for saving
output_file <- here::here(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
rio::export(incoherent_rows, file = output_file)
#### Step 6.3: Add IPD/OPD Specification ---------------------------------------
# option 1: classify based on facility name patterns
dhis2_df <- dhis2_df |>
dplyr::mutate(
facility_type = dplyr::case_when(
# inpatient facilities (hospitals)
stringr::str_detect(
hf,
regex("hospital|hosp", ignore_case = TRUE)
) ~ "IPD",
stringr::str_detect(
hf,
regex("district hospital|regional hospital", ignore_case = TRUE)
) ~ "IPD",
# outpatient facilities (clinics, health posts, community health)
stringr::str_detect(
hf,
regex(
"CHP|CHC|MCHP|clinic|health post|health centre",
ignore_case = TRUE
)
) ~ "OPD",
# default to OPD if no pattern matched
TRUE ~ "OPD"
)
)
# option 2: classify based on presence of inpatient indicators
dhis2_df <- dhis2_df |>
dplyr::group_by(hf_uid) |>
dplyr::mutate(
# facility has IPD if it ever reports admissions or deaths
has_ipd = any(!is.na(maladm) & maladm > 0, na.rm = TRUE) |
any(!is.na(maldth) & maldth > 0, na.rm = TRUE),
facility_type = dplyr::if_else(has_ipd, "IPD", "OPD")
) |>
dplyr::ungroup() |>
dplyr::select(-has_ipd)
# option 3: use a reference lookup file
hf_path <-
rio::import(
here::here(
"01_data",
"1.1_foundational",
"1.1b_health_facilities",
"processed",
"health_facility_master_list.xlsx"
)
)
dhis2_df <- dhis2_df |>
dplyr::left_join(
facility_lookup |> dplyr::select(hf_uid, facility_type),
by = "hf_uid"
)
# check classification distribution
dhis2_df |>
dplyr::distinct(hf_uid, facility_type) |>
dplyr::count(facility_type)
### Step 7: Finalize Data ------------------------------------------------------
#### Step 7.1: Troubleshoot Duplicate HF-Month Records with Different Data -----
# identify duplicate hf-month combinations
duplicates <- dhis2_df |>
dplyr::group_by(record_id) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::ungroup()
# count number of duplicate pairs
n_duplicates <- duplicates |>
dplyr::distinct(record_id) |>
nrow()
# if duplicates exist, inspect them
if (n_duplicates > 0) {
# view duplicate records with key indicators
duplicate_details <- duplicates |>
dplyr::select(
record_id,
adm0, adm1, adm2, adm3, hfm yearmon,
allout,
test,
conf,
maltreat
) |>
dplyr::arrange(record_id)
print(duplicate_details)
# export for review with SNT team
rio::export(
duplicate_details,
here::here(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_duplicate_records_dhis2.xlsx"
)
)
}
# option 1: keep first record (if duplicates are exact copies)
dhis2_df <- dhis2_df |>
dplyr::distinct(record_id, .keep_all = TRUE)
# option 2: keep record with most complete data
dhis2_df <- dhis2_df |>
dplyr::group_by(record_id) |>
dplyr::slice_max(
# count non-NA values across indicator columns
order_by = rowSums(!is.na(dplyr::across(dplyr::where(is.numeric)))),
n = 1,
with_ties = FALSE
) |>
dplyr::ungroup()
# option 3: sum duplicates (if records represent partial reports)
dhis2_df <- dhis2_df |>
dplyr::group_by(record_id, hf, adm0, adm1, adm2, adm3) |>
dplyr::summarise(
dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)),
.groups = "drop"
)
#### Step 7.2: Generate Final Data Dictionary ----------------------------------
# define descriptions for computed/derived columns
computed_vars <- tibble::tribble(
~snt_var, ~indicator_label,
# time columns
"date", "Report date (YYYY-MM-DD)",
"year", "Report year",
"month", "Report month (1-12)",
"yearmon", "Year-month label (e.g., Jan 2020)",
# identifier columns
"hf_uid", "Unique health facility identifier (hash)",
"record_id", "Unique record identifier (facility + month hash)",
"location_short", "Location label: adm1 ~ adm2",
"location_full", "Location label: adm1 ~ adm2 ~ adm3 ~ hf",
"facility_type", "Facility type (IPD/OPD)",
# aggregated totals
"allout", "Total outpatient visits (allout_u5 + allout_ov5)",
"susp", "Total suspected cases (all ages, HF + COM)",
"test", "Total tested (test_hf + test_com)",
"test_hf", "Total tested at health facility",
"test_com", "Total tested in community",
"conf", "Total confirmed cases (conf_hf + conf_com)",
"conf_hf", "Total confirmed cases at health facility",
"conf_com", "Total confirmed cases in community",
"maltreat", "Total treated cases (maltreat_hf + maltreat_com)",
"maltreat_hf", "Total treated cases at health facility",
"maltreat_com", "Total treated cases in community",
"pres", "Total presumed cases (pres_hf + pres_com)",
"pres_hf", "Total presumed cases at health facility",
"pres_com", "Total presumed cases in community",
"maladm", "Total malaria admissions (all ages)",
"maldth", "Total malaria deaths (all ages)",
# age-group totals
"test_u5", "Total tested under 5 (HF + COM)",
"test_5_14", "Total tested 5-14 years (HF + COM)",
"test_ov15", "Total tested over 15 years (HF + COM)",
"test_hf_u5", "Tested under 5 at health facility",
"test_hf_5_14", "Tested 5-14 years at health facility",
"test_hf_ov15", "Tested over 15 years at health facility",
"test_com_u5", "Tested under 5 in community",
"test_com_5_14", "Tested 5-14 years in community",
"test_com_ov15", "Tested over 15 years in community",
"susp_u5", "Suspected cases under 5 (HF + COM)",
"susp_5_14", "Suspected cases 5-14 years (HF + COM)",
"susp_ov15", "Suspected cases over 15 years (HF + COM)",
"susp_hf_u5", "Suspected cases under 5 at health facility",
"susp_hf_5_14", "Suspected cases 5-14 years at health facility",
"susp_hf_ov15", "Suspected cases over 15 years at health facility",
"susp_com_u5", "Suspected cases under 5 in community",
"susp_com_5_14", "Suspected cases 5-14 years in community",
"susp_com_ov15", "Suspected cases over 15 years in community",
"conf_u5", "Confirmed cases under 5 (HF + COM)",
"conf_5_14", "Confirmed cases 5-14 years (HF + COM)",
"conf_ov15", "Confirmed cases over 15 years (HF + COM)",
"conf_hf_u5", "Confirmed cases under 5 at health facility",
"conf_hf_5_14", "Confirmed cases 5-14 years at health facility",
"conf_hf_ov15", "Confirmed cases over 15 years at health facility",
"conf_com_u5", "Confirmed cases under 5 in community",
"conf_com_5_14", "Confirmed cases 5-14 years in community",
"conf_com_ov15", "Confirmed cases over 15 years in community",
"maltreat_u5", "Treated cases under 5 (HF + COM)",
"maltreat_5_14", "Treated cases 5-14 years (HF + COM)",
"maltreat_ov15", "Treated cases over 15 years (HF + COM)",
"maltreat_hf_u5", "Treated cases under 5 at health facility",
"maltreat_hf_5_14", "Treated cases 5-14 years at health facility",
"maltreat_hf_ov15", "Treated cases over 15 years at health facility",
"maltreat_com_u5", "Treated cases under 5 in community",
"maltreat_com_5_14", "Treated cases 5-14 years in community",
"maltreat_com_ov15", "Treated cases over 15 years in community",
"maltreat_u24_hf", "Treated cases within 24hrs at health facility",
"maltreat_ov24_hf", "Treated cases after 24hrs at health facility",
"maltreat_u24_com", "Treated cases within 24hrs in community",
"maltreat_ov24_com", "Treated cases after 24hrs in community",
"maltreat_u24_total", "Total treated cases within 24hrs (HF + COM)",
"maltreat_ov24_total", "Total treated cases after 24hrs (HF + COM)",
"pres_u5", "Presumed cases under 5 (HF + COM)",
"pres_5_14", "Presumed cases 5-14 years (HF + COM)",
"pres_ov15", "Presumed cases over 15 years (HF + COM)",
"pres_hf_u5", "Presumed cases under 5 at health facility",
"pres_hf_5_14", "Presumed cases 5-14 years at health facility",
"pres_hf_ov15", "Presumed cases over 15 years at health facility",
"pres_com_u5", "Presumed cases under 5 in community",
"pres_com_5_14", "Presumed cases 5-14 years in community",
"pres_com_ov15", "Presumed cases over 15 years in community"
)
# combine original and computed dictionaries
full_data_dict <- dplyr::bind_rows(
data_dict,
computed_vars
) |>
# keep only columns present in final dataset
dplyr::filter(snt_var %in% names(dhis2_df)) |>
dplyr::arrange(snt_var) |>
dplyr::select(snt_variable = snt_var, label = indicator_label)
# check
full_data_dict |>
head()
#### Step 7.3: Arrange Final Column Order --------------------------------------
# arrange columns in logical order
dhis2_df <- dhis2_df |>
dplyr::select(
# identifiers
record_id,
# location hierarchy
adm0,
adm1,
adm2,
adm3,
hf,
hf_uid,
location_short,
location_full,
facility_type,
# time variables
date,
yearmon,
year,
month,
# all remaining indicator columns
dplyr::everything()
)
# check column order
colnames(dhis2_df) |> head(20)
### Step 8: Aggregate and Save Data --------------------------------------------
#### Step 8.1: Save Data at the Health Facility Level --------------------------
# set up output path
save_path <- here::here(
"01_data",
"02_epidemiology",
"2a_routine_surveillance",
"processed"
)
# create list to save
dhis2_hf_list <- list(
data = dhis2_df,
dictionary = full_data_dict
)
# save to xlsx
rio::export(
dhis2_hf_list,
here::here(save_path, "sle_dhis2_health_facility_data.xlsx")
)
# save to rds
rio::export(
dhis2_hf_list,
here::here(save_path, "sle_dhis2_health_facility_data.rds")
)
#### Step 8.2: Aggregate and Save Data at Each Admin Unit Level ----------------
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols <- c(
# totals
"allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
# by location
"test_hf", "test_com", "conf_hf", "conf_com",
"maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
# by age group - totals
"allout_u5", "allout_ov5",
"test_u5", "test_5_14", "test_ov15",
"conf_u5", "conf_5_14", "conf_ov15",
"maltreat_u5", "maltreat_5_14", "maltreat_ov15",
"pres_u5", "pres_5_14", "pres_ov15",
"susp_u5", "susp_5_14", "susp_ov15",
"maladm_u5", "maladm_5_14", "maladm_ov15",
"maldth_u5", "maldth_5_14", "maldth_ov15",
# by age and location
"test_hf_u5", "test_hf_5_14", "test_hf_ov15",
"test_com_u5", "test_com_5_14", "test_com_ov15",
"conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
"conf_com_u5", "conf_com_5_14", "conf_com_ov15",
"maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
"maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
"pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
"pres_com_u5", "pres_com_5_14", "pres_com_ov15",
# treatment timing
"maltreat_u24_hf", "maltreat_ov24_hf",
"maltreat_u24_com", "maltreat_ov24_com",
"maltreat_u24_total", "maltreat_ov24_total"
)
# function to aggregate at a given admin level
aggregate_admin <- function(df, group_cols, sum_cols) {
df |>
dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |>
dplyr::summarise(
dplyr::across(
dplyr::any_of(sum_cols),
~ sum(.x, na.rm = TRUE)
),
n_facilities = dplyr::n(),
.groups = "drop"
)
}
# aggregate at adm3 level
group_cols_adm3 <- c("adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon")
dhis2_adm3 <- aggregate_admin(dhis2_df, group_cols_adm3, sum_cols) |>
dplyr::mutate(
record_id = sntutils::vdigest(
paste(adm0, adm1, adm2, adm3, yearmon),
algo = "xxhash32"
),
location_short = paste(adm1, adm2, sep = " ~ "),
location_full = paste(adm1, adm2, adm3, sep = " ~ ")
)
# aggregate at adm2 level
group_cols_adm2 <- c("adm0", "adm1", "adm2", "year", "month", "yearmon")
dhis2_adm2 <- aggregate_admin(dhis2_df, group_cols_adm2, sum_cols) |>
dplyr::mutate(
record_id = sntutils::vdigest(
paste(adm0, adm1, adm2, yearmon),
algo = "xxhash32"
),
location_short = paste(adm1, adm2, sep = " ~ "),
location_full = paste(adm1, adm2, sep = " ~ ")
)
# aggregate at adm1 level
group_cols_adm1 <- c("adm0", "adm1", "year", "month", "yearmon")
dhis2_adm1 <- aggregate_admin(dhis2_df, group_cols_adm1, sum_cols) |>
dplyr::mutate(
record_id = sntutils::vdigest(
paste(adm0, adm1, yearmon),
algo = "xxhash32"
),
location_short = adm1,
location_full = adm1
)
# create data dictionaries for each level (filter to relevant columns only)
id_cols <- c("n_facilities", "record_id", "location_short", "location_full")
# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_dict <- full_data_dict |>
dplyr::filter(snt_variable %in% c(group_cols_adm3, sum_cols, id_cols))
# adm2 dict: excludes adm3 (not present at this level)
adm2_dict <- full_data_dict |>
dplyr::filter(
snt_variable %in% c(group_cols_adm2, sum_cols, id_cols),
!snt_variable %in% c("adm3")
)
# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_dict <- full_data_dict |>
dplyr::filter(
snt_variable %in% c(group_cols_adm1, sum_cols, id_cols),
!snt_variable %in% c("adm2", "adm3")
)
# save adm3 data with dictionary
rio::export(
list(data = dhis2_adm3, dictionary = adm3_dict),
here::here(save_path, "sle_dhis2_adm3_data.xlsx")
)
rio::export(dhis2_adm3, here::here(save_path, "sle_dhis2_adm3_data.rds"))
# save adm2 data with dictionary
rio::export(
list(data = dhis2_adm2, dictionary = adm2_dict),
here::here(save_path, "sle_dhis2_adm2_data.xlsx")
)
rio::export(dhis2_adm2, here::here(save_path, "sle_dhis2_adm2_data.rds"))
# save adm1 data with dictionary
rio::export(
list(data = dhis2_adm1, dictionary = adm1_dict),
here::here(save_path, "sle_dhis2_adm1_data.xlsx")
)
rio::export(dhis2_adm1, here::here(save_path, "sle_dhis2_adm1_data.rds"))Show full code
################################################################################
#################### ~ DHIS2 data preprocessing full code ~ ####################
################################################################################
### Step 1: Import Packages and Data -------------------------------------------
#### Step 1.1: Import Packages -------------------------------------------------
import pandas as pd # load core data tools
from pathlib import Path # handle file system paths
from pyprojroot import here # build project-relative paths
import os # optional path utilities
import locale # for setting language locale
import geopandas as gpd # for importing shapefiles
import xxhash # for hashing
import pyarrow.parquet as pq # for exportin parquet files
import pyarrow as pa # for exportin parquet files
#### Step 1.2: Import Data -----------------------------------------------------
# set the full dhis2 path
core_routine_path = Path(
here("1.1.2_epidemiology/1.1.2a_routine_surveillance/raw")
)
# set the full dhis2 path
path_to_dhis2 = core_routine_path / "sle_dhis2_2015_2022.xlsx"
# set file extension
extension = "xls"
# find all files with specified extension in directory
files = list(core_routine_path.glob(f"*.{extension}"))
# initialise dataframe
dhis2_df = pd.DataFrame()
# iterate over files, concatenate stack as one dataframe
for file in files:
temp = pd.read_excel(file, sheet_name="Sheet1")
dhis2_df = pd.concat([dhis2_df, temp])
# Inspect data
dhis2_df.head(10)
### Step 2: Diagnostics After Importing and Appending Multiple DHIS2 Extracts --
#### Step 2.1: Verify All Expected Groups Are Present After Import -------------
# check the adm's' in our data
dhis2_df[["orgunitlevel3"]].drop_duplicates().reset_index(drop=True)
#### Step 2.2: Check for Completely Missing Variables --------------------------
# compute percent missing for each column
pct_missing = dhis2_df.isna().mean() * 100
# convert to tidy/long format
missing_table = (
pct_missing.reset_index()
.rename(columns={"index": "variable", 0: "pct_missing"})
)
# filter for completely missing columns
missing_table[missing_table["pct_missing"] == 100]
### Step 3: Rename Columns -----------------------------------------------------
# import data dictionary
dict_path = os.path.join(core_routine_path, "sle_dhis2_data_dict.xlsx")
data_dict = pd.read_excel(dict_path)
# build rename dictionary: {old_name: new_name}
rename_dict = dict(zip(data_dict["indicator_label"], data_dict["snt_var"]))
# rename dhis2 data
dhis2_df = dhis2_df.rename(columns=rename_dict).drop(
columns=["orgunitlevel5"], errors="ignore"
)
# view updated column names
dhis2_df.columns.tolist()
### Step 4: Standardise the Date Columns ---------------------------------------
# set locale depending on the language in your DHIS2 export
# options include: "en_US.UTF-8", "fr_FR.UTF-8", "pt_PT.UTF-8"
locale.setlocale(locale.LC_TIME, "en_US.UTF-8")
# parse the raw date into a proper datetime object
dhis2_df["date"] = pd.to_datetime(
dhis2_df["periodname"],
format="%B %Y",
errors="coerce"
)
# create year, month, and ordered yearmon columns
dhis2_df["year"] = dhis2_df["date"].dt.year
dhis2_df["month"] = dhis2_df["date"].dt.month
dhis2_df["yearmon"] = dhis2_df["date"].dt.strftime("%b %Y")
# check the output
dhis2_df[["date", "year", "month", "yearmon"]].head()
### Step 5: Managing Location Columns ------------------------------------------
#### Steps 5.1: Harmonise Administrative Names ---------------------------------
# If not already, install the sntutils python package from GitHub
# has a number of useful helper functions
pip install git+https://github.com/ahadi-analytics/sntutils-py.git
from sntutils.geo import prep_geonames # for name matching
# set up location to save cache
cache_path = Path("1.1_foundational/1d_cache_files")
# get adm3 shapefile to use as reference lookup
shp_adm3 = gpd.read_file(
Path(here("english/data_r/shapefiles/sle_spatial_adm3_2021.geojson"))
).drop(columns="geometry")
# standardise admin names
dhis2_df = dhis2_df.assign(
adm0=lambda x: x["adm0"].str.upper(),
# the adm2 in the lookup shapefile don't have
# "DISTRICT" in the name, so we remove it to enable matching
adm2=lambda x: x["adm1"]
.str.upper()
.str.replace(" DISTRICT", "", regex=False)
.str.strip(),
adm3=lambda x: x["adm3"]
.str.upper()
.str.replace(" CHIEFDOM", "", regex=False)
.str.strip(),
hf=lambda x: x["hf"]
.str.upper()
)
# assign provinces based on district groupings
# (no adm1 as current adm1 is adm2)
district_to_province = {
"KAILAHUN": "EASTERN", "KENEMA": "EASTERN", "KONO": "EASTERN",
"BOMBALI": "NORTH EAST", "FALABA": "NORTH EAST",
"KOINADUGU": "NORTH EAST", "TONKOLILI": "NORTH EAST",
"KAMBIA": "NORTH WEST", "KARENE": "NORTH WEST", "PORT LOKO": "NORTH WEST",
"BO": "SOUTHERN", "BONTHE": "SOUTHERN",
"MOYAMBA": "SOUTHERN", "PUJEHUN": "SOUTHERN",
"WESTERN AREA RURAL": "WESTERN", "WESTERN AREA URBAN": "WESTERN"
}
dhis2_df["adm1"] = dhis2_df["adm2"].map(district_to_province)
# harmonise admin names between dhis2 data and shapefile
# note: sntutils::prep_geonames() has no direct python equivalent
# manual fuzzy matching or exact merge required
dhis2_df = dhis2_df.merge(
lookup_keys,
on=["adm0", "adm1", "adm2", "adm3"],
how="left"
)
# Harmonise admin names between population data and shapefile
dhis2_df = prep_geonames(
target_df=dhis2_df,
lookup_df=shp_adm3,
level0="adm0",
level1="adm1",
level2="adm2",
level3="adm3",
cache_path=here(cache_path, "geoname_decisions_cache.xlsx")
)
#### Steps 5.1: Create unique identifiers and location labels ------------------
# helper function to create xxhash32 digest (equivalent to sntutils::vdigest)
def vdigest(x, algo="xxhash32"):
"""Create vectorised hash digest."""
return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())
# create identifiers
dhis2_df = dhis2_df.assign(
# create unique health facility identifier from admin hierarchy
# use spaces to match R's paste() behavior for consistent hashes
hf_uid=lambda x: vdigest(
x["adm0"] + " " + x["adm1"] + " " + x["adm2"] + " " + x["adm3"] + " " + x["hf"]
),
# create location labels for mapping
location_short=lambda x: x["adm1"] + " ~ " + x["adm2"],
location_full=lambda x: (
x["adm1"] + " ~ " + x["adm2"] + " ~ " + x["adm3"] + " ~ " + x["hf"]
),
)
# generate consistent record id (facility + time period)
# use space to match R's paste() behavior
dhis2_df["record_id"] = vdigest(
dhis2_df["hf_uid"] + " " + dhis2_df["yearmon"].astype(str)
)
# check
dhis2_df[["location_full", "hf_uid", "record_id"]].drop_duplicates().sort_values(
"location_full"
).head()
### Step 6: Compute Variables --------------------------------------------------
#### Step 6.1: Compute Indicator Totals and New Variables ----------------------
# helper function: row-wise sum that returns value if only one non-NA
def fallback_row_sum(df, cols):
return df[cols].sum(axis=1, skipna=True, min_count=1)
# helper function: difference with floor at 0
def fallback_diff(df, col1, col2):
return (df[col1] - df[col2]).clip(lower=0)
dhis2_df = (
dhis2_df.assign(
# outpatient visits
allout=lambda x: fallback_row_sum(x, ["allout_u5", "allout_ov5"]),
# suspected cases
susp=lambda x: fallback_row_sum(
x,
[
"susp_u5_hf",
"susp_5_14_hf",
"susp_ov15_hf",
"susp_u5_com",
"susp_5_14_com",
"susp_ov15_com",
],
),
# tested cases
test_hf=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_u5_hf",
"test_pos_mic_u5_hf",
"test_neg_mic_5_14_hf",
"test_pos_mic_5_14_hf",
"test_neg_mic_ov15_hf",
"test_pos_mic_ov15_hf",
"tes_neg_rdt_u5_hf",
"tes_pos_rdt_u5_hf",
"tes_neg_rdt_5_14_hf",
"tes_pos_rdt_5_14_hf",
"tes_neg_rdt_ov15_hf",
"tes_pos_rdt_ov15_hf",
],
),
test_com=lambda x: fallback_row_sum(
x,
[
"tes_neg_rdt_u5_com",
"tes_pos_rdt_u5_com",
"tes_neg_rdt_5_14_com",
"tes_pos_rdt_5_14_com",
"tes_neg_rdt_ov15_com",
"tes_pos_rdt_ov15_com",
],
),
# confirmed cases (HF and COM)
conf_hf=lambda x: fallback_row_sum(
x,
[
"test_pos_mic_u5_hf",
"test_pos_mic_5_14_hf",
"test_pos_mic_ov15_hf",
"tes_pos_rdt_u5_hf",
"tes_pos_rdt_5_14_hf",
"tes_pos_rdt_ov15_hf",
],
),
conf_com=lambda x: fallback_row_sum(
x,
[
"tes_pos_rdt_u5_com",
"tes_pos_rdt_5_14_com",
"tes_pos_rdt_ov15_com",
],
),
# treated cases
maltreat_com=lambda x: fallback_row_sum(
x,
[
"maltreat_u24_u5_com",
"maltreat_ov24_u5_com",
"maltreat_u24_5_14_com",
"maltreat_ov24_5_14_com",
"maltreat_u24_ov15_com",
"maltreat_ov24_ov15_com",
],
),
maltreat_hf=lambda x: fallback_row_sum(
x,
[
"maltreat_u24_u5_hf",
"maltreat_ov24_u5_hf",
"maltreat_u24_5_14_hf",
"maltreat_ov24_5_14_hf",
"maltreat_u24_ov15_hf",
"maltreat_ov24_ov15_hf",
],
),
# malaria admissions
maladm=lambda x: fallback_row_sum(
x, ["maladm_u5", "maladm_5_14", "maladm_ov15"]
),
# malaria deaths
maldth=lambda x: fallback_row_sum(
x,
[
"maldth_u5",
"maldth_1_59m",
"maldth_10_14",
"maldth_5_9",
"maldth_5_14",
"maldth_ov15",
"maldth_fem_ov15",
"maldth_mal_ov15",
],
),
# AGE-GROUP SPECIFIC AGGREGATIONS
# Tested cases by age group (HF only)
test_hf_u5=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_u5_hf",
"test_pos_mic_u5_hf",
"tes_neg_rdt_u5_hf",
"tes_pos_rdt_u5_hf",
],
),
test_hf_5_14=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_5_14_hf",
"test_pos_mic_5_14_hf",
"tes_neg_rdt_5_14_hf",
"tes_pos_rdt_5_14_hf",
],
),
test_hf_ov15=lambda x: fallback_row_sum(
x,
[
"test_neg_mic_ov15_hf",
"test_pos_mic_ov15_hf",
"tes_neg_rdt_ov15_hf",
"tes_pos_rdt_ov15_hf",
],
),
# Tested cases by age group (Community only)
test_com_u5=lambda x: fallback_row_sum(
x, ["tes_neg_rdt_u5_com", "tes_pos_rdt_u5_com"]
),
test_com_5_14=lambda x: fallback_row_sum(
x, ["tes_neg_rdt_5_14_com", "tes_pos_rdt_5_14_com"]
),
test_com_ov15=lambda x: fallback_row_sum(
x, ["tes_neg_rdt_ov15_com", "tes_pos_rdt_ov15_com"]
),
# Suspected cases by age group (rename for consistency)
susp_hf_u5=lambda x: x["susp_u5_hf"],
susp_hf_5_14=lambda x: x["susp_5_14_hf"],
susp_hf_ov15=lambda x: x["susp_ov15_hf"],
susp_com_u5=lambda x: x["susp_u5_com"],
susp_com_5_14=lambda x: x["susp_5_14_com"],
susp_com_ov15=lambda x: x["susp_ov15_com"],
# Confirmed cases by age group (HF only)
conf_hf_u5=lambda x: fallback_row_sum(
x, ["test_pos_mic_u5_hf", "tes_pos_rdt_u5_hf"]
),
conf_hf_5_14=lambda x: fallback_row_sum(
x, ["test_pos_mic_5_14_hf", "tes_pos_rdt_5_14_hf"]
),
conf_hf_ov15=lambda x: fallback_row_sum(
x, ["test_pos_mic_ov15_hf", "tes_pos_rdt_ov15_hf"]
),
# Confirmed cases by age group (Community only)
conf_com_u5=lambda x: x["tes_pos_rdt_u5_com"],
conf_com_5_14=lambda x: x["tes_pos_rdt_5_14_com"],
conf_com_ov15=lambda x: x["tes_pos_rdt_ov15_com"],
# Treated cases by age group (HF only)
maltreat_hf_u5=lambda x: fallback_row_sum(
x, ["maltreat_u24_u5_hf", "maltreat_ov24_u5_hf"]
),
maltreat_hf_5_14=lambda x: fallback_row_sum(
x, ["maltreat_u24_5_14_hf", "maltreat_ov24_5_14_hf"]
),
maltreat_hf_ov15=lambda x: fallback_row_sum(
x, ["maltreat_u24_ov15_hf", "maltreat_ov24_ov15_hf"]
),
# Treated cases by age group (Community only)
maltreat_com_u5=lambda x: fallback_row_sum(
x, ["maltreat_u24_u5_com", "maltreat_ov24_u5_com"]
),
maltreat_com_5_14=lambda x: fallback_row_sum(
x, ["maltreat_u24_5_14_com", "maltreat_ov24_5_14_com"]
),
maltreat_com_ov15=lambda x: fallback_row_sum(
x, ["maltreat_u24_ov15_com", "maltreat_ov24_ov15_com"]
),
# Total treated cases within/after 24 hours
maltreat_u24_hf=lambda x: fallback_row_sum(
x,
["maltreat_u24_u5_hf", "maltreat_u24_5_14_hf", "maltreat_u24_ov15_hf"],
),
maltreat_ov24_hf=lambda x: fallback_row_sum(
x,
["maltreat_ov24_u5_hf", "maltreat_ov24_5_14_hf", "maltreat_ov24_ov15_hf"],
),
maltreat_u24_com=lambda x: fallback_row_sum(
x,
[
"maltreat_u24_u5_com",
"maltreat_u24_5_14_com",
"maltreat_u24_ov15_com",
],
),
maltreat_ov24_com=lambda x: fallback_row_sum(
x,
[
"maltreat_ov24_u5_com",
"maltreat_ov24_5_14_com",
"maltreat_ov24_ov15_com",
],
),
)
.assign(
# second pass: computed from first pass columns
test=lambda x: fallback_row_sum(x, ["test_hf", "test_com"]),
conf=lambda x: fallback_row_sum(x, ["conf_hf", "conf_com"]),
maltreat=lambda x: fallback_row_sum(x, ["maltreat_hf", "maltreat_com"]),
# totals by age group
test_u5=lambda x: fallback_row_sum(x, ["test_hf_u5", "test_com_u5"]),
test_5_14=lambda x: fallback_row_sum(x, ["test_hf_5_14", "test_com_5_14"]),
test_ov15=lambda x: fallback_row_sum(x, ["test_hf_ov15", "test_com_ov15"]),
susp_u5=lambda x: fallback_row_sum(x, ["susp_hf_u5", "susp_com_u5"]),
susp_5_14=lambda x: fallback_row_sum(x, ["susp_hf_5_14", "susp_com_5_14"]),
susp_ov15=lambda x: fallback_row_sum(x, ["susp_hf_ov15", "susp_com_ov15"]),
conf_u5=lambda x: fallback_row_sum(x, ["conf_hf_u5", "conf_com_u5"]),
conf_5_14=lambda x: fallback_row_sum(x, ["conf_hf_5_14", "conf_com_5_14"]),
conf_ov15=lambda x: fallback_row_sum(x, ["conf_hf_ov15", "conf_com_ov15"]),
maltreat_u5=lambda x: fallback_row_sum(
x, ["maltreat_hf_u5", "maltreat_com_u5"]
),
maltreat_5_14=lambda x: fallback_row_sum(
x, ["maltreat_hf_5_14", "maltreat_com_5_14"]
),
maltreat_ov15=lambda x: fallback_row_sum(
x, ["maltreat_hf_ov15", "maltreat_com_ov15"]
),
maltreat_u24_total=lambda x: fallback_row_sum(
x, ["maltreat_u24_hf", "maltreat_u24_com"]
),
maltreat_ov24_total=lambda x: fallback_row_sum(
x, ["maltreat_ov24_hf", "maltreat_ov24_com"]
),
# presumed cases
pres_com=lambda x: fallback_diff(x, "maltreat_com", "conf_com"),
pres_hf=lambda x: fallback_diff(x, "maltreat_hf", "conf_hf"),
pres_com_u5=lambda x: fallback_diff(x, "maltreat_com_u5", "conf_com_u5"),
pres_com_5_14=lambda x: fallback_diff(x, "maltreat_com_5_14", "conf_com_5_14"),
pres_com_ov15=lambda x: fallback_diff(x, "maltreat_com_ov15", "conf_com_ov15"),
pres_hf_u5=lambda x: fallback_diff(x, "maltreat_hf_u5", "conf_hf_u5"),
pres_hf_5_14=lambda x: fallback_diff(x, "maltreat_hf_5_14", "conf_hf_5_14"),
pres_hf_ov15=lambda x: fallback_diff(x, "maltreat_hf_ov15", "conf_hf_ov15"),
)
.assign(
# third pass: totals from presumed
pres=lambda x: fallback_row_sum(x, ["pres_com", "pres_hf"]),
pres_u5=lambda x: fallback_row_sum(x, ["pres_com_u5", "pres_hf_u5"]),
pres_5_14=lambda x: fallback_row_sum(x, ["pres_com_5_14", "pres_hf_5_14"]),
pres_ov15=lambda x: fallback_row_sum(x, ["pres_com_ov15", "pres_hf_ov15"]),
)
)
# inspect results
(
dhis2_df[
dhis2_df["record_id"].isin(
["a80a6352", "aa2eb756", "c8c29999", "eb63621c", "27ee9eeb", "93b16cfe"]
)
][
[
"allout",
"allout_u5",
"allout_ov5",
"pres_hf_u5",
"maltreat_hf_u5",
"conf_hf_u5",
]
]
.sort_values("allout")
.head(6)
)
#### Step 6.2: Quality Control of Indicator Totals -----------------------------
# create mismatch flags for each indicator group
# helper function to count mismatches, ignoring NAs
def count_mismatch(total, components):
"""Count mismatches between total and sum of components, ignoring NAs."""
calculated = components.sum(axis=1)
# only compare where both total and calculated are not NA
mask = total.notna() & calculated.notna()
return (total[mask] != calculated[mask]).sum()
mismatch_summary = pd.DataFrame({
# outpatient checks
"allout_mismatch": [
count_mismatch(
dhis2_df["allout"],
dhis2_df[["allout_u5", "allout_ov5"]]
)
],
# malaria admissions check
"maladm_mismatch": [
count_mismatch(
dhis2_df["maladm"],
dhis2_df[["maladm_u5", "maladm_5_14", "maladm_ov15"]]
)
],
# total tests check
"test_mismatch": [
count_mismatch(
dhis2_df["test"],
dhis2_df[["test_hf", "test_com"]]
)
],
# total confirmed check
"conf_mismatch": [
count_mismatch(
dhis2_df["conf"],
dhis2_df[["conf_hf", "conf_com"]]
)
],
# total treated check
"maltreat_mismatch": [
count_mismatch(
dhis2_df["maltreat"],
dhis2_df[["maltreat_hf", "maltreat_com"]]
)
],
# total presumed check
"pres_mismatch": [
count_mismatch(
dhis2_df["pres"],
dhis2_df[["pres_hf", "pres_com"]]
)
],
# malaria deaths check
"maldth_mismatch": [
count_mismatch(
dhis2_df["maldth"],
dhis2_df[[
"maldth_1_59m", "maldth_u5", "maldth_5_9", "maldth_10_14",
"maldth_5_14", "maldth_fem_ov15", "maldth_mal_ov15", "maldth_ov15"
]]
)
],
# tested cases by age group
"test_u5_mismatch": [
count_mismatch(
dhis2_df["test_u5"],
dhis2_df[["test_hf_u5", "test_com_u5"]]
)
],
"test_5_14_mismatch": [
count_mismatch(
dhis2_df["test_5_14"],
dhis2_df[["test_hf_5_14", "test_com_5_14"]]
)
],
"test_ov15_mismatch": [
count_mismatch(
dhis2_df["test_ov15"],
dhis2_df[["test_hf_ov15", "test_com_ov15"]]
)
],
# confirmed cases by age group
"conf_u5_mismatch": [
count_mismatch(
dhis2_df["conf_u5"],
dhis2_df[["conf_hf_u5", "conf_com_u5"]]
)
],
"conf_5_14_mismatch": [
count_mismatch(
dhis2_df["conf_5_14"],
dhis2_df[["conf_hf_5_14", "conf_com_5_14"]]
)
],
"conf_ov15_mismatch": [
count_mismatch(
dhis2_df["conf_ov15"],
dhis2_df[["conf_hf_ov15", "conf_com_ov15"]]
)
],
# presumed cases by age group
"pres_u5_mismatch": [
count_mismatch(
dhis2_df["pres_u5"],
dhis2_df[["pres_hf_u5", "pres_com_u5"]]
)
],
"pres_5_14_mismatch": [
count_mismatch(
dhis2_df["pres_5_14"],
dhis2_df[["pres_hf_5_14", "pres_com_5_14"]]
)
],
"pres_ov15_mismatch": [
count_mismatch(
dhis2_df["pres_ov15"],
dhis2_df[["pres_hf_ov15", "pres_com_ov15"]]
)
],
# suspected cases by age group
"susp_u5_mismatch": [
count_mismatch(
dhis2_df["susp_u5"],
dhis2_df[["susp_hf_u5", "susp_com_u5"]]
)
],
"susp_5_14_mismatch": [
count_mismatch(
dhis2_df["susp_5_14"],
dhis2_df[["susp_hf_5_14", "susp_com_5_14"]]
)
],
"susp_ov15_mismatch": [
count_mismatch(
dhis2_df["susp_ov15"],
dhis2_df[["susp_hf_ov15", "susp_com_ov15"]]
)
],
# treated cases by age group
"maltreat_u5_mismatch": [
count_mismatch(
dhis2_df["maltreat_u5"],
dhis2_df[["maltreat_hf_u5", "maltreat_com_u5"]]
)
],
"maltreat_5_14_mismatch": [
count_mismatch(
dhis2_df["maltreat_5_14"],
dhis2_df[["maltreat_hf_5_14", "maltreat_com_5_14"]]
)
],
"maltreat_ov15_mismatch": [
count_mismatch(
dhis2_df["maltreat_ov15"],
dhis2_df[["maltreat_hf_ov15", "maltreat_com_ov15"]]
)
],
# treatment timing checks
"maltreat_u24_mismatch": [
count_mismatch(
dhis2_df["maltreat_u24_total"],
dhis2_df[["maltreat_u24_hf", "maltreat_u24_com"]]
)
],
"maltreat_ov24_mismatch": [
count_mismatch(
dhis2_df["maltreat_ov24_total"],
dhis2_df[["maltreat_ov24_hf", "maltreat_ov24_com"]]
)
]
})
# pivot to long format for easier viewing
mismatch_summary = (
mismatch_summary
.melt(var_name="indicator", value_name="n_mismatches")
# filter to show only indicators with mismatches
.query("n_mismatches > 0")
)
mismatch_summary
#### Step 6.3: Export Rows with Incoherent Totals ------------------------------
# identify rows with incoherent totals
incoherent_rows = dhis2_df[
# outpatient checks
(dhis2_df["allout"] != (dhis2_df["allout_u5"] + dhis2_df["allout_ov5"])) |
# malaria admissions check
(dhis2_df["maladm"] != (dhis2_df["maladm_u5"] + dhis2_df["maladm_5_14"] + dhis2_df["maladm_ov15"])) |
# total tests check
(dhis2_df["test"] != (dhis2_df["test_hf"] + dhis2_df["test_com"])) |
# total confirmed check
(dhis2_df["conf"] != (dhis2_df["conf_hf"] + dhis2_df["conf_com"])) |
# total treated check
(dhis2_df["maltreat"] != (dhis2_df["maltreat_hf"] + dhis2_df["maltreat_com"])) |
# total presumed check
(dhis2_df["pres"] != (dhis2_df["pres_hf"] + dhis2_df["pres_com"]))
]
# select relevant columns
incoherent_rows = incoherent_rows[
["hf", "periodname"] +
[col for col in incoherent_rows.columns
if any(x in col for x in ["allout", "maladm", "test", "conf", "maltreat", "pres"])]
]
# set path for saving
output_file = Path(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_incoherent_totals_dhis2.xlsx"
)
# export to xlsx
incoherent_rows.to_excel(output_file, index=False)
#### Step 6.3: Add IPD/OPD Specification ---------------------------------------
# option 1: classify based on facility name patterns
def classify_facility(hf_name):
"""Classify facility as IPD or OPD based on name patterns."""
hf_lower = hf_name.lower() if pd.notna(hf_name) else ""
# inpatient facilities (hospitals)
if any(x in hf_lower for x in ["hospital", "hosp"]):
return "IPD"
# outpatient facilities (clinics, health posts, community health)
elif any(x in hf_lower for x in ["chp", "chc", "mchp", "clinic", "health post", "health centre"]):
return "OPD"
# default to OPD
else:
return "OPD"
dhis2_df["facility_type"] = dhis2_df["hf"].apply(classify_facility)
# option 2: classify based on presence of inpatient indicators
has_ipd = (
dhis2_df
.groupby("hf_uid")
.apply(
lambda x: ((x["maladm"].notna() & (x["maladm"] > 0)).any() |
(x["maldth"].notna() & (x["maldth"] > 0)).any())
)
.reset_index(name="has_ipd")
)
dhis2_df = dhis2_df.merge(has_ipd, on="hf_uid", how="left")
dhis2_df["facility_type"] = dhis2_df["has_ipd"].map({True: "IPD", False: "OPD"})
dhis2_df = dhis2_df.drop(columns="has_ipd")
# option 3: use a reference lookup file
facility_lookup = pd.read_excel(
Path("path/to/facility_classification.xlsx")
)
dhis2_df = dhis2_df.merge(
facility_lookup[["hf_uid", "facility_type"]],
on="hf_uid",
how="left"
)
# check classification distribution
dhis2_df[["hf_uid", "facility_type"]].drop_duplicates()["facility_type"].value_counts()
### Step 7: Finalize Data ------------------------------------------------------
#### Step 7.1: Troubleshoot Duplicate HF-Month Records with Different Data -----
# identify duplicate hf-month combinations
duplicates = dhis2_df.groupby(["hf_uid", "yearmon"]).filter(lambda x: len(x) > 1)
# count number of duplicate pairs
n_duplicates = duplicates[["hf_uid", "yearmon"]].drop_duplicates().shape[0]
# if duplicates exist, inspect them
if n_duplicates > 0:
# view duplicate records with key indicators
duplicate_details = duplicates[
["record_id", "allout", "test", "conf", "maltreat"]
].sort_values(["hf_uid", "yearmon"])
print(duplicate_details)
# export for review with SNT team
duplicate_details.to_excel(
Path(
"1.1.2_epidemiology",
"1.1.2a_routine_surveillance",
"processed",
"sle_duplicate_records_dhis2.xlsx",
),
index=False,
)
# option 1: keep first record (if duplicates are exact copies)
dhis2_df = dhis2_df.drop_duplicates(subset=["hf_uid", "yearmon"], keep="first")
# option 2: keep record with most complete data
dhis2_df = (
dhis2_df.assign(
n_complete=lambda x: x.select_dtypes(include="number").notna().sum(axis=1)
)
.sort_values("n_complete", ascending=False)
.drop_duplicates(subset=["record_id"], keep="first")
.drop(columns="n_complete")
)
# option 3: sum duplicates (if records represent partial reports)
group_cols = ["hf_uid", "yearmon", "hf", "adm0", "adm1", "adm2", "adm3"]
numeric_cols = dhis2_df.select_dtypes(include="number").columns.tolist()
dhis2_df = dhis2_df.groupby(group_cols, as_index=False)[numeric_cols].sum()
# verify duplicates resolved
n_remaining = dhis2_df.groupby(["record_id"]).filter(lambda x: len(x) > 1).shape[0]
#### Step 7.2: Generate Final Data Dictionary ----------------------------------
# define descriptions for computed/derived columns
computed_vars = pd.DataFrame([
# time columns
{"snt_var": "date", "indicator_label": "Report date (YYYY-MM-DD)"},
{"snt_var": "year", "indicator_label": "Report year"},
{"snt_var": "month", "indicator_label": "Report month (1-12)"},
{"snt_var": "yearmon", "indicator_label": "Year-month label (e.g., Jan 2020)"},
# identifier columns
{"snt_var": "hf_uid", "indicator_label": "Unique health facility identifier (hash)"},
{"snt_var": "record_id", "indicator_label": "Unique record identifier (facility + month hash)"},
{"snt_var": "location_short", "indicator_label": "Location label: adm1 ~ adm2"},
{"snt_var": "location_full", "indicator_label": "Location label: adm1 ~ adm2 ~ adm3 ~ hf"},
{"snt_var": "facility_type", "indicator_label": "Facility type (IPD/OPD)"},
# aggregated totals
{"snt_var": "allout", "indicator_label": "Total outpatient visits (allout_u5 + allout_ov5)"},
{"snt_var": "susp", "indicator_label": "Total suspected cases (all ages, HF + COM)"},
{"snt_var": "test", "indicator_label": "Total tested (test_hf + test_com)"},
{"snt_var": "test_hf", "indicator_label": "Total tested at health facility"},
{"snt_var": "test_com", "indicator_label": "Total tested in community"},
{"snt_var": "conf", "indicator_label": "Total confirmed cases (conf_hf + conf_com)"},
{"snt_var": "conf_hf", "indicator_label": "Total confirmed cases at health facility"},
{"snt_var": "conf_com", "indicator_label": "Total confirmed cases in community"},
{"snt_var": "maltreat", "indicator_label": "Total treated cases (maltreat_hf + maltreat_com)"},
{"snt_var": "maltreat_hf", "indicator_label": "Total treated cases at health facility"},
{"snt_var": "maltreat_com", "indicator_label": "Total treated cases in community"},
{"snt_var": "pres", "indicator_label": "Total presumed cases (pres_hf + pres_com)"},
{"snt_var": "pres_hf", "indicator_label": "Total presumed cases at health facility"},
{"snt_var": "pres_com", "indicator_label": "Total presumed cases in community"},
{"snt_var": "maladm", "indicator_label": "Total malaria admissions (all ages)"},
{"snt_var": "maldth", "indicator_label": "Total malaria deaths (all ages)"},
# age-group totals
{"snt_var": "test_u5", "indicator_label": "Total tested under 5 (HF + COM)"},
{"snt_var": "test_5_14", "indicator_label": "Total tested 5-14 years (HF + COM)"},
{"snt_var": "test_ov15", "indicator_label": "Total tested over 15 years (HF + COM)"},
{"snt_var": "conf_u5", "indicator_label": "Confirmed cases under 5 (HF + COM)"},
{"snt_var": "conf_5_14", "indicator_label": "Confirmed cases 5-14 years (HF + COM)"},
{"snt_var": "conf_ov15", "indicator_label": "Confirmed cases over 15 years (HF + COM)"},
{"snt_var": "maltreat_u5", "indicator_label": "Treated cases under 5 (HF + COM)"},
{"snt_var": "maltreat_5_14", "indicator_label": "Treated cases 5-14 years (HF + COM)"},
{"snt_var": "maltreat_ov15", "indicator_label": "Treated cases over 15 years (HF + COM)"},
{"snt_var": "pres_u5", "indicator_label": "Presumed cases under 5 (HF + COM)"},
{"snt_var": "pres_5_14", "indicator_label": "Presumed cases 5-14 years (HF + COM)"},
{"snt_var": "pres_ov15", "indicator_label": "Presumed cases over 15 years (HF + COM)"},
# add remaining age-group variables as needed...
])
# combine original and computed dictionaries
full_data_dict = (
pd.concat([data_dict, computed_vars], ignore_index=True)
.rename(columns={"snt_var": "snt_variable", "indicator_label": "label"})
[["snt_variable", "label"]]
)
# keep only columns present in final dataset
full_data_dict = (
full_data_dict[full_data_dict["snt_variable"].isin(dhis2_df.columns)]
.sort_values("snt_variable")
)
# check
full_data_dict.head(10)
#### Step 7.3: Arrange Final Column Order --------------------------------------
# define column order
id_cols = ["record_id"]
location_cols = ["adm0", "adm1", "adm2", "adm3", "hf", "hf_uid",
"location_short", "location_full", "facility_type"]
time_cols = ["date", "yearmon", "year", "month"]
# get remaining columns in original order
ordered_cols = id_cols + location_cols + time_cols
remaining_cols = [col for col in dhis2_df.columns if col not in ordered_cols]
# reorder dataframe
dhis2_df = dhis2_df[ordered_cols + remaining_cols]
# check column order
print(dhis2_df.columns[:20].tolist())
### Step 8: Aggregate and Save Data --------------------------------------------
#### Step 8.1: Save Data at the Health Facility Level --------------------------
save_path = Path(
here("01_data/02_epidemiology/2a_routine_surveillance/processed")
)
# save data to xlsx
dhis2_df.to_excel(
save_path / "sle_dhis2_health_facility_data.xlsx",
index=False
)
# save dictionary to xlsx
full_data_dict.to_excel(
save_path / "sle_dhis2_health_facility_dict.xlsx",
index=False
)
# save to parquet
dhis2_df.to_parquet(
save_path / "sle_dhis2_health_facility_data.parquet",
compression="zstd",
index=False
)
#### Step 8.2: Aggregate and Save Data at Each Admin Unit Level ----------------
# define numeric columns to sum (excluding HF-level identifiers)
sum_cols = [
# totals
"allout", "susp", "test", "conf", "pres", "maltreat", "maladm", "maldth",
# by location
"test_hf", "test_com", "conf_hf", "conf_com",
"maltreat_hf", "maltreat_com", "pres_hf", "pres_com",
# by age group - totals
"allout_u5", "allout_ov5",
"test_u5", "test_5_14", "test_ov15",
"conf_u5", "conf_5_14", "conf_ov15",
"maltreat_u5", "maltreat_5_14", "maltreat_ov15",
"pres_u5", "pres_5_14", "pres_ov15",
"susp_u5", "susp_5_14", "susp_ov15",
"maladm_u5", "maladm_5_14", "maladm_ov15",
"maldth_u5", "maldth_5_14", "maldth_ov15",
# by age and location
"test_hf_u5", "test_hf_5_14", "test_hf_ov15",
"test_com_u5", "test_com_5_14", "test_com_ov15",
"conf_hf_u5", "conf_hf_5_14", "conf_hf_ov15",
"conf_com_u5", "conf_com_5_14", "conf_com_ov15",
"maltreat_hf_u5", "maltreat_hf_5_14", "maltreat_hf_ov15",
"maltreat_com_u5", "maltreat_com_5_14", "maltreat_com_ov15",
"pres_hf_u5", "pres_hf_5_14", "pres_hf_ov15",
"pres_com_u5", "pres_com_5_14", "pres_com_ov15",
# treatment timing
"maltreat_u24_hf", "maltreat_ov24_hf",
"maltreat_u24_com", "maltreat_ov24_com",
"maltreat_u24_total", "maltreat_ov24_total"
]
# function to aggregate at a given admin level
def aggregate_admin(df, group_cols, sum_cols):
# filter to columns that exist in dataframe
existing_sum_cols = [c for c in sum_cols if c in df.columns]
# aggregate
agg_df = (
df
.groupby(group_cols, as_index=False)
.agg(
**{col: (col, "sum") for col in existing_sum_cols},
n_facilities=("hf_uid", "count")
)
)
return agg_df
# aggregate at adm3 level
group_cols_adm3 = ["adm0", "adm1", "adm2", "adm3", "year", "month", "yearmon"]
dhis2_adm3 = aggregate_admin(dhis2_df, group_cols_adm3, sum_cols)
dhis2_adm3["record_id"] = vdigest(
dhis2_adm3["adm0"] + " " + dhis2_adm3["adm1"] + " " +
dhis2_adm3["adm2"] + " " + dhis2_adm3["adm3"] + " " +
dhis2_adm3["yearmon"].astype(str)
)
dhis2_adm3["location_short"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"]
dhis2_adm3["location_full"] = dhis2_adm3["adm1"] + " ~ " + dhis2_adm3["adm2"] + " ~ " + dhis2_adm3["adm3"]
# aggregate at adm2 level
group_cols_adm2 = ["adm0", "adm1", "adm2", "year", "month", "yearmon"]
dhis2_adm2 = aggregate_admin(dhis2_df, group_cols_adm2, sum_cols)
dhis2_adm2["record_id"] = vdigest(
dhis2_adm2["adm0"] + " " + dhis2_adm2["adm1"] + " " +
dhis2_adm2["adm2"] + " " + dhis2_adm2["yearmon"].astype(str)
)
dhis2_adm2["location_short"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
dhis2_adm2["location_full"] = dhis2_adm2["adm1"] + " ~ " + dhis2_adm2["adm2"]
# aggregate at adm1 level
group_cols_adm1 = ["adm0", "adm1", "year", "month", "yearmon"]
dhis2_adm1 = aggregate_admin(dhis2_df, group_cols_adm1, sum_cols)
dhis2_adm1["record_id"] = vdigest(
dhis2_adm1["adm0"] + " " + dhis2_adm1["adm1"] + " " +
dhis2_adm1["yearmon"].astype(str)
)
dhis2_adm1["location_short"] = dhis2_adm1["adm1"]
dhis2_adm1["location_full"] = dhis2_adm1["adm1"]
# create data dictionaries for each level (filter to relevant columns only)
id_cols = ["n_facilities", "record_id", "location_short", "location_full"]
# adm3 dict: includes adm0, adm1, adm2, adm3
adm3_cols = group_cols_adm3 + [c for c in sum_cols if c in dhis2_adm3.columns] + id_cols
adm3_dict = full_data_dict[full_data_dict["snt_variable"].isin(adm3_cols)]
# adm2 dict: excludes adm3 (not present at this level)
adm2_cols = group_cols_adm2 + [c for c in sum_cols if c in dhis2_adm2.columns] + id_cols
adm2_dict = full_data_dict[
(full_data_dict["snt_variable"].isin(adm2_cols)) &
(~full_data_dict["snt_variable"].isin(["adm3"]))
]
# adm1 dict: excludes adm2, adm3 (not present at this level)
adm1_cols = group_cols_adm1 + [c for c in sum_cols if c in dhis2_adm1.columns] + id_cols
adm1_dict = full_data_dict[
(full_data_dict["snt_variable"].isin(adm1_cols)) &
(~full_data_dict["snt_variable"].isin(["adm2", "adm3"]))
]
# save adm3 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm3_data.xlsx") as writer:
dhis2_adm3.to_excel(writer, sheet_name="data", index=False)
adm3_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm3.to_parquet(save_path / "sle_dhis2_adm3_data.parquet", compression="zstd", index=False)
# save adm2 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm2_data.xlsx") as writer:
dhis2_adm2.to_excel(writer, sheet_name="data", index=False)
adm2_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm2.to_parquet(save_path / "sle_dhis2_adm2_data.parquet", compression="zstd", index=False)
# save adm1 data
with pd.ExcelWriter(save_path / "sle_dhis2_adm1_data.xlsx") as writer:
dhis2_adm1.to_excel(writer, sheet_name="data", index=False)
adm1_dict.to_excel(writer, sheet_name="dictionary", index=False)
dhis2_adm1.to_parquet(save_path / "sle_dhis2_adm1_data.parquet", compression="zstd", index=False)