Skip to contents

The DHS workflow turns Demographic and Health Survey (DHS) and Malaria Indicator Survey (MIS) microdata into survey-weighted indicator estimates. Every estimator accounts for the survey design (weights, clustering, stratification), returns long-format admin-stratified tibbles, and ships a machine-readable data dictionary.

The recommended order of operations is: (0) discover what surveys exist and inspect the variables they contain, (1) read the recodes you need, then (2) compute indicators. Step 0 matters - DHS variable names drift between phases and survey types, so you should confirm the variables an indicator depends on are actually present before you build it.

A complete, runnable version of this workflow ships with the package at inst/scripts/example_dhs_analysis.R.

Reading DHS data: pick a path

Every calc_*_dhs() function accepts a plain data frame, so how you read the data is up to you. The walkthrough below uses dhs_read() against a parquet archive because it is the route AHADI uses internally and what run_mbg_pipeline() is built on. It is not the only route.

  • One or a few DHS files (.dta, .csv, .rds, .sav, …) - read each recode with sntutils::read() (or haven::read_dta()) and skip straight to Step 2 below. Most analysts should take this path.

    kr <- sntutils::read("TGKR81FL.DTA")
    ge <- sntutils::read("TGGE8AFL.dta")
    shp_admin <- sf::read_sf("path/to/admin.shp")
    
    fever <- calc_fever_dhs(dhs_kr = kr, gps_data = ge,
                            shapefile = shp_admin,
                            admin_level = c("adm0", "adm1"))

    In this mode you do not need dhs_read(), you do not need a parquet archive, and you do not need any directory layout convention. Step 0 (discovery / dictionary inspection) still works - run make_dhs_raw_dictionary(kr) and list_dhs_var_labels(kr) on the data frame you just loaded.

  • Archive of many surveys (the path used below). Build a hive-partitioned parquet archive in the layout documented in ?dhs_read and use dhs_read() to query it by file type / country / year. The archive is what enables cross-survey discovery and feeds run_mbg_pipeline(). See Build your own archive for a recipe.

library(sntmethods)

country_iso2 <- "TG"
country_iso3 <- "tgo"
path_dhs_parquet <- here::here(dhs_data_path(), "01_data/parquet")

Step 0a: discover available surveys

The GE (geographic) recode is the cheapest way to enumerate the survey years available in your parquet archive for a given country and survey type:

survey_years_dhs <- dhs_read(
  path = path_dhs_parquet, file_type = "GE",
  survey_type = "DHS", country_code = country_iso2
) |>
  dplyr::pull(DHSYEAR) |>
  unique()

survey_years_mis <- dhs_read(
  path = path_dhs_parquet, file_type = "GE",
  survey_type = "MIS", country_code = country_iso2
) |>
  dplyr::pull(DHSYEAR) |>
  unique()

Step 0b: inspect the variables BEFORE building indicators

This is the step to run first. Before computing any indicator, build the variable list / data dictionary for each recode so you can confirm the variables that indicator needs are present (and named as expected) in this survey.

make_dhs_raw_dictionary() returns a tidy table of every variable in a raw DHS dataset, with its label - a complete “what’s in this file” listing:

kr <- dhs_read(path = path_dhs_parquet, file_type = "KR",
               survey_type = "DHS", country_code = country_iso2,
               survey_year = 2017)

# Full variable list for the recode (variable name + label + type)
kr_vars <- make_dhs_raw_dictionary(kr)
kr_vars
#> # variable, label, ... - one row per column in the recode

list_dhs_var_labels() is a lighter lookup of variable labels - handy for spot-checking a single candidate variable (e.g. does this survey carry the fever item h22, or the ACT items ml13*?):

A robust pattern - used in the example script - is to build the dictionary for every recode of every survey up front and save it, so the team can audit variable availability across surveys before any indicator is computed:

recodes <- list(GE = "GE", PR = "PR", HR = "HR", KR = "KR", IR = "IR")

survey_dictionary <- purrr::imap_dfr(recodes, function(ft, nm) {
  dat <- dhs_read(path = path_dhs_parquet, file_type = ft,
                  survey_type = "DHS", country_code = country_iso2,
                  survey_year = 2017)
  if (is.null(dat) || nrow(dat) == 0) return(NULL)
  make_dhs_raw_dictionary(dplyr::slice(dat, 1:2)) |>
    dplyr::mutate(recode = nm, .before = 1)
})

The package also ships curated dictionaries describing the indicators (not just the raw variables) - dhs_dictionary() for the unified view across all domains, and per-domain helpers (itn_dictionary(), act_dictionary(), …). See Methodology & conventions.

Step 1: read the recodes

dhs_read() reads a single survey recode from the partitioned parquet archive, preserving value labels and survey-specific variables:

ge <- dhs_read(path = path_dhs_parquet, file_type = "GE",
               survey_type = "DHS", country_code = country_iso2,
               survey_year = 2017)
pr <- dhs_read(path = path_dhs_parquet, file_type = "PR", ...)
hr <- dhs_read(path = path_dhs_parquet, file_type = "HR", ...)
kr <- dhs_read(path = path_dhs_parquet, file_type = "KR", ...)
ir <- dhs_read(path = path_dhs_parquet, file_type = "IR", ...)

Recodes map to indicator families as follows:

Recode Contents Used by
GE GPS / cluster coordinates survey discovery, cluster geolocation
HR Household ITN ownership/access, IRS, wealth
PR Person (household roster) PfPR, anemia, ITN use
KR Children (under 5) fever, care-seeking, testing, ACT, EPI, SMC, U5MR
IR Women (15-49) ANC, IPTp

Step 2: compute indicators

Every calc_*_dhs() function takes the recode(s) it needs plus the GPS recode (gps_data), the admin shapefile, and the admin_level(s) to return:

fever <- calc_fever_dhs(
  dhs_kr = kr, gps_data = ge,
  shapefile = shp_admin, admin_level = c("adm0", "adm1")
)

itn <- calc_itn_dhs(
  dhs_hr = hr, dhs_pr = pr, gps_data = ge,
  shapefile = shp_admin, admin_level = c("adm0", "adm1")
)

pfpr   <- calc_pfpr_dhs(dhs_pr = pr, gps_data = ge, shapefile = shp_admin)
csb    <- calc_csb_dhs(dhs_kr = kr, gps_data = ge, shapefile = shp_admin)
act    <- calc_act_dhs(dhs_kr = kr, gps_data = ge, shapefile = shp_admin)
cm     <- calc_case_management_dhs(dhs_kr = kr, gps_data = ge, shapefile = shp_admin)
iptp   <- calc_iptp_dhs(dhs_ir = ir, gps_data = ge, shapefile = shp_admin)
wealth <- calc_wealth_dhs(dhs_hr = hr, gps_data = ge, shapefile = shp_admin)

# A few take extra arguments:
anemia <- calc_severe_anemia_dhs(dhs_pr = pr, gps_data = ge,
                                 shapefile = shp_admin,
                                 altitude_adjusted = FALSE)
u5mr   <- calc_u5mr_dhs(dhs_kr = kr, period_years = 5,
                        gps_data = ge, shapefile = shp_admin)

Keep direct survey estimates at adm0/adm1. DHS/MIS sample designs are stratified by region and the published weights (v005/hv005) are representative at adm0 and adm1 only. Going to adm2 breaks the design assumption - clusters per adm2 are sparse, weights are no longer self-weighting within the stratum, and direct estimates carry unstable variance and wide CIs. For sub-region estimates use the MBG pipeline instead of direct survey aggregation.

Indicator coverage

Domain Key indicators Recode Function
Parasite prevalence PfPR by RDT/microscopy; age groups (u5, 5-10, u10, 2-10) PR calc_pfpr_dhs()
ITN Ownership, access, use (all ages + u5/pregnant/age bands), use-if-access HR, PR calc_itn_dhs()
IRS Household spraying coverage HR calc_irs_dhs()
Fever Fever prevalence in children under 5 KR calc_fever_dhs()
Care-seeking By sector (public, private, CHW, pharmacy, trained, none) KR calc_csb_dhs()
Malaria testing RDT/microscopy testing among febrile children KR calc_malaria_dx_dhs()
Antimalarials Any antimalarial treatment among febrile U5 KR calc_antimalarial_dhs()
ACT treatment ACT receipt by source; ACT among antimalarials / care seekers KR calc_act_dhs()
Case management Effective coverage (fever → care → test → treat) KR calc_case_management_dhs()
ANC Antenatal care visits (1+/2+/3+/4+/8+) IR calc_anc_dhs()
IPTp Intermittent preventive treatment doses (1+/2+/3+/4+) IR calc_iptp_dhs()
EPI vaccines BCG, DPT, polio, measles, penta, PCV, rota, IPV, HepB, YF, malaria, fully/zero-dose KR calc_epi_dhs()
Under-5 mortality U5MR per 1,000 live births (via DHS.rates) KR calc_u5mr_dhs()
Anemia Any, moderate+, severe (children 6-59 months) PR calc_severe_anemia_dhs()
SMC Seasonal malaria chemoprevention coverage KR calc_smc_dhs()
Wealth Quintile distribution; Gini coefficient (Brown formula) HR calc_wealth_dhs()

Each survey indicator has a spatial counterpart (calc_*_mbg() / prep_*_mbg()) - see Spatial modeling (MBG).

Output shape

Every calc_*_dhs() returns a named list of tibbles, one per admin level (adm0, adm1, …). Each row is one indicator at one admin unit, with standardised columns:

survey_id, iso3, iso2, survey_type, survey_year,
adm0, [adm1], [adm2], type, geo_source,
point, ci_l, ci_u, numerator, denominator,
indicator, indicator_code, numerator_description,
denominator_description, denominator_code

Because the shape is identical across families, you can stack results and treat a whole survey - or many surveys - as one tidy table:

library(dplyr)
all_adm1 <- bind_rows(
  itn$adm1, pfpr$adm1, fever$adm1, csb$adm1, act$adm1, cm$adm1
)

Running across many surveys

For a country with several DHS and MIS rounds, read each survey into a bundle, run every indicator defensively (so a single missing recode or variable doesn’t abort the run), and stack by admin level. The full pattern - including the tryCatch wrappers that skip indicators whose required variables are missing - is in inst/scripts/example_dhs_analysis.R. This is exactly why Step 0 matters: the variable dictionary tells you in advance which indicators a given survey can support.

See also