The DHS workflow turns Demographic and Health Survey (DHS) and Malaria Indicator Survey (MIS) microdata into survey-weighted indicator estimates. Every estimator accounts for the survey design (weights, clustering, stratification), returns long-format admin-stratified tibbles, and ships a machine-readable data dictionary.
The recommended order of operations is: (0) discover what surveys exist and inspect the variables they contain, (1) read the recodes you need, then (2) compute indicators. Step 0 matters - DHS variable names drift between phases and survey types, so you should confirm the variables an indicator depends on are actually present before you build it.
A complete, runnable version of this workflow ships with the package
at inst/scripts/example_dhs_analysis.R.
Reading DHS data: pick a path
Every calc_*_dhs() function accepts a plain data frame,
so how you read the data is up to you. The walkthrough below uses
dhs_read() against a parquet archive because it is the
route AHADI uses internally and what run_mbg_pipeline() is
built on. It is not the only route.
-
One or a few DHS files (
.dta,.csv,.rds,.sav, …) - read each recode withsntutils::read()(orhaven::read_dta()) and skip straight to Step 2 below. Most analysts should take this path.kr <- sntutils::read("TGKR81FL.DTA") ge <- sntutils::read("TGGE8AFL.dta") shp_admin <- sf::read_sf("path/to/admin.shp") fever <- calc_fever_dhs(dhs_kr = kr, gps_data = ge, shapefile = shp_admin, admin_level = c("adm0", "adm1"))In this mode you do not need
dhs_read(), you do not need a parquet archive, and you do not need any directory layout convention. Step 0 (discovery / dictionary inspection) still works - runmake_dhs_raw_dictionary(kr)andlist_dhs_var_labels(kr)on the data frame you just loaded. Archive of many surveys (the path used below). Build a hive-partitioned parquet archive in the layout documented in
?dhs_readand usedhs_read()to query it by file type / country / year. The archive is what enables cross-survey discovery and feedsrun_mbg_pipeline(). See Build your own archive for a recipe.
library(sntmethods)
country_iso2 <- "TG"
country_iso3 <- "tgo"
path_dhs_parquet <- here::here(dhs_data_path(), "01_data/parquet")Step 0a: discover available surveys
The GE (geographic) recode is the cheapest way to
enumerate the survey years available in your parquet archive for a given
country and survey type:
survey_years_dhs <- dhs_read(
path = path_dhs_parquet, file_type = "GE",
survey_type = "DHS", country_code = country_iso2
) |>
dplyr::pull(DHSYEAR) |>
unique()
survey_years_mis <- dhs_read(
path = path_dhs_parquet, file_type = "GE",
survey_type = "MIS", country_code = country_iso2
) |>
dplyr::pull(DHSYEAR) |>
unique()Step 0b: inspect the variables BEFORE building indicators
This is the step to run first. Before computing any indicator, build the variable list / data dictionary for each recode so you can confirm the variables that indicator needs are present (and named as expected) in this survey.
make_dhs_raw_dictionary() returns a tidy table of every
variable in a raw DHS dataset, with its label - a complete “what’s in
this file” listing:
kr <- dhs_read(path = path_dhs_parquet, file_type = "KR",
survey_type = "DHS", country_code = country_iso2,
survey_year = 2017)
# Full variable list for the recode (variable name + label + type)
kr_vars <- make_dhs_raw_dictionary(kr)
kr_vars
#> # variable, label, ... - one row per column in the recodelist_dhs_var_labels() is a lighter lookup of variable
labels - handy for spot-checking a single candidate variable (e.g. does
this survey carry the fever item h22, or the ACT items
ml13*?):
A robust pattern - used in the example script - is to build the dictionary for every recode of every survey up front and save it, so the team can audit variable availability across surveys before any indicator is computed:
recodes <- list(GE = "GE", PR = "PR", HR = "HR", KR = "KR", IR = "IR")
survey_dictionary <- purrr::imap_dfr(recodes, function(ft, nm) {
dat <- dhs_read(path = path_dhs_parquet, file_type = ft,
survey_type = "DHS", country_code = country_iso2,
survey_year = 2017)
if (is.null(dat) || nrow(dat) == 0) return(NULL)
make_dhs_raw_dictionary(dplyr::slice(dat, 1:2)) |>
dplyr::mutate(recode = nm, .before = 1)
})The package also ships curated dictionaries
describing the indicators (not just the raw variables) -
dhs_dictionary() for the unified view across all domains,
and per-domain helpers (itn_dictionary(),
act_dictionary(), …). See Methodology & conventions.
Step 1: read the recodes
dhs_read() reads a single survey recode from the
partitioned parquet archive, preserving value labels and survey-specific
variables:
ge <- dhs_read(path = path_dhs_parquet, file_type = "GE",
survey_type = "DHS", country_code = country_iso2,
survey_year = 2017)
pr <- dhs_read(path = path_dhs_parquet, file_type = "PR", ...)
hr <- dhs_read(path = path_dhs_parquet, file_type = "HR", ...)
kr <- dhs_read(path = path_dhs_parquet, file_type = "KR", ...)
ir <- dhs_read(path = path_dhs_parquet, file_type = "IR", ...)Recodes map to indicator families as follows:
| Recode | Contents | Used by |
|---|---|---|
GE |
GPS / cluster coordinates | survey discovery, cluster geolocation |
HR |
Household | ITN ownership/access, IRS, wealth |
PR |
Person (household roster) | PfPR, anemia, ITN use |
KR |
Children (under 5) | fever, care-seeking, testing, ACT, EPI, SMC, U5MR |
IR |
Women (15-49) | ANC, IPTp |
Step 2: compute indicators
Every calc_*_dhs() function takes the recode(s) it needs
plus the GPS recode (gps_data), the admin
shapefile, and the admin_level(s) to
return:
fever <- calc_fever_dhs(
dhs_kr = kr, gps_data = ge,
shapefile = shp_admin, admin_level = c("adm0", "adm1")
)
itn <- calc_itn_dhs(
dhs_hr = hr, dhs_pr = pr, gps_data = ge,
shapefile = shp_admin, admin_level = c("adm0", "adm1")
)
pfpr <- calc_pfpr_dhs(dhs_pr = pr, gps_data = ge, shapefile = shp_admin)
csb <- calc_csb_dhs(dhs_kr = kr, gps_data = ge, shapefile = shp_admin)
act <- calc_act_dhs(dhs_kr = kr, gps_data = ge, shapefile = shp_admin)
cm <- calc_case_management_dhs(dhs_kr = kr, gps_data = ge, shapefile = shp_admin)
iptp <- calc_iptp_dhs(dhs_ir = ir, gps_data = ge, shapefile = shp_admin)
wealth <- calc_wealth_dhs(dhs_hr = hr, gps_data = ge, shapefile = shp_admin)
# A few take extra arguments:
anemia <- calc_severe_anemia_dhs(dhs_pr = pr, gps_data = ge,
shapefile = shp_admin,
altitude_adjusted = FALSE)
u5mr <- calc_u5mr_dhs(dhs_kr = kr, period_years = 5,
gps_data = ge, shapefile = shp_admin)Keep direct survey estimates at adm0/adm1. DHS/MIS
sample designs are stratified by region and the published weights
(v005/hv005) are representative at adm0 and
adm1 only. Going to adm2 breaks the design assumption - clusters per
adm2 are sparse, weights are no longer self-weighting within the
stratum, and direct estimates carry unstable variance and wide CIs. For
sub-region estimates use the MBG
pipeline instead of direct survey aggregation.
Indicator coverage
| Domain | Key indicators | Recode | Function |
|---|---|---|---|
| Parasite prevalence | PfPR by RDT/microscopy; age groups (u5, 5-10, u10, 2-10) | PR | calc_pfpr_dhs() |
| ITN | Ownership, access, use (all ages + u5/pregnant/age bands), use-if-access | HR, PR | calc_itn_dhs() |
| IRS | Household spraying coverage | HR | calc_irs_dhs() |
| Fever | Fever prevalence in children under 5 | KR | calc_fever_dhs() |
| Care-seeking | By sector (public, private, CHW, pharmacy, trained, none) | KR | calc_csb_dhs() |
| Malaria testing | RDT/microscopy testing among febrile children | KR | calc_malaria_dx_dhs() |
| Antimalarials | Any antimalarial treatment among febrile U5 | KR | calc_antimalarial_dhs() |
| ACT treatment | ACT receipt by source; ACT among antimalarials / care seekers | KR | calc_act_dhs() |
| Case management | Effective coverage (fever → care → test → treat) | KR | calc_case_management_dhs() |
| ANC | Antenatal care visits (1+/2+/3+/4+/8+) | IR | calc_anc_dhs() |
| IPTp | Intermittent preventive treatment doses (1+/2+/3+/4+) | IR | calc_iptp_dhs() |
| EPI vaccines | BCG, DPT, polio, measles, penta, PCV, rota, IPV, HepB, YF, malaria, fully/zero-dose | KR | calc_epi_dhs() |
| Under-5 mortality | U5MR per 1,000 live births (via DHS.rates) |
KR | calc_u5mr_dhs() |
| Anemia | Any, moderate+, severe (children 6-59 months) | PR | calc_severe_anemia_dhs() |
| SMC | Seasonal malaria chemoprevention coverage | KR | calc_smc_dhs() |
| Wealth | Quintile distribution; Gini coefficient (Brown formula) | HR | calc_wealth_dhs() |
Each survey indicator has a spatial counterpart
(calc_*_mbg() / prep_*_mbg()) - see Spatial modeling (MBG).
Output shape
Every calc_*_dhs() returns a named list of tibbles, one
per admin level (adm0, adm1, …). Each row is
one indicator at one admin unit, with standardised columns:
survey_id, iso3, iso2, survey_type, survey_year,
adm0, [adm1], [adm2], type, geo_source,
point, ci_l, ci_u, numerator, denominator,
indicator, indicator_code, numerator_description,
denominator_description, denominator_code
Because the shape is identical across families, you can stack results and treat a whole survey - or many surveys - as one tidy table:
Running across many surveys
For a country with several DHS and MIS rounds, read each survey into
a bundle, run every indicator defensively (so a single missing recode or
variable doesn’t abort the run), and stack by admin level. The full
pattern - including the tryCatch wrappers that skip
indicators whose required variables are missing - is in inst/scripts/example_dhs_analysis.R.
This is exactly why Step 0 matters: the variable dictionary tells you in
advance which indicators a given survey can support.
See also
- Spatial modeling (MBG) for sub-adm1 estimates
- Methodology & conventions for indicator specs and dictionaries
-
Reference
for every
calc_*_dhs()signature ```
