Read & clean

Most of an SNT analysis is data wrangling: pulling spreadsheets out of DHIS2 exports, parsing dates that have been entered four different ways, matching district names that disagree across files, and producing a clean tibble the rest of the pipeline can trust. This article walks through the I/O and cleaning functions in sntutils in the order you’d actually use them on a real project.

For the methodology and conceptual background behind the steps in this article, please check the SNT Code Library:

DHIS2 import and preprocessing - what to expect from a DHIS2 export and the cleaning steps it needs.
DHIS2 extraction - how DHIS2 exports are produced upstream.
Routine-data context - background on what’s collected and how.
Final clean database - what the cleaned, joined dataset should look like.

Reading and writing

read() and write() are thin wrappers over rio plus the spatial ecosystem. They pick an importer or exporter based on the file extension, so we can stop thinking about whether something is readr::read_csv() or haven::read_dta() or sf::st_read().

library(sntutils)

# tabular formats
df_csv    <- read("path/to/file.csv", sep = ",")
df_excel  <- read("path/to/file.xlsx", sheet = 1)
df_stata  <- read("path/to/file.dta")
df_spss   <- read("path/to/file.sav")
df_rds    <- read("path/to/file.rds")

# spatial formats - returns an sf object
sf_geojson  <- read("path/to/file.geojson")
sf_shape    <- read("path/to/file.shp")

# write in the format implied by the extension
write(df, "path/to/export.csv")
write(df, "path/to/export.xlsx")
write(df, "path/to/export.dta")
write(sf_geojson, "path/to/export.geojson")

# write multiple tibbles as sheets in one Excel file
write(
  list(facilities = hf_df, cases = case_df, pop = pop_df),
  "path/to/multi_sheet.xlsx"
)

read() defers to the underlying reader for arguments, so you can pass sheet, sep, skip, col_types etc. exactly as you would directly.

Reproducible reads with metadata

For data we want to track across runs - DHIS2 exports, processed intermediate files, anything that will land in 01_data/ - use write_snt_data() and read_snt_data() instead. They write atomically, hash the payload, and drop a sidecar JSON next to the file with the schema, row count, write timestamp and hash. read_snt_data() looks for the file by logical name inside a directory, so the rest of the pipeline doesn’t care about the extension.

# write - atomic, hashed, sidecar metadata next to the file
write_snt_data(
  obj  = cleaned_dhis2,
  path = "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed",
  data_name = "sl_dhis2_clean",
  formats = c("rds", "csv")
)

# read - finds the most recent file with that logical name
sl_dhis2 <- read_snt_data(
  path = "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed",
  data_name = "sl_dhis2_clean"
)

The sidecar makes it easy to detect upstream changes: if the input hash hasn’t moved, downstream steps can be cached.

Standardising column names

DHIS2 exports often arrive with names like "OPD Total - Suspected Malaria (U5)". standardize_names() collapses these into something joinable:

standardize_names(c("Chp Kpalime-III", "Kpalimé CHP 3"))
#> [1] "CHP KPALIME 3" "CHP KPALIME 3"

# applied across a tibble
sl_dhis2 <- sl_dhis2 |>
  dplyr::mutate(
    dplyr::across(c(adm1, adm2, adm3, hf), standardize_names)
  )

The function lowercases or uppercases, strips accents and punctuation, converts Roman numerals to digits, and sorts internal tokens so "CHP Kpalime III" and "III CHP Kpalime" collapse to the same string. Set to_upper = FALSE for lowercase_with_underscores (handy for column names).

clean_filenames() does the equivalent job for file names - useful when unzipping country submissions where filenames carry spaces and accents.

Parsing dates

autoparse_dates() walks a list of candidate formats and detects which one applies per cell. It handles ISO-8601 with timezones, ambiguous dd/mm vs mm/dd cases, and Excel serial dates.

df <- data.frame(
  mixed_dates   = c("2023-10-03", "11.09.2022", "25-12-21 23:59",
                    "2020-08-15T00:00:00Z"),
  iso8601_dates = c("2021-03-20T00:01:00.513+01:00",
                    "2022-11-05T23:15:59.123Z")
)

parsed <- autoparse_dates(
  data          = df,
  date_cols     = c("mixed_dates", "iso8601_dates"),
  output_format = "%Y-%m-%d"
)

parsed$mixed_dates
#> [1] "2023-10-03" "2022-09-11" "2021-12-25" "2020-08-15"

parsed$iso8601_dates
#> [1] "2021-03-20" "2022-11-05"

To see the formats it knows about, inspect available_date_formats:

utils::head(available_date_formats, 8)
#> [1] "%Y-%m-%d"      "%Y/%m/%d"      "%d-%m-%Y"      "%d/%m/%Y"
#> [5] "%d.%m.%Y"      "%d-%b-%Y"      "%d %B %Y"      "%Y-%m-%dT%H:%M:%S"

Pass an additional formats = vector to extend the list for a country whose data carries an unusual encoding (some MIS exports use %d %m %Y with a literal space, for example).

Inferring variable types

DHIS2 / Excel pipelines almost always leave numerics as character and factor-like columns as freeform strings. The two helpers below recover the right types without losing the IDs that look numeric but aren’t.

`auto_parse_types()` - one call, sensible types

Walks each column and converts to the narrowest sensible type. It recognises ID-like columns by name suffix (*_id, *_uid, *_code, etc.) and keeps them as character so leading zeros survive.

df <- tibble::tibble(
  hf_uid    = c("001", "002", "003"),       # preserved as character
  sex       = c("M", "F", "F"),             # detected as factor
  age       = c("1", "2", "3"),             # parsed to integer
  weight_kg = c("12.4", "15.0", "9.8"),     # parsed to double
  dob       = c("2001-05-10", "2003-02-14", "2005-09-30"),  # to Date
  notes     = c("ok", "review", NA_character_)  # stays character
)

auto_parse_types(df)
#> # A tibble: 3 × 6
#>   hf_uid sex     age weight_kg dob        notes
#>   <chr>  <fct> <int>     <dbl> <date>     <chr>
#> 1 001    M         1      12.4 2001-05-10 ok
#> 2 002    F         2      15   2003-02-14 review
#> 3 003    F         3       9.8 2005-09-30 <NA>

Control the factor-vs-character cut-off via max_levels (default 50): columns with more unique values than that stay character, so a facility_name column with 1700 distinct values doesn’t get coerced to a 1700-level factor.

`detect_factors()` - the cardinality check, exposed

detect_factors() is the lower-level helper that powers the factor decision in auto_parse_types(). Call it directly when you want to see what would be factorised before committing to the conversion:

tibble::tibble(
  adm  = c("A", "B", "A", "C", "B"),       # 3 levels - good factor candidate
  code = c("01", "02", "03", "04", "05"),  # 5 unique - looks like an ID
  note = letters[1:5]                       # 5 unique - just text
) |>
  detect_factors(max_levels = 4)
#> # A tibble: 3 × 4
#>   variable n_unique would_factor reason
#>   <chr>       <int> <lgl>        <chr>
#> 1 adm             3 TRUE         under max_levels
#> 2 code            5 FALSE        looks like an ID column
#> 3 note            5 FALSE        over max_levels

The reason column makes it easy to audit type decisions when a country submission looks suspicious.

Harmonising admin names

The big one. prep_geonames() cleans hierarchical admin names by combining string-distance matching with an interactive picker and a durable cache of decisions. It supports up to six admin levels, accepts your own lookup table, and can also be run non-interactively for batch jobs.

# input data with messy spellings
dhis2_dummy <- data.frame(
  country  = c("ANGOLA", "UGA", "ZAMBIA", "KEN"),
  province = c("CABONDA", "TESO", "LUSAKA", "NAIROBY"),
  district = c("BALIZE", "BOKEDEA", "RAFUNSA", "KIBRA")
)

# the reference table to match against
my_lookup <- data.frame(
  country  = c("Angola", "Uganda", "Zambia", "Kenya"),
  province = c("Cabinda", "Teso",   "Lusaka", "Nairobi"),
  district = c("Belize",  "Bukedea", "Rufunsa", "Kibera")
)

cleaned <- prep_geonames(
  target_df  = dhis2_dummy,
  lookup_df  = my_lookup,
  level0     = "country",
  level1     = "province",
  level2     = "district",
  interactive = TRUE
)

If no lookup_df is given, prep_geonames() falls back to WHO geonames shipped with the package. Decisions are persisted to disk, so a second run on the same data is instant.

A demo of the interactive flow is in the README on GitHub.

Dictionaries: producing one, and using the curated SNT schema

Two flavours of dictionary live in sntutils. Both come up on every country project, so they’re worth knowing well.

The curated SNT variable dictionary

The package ships with a canonical schema of the SNT variable names that the rest of sntutils expects (conf, pres, test, allout, maltreat_u5, …). snt_data_dict() returns this curated dictionary as a tibble, with bilingual labels and a category tag for each variable. This is the source of truth for variable naming across country teams.

snt_data_dict() |>
  dplyr::select(variable, category, label_en, label_fr) |>
  dplyr::filter(category == "case_management") |>
  utils::head()
#> # A tibble: 6 × 4
#>   variable    category         label_en                       label_fr
#>   <chr>       <chr>            <chr>                          <chr>
#> 1 allout      case_management  All-cause outpatient visits    Consultations externes ...
#> 2 susp        case_management  Suspected malaria cases        Cas suspects de paludisme
#> 3 test        case_management  Malaria tests performed        Tests de paludisme ...
#> 4 conf        case_management  Confirmed malaria cases        Cas confirmés
#> 5 pres        case_management  Presumed malaria cases         Cas présumés
#> 6 maltreat    case_management  Malaria cases treated          Cas traités

check_snt_var() validates one or more variable names against the schema, with fuzzy matching when names are close-but-not-exact:

check_snt_var("maltreat_u5")
#> ✔ maltreat_u5 - Malaria cases treated, under 5

check_snt_var("conf_uner5")     # typo
#> ✖ Unknown SNT variable: conf_uner5
#> ℹ Did you mean: conf_u5?

Use it as a sanity check at the start of every analysis, before joining DHIS2 to anything else.

Generate a dictionary for your own dataset

Once a dataset is clean, build_dictionary() produces a per-variable summary tibble - name, type, number of unique values, value range or top levels, and (where they’re known) the English label from the SNT schema. This is the artefact we hand to country teams as documentation for any processed dataset.

dd <- build_dictionary(sl_dhis2)

dd |>
  dplyr::select(variable, type, n_unique, label_en) |>
  utils::head()
#> # A tibble: 6 × 4
#>   variable type        n_unique label_en
#>   <chr>    <chr>          <int> <chr>
#> 1 adm0     factor             1 Country (adm0)
#> 2 adm1     factor             5 Region (adm1)
#> 3 adm2     factor            17 District (adm2)
#> 4 adm3     factor           191 Chiefdom (adm3)
#> 5 hf       character       1702 Health facility name
#> 6 conf     numeric            - Confirmed malaria cases

The output drops cleanly into a multi-sheet Excel via write(), so we can ship a data_dictionary.xlsx alongside every processed dataset without writing it by hand.

clear_snt_cache() resets the in-memory cache after editing the underlying labels CSV - call it if you’ve updated the schema mid-session.

A clean pipeline, end to end

Putting it all together - the cleaning step from the Get started example, but with explicit intermediate writes so we can audit each stage:

raw <- read("01_data/1.2_epidemiology/1.2a_routine_surveillance/raw/sl_dhis2.csv")

clean <- raw |>
  standardize_names() |>             # tidy column names
  autoparse_dates(date_cols = "date") |>   # normalise dates
  auto_parse_types() |>              # infer types
  dplyr::mutate(
    hf_uid = vdigest(paste0(adm1, adm2, hf), algo = "xxhash32")
  )

write_snt_data(
  obj       = clean,
  path      = "01_data/1.2_epidemiology/1.2a_routine_surveillance/processed",
  data_name = "sl_dhis2_clean",
  formats   = c("rds", "csv")
)

From here, downstream articles take this clean object and turn it into reporting rates, consistency checks, maps, and climate-adjusted indicators.