Skip to contents

The other articles cover the headline analyses. This one rounds up the day-to-day plumbing - folder scaffolding, paths, translation, hashing, image compression, and the small numerical helpers - that make the rest of sntutils feel coherent.

Project structure

AHADI projects follow a fixed top-level layout so that scripts written for one country drop straight into another. Two functions build it.

Just the data tree

Creates only the data tree under 01_data/. Each domain folder has raw/ and processed/ subfolders.

01_data/
├── 1.1_foundational/
│   ├── 1.1a_admin_boundaries/
│   ├── 1.1b_physical_features/
│   ├── 1.1c_health_facilities/
│   ├── 1.1d_community_health_workers/
│   ├── 1.1e_population/
│   │   ├── 1.1ei_national/
│   │   ├── 1.1eii_worldpop_rasters/
│   │   └── 1.1eiii_displaced_pop/
│   └── 1.1f_cache_files/
├── 1.2_epidemiology/
│   ├── 1.2a_routine_surveillance/
│   ├── 1.2b_pfpr_estimates/
│   └── 1.2c_mortality_estimates/
├── 1.3_interventions/
│   ├── 1.3a_itns/
│   ├── 1.3b_iptp/
│   ├── 1.3c_smc/
│   ├── 1.3d_vap/
│   ├── 1.3e_anc/
│   └── 1.3f_irs/
├── 1.4_drug_efficacy_resistance/
├── 1.5_environment/
│   ├── 1.5a_climate/
│   ├── 1.5b_accessibility/
│   └── 1.5c_land_use/
├── 1.6_health_systems/
│   └── 1.6a_dhs/
├── 1.7_entomology/
├── 1.8_commodities/
├── 1.9_finance/
└── 1.10_final/

Every leaf folder (except 1.10_final/ and 1.1f_cache_files/) gets its own raw/ and processed/ subfolders automatically.

Full project skeleton

Same data tree plus the surrounding scaffolding for scripts, outputs and reports - use this on the first day of a new project.

initialize_project_structure(base_path = "my_snt_project")
my_snt_project/
├── 01_data/                       # full hierarchical data tree from above
├── 02_scripts/
├── 03_outputs/
│   ├── 3.1_validation/            (figures/ + tables/)
│   ├── 3.2_intermediate_products/ (figures/ + tables/)
│   ├── 3.3_final_snt_outputs/     (figures/ + tables/)
│   └── 3.4_model/                 (figures/ + tables/)
├── 04_reports/
└── 05_metadata_docs/

Resolving paths

Once the folders exist, setup_project_paths() returns a named list of absolute paths to every important location. Detects the project root via here::here() / rprojroot, with getwd() as a fallback. The keys are short and stable so scripts don’t have to know where the project lives on disk.

paths <- setup_project_paths(quiet = TRUE)

paths$admin_shp
#> [1] "/Users/me/projects/sle-snt/01_data/1.1_foundational/1.1a_admin_boundaries"

paths$dhis2
#> [1] "/Users/me/projects/sle-snt/01_data/1.2_epidemiology/1.2a_routine_surveillance"

paths$val_fig
#> [1] "/Users/me/projects/sle-snt/03_outputs/validation/figures"

Every download, validation and processing function in the package accepts these paths directly, so a typical script reads:

paths  <- setup_project_paths()
shp    <- read(file.path(paths$admin_shp, "sle_adm2.geojson"))
dhis2  <- read_snt_data(paths$dhis2, "sl_dhis2_clean")

ahadi_path() is a related helper for navigating the AHADI shared OneDrive - it locates a project by name across team members’ synced roots, useful for cross-machine reproducibility.

clear_snt_cache() resets the in-memory caches used by snt_data_dict() and related helpers; call it after editing the dictionary CSV without restarting R.

Translation and localisation

Country teams switch between English, French and Portuguese on a near- weekly basis. Every plot and label function in sntutils accepts a target_language argument, which routes through this small translation stack.

Translate a single string

Calls Google Translate (via the gtranslate package) once per unique string and caches the response to a local JSON file. The next call for the same (source, target, text) triple returns instantly from cache.

translate_text("Reporting rate by district",
               target_language = "fr",
               source_language = "en")
#> [1] "Taux de rapportage par district"

translate_text("Malaria cases",
               target_language = "pt",
               cache_path = "~/translation_cache")
#> [1] "Casos de malária"

Translate a column

Vectorised wrapper for translate_text() - translate a whole column in one call, with caching shared across the batch.

df <- tibble::tibble(
  label = c("Confirmed cases", "Presumed cases", "Tests performed")
)

df |>
  dplyr::mutate(label_es = translate_text_vec(label, target_language = "es"))
#> # A tibble: 3 × 2
#>   label           label_es
#>   <chr>           <chr>
#> 1 Confirmed cases Casos confirmados
#> 2 Presumed cases  Casos presuntos
#> 3 Tests performed Pruebas realizadas

Locale-aware month-year formatting

Locale-aware month-year formatting for time-series axes.

dates <- seq(as.Date("2022-01-01"), as.Date("2022-03-01"), by = "month")

translate_yearmon(dates, language = "fr")
#> [1] "janv. 2022" "févr. 2022" "mars 2022"

translate_yearmon(dates, language = "es", format = "%B %Y")
#> [1] "enero 2022" "febrero 2022" "marzo 2022"

French malaria acronyms

Google Translate consistently butchers technical malaria acronyms when asked to translate plot titles, labels and report text into French - it emits “PIT” instead of TPI, the literal phrase “moustiquaire imprégnée à longue durée d’action” instead of MILDA, and so on. french_malaria_acronyms() is the curated override list that every translation-aware function in sntutils consults to keep the French outputs publication-ready.

french_malaria_acronyms() |> utils::head()
#> # A tibble: 6 × 3
#>   en       fr      definition_fr
#>   <chr>    <chr>   <chr>
#> 1 ITN      MILDA   Moustiquaire imprégnée à longue durée d'action
#> 2 IRS      AID     Aspersion intra-domiciliaire
#> 3 IPT      TPI     Traitement préventif intermittent
#> 4 IPTp     TPIp    Traitement préventif intermittent pendant la grossesse
#> 5 SMC      CPS     Chimioprévention du paludisme saisonnier
#> 6 RDT      TDR     Test de diagnostic rapide

This list is consumed automatically inside translate_text(), reporting_rate_plot(), consistency_check(), the faceted-map helpers and any other function with a target_language argument - we don’t need to call it directly, but it’s worth knowing it exists when a French translation looks off and we want to extend or override the mapping for a country team.

Image compression

When SNT reports get assembled into PDFs or Word docs, embedded PNGs dominate the file size. compress_png() wraps pngquant (installed on demand) to shrink PNGs by ~70% with no visible quality loss. Both reporting_rate_plot() and consistency_check() call it under the hood when saving with compress_image = TRUE.

# single file
compress_png(
  "path/to/large_image.png",
  output_path = "path/to/consistency_plot.png"
)
#> ── Compression Summary ──
#>
#> ✔ Successfully compressed: consistency_plot.png
#> ℹ Total compression: 200.21 KB (71.54% saved)
#> ℹ Excellent compression!
#>
#> ── File Size
#> Before compression: 279.87 KB
#> After compression:   79.66 KB

# whole directory
compress_png(
  "path/to/image_folder/",
  output_path = "path/to/compressed_folder/",
  verbose     = TRUE
)

Hashing for stable IDs

vdigest() is a vectorised version of digest::digest(). We use it constantly to mint stable IDs for facility-month records, deduplicate across DHIS2 exports, and detect content changes when caching expensive steps.

sl_dhis2 |>
  dplyr::distinct(adm3) |>
  dplyr::mutate(adm3_hash = vdigest(adm3)) |>
  utils::head()
#> # A tibble: 6 × 2
#>   adm3             adm3_hash
#>   <chr>            <chr>
#> 1 Bo City          c810b59ec12efb2ac8b5cc84f46857ce
#> 2 Kakua Chiefdom   27fd84f751fac150c2f8a8f42b71c3da
#> 3 Baoma Chiefdom   462ef3c87dc9b40b2ec2e0e0a54dd63e
#> 4 Valunia Chiefdom df394518e6987ed686d76e83a409f090
#> 5 Bagbwe Chiefdom  3aa7a61247e34ab397ff813fe520c8b7
#> 6 Wonde Chiefdom   196dc9792e2038b41411ec2afae37e61

Default algorithm is md5, but xxhash32 is much faster for the long character vectors typical of SNT data and produces shorter IDs:

sl_dhis2 <- sl_dhis2 |>
  dplyr::mutate(
    hf_uid    = vdigest(paste0(adm1, adm2, hf), algo = "xxhash32"),
    record_id = vdigest(paste(hf_uid, year_mon),  algo = "xxhash32")
  )

Numeric formatting helpers

The small but frequently used numerics that prevent boilerplate in every script.

# thousands separator, with configurable mark and decimals
big_mark(1234567.89)
#> [1] "1,234,567.89"

big_mark(c(1234.56, 7890123.45), decimals = 1, big_mark = " ")
#> [1] "1 234.6"     "7 890 123.5"

# NA-safe wrappers
sum2(c(1, 2, NA, 4))
#> [1] 7

mean2(c(1, 2, NA, 4))
#> [1] 2.333333

median2(c(1, 2, NA, 4, 5))
#> [1] 3

Defensive numerics: safe_sum(), fallback_row_sum(), fallback_diff()

Routine surveillance data is full of partial rows, all-NA months, and columns that arrived as character because Excel snuck a comma in somewhere. Plain sum() and - either propagate NA everywhere or silently coerce. These three helpers solve the same problem at three different scales and are what correct_outliers(), impute_outlier_ma() and the cascade reconciliation paths call internally. They’re equally useful in your own scripts.

safe_sum() - sum a vector without surprises

Returns NA only when every value is missing; otherwise sums the non-missing values and never silently coerces character input to zero.

safe_sum(c(1, 2, NA, 4))
#> [1] 7

safe_sum(c(NA, NA, NA))
#> [1] NA            # NOT 0 - you wanted "everything missing", you got it

fallback_row_sum() - sum a set of columns row-wise

Sums any number of columns row-wise with an explicit min_present floor: a row needs at least that many non-missing columns or its total stays NA rather than masquerading as zero.

dplyr::tibble(
  conf_u5  = c(10, NA, 0,  NA),
  conf_5_14 = c( 5, 12, NA, NA),
  conf_ov15 = c(20, 18, NA, NA)
) |>
  dplyr::mutate(
    conf_total = fallback_row_sum(conf_u5, conf_5_14, conf_ov15,
                                  min_present = 2)
  )
#> # A tibble: 4 × 4
#>   conf_u5 conf_5_14 conf_ov15 conf_total
#>     <dbl>     <dbl>     <dbl>      <dbl>
#> 1      10         5        20         35
#> 2      NA        12        18         30   # 2 of 3 present - OK
#> 3       0        NA        NA         NA   # only 1 present - flagged
#> 4      NA        NA        NA         NA

Use .keep_zero_as_zero = FALSE if you want a row of explicit zeros to also count as informative; the default treats c(0, 0, 0) as a valid zero total.

fallback_diff() - subtraction that won’t go negative

conf - maltreat should be non-negative in a clean cascade. When it’s not, you want either NA (mark for review) or a clamped value. fallback_diff() does the latter with a configurable floor.

fallback_diff(col1 = c(100, 80, 50), col2 = c(40, 100, NA))
#> [1] 60  0 NA          # clamped at the `minimum = 0` floor

# stricter: NA out any negative or invalid subtraction
fallback_diff(c(100, 80, 50), c(40, 100, NA), minimum = NA)
#> [1] 60 NA NA

The same function powers the cascade-reconciliation step in correct_outliers(): when an outlier is replaced, the downstream cascade variable is recomputed via fallback_diff() so the resulting row stays internally consistent.

Plot helpers

A handful of internal building blocks are exported because country teams reuse them in custom one-off charts:

Putting it together

A typical script header at AHADI ends up looking like this:

library(sntutils)

# 1. resolve where everything lives
paths <- setup_project_paths()

# 2. read with audit trail
sl_dhis2 <- read_snt_data(paths$dhis2, "sl_dhis2_clean")

# 3. mint stable IDs
sl_dhis2 <- sl_dhis2 |>
  dplyr::mutate(
    hf_uid    = vdigest(paste0(adm1, adm2, hf), algo = "xxhash32"),
    record_id = vdigest(paste(hf_uid, year_mon),  algo = "xxhash32")
  )

# 4. produce country-team-language outputs
reporting_rate_plot(
  data             = sl_dhis2,
  vars_of_interest = "conf",
  x_var            = "year_mon",
  y_var            = "adm2",
  hf_col           = "hf_uid",
  key_indicators   = c("allout", "test", "treat", "conf", "pres"),
  target_language  = "fr",
  compress_image   = TRUE,
  plot_path        = paths$final_fig
)

That’s six lines of script for a country-month reporting plot with audit trail, stable IDs, French labels and a compressed PNG ready to drop into a PDF. The plumbing in this article is what makes the analysis lines above this header so short.