The other articles cover the headline analyses. This one rounds up
the day-to-day plumbing - folder scaffolding, paths, translation,
hashing, image compression, and the small numerical helpers - that make
the rest of sntutils feel coherent.
Project structure
AHADI projects follow a fixed top-level layout so that scripts written for one country drop straight into another. Two functions build it.
Just the data tree
Creates only the data tree under 01_data/. Each domain
folder has raw/ and processed/ subfolders.
library(sntutils)
create_data_structure(base_path = ".")01_data/
├── 1.1_foundational/
│ ├── 1.1a_admin_boundaries/
│ ├── 1.1b_physical_features/
│ ├── 1.1c_health_facilities/
│ ├── 1.1d_community_health_workers/
│ ├── 1.1e_population/
│ │ ├── 1.1ei_national/
│ │ ├── 1.1eii_worldpop_rasters/
│ │ └── 1.1eiii_displaced_pop/
│ └── 1.1f_cache_files/
├── 1.2_epidemiology/
│ ├── 1.2a_routine_surveillance/
│ ├── 1.2b_pfpr_estimates/
│ └── 1.2c_mortality_estimates/
├── 1.3_interventions/
│ ├── 1.3a_itns/
│ ├── 1.3b_iptp/
│ ├── 1.3c_smc/
│ ├── 1.3d_vap/
│ ├── 1.3e_anc/
│ └── 1.3f_irs/
├── 1.4_drug_efficacy_resistance/
├── 1.5_environment/
│ ├── 1.5a_climate/
│ ├── 1.5b_accessibility/
│ └── 1.5c_land_use/
├── 1.6_health_systems/
│ └── 1.6a_dhs/
├── 1.7_entomology/
├── 1.8_commodities/
├── 1.9_finance/
└── 1.10_final/
Every leaf folder (except 1.10_final/ and
1.1f_cache_files/) gets its own raw/ and
processed/ subfolders automatically.
Full project skeleton
Same data tree plus the surrounding scaffolding for scripts, outputs and reports - use this on the first day of a new project.
initialize_project_structure(base_path = "my_snt_project")my_snt_project/
├── 01_data/ # full hierarchical data tree from above
├── 02_scripts/
├── 03_outputs/
│ ├── 3.1_validation/ (figures/ + tables/)
│ ├── 3.2_intermediate_products/ (figures/ + tables/)
│ ├── 3.3_final_snt_outputs/ (figures/ + tables/)
│ └── 3.4_model/ (figures/ + tables/)
├── 04_reports/
└── 05_metadata_docs/
Resolving paths
Once the folders exist, setup_project_paths() returns a
named list of absolute paths to every important location. Detects the
project root via here::here() / rprojroot,
with getwd() as a fallback. The keys are short and stable
so scripts don’t have to know where the project lives on disk.
paths <- setup_project_paths(quiet = TRUE)
paths$admin_shp
#> [1] "/Users/me/projects/sle-snt/01_data/1.1_foundational/1.1a_admin_boundaries"
paths$dhis2
#> [1] "/Users/me/projects/sle-snt/01_data/1.2_epidemiology/1.2a_routine_surveillance"
paths$val_fig
#> [1] "/Users/me/projects/sle-snt/03_outputs/validation/figures"Every download, validation and processing function in the package accepts these paths directly, so a typical script reads:
paths <- setup_project_paths()
shp <- read(file.path(paths$admin_shp, "sle_adm2.geojson"))
dhis2 <- read_snt_data(paths$dhis2, "sl_dhis2_clean")ahadi_path() is a related helper for navigating the
AHADI shared OneDrive - it locates a project by name across team
members’ synced roots, useful for cross-machine reproducibility.
clear_snt_cache() resets the in-memory caches used by
snt_data_dict() and related helpers; call it after editing
the dictionary CSV without restarting R.
Translation and localisation
Country teams switch between English, French and Portuguese on a
near- weekly basis. Every plot and label function in
sntutils accepts a target_language argument,
which routes through this small translation stack.
Translate a single string
Calls Google Translate (via the gtranslate package) once
per unique string and caches the response to a local JSON file. The next
call for the same (source, target, text) triple returns
instantly from cache.
translate_text("Reporting rate by district",
target_language = "fr",
source_language = "en")
#> [1] "Taux de rapportage par district"
translate_text("Malaria cases",
target_language = "pt",
cache_path = "~/translation_cache")
#> [1] "Casos de malária"Translate a column
Vectorised wrapper for translate_text() - translate a
whole column in one call, with caching shared across the batch.
df <- tibble::tibble(
label = c("Confirmed cases", "Presumed cases", "Tests performed")
)
df |>
dplyr::mutate(label_es = translate_text_vec(label, target_language = "es"))
#> # A tibble: 3 × 2
#> label label_es
#> <chr> <chr>
#> 1 Confirmed cases Casos confirmados
#> 2 Presumed cases Casos presuntos
#> 3 Tests performed Pruebas realizadasLocale-aware month-year formatting
Locale-aware month-year formatting for time-series axes.
dates <- seq(as.Date("2022-01-01"), as.Date("2022-03-01"), by = "month")
translate_yearmon(dates, language = "fr")
#> [1] "janv. 2022" "févr. 2022" "mars 2022"
translate_yearmon(dates, language = "es", format = "%B %Y")
#> [1] "enero 2022" "febrero 2022" "marzo 2022"French malaria acronyms
Google Translate consistently butchers technical malaria acronyms
when asked to translate plot titles, labels and report text into French
- it emits “PIT” instead of TPI, the literal phrase
“moustiquaire imprégnée à longue durée d’action” instead of
MILDA, and so on. french_malaria_acronyms() is
the curated override list that every translation-aware function in
sntutils consults to keep the French outputs
publication-ready.
french_malaria_acronyms() |> utils::head()
#> # A tibble: 6 × 3
#> en fr definition_fr
#> <chr> <chr> <chr>
#> 1 ITN MILDA Moustiquaire imprégnée à longue durée d'action
#> 2 IRS AID Aspersion intra-domiciliaire
#> 3 IPT TPI Traitement préventif intermittent
#> 4 IPTp TPIp Traitement préventif intermittent pendant la grossesse
#> 5 SMC CPS Chimioprévention du paludisme saisonnier
#> 6 RDT TDR Test de diagnostic rapideThis list is consumed automatically inside
translate_text(), reporting_rate_plot(),
consistency_check(), the faceted-map helpers and any other
function with a target_language argument - we don’t need to
call it directly, but it’s worth knowing it exists when a French
translation looks off and we want to extend or override the mapping for
a country team.
Image compression
When SNT reports get assembled into PDFs or Word docs, embedded PNGs
dominate the file size. compress_png() wraps
pngquant (installed on demand) to shrink PNGs by ~70% with
no visible quality loss. Both reporting_rate_plot() and
consistency_check() call it under the hood when saving with
compress_image = TRUE.
# single file
compress_png(
"path/to/large_image.png",
output_path = "path/to/consistency_plot.png"
)
#> ── Compression Summary ──
#>
#> ✔ Successfully compressed: consistency_plot.png
#> ℹ Total compression: 200.21 KB (71.54% saved)
#> ℹ Excellent compression!
#>
#> ── File Size
#> Before compression: 279.87 KB
#> After compression: 79.66 KB
# whole directory
compress_png(
"path/to/image_folder/",
output_path = "path/to/compressed_folder/",
verbose = TRUE
)Hashing for stable IDs
vdigest() is a vectorised version of
digest::digest(). We use it constantly to mint stable IDs
for facility-month records, deduplicate across DHIS2 exports, and detect
content changes when caching expensive steps.
sl_dhis2 |>
dplyr::distinct(adm3) |>
dplyr::mutate(adm3_hash = vdigest(adm3)) |>
utils::head()
#> # A tibble: 6 × 2
#> adm3 adm3_hash
#> <chr> <chr>
#> 1 Bo City c810b59ec12efb2ac8b5cc84f46857ce
#> 2 Kakua Chiefdom 27fd84f751fac150c2f8a8f42b71c3da
#> 3 Baoma Chiefdom 462ef3c87dc9b40b2ec2e0e0a54dd63e
#> 4 Valunia Chiefdom df394518e6987ed686d76e83a409f090
#> 5 Bagbwe Chiefdom 3aa7a61247e34ab397ff813fe520c8b7
#> 6 Wonde Chiefdom 196dc9792e2038b41411ec2afae37e61Default algorithm is md5, but xxhash32 is
much faster for the long character vectors typical of SNT data and
produces shorter IDs:
Numeric formatting helpers
The small but frequently used numerics that prevent boilerplate in every script.
# thousands separator, with configurable mark and decimals
big_mark(1234567.89)
#> [1] "1,234,567.89"
big_mark(c(1234.56, 7890123.45), decimals = 1, big_mark = " ")
#> [1] "1 234.6" "7 890 123.5"
# NA-safe wrappers
sum2(c(1, 2, NA, 4))
#> [1] 7
mean2(c(1, 2, NA, 4))
#> [1] 2.333333
median2(c(1, 2, NA, 4, 5))
#> [1] 3Defensive numerics: safe_sum(),
fallback_row_sum(), fallback_diff()
Routine surveillance data is full of partial rows, all-NA months, and
columns that arrived as character because Excel snuck a comma in
somewhere. Plain sum() and - either propagate
NA everywhere or silently coerce. These three helpers solve
the same problem at three different scales and are what
correct_outliers(), impute_outlier_ma() and
the cascade reconciliation paths call internally. They’re equally useful
in your own scripts.
safe_sum() - sum a vector without surprises
Returns NA only when every value is missing; otherwise
sums the non-missing values and never silently coerces character input
to zero.
fallback_row_sum() - sum a set of columns row-wise
Sums any number of columns row-wise with an explicit
min_present floor: a row needs at least that many
non-missing columns or its total stays NA rather than
masquerading as zero.
dplyr::tibble(
conf_u5 = c(10, NA, 0, NA),
conf_5_14 = c( 5, 12, NA, NA),
conf_ov15 = c(20, 18, NA, NA)
) |>
dplyr::mutate(
conf_total = fallback_row_sum(conf_u5, conf_5_14, conf_ov15,
min_present = 2)
)
#> # A tibble: 4 × 4
#> conf_u5 conf_5_14 conf_ov15 conf_total
#> <dbl> <dbl> <dbl> <dbl>
#> 1 10 5 20 35
#> 2 NA 12 18 30 # 2 of 3 present - OK
#> 3 0 NA NA NA # only 1 present - flagged
#> 4 NA NA NA NAUse .keep_zero_as_zero = FALSE if you want a row of
explicit zeros to also count as informative; the default treats
c(0, 0, 0) as a valid zero total.
fallback_diff() - subtraction that won’t go
negative
conf - maltreat should be non-negative in a clean
cascade. When it’s not, you want either NA (mark for
review) or a clamped value. fallback_diff() does the latter
with a configurable floor.
fallback_diff(col1 = c(100, 80, 50), col2 = c(40, 100, NA))
#> [1] 60 0 NA # clamped at the `minimum = 0` floor
# stricter: NA out any negative or invalid subtraction
fallback_diff(c(100, 80, 50), c(40, 100, NA), minimum = NA)
#> [1] 60 NA NAThe same function powers the cascade-reconciliation step in
correct_outliers(): when an outlier is replaced, the
downstream cascade variable is recomputed via
fallback_diff() so the resulting row stays internally
consistent.
Plot helpers
A handful of internal building blocks are exported because country teams reuse them in custom one-off charts:
-
get_palette()/list_palettes()- AHADI-branded discrete and gradient palettes (ahadi_main,ahadi_warm,ahadi_cool, …). -
auto_bin()- choose bin edges for an indicator using Fisher-Jenks or quantile breaks, returning a labelled factor. -
detect_factors(),detect_time_pattern(),extract_time_components()- utilities that power the auto-parsers but are exposed for ad-hoc use.
-
get_model(),generate_ir_plot(),run_resistance_trend()- fit and render incidence-rate / resistance-trend plots used in insecticide-resistance reports. -
prepare_plot_data(),get_pathway_vars()- internal data-shaping helpers that show up in custom reporting pipelines.
Putting it together
A typical script header at AHADI ends up looking like this:
library(sntutils)
# 1. resolve where everything lives
paths <- setup_project_paths()
# 2. read with audit trail
sl_dhis2 <- read_snt_data(paths$dhis2, "sl_dhis2_clean")
# 3. mint stable IDs
sl_dhis2 <- sl_dhis2 |>
dplyr::mutate(
hf_uid = vdigest(paste0(adm1, adm2, hf), algo = "xxhash32"),
record_id = vdigest(paste(hf_uid, year_mon), algo = "xxhash32")
)
# 4. produce country-team-language outputs
reporting_rate_plot(
data = sl_dhis2,
vars_of_interest = "conf",
x_var = "year_mon",
y_var = "adm2",
hf_col = "hf_uid",
key_indicators = c("allout", "test", "treat", "conf", "pres"),
target_language = "fr",
compress_image = TRUE,
plot_path = paths$final_fig
)That’s six lines of script for a country-month reporting plot with audit trail, stable IDs, French labels and a compressed PNG ready to drop into a PDF. The plumbing in this article is what makes the analysis lines above this header so short.
