Spatial work is where SNT analyses fail most quietly. Geometry might
be invalid, CRS might disagree, district names might have been renamed
between the shapefile and the DHIS2 export, or facility coordinates
might land in the ocean. sntutils provides a stack of
focused functions to find and fix each of these issues, then a small
mapping layer to plot the result.
For the methodology and conceptual background behind the steps in this article, please check the SNT Code Library:
- Working with shapefiles - source choice, vintages, projection notes.
- Shapefile management - the day-to-day workflow.
- Merging shapefiles with non-spatial data.
- Health-facility coordinates - validation, common errors.
- Master facility lists - matching DHIS2 to the MFL.
Downloading boundaries
For most SNT work the WHO geohub boundaries are the
canonical input. download_shapefile() pulls them by ISO3
code and admin level and returns an sf object ready to
use.
library(sntutils)
# pull adm2 boundaries for Sierra Leone and Togo
boundaries <- download_shapefile(
country_codes = c("SLE", "TGO"),
admin_level = "adm2",
latest = TRUE,
dest_path = "01_data/1.1_foundational/1.1a_admin_boundaries"
)
class(boundaries)
#> [1] "sf" "tbl_df" "tbl" "data.frame"When dest_path is NULL the file is
downloaded to a session-scoped cache and not persisted across runs.
Validating admin geometries
Once a shapefile is in hand, validate_process_spatial()
runs a battery of checks - invalid geometry, mixed Z/M dimensions, wrong
CRS, missing admin codes, duplicated admin names - and (with
fix_issues = TRUE) attempts safe automatic repairs. It
always returns the same shape: a list with a cleaned sf
object plus a tibble of issues found.
boundaries_clean <- validate_process_spatial(
shp = boundaries,
name = "WHO geohub SLE adm2",
adm0_col = "adm0_name",
adm1_col = "adm1_name",
adm2_col = "adm2_name",
fix_issues = TRUE,
geometry_crs = 4326,
drop_z = TRUE
)
names(boundaries_clean)
#> [1] "shp" "issues" "summary"The issues tibble is the audit trail - keep it next to
the shapefile on disk so reviewers can see what was changed.
Validating facility coordinates
For facility data, validate_process_coordinates() checks
that lat/lon columns parse, sit within plausible bounds, have enough
decimal precision to be real readings (default
min_decimals = 3), and fall inside the country polygon if
you pass adm0_sf.
hf_clean <- validate_process_coordinates(
data = hf_raw,
name = "SLE master facility list",
lon_col = "longitude",
lat_col = "latitude",
adm0_sf = boundaries_clean$shp |>
dplyr::filter(adm0_name == "Sierra Leone"),
geometry_crs = 4326,
min_decimals = 3,
id_col = "facility_uid",
fix_issues = TRUE
)
hf_clean$summary
#> ● 4,213 input rows
#> ● 4,189 valid coordinates retained
#> ● 8 rows dropped (low precision)
#> ● 16 rows dropped (outside adm0)Returns an sf of valid POINT geometry plus an
issues tibble of dropped or flagged rows.
Crosswalking shapefile vintages
When a country redistricts (Sierra Leone 2017, Togo 2019, several
recent DRC changes), historical surveillance data is keyed to the old
boundaries and new analyses to the new.
crosswalk_shapefiles_sf() computes the area-weighted
overlap between two sf layers so we can reproject
indicators forward or backward.
xwalk <- crosswalk_shapefiles_sf(
old_sf = sle_adm2_2014,
new_sf = sle_adm2_2022,
level = "adm2",
old_suffix = "_old",
min_weight = 0.01, # drop slivers <1%
area_crs = 32629, # Sierra Leone equal-area
verbose = TRUE
)
xwalk |> dplyr::select(adm2_old, adm2_name, weight, primary) |> head()
#> # A tibble: 6 × 4
#> adm2_old adm2_name weight primary
#> <chr> <chr> <dbl> <lgl>
#> 1 Bo District Bo District 0.998 TRUE
#> 2 Bo District Bo City 0.002 FALSE
#> 3 Bonthe Bonthe District 1.000 TRUE
#> 4 …Multiply by weight and
dplyr::group_by(adm2_name) |> summarise(sum()) to push
aggregate counts from old units to new.
Fuzzy-matching facilities
DHIS2 facility names rarely line up perfectly with the master
facility list. fuzzy_match_facilities() runs a staged
matching pipeline:
- exact match,
- match after
standardize_names()normalisation, - string-distance match using one or more methods (
jw,lv,osa, …), - optional interactive picker for unresolved rows,
and returns a tibble of best matches plus diagnostics by stage.
matches <- fuzzy_match_facilities(
target_df = dhis2_facilities, # what we're cleaning
lookup_df = mfl_facilities, # the reference list
admin_cols = c("adm1", "adm2", "adm3"),
hf_col_name = "hf",
uid_col = "hf_uid",
fuzzy_methods = c("jw", "osa"),
fuzzy_threshold = 95,
match_interactivity = TRUE,
save_path = "01_data/1.1_foundational/1.1b_health_facilities/processed"
)
matches$results |>
dplyr::count(match_status)
#> # A tibble: 3 × 2
#> match_status n
#> <chr> <int>
#> 1 high 4012
#> 2 medium 147
#> 3 low 54calculate_match_stats() summarises the same results by
method so we can compare matching strategies side by side.
Renaming DHIS2 columns: dhis2_map()
dhis2_map() renames a DHIS2 export’s columns using a
name-mapping dictionary, so we can keep the upstream column labels
intact on disk and only remap to SNT names when we load the data. It’s a
small but critical step in every DHIS2 import workflow.
Used in: the DHIS2 import and preprocessing chapter of the SNT Code Library walks through where this fits in the end-to-end DHIS2 cleaning pipeline.
dhis2_map()is the helper that performs the rename step described there.
mapped <- dhis2_map(
data = sl_dhis2_raw,
dict = dhis2_label_lookup,
new_col = "snt_name",
old_col = "dhis2_label",
drop_unmatched = FALSE
)Pair it with check_snt_var() after the rename to confirm
the resulting column names match the canonical SNT schema (see Read
& clean).
Drawing maps
Categorical fills
For administrative reference maps and any other categorical layer:
plot_admin_map_distinct(
sf_data = boundaries_clean$shp,
fill_col = "adm1_name",
title = "Sierra Leone - districts coloured by region",
palette = "ahadi_main"
)Available palettes are listed via list_palettes(); pull
a specific palette with get_palette().
Faceted choropleths
When the map needs to faceted by year, indicator, scenario or intervention, these two helpers produce consistent small-multiples with the SNT plotting defaults baked in:
# bins - discrete legend, good for incidence categories
facetted_map_bins(
data = incidence_long,
sf_data = boundaries_clean$shp,
facet_col = "year",
fill_col = "incidence_per_1000",
bins = c(0, 10, 50, 100, 250, 500, Inf),
title = "Annual malaria incidence per 1,000 - Sierra Leone"
)
# gradient - continuous legend, good for reporting completeness
facetted_map_gradient(
data = reporting_long,
sf_data = boundaries_clean$shp,
facet_col = "year_mon",
fill_col = "reprate",
limits = c(0, 1),
title = "Monthly reporting completeness"
)Both functions accept target_language so legend titles
and labels can be translated automatically (see the Project setup and utilities
article).
A spatial pipeline, end to end
# 1. pull boundaries
sle_adm2 <- download_shapefile(
country_codes = "SLE",
admin_level = "adm2",
latest = TRUE
)
# 2. validate them
sle_adm2_clean <- validate_process_spatial(
shp = sle_adm2, name = "SLE adm2",
adm0_col = "adm0_name", adm1_col = "adm1_name", adm2_col = "adm2_name",
fix_issues = TRUE
)$shp
# 3. validate facility coordinates against the polygon
hf_geo <- validate_process_coordinates(
data = hf_raw,
lon_col = "longitude", lat_col = "latitude",
adm0_sf = sle_adm2_clean,
id_col = "facility_uid",
fix_issues = TRUE
)$shp
# 4. fuzzy-match DHIS2 facility names to the master list
match_results <- fuzzy_match_facilities(
target_df = sl_dhis2 |> dplyr::distinct(adm1, adm2, adm3, hf, hf_uid),
lookup_df = hf_geo,
admin_cols = c("adm1", "adm2", "adm3"),
hf_col_name = "hf",
uid_col = "hf_uid"
)$results
# 5. plot
plot_admin_map_distinct(
sf_data = sle_adm2_clean,
fill_col = "adm1_name",
title = "Sierra Leone - adm2 by region"
)By the end of this pipeline we have a single, validated
sf boundary file, a validated sf
facility-points file, a high-confidence link between DHIS2 facility
names and the MFL, and a baseline map. Everything that follows -
reporting rates, climate extraction, population weighting - assumes this
foundation.
