
Facility name matching across datasets (DHIS2 vs MFL)
Source:R/fuzzy_match_facilities.R
fuzzy_match_facilities.RdOrchestrates a multi-step matching pipeline between a target dataset
(e.g., DHIS2 facilities) and a lookup dataset (e.g., MFL).
Steps include exact match, standardized-name match, fuzzy match,
and interactive stratified match (using sntutils::prep_geonames()).
Returns consolidated results and QA summaries.
Usage
fuzzy_match_facilities(
target_df,
lookup_df,
admin_cols = c("adm0", "adm1", "adm2"),
hf_col_name = "hf",
uid_col = "hf_uid",
steps = c(exact = TRUE, standardization = TRUE, fuzzy = TRUE, interactive = TRUE),
lookup_cols = character(),
match_interactivity = TRUE,
fuzzy_methods = c("jw", "lv"),
fuzzy_threshold = 95L,
score_exact = 100L,
score_standardization = 100L,
score_interactive = 95L,
status_cuts = c(high = 95, medium = 80, low = 70),
include_missing_name_rows = TRUE,
save_path = NULL,
matching_cache_path = NULL,
save_stem = "facility_matching",
summary_language = "en",
verbose = TRUE
)Arguments
- target_df
Tibble/data.frame with facility names and admin columns.
- lookup_df
Tibble/data.frame with facility names and admin columns.
- admin_cols
Character vector of admin columns ordered high to low. Default c("adm0", "adm1", "adm2").
- hf_col_name
Character. Facility name column used in both data frames. Default "hf".
- uid_col
Character. Column in
target_dfuniquely identifying each facility. Default "hf_uid".- steps
Named logical vector toggling steps: c(exact = TRUE, standardization = TRUE, fuzzy = TRUE, interactive = FALSE).
- lookup_cols
Character vector of
lookup_dfcolumns to append to the match results and returned target table. Defaultcharacter().- match_interactivity
Logical flag forwarded to the interactive matching helper. Default
TRUE.- fuzzy_methods
Character vector of string distance methods to combine. Supported: c("jw", "lv"). Default c("jw", "lv").
- fuzzy_threshold
Integer in 0-100 for acceptance. Default 95.
- score_exact
Integer score for exact matches. Default 100.
- score_standardization
Integer score for standardized matches. Default 100.
- score_interactive
Integer score for interactive matches. Default 95.
- status_cuts
Named numeric thresholds for status classification: c(high = 95, medium = 80, low = 70).
- include_missing_name_rows
Logical. If TRUE, adds rows with missing name as
match_method = "missing_name". Default TRUE.- save_path
Character. Directory path root for outputs. Default NULL; when NULL, results are not written to disk.
- matching_cache_path
Character. Directory for interactive cache and unmatched files. Default here::here(paths$cache).
- save_stem
Character. Base filename stem for saved outputs. Default "facility_matching".
- summary_language
Character vector choosing summary language columns. Supported codes: "en", "fr", "pt". Default "en".
- verbose
Logical. If TRUE, prints a CLI summary box preview of matching results. Default TRUE.