flowchart TD
Start[DHIS2 Health Facilities] --> Step3[Step 3: Initial Matching<br/>Diagnostics]
Step3 --> Step4[Step 4: Interactive Stratified<br/>Geographic Matching]
Step3 --> Step5[Step 5: Process & Prepare<br/>Unmatched Data]:::altPath
Step4 --> Decision1{Matched?}
Decision1 -->|Yes | Matched1[High-Confidence<br/>Matched Records]
Decision1 -->|No | Step5
Step5 --> Step6[Step 6: Fuzzy Matching<br/>on Remaining Unmatched]
Step6 --> Step7[Step 7: Evaluate Fuzzy<br/>Match Quality]
Step7 --> Step8[Step 8: Finalize Fuzzy<br/>Match Selection]
Step8 --> Decision2{Matched?}
Decision2 -->|Yes| Matched2[Fuzzy Matched<br/>Records]
Decision2 -->|No| Unmatched[Final Unmatched<br/>for Manual Review]
Matched1 --> Step9[Step 9: Combine All<br/>Matched Results]
Matched2 --> Step9
Unmatched --> Step9
Step9 --> Step10[Step 10: Final<br/>Checks]
Step10 --> Step11[Step 11: Save<br/>Final Datasets]
classDef altPath fill:#d1c4e9,stroke:#673ab7,stroke-width:2px;
style Start fill:#e1f5fe
style Step3 fill:#e8f5e9
style Step4 fill:#fff3e0
style Step5 fill:#f3e5f5
style Step6 fill:#f3e5f5
style Step7 fill:#fff9c4
style Step8 fill:#f3e5f5
style Step9 fill:#e8f5e9
style Step10 fill:#fff9c4
style Step11 fill:#c8e6c9
style Matched1 fill:#a5d6a7
style Matched2 fill:#ffcc80
style Unmatched fill:#ffab91
Fuzzy matching of names across datasets
Intermediate
Overview
On this page, we demonstrate how to apply fuzzy matching between DHIS2 and the master health facility list (MFL) in a structured, reviewable way. While the worked example uses DHIS2 and the MFL, we can apply the same fuzzy matching approach to any dataset with facility names, or any string field used as a join key across multiple datasets. For example, if the goal is to join a shapefile with tabular data and adm2 names are not joining properly, consider applying fuzzy name matching on the adm2 names across the two datasets.
The MFL is the national reference list of health facilities. Maintained by the Ministry of Health or national statistics office, it provides standardized names, unique facility IDs, coordinates, type, and administrative units, and can also provide other information such as operational status and availability of particular services. The MFL is often the only dataset of health facilities that includes their geolocation.
Joining other datasets, such as routine surveillance, with the MFL therefore makes possible many kinds of subnational analysis. If unique health facility IDs are not available across both the MFL and the dataset to be joined or if unique IDs are unreliable, then the joining will rely on matching health facility names. However, variations in the spelling, abbreviations, word order, or formatting of health facility names can make this process challenging.
One possible solution that is reasonably scalable and automatable is fuzzy name matching, a technique that uses string similarity algorithms to identify close matches between text fields even when they differ slightly. While not perfect, this approach can improve the matching rate between datasets where facility names differ due to inconsistent formatting, spelling, or abbreviations, leaving fewer unmatched facilities to resolve manually and through consultation with the SNT team. Fuzzy matches should still be reviewed, especially in important applications.
- Run diagnostics to check exact matches, admin alignment, and duplicates
- Use stratified geographic matching with human validation for high-confidence matches
- Standardize names and abbreviations to prepare unmatched data
- Apply fuzzy matching on remaining facilities with candidate grids and similarity scores
- Assess fuzzy match quality with diagnostics, visuals, and thresholds
- Select a matching strategy (single, composite, weighted, or fallback) based on data needs
- Merge stratified and fuzzy matches into one dataset
- Save final matched output for analysis and integration
Understanding Fuzzy Matching
Fuzzy matching helps us reconcile text fields that don’t exactly align but are highly likely to be referring to the same entity, something that can happen when working with health facility names in real-world data. In this section, we explain what fuzzy matching is, why it’s needed, how it works, and what we need to consider when applying it in the context of SNT.
Why Exact Matching Isn’t Sufficient
In theory, if the same facility appears in two datasets, we should be able to join them directly by name. In practice, facility names may sometimes be spelled differently, use inconsistent abbreviations, or contain formatting differences that prevent a clean join.
For example, a facility might appear in DHIS2 as “Makeni Gov. Hosp” and in the MFL as “Makeni Government Hospital”. A direct match won’t work, even though the two clearly refer to the same place. Small differences in spelling, punctuation, or word order require a more flexible matching approach.
What Is Fuzzy String Matching?
Fuzzy string matching is a method for finding approximate matches between text strings (a text string is the data format that coding languages use for handling text). Instead of requiring an exact match, fuzzy string matching calculates how similar two strings are, usually with a score between 0 and 100. A higher score means the strings are more alike.
This is useful when working with real-world facility names. Fuzzy matching allows us to say: “these two names are close enough to be considered a likely match,” and lets us decide where to draw the line between automatic match, manual review, or no match.
In the SNT process, health facility data can come from multiple sources with varying naming conventions. For example, the MFL might serve as the authoritative source, while the DHIS2 could contain operational variations of the same facility name. Fuzzy matching helps reconcile these differences systematically.
Before performing fuzzy matching, review data considerations with the SNT team. These considerations may include:
- Facility name completeness (no missing values in key columns)
- Local naming conventions and common abbreviations, so that fuzzy matching can be set up to account for them
- Protocol for questionable matches
Below is an example output from fuzzy matching. The algorithm compares facility names from DHIS2 to entries in the MFL, assigning a similarity score to each match. Higher scores (closer to 100) indicate stronger similarity; lower scores may need manual review to determine acceptability or exclusion.
| Facility Name (DHIS2) | Best Match in MFL | Score | Decision |
|---|---|---|---|
| Makeni Govt. Hospital | Makeni Government Hospital | 96 | High confidence |
| Makeni Goverment Hospital | Makeni Government Hospital | 93 | High confidence |
| Police CHC | Police Community Health Center | 90 | High confidence |
| Loreto Clinic | Clinic Loreto | 88 | High confidence |
| Centre Medical | Centre Médical | 85 | High confidence |
| An-Nour Hospital | An-Noor Hospital | 82 | Accept with caution |
| Rahma Clinic | Rahmah Clinic | 87 | High confidence |
| Bo MCHP | Bo Maternal Child Health Post | 84 | Accept with caution |
| Clinic A | Kenema Government Hospital | 41 | Needs manual review |
| ABC Health Post | Makeni Government Hospital | 29 | Needs manual review |
The table above illustrates fuzzy matching scenarios that may arise. Below are some examples of the types of variations we are addressing:
- Abbreviations:
Bo MCHP → Bo Maternal Child Health Post - Spelling inconsistencies or typos:
Makeni Goverment Hospital → Makeni Government Hospital - Formatting differences:
Police CHC → Police Community Health Center - Word order variations:
Loreto Clinic → Clinic Loreto - Inconsistent accents:
Centre Medical → Centre Médical - Inconsistent transliterations:
Rahma Clinic → Rahmah Clinic
Choosing a Matching Strategy
Fuzzy name matching involves defining what counts as “similar” and deciding which kinds of differences (like spelling, order, or abbreviations) are acceptable. Here we discuss how to assess the quality of matches and when to use different types of string similarity algorithms. Understanding these concepts helps ensure the methods we apply are appropriate for the kinds of mismatches we expect in the data.
How Do We Know if a Match Is “Good”?
Fuzzy matching algorithms return similarity scores that estimate how alike two names are. But a high similarity score does not always guarantee a correct match. For example, facilities with shared components like “community health post” may receive high scores, even if the match is incorrect. This highlights the value of looking beyond raw scores.
To improve confidence in matches, we can use additional diagnostics that assess match quality from different angles. These include:
- Prefix Match: Names start with similar words or characters
- Suffix Match: Names end in similar ways, such as shared final words
- Token Difference: Similar number of words in each name
- Character Difference: Overall lengths of the names are close to each other
These diagnostics are meant to complement similarity scores, not replace them. The diagnostics help identify cases where a high score may be misleading and guide how we filter, rank, or combine results across multiple algorithms.
Consider this example:
Let’s say Makeni Govt Hospital matches with Makeni Government Hospital. But how good is this match?
We can assess the quality using a few intuitive features:
The prefix matches: both start with Makeni. The suffix matches: both end with Hospital. The token difference is 0, both contain three words. The character difference is small, only 6 characters apart. The difference mainly comes from Govt vs Government.
These features suggest a strong match, even if the raw string similarity score is imperfect.
Choosing Among String Matching Algorithms
Different algorithms are better suited to different mismatch types, such as typos, abbreviations, or word order. Choosing the right algorithm depends on the specific inconsistencies in the data. We may want to try multiple algorithms to match across different types of inconsistencies. Below, we highlight the main algorithms, their strengths, and their limitations.
Levenshtein distance (edit distance)
Counts the number of single-character edits (insertions, deletions, substitutions) needed to turn one string into another.
- Best for: Typos and misspellings
- Example:
Makeni Goverment Hospital → Makeni Government Hospital - Limitations: Does not handle word swaps or phonetic differences
Jaro-Winkler similarity
Measures character overlap and transpositions, with extra weight for common prefixes.
- Best for: Prefix alignment and small rearrangements
- Examples:
Loreto Clinic → Clinic Loreto, orMakeni Govt Hospital → Makeni Government Hospital - Limitations: Weak with missing words or abbreviations
Q-gram distance
Compares overlapping sequences (e.g., 2- or 3-character chunks) between strings.
- Best for: Substring overlap and fuzzy substring detection
- Example:
Kenema Town MCHP → Kenema MCH Post - Limitations: Can over-penalize short names or spacing issues
Longest common subsequence (LCS)
Finds the longest sequence of characters that appear left-to-right in both strings (not necessarily contiguously).
- Best for: Partial matches with insertions
- Example:
St Mary Hosp → Saint Mary Hospital - Limitations: Ignores spacing and order changes beyond sequence alignment
Soundex and other phonetic algorithms
Convert words into phonetic codes based on how they sound.
- Best for: Names with different spellings but similar pronunciation
- Examples:
Rahma Clinic → Rahmah Clinic, orAn-Nour Hospital → An-Noor Hospital - Limitations: Fails with structural or abbreviation-based mismatches
Best Practices for an Effective Fuzzy Matching Workflow
Fuzzy matching is only as effective as the preparation and decision rules that support it. This section outlines key steps and good practices that improve the accuracy, transparency, and efficiency of the matching workflow. Below are some tips to improve fuzzy matching:
Tip 1: Work from a cleaned copy of the original name column
Always preserve the original facility name column and perform all cleaning steps on a new copy. This ensures:
- We can always refer back to the original, untouched names
- All joins and fuzzy matching occur on the cleaned and standardized version
- We retain the ability to inspect and manually resolve low-confidence or unmatched records later using the original, unformatted names
This separation is especially useful for manual matching workflows, where human judgment relies on seeing names as they originally appeared.
Tip 2: Preprocess text to reduce noise
Before running any matching algorithm, clean and normalize the text. The list below provides several suggestions; confirm whether each makes sense in the context of the project:
- Remove extra spaces to prevent false mismatches
- Trim whitespace: Remove extra spaces at the beginning or end of text
- Collapse multiple spaces: Replace repeated spaces with a single space
- Standardize text characters to ensure consistency
- Convert all text to lowercase: Ensure consistency regardless of original casing
- Normalize accented characters: Convert characters like “é” to “e”
- Standardize non-text characters
- Standardize punctuation: Replace or remove punctuation to reduce irrelevant variation
- Convert Roman numerals to numbers: e.g., “Koinadugu II CHC” → “Koinadugu 2 CHC”
- Standardize word use, if applicable
- Sort words alphabetically within each name: Handle word order differences by splitting names into words, sorting them, and rejoining. For example, both “Port Loko CHC” and “CHC Port Loko” become “CHC Loko Port”, making them directly comparable despite naming convention differences.
This step reduces irrelevant variation and reduces matching burden on the algorithms.
Tip 3: Handle abbreviations and domain-specific terms
Abbreviations and local terms in facility names can lead to mismatches if not standardized. Depending on the context, we can either expand abbreviations (e.g., CHC → Community Health Center) or contract long forms into standard short forms. Each approach may yield different matching outcomes. What matters is applying the transformation consistently to both datasets.
Examples of variations:
CHCvsCommunity Health CenterPHUvsPeripheral Health UnitMCHPvsMaternal and Child Health Post
Recommended practices:
- Build a simple abbreviation dictionary for the project context
- Apply substitution rules (for example, replace all
CHCwithCommunity Health Center) before running matching algorithms - Use domain-specific expansions or contractions where appropriate
We can scan facility name fields for potential abbreviations by looking for short, all-uppercase tokens that appear within longer strings. A common heuristic is:
Look for 2–5 letter sequences in all capital letters that appear standalone among lowercase or mixed-case words.
This often finds abbreviations like CHC, PHU, or MCHP embedded in names such as Bo MCHP or Kailahun PHU. This check can be done by eye, or with code.
Tip 4: Limit matching scope using geographic information
Fuzzy matching is more accurate and efficient when restricted to a plausible geographic scope. Instead of comparing every DHIS2 facility name to every MFL name in the entire country, limit comparisons to within the same administrative unit, such as adm1 (region) or adm3 (chiefdom), depending on the available data quality.
- For example, only compare names where
adm1andadm3are the same for both facilities - This speeds up processing and ensures that identically named facilities in different geographic areas are not mistakenly matched to each other
This approach assumes that:
- The administrative fields (
adm1,adm2,adm3) are already cleaned and harmonized across both the DHIS2 and MFL datasets - Facilities are correctly assigned to their administrative units, with no major misplacements or missing assignments
- Remaining unmatched facilities are indeed name mismatches, not geographic mismatches needing spatial correction
The fuzzy matching approaches outlined on this page can be used to clean and match admin unit names.
By grounding fuzzy matching in geographic logic, this step improves both the relevance and reliability of the matching results.
Tip 5: Apply multiple similarity algorithms
Rather than relying on a single algorithm, select the best match based on the highest score across multiple matching algorithms, all while considering supporting quality checks. This match-first approach ensures flexibility and avoids locking decisions to one metric.
This is important because each fuzzy matching algorithm captures a different aspect of similarity. For example:
- Levenshtein: Good for typos and deletions
- Jaro-Winkler: Sensitive to character transpositions and prefix agreement
- Qgram / LCS: Detect reordered or partial word overlaps
- Rank-Ensemble: Prioritizes agreement across methods by comparing rank positions
- Composite Scores: Integrates multiple metrics for better consensus
By comparing and combining outputs from these methods, we can better handle the diverse inconsistencies found in real-world datasets.
Tip 6: Review and validate with the SNT team
Automated matching can produce strong results, but human review is required. Always check:
- High-confidence matches (85+): Very likely a true match and can be quickly verified by eye
- Mid-confidence matches (70–84): May require correction
- Low scores (below 70): Likely to be false matches
- Similar-looking names in different districts: May indicate over-matching
- Unmatched facilities: May need manual review or database updates
Always conduct a structured review of final match results with the SNT team. Validation ensures the SNT team applies local knowledge and prevents misclassification, especially when matches inform planning, facility mapping, or coverage analysis.
Matching Workflow Overview
This overview explains the end-to-end reconciliation used in Sierra Leone to match DHIS2 facility names to the national MFL. The workflow is auditable, prioritizes high-confidence matches first, and produces a single review-ready table with the selected match, similarity diagnostics, decision flags (accept/review), and stable IDs for downstream use.
Two-Phase Approach
- Phase 1: Stratified geographic matching (
adm2/adm3): Anchor comparisons within administrative boundaries to capture most matches with high confidence and expose misassigned admins early - Phase 2: Fuzzy name matching on remainders: Standardize names, generate candidates without geographic constraints, score with multiple algorithms, and use diagnostics to select matches for review/acceptance
11-Step Process Summary
Follow these 11 steps; detailed, runnable code appears in the next section.
- Install and load required libraries for data manipulation, fuzzy string matching, and file handling.
- Load data from DHIS2 and Master Facility List sources with initial cleaning.
- Conduct initial matching diagnostics to assess exact match rates, administrative alignment, and duplicate name patterns.
- Apply interactive stratified geographic matching to resolve high-confidence matches using geographic constraints and human validation.
- Process and prepare unmatched data by standardizing names and handling abbreviations to reduce variation.
- Perform fuzzy matching on remaining unmatched by constructing candidate grids, computing similarity scores, creating composite scores, and extracting best matches.
- Evaluate fuzzy match quality using diagnostics, visualizations, and weighted thresholding based on structural quality.
- Finalize fuzzy match selection using one of four strategies: single method, composite score, weighted composite, or progressive fallback loop.
- Combine all matched results from stratified matching and fuzzy matching approaches into a comprehensive dataset.
- Perform final checks to validate the matching results and identify any remaining issues.
- Save final datasets including matched results, unmatched facilities, and summary statistics for downstream analysis.
Together, the two phases minimize manual reconciliation while keeping ambiguous cases transparent for SNT team review.
Workflow Diagram
The diagram below summarizes the phased process from diagnostics to final output.
- For direct fuzzy matching without geographic stratification, follow the purple path from Step 3 → Step 5 (bypassing Step 4).
- For the complete matching workflow with geographic stratification, follow all steps sequentially from Step 3 through Step 11.
This provides flexibility: skip stratification if only name-based matching is needed, or include it when location context is important.
After stratified geographic matching, a smaller set of facilities remain unresolved. These are usually due to misassigned administrative units, newly opened or closed facilities, or name variations too large for geographic matching to resolve. These unmatched records then move into Step 5, where standardization and fuzzy matching provide another opportunity to recover valid matches.
Step-by-Step
Follow the 11-step workflow outlined in Matching Workflow Overview. The sections below provide runnable, Sierra Leone-specific code for each step and notes on how to adapt it to the project context.
To skip the step-by-step explanation, jump to the full code at the end of this page.
Step 1: Install and Load Required Libraries
Install and load the necessary packages for data manipulation, fuzzy string matching, and file handling.
# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# load all required packages using pacman
pacman::p_load(
readxl, # for reading Excel files
dplyr, # for data manipulation
stringdist, # for calculating string distances (fuzzy matching)
tibble, # for working with modern data frames
knitr, # for creating formatted tables
openxlsx, # for writing Excel files
httr, # for HTTP requests to download files (optional)
here # for cross-platform file paths
)Install packages in your terminal, if not already installed. If you need help installing packages, please refer to the Getting Started page.
from pathlib import Path
import re
import unicodedata
import numpy as np
import pandas as pd
import pyreadr
import xxhash
import matplotlib.pyplot as plt
from pyprojroot import here
from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz
def read_rds(path):
"""Read a single-object RDS file as a pandas DataFrame."""
result = pyreadr.read_r(str(path))
return next(iter(result.values()))
def vdigest(x, algo="xxhash32"):
"""Vectorised xxhash32 digest. Mirrors sntutils::vdigest()."""
return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())
def cli_header(message):
print(f"\n{message}")
def cli_info(message):
print(f"INFO: {message}")
def cli_success(message):
print(f"SUCCESS: {message}")
def cli_warning(message):
print(f"WARNING: {message}")
def cli_danger(message):
print(f"ERROR: {message}")
def anti_join(left, right, on):
"""Return rows in left with no matching key in right."""
right_keys = right[on].drop_duplicates()
return (
left.merge(right_keys, on=on, how="left", indicator=True)
.loc[lambda x: x["_merge"] == "left_only"]
.drop(columns="_merge")
)
def show_table(df, n=10, caption=None):
"""Render a compact scrollable HTML table with the .out-table style.
Chunks calling this must set #| results: asis."""
from IPython.display import display, HTML
rows = df.head(n)
cap_html = f"<caption>{caption}</caption>" if caption else ""
table_html = rows.to_html(
index=False,
classes="out-table",
border=0,
na_rep="",
)
# inject caption before the table header
if cap_html:
table_html = table_html.replace(
"<thead>", cap_html + "<thead>", 1
)
display(HTML(f'<div class="out-scroll">{table_html}</div>'))To adapt the code:
- Keep these imports and helpers at the top of the Python workflow. Later Python chunks use them for fuzzy matching, data loading, and diagnostic output.
Step 2: Load Data
This step imports the DHIS2 and MFL health facility datasets. It then creates dhis2_hf_df, which contains the unique administrative units and facility names for fuzzy matching. Finally, it outputs samples from each dataset to review the columns and preview the first rows, ensuring the outputs look as expected
Show the code
# set up path to dhis2 hf data
dhis2_path <- here::here(
"01_data",
"1.2_epidemiology",
"1.2a_routine_surveillance",
"processed"
)
hf_path <- here::here(
"01_data",
"1.1_foundational",
"1.1c_health_facilities",
"processed"
)
# read DHIS2 health facilities data
dhis2_df <- readRDS(
here::here(dhis2_path, "sle_dhis2_with_clean_adm3.rds")
) |>
# rename facility name columns for clarity and consistency
dplyr::rename(hf_dhis2_raw = hf)
# get distinct admin and hf cols
dhis2_hf_df <- dhis2_df |>
dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw)
# read MFL health facilities data
master_hf_df <- read.csv(
here::here(hf_path, "hf_final_clean_data.csv")
) |>
dplyr::distinct(
adm0, adm1, adm2, adm3, hf, lat, long, .keep_all = TRUE
) |>
# rename facility name columns for clarity and consistency
dplyr::mutate(hf_mfl_raw = hf)
# attach a stable DHIS2 facility ID for consistent counting across
# steps
# distinct by geography + name to avoid collisions across admins
dhis2_map <- dhis2_df |>
dplyr::distinct(adm0, adm1, adm2, adm3, hf) |>
dplyr::mutate(
hf_uid_new = sntutils::vdigest(
paste(
tolower(stringr::str_squish(adm0)),
tolower(stringr::str_squish(adm1)),
tolower(stringr::str_squish(adm2)),
tolower(stringr::str_squish(adm3)),
tolower(stringr::str_squish(hf)),
sep = "|"
),
algo = "xxhash32"
)
)
# print head of data
cli::cli_h3("Sample of DHIS2 data:")
head(dhis2_hf_df)
cli::cli_h3("Sample of MFL data:")
head(master_hf_df)To adapt the code:
- Lines 2–7: Update dhis2 file paths to match the location of the data files.
- Lines 9–14: Update MFL file paths to match the location of the data files.
- Lines 18, 29: Change file names to match the specific data files.
- Lines 21, 35: Update
hfand the renamed columns (hf_mfl_raw,hf_dhis2_raw) to match the actual health facility name columns in the datasets.
Show the code
# set up path to dhis2 hf data
dhis2_path = Path(here("01_data/1.2_epidemiology/1.2a_routine_surveillance/processed"))
hf_path = Path(here("01_data/1.1_foundational/1.1c_health_facilities/processed"))
# read DHIS2 health facilities data
dhis2_df = read_rds(dhis2_path / "sle_dhis2_with_clean_adm3.rds")
dhis2_df = dhis2_df.assign(hf_dhis2_raw=dhis2_df["hf"])
# get distinct admin and hf cols
dhis2_hf_df = dhis2_df[
["adm0", "adm1", "adm2", "adm3", "hf", "hf_dhis2_raw"]
].drop_duplicates()
# read MFL health facilities data
master_hf_df = pd.read_csv(hf_path / "hf_final_clean_data.csv")
master_hf_df = (
master_hf_df
.drop_duplicates(subset=["adm0", "adm1", "adm2", "adm3", "hf", "lat", "long"])
.assign(hf_mfl_raw=lambda d: d["hf"])
)
# attach a stable DHIS2 facility ID
dhis2_map = (
dhis2_df[["adm0", "adm1", "adm2", "adm3", "hf"]]
.drop_duplicates()
.assign(
hf_uid_new=lambda d: vdigest(
d["adm0"].str.lower().str.strip() + "|" +
d["adm1"].str.lower().str.strip() + "|" +
d["adm2"].str.lower().str.strip() + "|" +
d["adm3"].str.lower().str.strip() + "|" +
d["hf"].str.lower().str.strip()
)
)
)
# print head of data
cli_header("Sample of DHIS2 data:")
dhis2_hf_df.head()
cli_header("Sample of MFL data:")
master_hf_df.head()To adapt the code:
- Lines 2–3: Update dhis2 and MFL file paths to match the location of the data files.
- Lines 6, 14: Change file names to match the specific data files.
- Lines 8, 19: Update
hfand the renamed columns (hf_mfl_raw,hf_dhis2_raw) to match the actual health facility name columns in the datasets.
Step 3: Initial Matching Diagnostics
Before attempting any matching, we assess the data systematically to understand the matching challenge. We break this into three focused checks: overall exact matches, administrative-level matching patterns, and duplicate name issues.
Step 3.1: Overall exact match check
Next, we check overall exact matches without administrative constraints and calculate the total matching potential.
Show the code
# check for exact matches on raw names (no admin constraint)
exact_matches_all <- dhis2_hf_df |>
dplyr::inner_join(
master_hf_df,
by = c("hf_dhis2_raw" = "hf_mfl_raw"),
relationship = "many-to-many"
)
# calculate matching potential
total_dhis2 <- nrow(dhis2_hf_df)
total_mfl <- nrow(master_hf_df)
unmatched_dhis2 <- total_dhis2 - nrow(exact_matches_all)
cli::cli_h3("Overall Matching Summary")
cli::cli_alert_info("Total DHIS2 facilities: {total_dhis2}")
cli::cli_alert_info("Total MFL facilities: {total_mfl}")
cli::cli_alert_success(
paste0(
"Exact matches found: {nrow(exact_matches_all)} (",
"{round(nrow(exact_matches_all)/total_dhis2*100, 1)}%)
)
)
cli::cli_alert_warning("Remaining to match: {unmatched_dhis2}")To adapt the code:
- Line 5: Update join columns to match the facility name fields.
Show the code
# check for exact matches on raw names (no admin constraint)
exact_matches_all = dhis2_hf_df.merge(
master_hf_df[["hf_mfl_raw"]],
left_on="hf_dhis2_raw",
right_on="hf_mfl_raw",
how="inner",
)
# calculate matching potential
total_dhis2 = len(dhis2_hf_df)
total_mfl = len(master_hf_df)
unmatched_dhis2 = total_dhis2 - len(exact_matches_all)
cli_header("Overall Matching Summary")
cli_info(f"Total DHIS2 facilities: {total_dhis2}")
cli_info(f"Total MFL facilities: {total_mfl}")
cli_success(
f"Exact matches found: {len(exact_matches_all)} "
f"({round(len(exact_matches_all) / total_dhis2 * 100, 1)}%)"
)
cli_warning(f"Remaining to match: {unmatched_dhis2}")To adapt the code:
- Line 5: Update join columns to match the facility name fields.
A majority of facilities match exactly by name as shown in the output above. This baseline highlights the scope before applying geographic or fuzzy matching approaches.
Step 3.2: Administrative level match check
First, we examine how well facilities match when we consider administrative boundaries, checking matches at both adm2 (district) and adm3 (chiefdom) levels.
Show the code
# check matches at adm2 level (district)
dhis2_by_adm2 <- dhis2_hf_df |>
dplyr::group_by(adm2) |>
dplyr::summarise(total_dhis2 = dplyr::n())
matches_by_adm2 <- dhis2_hf_df |>
dplyr::inner_join(
master_hf_df,
by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2")
) |>
dplyr::group_by(adm2) |>
dplyr::summarise(exact_matches = dplyr::n()) |>
dplyr::left_join(dhis2_by_adm2, by = "adm2") |>
dplyr::mutate(
match_rate = round(exact_matches / total_dhis2 * 100, 1)
) |>
dplyr::select(adm2, exact_matches, total_dhis2, match_rate) |>
dplyr::arrange(dplyr::desc(match_rate))
cli::cli_h3("Exact matches by district (adm2))
matches_by_adm2
# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 <- dhis2_hf_df |>
dplyr::group_by(adm2, adm3) |>
dplyr::summarise(total_dhis2 = dplyr::n(), .groups = "drop")
matches_by_adm3 <- dhis2_hf_df |>
dplyr::inner_join(
master_hf_df,
by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2", "adm3")
) |>
dplyr::group_by(adm2, adm3) |>
dplyr::summarise(exact_matches = dplyr::n(), .groups = "drop") |>
dplyr::left_join(dhis2_by_adm3, by = c("adm2", "adm3")) |>
dplyr::mutate(
match_rate = round(exact_matches / total_dhis2 * 100, 1)
) |>
dplyr::filter(total_dhis2 >= 5) |> # only show areas with 5+ facilities
dplyr::arrange(dplyr::desc(match_rate)) |>
dplyr::slice_head(n = 10) # show top 10 performing adm3 areas
cli::cli_h3("Top exact matches by chiefdom (adm3))
matches_by_adm3To adapt the code:
- Line 9: Update join columns to match the facility name fields.
- Line 11: Modify the administrative unit column (
adm2) to match the district level. - Lines 25, 31: Update
adm2andadm3columns to match the sub-district level.
Show the code
# check matches at adm2 level (district)
dhis2_by_adm2 = (
dhis2_hf_df.groupby("adm2", as_index=False)
.agg(total_dhis2=("hf_dhis2_raw", "count"))
)
matches_by_adm2 = (
dhis2_hf_df
.merge(
master_hf_df[["hf_mfl_raw", "adm2"]],
left_on=["hf_dhis2_raw", "adm2"],
right_on=["hf_mfl_raw", "adm2"],
how="inner",
)
.groupby("adm2", as_index=False)
.agg(exact_matches=("hf_dhis2_raw", "count"))
.merge(dhis2_by_adm2, on="adm2", how="left")
.assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
[["adm2", "exact_matches", "total_dhis2", "match_rate"]]
.sort_values("match_rate", ascending=False)
)
cli_header("Exact matches by district (adm2)")
matches_by_adm2
# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 = (
dhis2_hf_df.groupby(["adm2", "adm3"], as_index=False)
.agg(total_dhis2=("hf_dhis2_raw", "count"))
)
matches_by_adm3 = (
dhis2_hf_df
.merge(
master_hf_df[["hf_mfl_raw", "adm2", "adm3"]],
left_on=["hf_dhis2_raw", "adm2", "adm3"],
right_on=["hf_mfl_raw", "adm2", "adm3"],
how="inner",
)
.groupby(["adm2", "adm3"], as_index=False)
.agg(exact_matches=("hf_dhis2_raw", "count"))
.merge(dhis2_by_adm3, on=["adm2", "adm3"], how="left")
.assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
.loc[lambda d: d["total_dhis2"] >= 5]
.sort_values("match_rate", ascending=False)
.head(10)
)
cli_header("Top exact matches by chiefdom (adm3)")
matches_by_adm3To adapt the code:
- Line 9: Update join columns to match the facility name fields.
- Line 11: Modify the administrative unit column (
adm2) to match the district level. - Lines 31, 39: Update
adm2andadm3columns to match the sub-district level.
District match rates vary widely, from 86% (Karene) to 39% (Western Urban). Some chiefdoms achieve 100% matches while others struggle. This shows which areas have clean, standardized names versus those needing intensive matching work.
Step 3.3: Duplicate name check
Finally, we check duplicate facility names to understand potential matching complications.
Show the code
# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 <- dhis2_hf_df |>
dplyr::group_by(adm2, hf_dhis2_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, hf_dhis2_raw)
mfl_dups_adm2 <- master_hf_df |>
dplyr::group_by(adm2, hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, hf_mfl_raw)
cli::cli_h3("Duplicates within same district (adm2)")
cli::cli_alert_warning(
paste0(
"DHIS2 duplicates within districts: ",
"{length(unique(dhis2_dups_adm2$hf_dhis2_raw))}"
)
)
cli::cli_alert_warning(
paste0(
"MFL duplicates within districts: ",
"{length(unique(mfl_dups_adm2$hf_mfl_raw))}"
)
)
# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 <- dhis2_hf_df |>
dplyr::group_by(adm2, adm3, hf_dhis2_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, adm3, hf_dhis2_raw)
mfl_dups_adm3 <- master_hf_df |>
dplyr::group_by(adm2, adm3, hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, adm3, hf_mfl_raw)
cli::cli_h3("Duplicates within same chiefdom (adm3)")
cli::cli_alert_danger(
paste0(
"DHIS2 duplicates within chiefdoms: ",
"{length(unique(dhis2_dups_adm3$hf_dhis2_raw))}"
)
)
cli::cli_alert_danger(
paste0(
"MFL duplicates within chiefdoms: ",
"{length(unique(mfl_dups_adm3$hf_mfl_raw))}"
)
)
# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall <- dhis2_hf_df |>
dplyr::group_by(hf_dhis2_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(hf_dhis2_raw)
mfl_dups_overall <- master_hf_df |>
dplyr::group_by(hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(hf_mfl_raw)
cli::cli_h3("Overall duplicate names (across all areas)")
cli::cli_alert_info(
"DHIS2 overall duplicates: {length(
unique(dhis2_dups_overall$hf_dhis2_raw)
)}"
)
cli::cli_alert_info(
"MFL overall duplicates: {length(
unique(mfl_dups_overall$hf_mfl_raw)
)}"
)To adapt the code:
- Lines 3–4, 8–9: Update
adm2and facility name columns. - Lines 28–29, 33–34: Update
adm2,adm3and facility name columns. - Lines 54, 56, 59, 61: Update facility name columns for overall duplicates.
Show the code
# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 = (
dhis2_hf_df
.groupby(["adm2", "hf_dhis2_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "hf_dhis2_raw"])
)
mfl_dups_adm2 = (
master_hf_df
.groupby(["adm2", "hf_mfl_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "hf_mfl_raw"])
)
cli_header("Duplicates within same district (adm2)")
cli_warning(
f"DHIS2 duplicates within districts: {dhis2_dups_adm2['hf_dhis2_raw'].nunique()}"
)
cli_warning(
f"MFL duplicates within districts: {mfl_dups_adm2['hf_mfl_raw'].nunique()}"
)
# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 = (
dhis2_hf_df
.groupby(["adm2", "adm3", "hf_dhis2_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "adm3", "hf_dhis2_raw"])
)
mfl_dups_adm3 = (
master_hf_df
.groupby(["adm2", "adm3", "hf_mfl_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "adm3", "hf_mfl_raw"])
)
cli_header("Duplicates within same chiefdom (adm3)")
cli_danger(
f"DHIS2 duplicates within chiefdoms: {dhis2_dups_adm3['hf_dhis2_raw'].nunique()}"
)
cli_danger(
f"MFL duplicates within chiefdoms: {mfl_dups_adm3['hf_mfl_raw'].nunique()}"
)
# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall = (
dhis2_hf_df
.groupby("hf_dhis2_raw")
.filter(lambda x: len(x) > 1)
.sort_values("hf_dhis2_raw")
)
mfl_dups_overall = (
master_hf_df
.groupby("hf_mfl_raw")
.filter(lambda x: len(x) > 1)
.sort_values("hf_mfl_raw")
)
cli_header("Overall duplicate names (across all areas)")
cli_info(f"DHIS2 overall duplicates: {dhis2_dups_overall['hf_dhis2_raw'].nunique()}")
cli_info(f"MFL overall duplicates: {mfl_dups_overall['hf_mfl_raw'].nunique()}")To adapt the code:
- Lines 3–4, 8–9: Update
adm2and facility name columns. - Lines 26–27, 31–32: Update
adm2,adm3and facility name columns. - Lines 49, 54: Update facility name columns for overall duplicates.
Duplicates are few at chiefdom level but more noticeable within districts, where DHIS2 lists twice as many as the MFL. Nationally, the MFL has many more duplicate names (49 vs 19). These patterns highlight the need to use geography and unique IDs alongside names to avoid mismatches.
Step 4: Interactive Stratified Geographic Matching
Before standardizing names or performing fuzzy matching, geographic stratification with user judgment can be used to resolve most facility matches through interactive matching. This approach combines the precision of human review with the efficiency of automated suggestions, using the approach described in the Merging shapefiles with tabular data but applied here at the health facility level.
prep_geonames() (available as sntutils::prep_geonames() in R and sntutils.geo.prep_geonames() in Python) handles this entire process, including internal name standardization, geographic stratification, and interactive or automated matching based on user needs.
Show the code
# inner join (keep only matched polygons)
# set up location to save cache
cache_loc <- "01_data/1.1_foundational/1.1f_cache_files/processed"
# interactive stratified matching with automatic standardization
# this function handles name standardization internally
dhis2_df_cleaned <-
sntutils::prep_geonames(
target_df = dhis2_hf_df, # dataset to be cleaned
lookup_df = master_hf_df, # reference dataset with correct admin
level0 = "adm0",
level1 = "adm1",
level2 = "adm2",
level3 = "adm3",
level4 = "hf",
cache_path = here::here(cache_loc, "geoname_cache.rds"),
unmatched_export_path = here::here(cache_loc, "dhis2_hf_unmatched.rds")
)
# load the unmatched facilities for further processing (step 5–8 fuzzy
# matching)
dhis2_hf_to_process <- readRDS(
here::here(cache_loc, "dhis2_hf_unmatched.rds")
) |>
dplyr::select(adm0, adm1, adm2, adm3, hf_dhis2_raw = hf)
# summary statistics
n_original <- nrow(dhis2_hf_df)
n_matched <- n_original - nrow(dhis2_hf_to_process)
match_rate <- (n_matched / n_original) * 100
cli::cli_alert_success(
paste0(
"Stratified matching completed: ",
"{format(n_matched, big.mark = ',')}/{format(n_original, big.mark = ',')}",
" facilities matched ({round(match_rate, 1)}%)"
)
)
cli::cli_alert_info(
"Remaining unmatched: {nrow(dhis2_hf_to_process)} facilities"
)To adapt the code:
- Line 3: Update
cache_locto point to the folder for saving both the cache file and the unmatched facilities export. This should be a stable location that can be reused in later steps (e.g.,"01_data/1.1_foundational/1.1f_cache_files/processed"). - Lines 9–10: Replace
dhis2_hf_dfandmaster_hf_dfwith the dataset names. - Lines 11–15: Update column names to match the geographic hierarchy (
adm0,adm1,adm2,adm3) and facility name column (hf). - Lines 16–17: Update file paths for both the cache (
cache_path) and the unmatched facilities export (unmatched_export_path), which will be used as the input for the fuzzy matching steps (5–8). - Line 23: Use the same path as
unmatched_export_pathto load the unmatched facilities data for the subsequent cleaning and fuzzy matching steps.
Show the code
from sntutils.geo import prep_geonames
# set up location to save cache
cache_loc = "01_data/1.1_foundational/1.1f_cache_files/processed"
# interactive stratified matching with automatic standardization
dhis2_df_cleaned = prep_geonames(
target_df=dhis2_hf_df,
lookup_df=master_hf_df,
level0="adm0",
level1="adm1",
level2="adm2",
level3="adm3",
level4="hf",
cache_path=here(cache_loc, "geoname_cache.csv"),
unmatched_export_path=here(cache_loc, "dhis2_hf_unmatched.csv"),
)
# load the unmatched facilities for further processing (steps 5–8)
dhis2_hf_to_process = pd.read_csv(
here(cache_loc, "dhis2_hf_unmatched.csv")
).rename(columns={"hf": "hf_dhis2_raw"})
# summary statistics
n_original = len(dhis2_hf_df)
n_matched = n_original - len(dhis2_hf_to_process)
match_rate = n_matched / n_original * 100
cli_success(
f"Stratified matching completed: {n_matched:,}/{n_original:,} "
f"facilities matched ({round(match_rate, 1)}%)"
)
cli_info(f"Remaining unmatched: {len(dhis2_hf_to_process)} facilities")To adapt the code:
- Line 3: Update
cache_locto point to the folder for saving both the cache file and the unmatched facilities export. - Lines 8–9: Replace
dhis2_hf_dfandmaster_hf_dfwith the dataset names. - Lines 10–14: Update column names to match the geographic hierarchy (
adm0,adm1,adm2,adm3) and facility name column (hf). - Lines 15–16: Update file paths for both the cache and the unmatched facilities export.
- Line 21: Use the same path as
unmatched_export_pathto load unmatched facilities.
With our interactive stratified matching, we matched 1,203 of 1,771 facilities (67.9%). All administrative levels matched perfectly, but some facilities remain unmatched at the facility name level. The results suggest that combining geographic constraints with human validation is effective, leaving a smaller subset of facilities still to be addressed.
The cached results should always be validated with the SNT team before integration. This review step helps to:
- Confirm matches using local knowledge, especially for common or similar names
- Resolve ambiguities where facilities are misassigned or have multiple candidates
- Ensure consistency with national naming standards and classifications
- Document decisions to support future harmonization and reduce repeated effort
The cache is both a validation record and a resource for improving data quality over time. The table below shows the contents of the saved cache from the stratified matching process (cache_path).
The cache can also be reused in future runs to avoid repeating interactive matching. It preserves past decisions, ensures consistency, and provides an efficient starting point for updating facility names with the SNT team.
The next step is to run these unmatched facilities through the standardization and fuzzy matching pipeline to recover any remaining valid matches.
Step 5: Process and Prepare Unmatched Data for Fuzzy Matching
After stratified geographic matching, a smaller set of facilities remain unresolved. These are typically due to misassigned administrative units, newly opened or closed facilities, or name variations too large for geographic matching to resolve. To address these cases, Step 5 focuses on preparing the unmatched data by standardizing names and handling abbreviations to reduce variation before applying fuzzy matching.
Step 5.1: Standardize health facility names
Now we apply standardization to the unmatched facilities. This step helps resolve matches that failed due to formatting differences:
As outlined in the Best Practices for an Effective Fuzzy Matching Workflow section, we begin by standardizing the health facility name columns. To do this, we first apply common text cleaning operations such as converting to lowercase, trimming whitespace, removing punctuation, collapsing multiple spaces, normalizing accented characters and converting Roman numerals (e.g., II, III) into standard numeric form (e.g., 2, 3). Each name is then split into individual words, sorted alphabetically, and rejoined, ensuring that names with the same words in different orders are treated consistently. These preprocessing steps reduce irrelevant variation and improve the reliability of downstream fuzzy matching.
In this illustrative example, we apply every standardization option so that code is available for all. However, some of these steps may be unnecessary or lead to unanticipated mismatches for a given project. Review carefully which standardization options to include in the workflow. Creating multiple versions with more and fewer standardization steps can help inform the final decision.
Show the code
# create function to standardize hf names
standardize_names <- function(name_vec) {
# validate input
if (!rlang::is_atomic(name_vec)) {
cli::cli_abort("`name_vec` must be an atomic vector.")
}
name_vec |>
# ensure character type
as.character() |>
# convert to lowercase
stringr::str_to_lower() |>
# replace punctuation with space
stringr::str_replace_all("[[:punct:]]", " ") |>
# remove extra spaces and trim
stringr::str_squish() |>
# normalize accents
stringi::stri_trans_general("Latin-ASCII") |>
# normalize all space-like characters
stringi::stri_replace_all_regex("\\p{Zs}+", " ") |>
# convert roman numerals to arabic numerals
stringr::str_replace_all(
c(
" ix\\b" = " 9",
" viii\\b" = " 8",
" vii\\b" = " 7",
" vi\\b" = " 6",
" v\\b" = " 5",
" iv\\b" = " 4",
" iii\\b" = " 3",
" ii\\b" = " 2",
" i\\b" = " 1"
)
) |>
# sort tokens with letters first, numbers last; within groups
# alphabetical
purrr::map_chr(\(.x) {
# split on one-or-more spaces
tokens <- strsplit(.x, " +")[[1]]
# detect pure numeric tokens
is_num <- stringr::str_detect(tokens, "^[0-9]+$")
# order alphabetic first, then numeric; sort within each group
ordered <- c(sort(tokens[!is_num]), sort(tokens[is_num]))
# rejoin
paste(ordered, collapse = " ")
})
}
# set up example with messy formatting
example_word <- factor("Clínica! Rahmâ IV ( New clinic) East")
# print original structure
cat("\nExample before standardization:\n")
str(example_word)
# apply standardization
example_word_st <- standardize_names(example_word)
# print cleaned example
cat("\nExample after standardization:\n")
str(example_word_st)To adapt the code:
- Lines 1–50: To keep the full suite of standardizations, leave the
standardize_namesfunction as-is. Otherwise, remove the standardizations that are not needed.
Show the code
import unicodedata
def standardize_names(series):
"""Standardize health facility name strings for fuzzy matching.
Applies lowercase conversion, punctuation removal, accent normalization,
Roman numeral conversion, and alphabetical token sorting.
"""
roman_map = {
r"\bix\b": "9", r"\bviii\b": "8", r"\bvii\b": "7",
r"\bvi\b": "6", r"\bv\b": "5", r"\biv\b": "4",
r"\biii\b": "3", r"\bii\b": "2", r"\bi\b": "1",
}
def _clean(text):
if pd.isna(text):
return text
text = str(text).lower()
# replace punctuation with space
text = re.sub(r"[^\w\s]", " ", text)
# collapse multiple spaces
text = re.sub(r"\s+", " ", text).strip()
# normalize accents
text = unicodedata.normalize("NFD", text)
text = "".join(c for c in text if unicodedata.category(c) != "Mn")
# convert roman numerals
for pattern, replacement in roman_map.items():
text = re.sub(pattern, replacement, text)
# sort tokens: letters first (alphabetical), numbers last
tokens = text.split()
alpha = sorted(t for t in tokens if not t.isdigit())
numeric = sorted(t for t in tokens if t.isdigit())
return " ".join(alpha + numeric)
return series.apply(_clean)
# set up example with messy formatting
example_word = "Clínica! Rahmâ IV ( New clinic) East"
# print original
print(f"\nExample before standardization:\n{example_word}")
# apply standardization
example_word_st = standardize_names(pd.Series([example_word]))[0]
# print cleaned example
print(f"\nExample after standardization:\n{example_word_st}")To adapt the code:
- Lines 1–42: To keep the full suite of standardizations, leave the
standardize_namesfunction as-is. Otherwise, remove the standardizations that are not needed.
The function worked as expected: it turned the original factor Clínica! Rahmâ IV ( New clinic) East into a clean character string clinic clinica east new rahma 4. This confirms that the standardization steps worked as intended.
- The original name had extra spaces, accents, punctuation, and Roman numerals.
- All text was converted to lowercase.
- Punctuation was replaced with spaces, and multiple spaces were collapsed into one.
- Accents were removed, and Roman numerals were converted to numbers.
- Words were split, sorted alphabetically, and any numbers placed at the end.
We will now apply the standardization function to both the DHIS2 and MFL facility name columns to ensure consistent formatting. This helps ensure that string similarity scores reflect true differences, not artifacts of inconsistent text formatting.
Show the code
# remove duplicates from MFL and format hf col
master_hf_df <- master_hf_df |>
dplyr::distinct(hf_mfl_raw, .keep_all = TRUE) |>
dplyr::mutate(hf_mfl = standardize_names(hf_mfl_raw))
# important: preserve original full dataset before processing unmatched
# (this is our base N)
dhis2_hf_df_original <- dhis2_hf_df |>
dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw))
# build geo-aware stable ID once and attach to originals and unmatched
hf_uid_new_map <- dhis2_hf_df |>
dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
dplyr::mutate(
hf_uid_new = sntutils::vdigest(
paste(
tolower(stringr::str_squish(adm0)),
tolower(stringr::str_squish(adm1)),
tolower(stringr::str_squish(adm2)),
tolower(stringr::str_squish(adm3)),
tolower(stringr::str_squish(hf_dhis2_raw)),
sep = "|"
),
algo = "xxhash32"
)
)
dhis2_hf_df_original <- dhis2_hf_df_original |>
dplyr::left_join(
hf_uid_new_map,
by = c(
"adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
)
)
# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched <- dhis2_hf_to_process |>
dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw)) |>
dplyr::left_join(
hf_uid_new_map |>
dplyr::mutate(hf_dhis2_raw = toupper(hf_dhis2_raw)),
by = c(
"adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
)
)
knitr::kable(
# check if it worked
dhis2_hf_unmatched |>
dplyr::distinct(hf_dhis2_raw, hf_dhis2) |>
dplyr::slice_head(n = 10)
)To adapt the code:
- Lines 2–4: Update column names to match the data (
hf_mfl_raw,hf_dhis2_raw,hf_mfl,hf_dhis2). - Line 3: Modify the
distinct()function to use different duplicate removal criteria if needed. - Lines 6–9: Create
dhis2_hf_df_originalto preserve all original facilities (this is our base N that will be used in Step 9.2). - Lines 37–45: Create
dhis2_hf_unmatchedfor fuzzy matching steps - subsequent steps should use this instead ofdhis2_hf_df.
Show the code
# remove duplicates from MFL and create standardized column
master_hf_df = (
master_hf_df
.drop_duplicates(subset="hf_mfl_raw")
.assign(hf_mfl=lambda d: standardize_names(d["hf_mfl_raw"]))
)
# preserve original full dataset before processing unmatched
dhis2_hf_df_original = dhis2_hf_df.assign(
hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"])
)
# build geo-aware stable ID map
hf_uid_new_map = (
dhis2_hf_df[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"]]
.drop_duplicates()
.assign(
hf_uid_new=lambda d: vdigest(
d["adm0"].str.lower().str.strip() + "|" +
d["adm1"].str.lower().str.strip() + "|" +
d["adm2"].str.lower().str.strip() + "|" +
d["adm3"].str.lower().str.strip() + "|" +
d["hf_dhis2_raw"].str.lower().str.strip()
)
)
)
dhis2_hf_df_original = dhis2_hf_df_original.merge(
hf_uid_new_map,
on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
how="left",
)
# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched = (
dhis2_hf_to_process
.assign(hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"]))
.merge(
hf_uid_new_map.assign(
hf_dhis2_raw=lambda d: d["hf_dhis2_raw"].str.upper()
),
on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
how="left",
)
)
# check if standardization worked
dhis2_hf_unmatched[["hf_dhis2_raw", "hf_dhis2"]].drop_duplicates().head(10)To adapt the code:
- Lines 2–4: Update column names to match the data (
hf_mfl_raw,hf_dhis2_raw,hf_mfl,hf_dhis2). - Line 3: Modify
drop_duplicates()to use different duplicate removal criteria if needed. - Lines 7–9: Create
dhis2_hf_df_originalto preserve all original facilities. - Lines 37–44: Create
dhis2_hf_unmatchedfor fuzzy matching steps; subsequent steps should use this instead ofdhis2_hf_df.
We keep the original columns (hf_mfl_raw, hf_dhis2_raw) unchanged and create standardized versions (hf_mfl, hf_dhis2) for matching. This preserves the raw names for later review or manual matching.
Step 5.2: Handle abbreviations in health facility names
Health facility names in both DHIS2 and the MFL may contain abbreviations such as CHC, PHU, or MCHP. If these abbreviations are used inconsistently across or within datasets, standardization can improve matching accuracy.
To improve consistency and match quality, we begin by identifying frequently used abbreviations across both datasets by analyzing word patterns. Rather than expanding these abbreviations into their full forms, which can introduce unnecessary verbosity and noise, we have chosen to apply a predefined dictionary to standardize both datasets to use abbreviations. This approach simplifies name structures, shortens string lengths, and reduces the influence of repetitive generic terms on similarity scores.
We count the number of times each abbreviation is appearing in the datasets (freq) and define a “real” abbreviation as one that appears more than 2 times.
Show the code
abbrev_dictionary <-
dplyr::bind_rows(
dplyr::select(dhis2_hf_unmatched, hf = hf_dhis2_raw),
dplyr::select( master_hf_df, hf = hf_mfl_raw)) |>
# split into words
tidyr::separate_rows(hf, sep = " ") |>
# remove blanks
dplyr::filter(hf != "") |>
# detect 2-4 uppercase letter patterns
dplyr::filter(stringr::str_detect(hf, "^[A-Z]{2,4}$")) |>
# count frequencies
dplyr::count(hf, sort = TRUE) |>
# rename for clarity
dplyr::rename(word = hf, freq = n) |>
dplyr::filter(freq > 2) |>
as.data.frame()
# check output
abbrev_dictionaryTo adapt the code:
- Line 10: Modify the pattern
{2,4}to change the character length range for detected abbreviations (currently 2-4 characters). - Line 15: Change the frequency threshold (
freq > 2) to suit the project context: lower it to capture more potential abbreviations or raise it to focus on the most common ones. - Handle all-caps facility names: If facility names are entirely in uppercase, this pattern detection may not work as expected. Consider pre-processing to convert names to title case or adjusting the detection pattern.
Show the code
# combine DHIS2 and MFL names for abbreviation detection
all_names = pd.concat([
dhis2_hf_unmatched[["hf_dhis2_raw"]].rename(columns={"hf_dhis2_raw": "hf"}),
master_hf_df[["hf_mfl_raw"]].rename(columns={"hf_mfl_raw": "hf"}),
])
# split into words and filter to 2-4 uppercase letter tokens
abbrev_dictionary = (
all_names["hf"]
.dropna()
.str.split(expand=True)
.stack()
.reset_index(drop=True)
.rename("word")
.loc[lambda s: s.str.match(r"^[A-Z]{2,4}$")]
.value_counts()
.reset_index()
.rename(columns={"count": "freq"})
.loc[lambda d: d["freq"] > 2]
)
# check output
abbrev_dictionaryTo adapt the code:
- Line 15: Modify the pattern
{2,4}to change the character length range for detected abbreviations. - Line 19: Change the frequency threshold (
freq > 2) to suit the project context.
Now that we have identified common abbreviations (e.g., MCHP, CHP, CHC), we use a shared dictionary to standardize health facility names across both datasets. Rather than expanding these terms, we ensure they are used consistently as abbreviations in both DHIS2 and the MFL. Even within the MFL, some names may use abbreviations while others use full terms. Standardization helps align these differences and increases match accuracy.
This abbreviation dictionary approach has limitations when dealing with misspelled or variant terms. For example:
- If “Community Health Center” is misspelled as “Comunity Health Center”, the dictionary lookup will not match
- Regional variations like “Health Centre” vs “Health Center” may not be captured
- Typos in abbreviations themselves (e.g.,
CHCvsCHS) will not be standardized
Mitigation strategies:
- Use fuzzy matching algorithms in Step 4 to handle remaining mismatches after abbreviation standardization
- Consider adding common misspellings to the dictionary if they appear frequently
- Implement additional text cleaning steps (spell checking, variant standardization) before applying the abbreviation dictionary
Show the code
# define abbreviation dictionary (all lowercase, because we use the
# standardize column)
abbrev_dict <- c(
"maternal child health post" = "mchp",
"community health post" = "chp",
"community health center" = "chc",
"urban maternal clinic" = "umi",
"expanded programme on immunization" = "epi"
)
# apply replacements to MFL names
master_hf_df <- master_hf_df |>
dplyr::mutate(
hf_mfl = stringr::str_replace_all(hf_mfl, abbrev_dict)
)
# apply replacements to DHIS2 names
dhis2_hf_unmatched <- dhis2_hf_unmatched |>
dplyr::mutate(
hf_dhis2 = stringr::str_replace_all(hf_dhis2, abbrev_dict)
)
# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched |>
dplyr::filter(
stringr::str_detect(
hf_dhis2_raw,
paste0(
"(?i)Community Health Center|Maternal Child Health Post|",
"Community Health Post"
)
) |
stringr::str_detect(hf_dhis2_raw, "CHC|MCHP|CHP")
) |>
dplyr::select(hf_dhis2_raw, hf_dhis2) |>
head()To adapt the code:
- Lines 3–9: Modify the
abbrev_dictif the data includes other frequent abbreviations. - Lines 14, 20: Use the standardized lowercase columns (
hf_mflandhf_dhis2) as created in Step 5.1. Only apply this step after confirming abbreviations through frequency checks (see Step 5.2).
Show the code
# define abbreviation dictionary (all lowercase, using standardized column)
abbrev_dict = {
"maternal child health post": "mchp",
"community health post": "chp",
"community health center": "chc",
"urban maternal clinic": "umi",
"expanded programme on immunization": "epi",
}
# apply replacements to MFL names
for long_form, short_form in abbrev_dict.items():
master_hf_df["hf_mfl"] = master_hf_df["hf_mfl"].str.replace(
long_form, short_form, regex=False
)
# apply replacements to DHIS2 names
for long_form, short_form in abbrev_dict.items():
dhis2_hf_unmatched["hf_dhis2"] = dhis2_hf_unmatched["hf_dhis2"].str.replace(
long_form, short_form, regex=False
)
# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched.loc[
dhis2_hf_unmatched["hf_dhis2_raw"].str.contains(
r"(?i)Community Health Center|Maternal Child Health Post|Community Health Post|CHC|MCHP|CHP",
regex=True,
na=False,
),
["hf_dhis2_raw", "hf_dhis2"],
].drop_duplicates().head()To adapt the code:
- Lines 3–9: Modify
abbrev_dictif the data includes other frequent abbreviations. - Lines 13, 20: Use the standardized lowercase columns (
hf_mflandhf_dhis2) as created in Step 5.1.
Step 6: Perform Fuzzy Matching on Remaining Unmatched
We now apply fuzzy matching algorithms to the facilities that remain unmatched after stratified geographic matching in Step 4. These represent the most challenging cases where geographic alignment or name similarity could not establish a clear match.
This step focuses on facilities that could not be matched through: - Step 4: Interactive stratified geographic matching - Step 5: Name standardization and exact matching
These remaining facilities often have geographic misalignments or extreme name variations that require more flexible matching approaches.
In our example, we aim to find an MFL match for each remaining unmatched DHIS2 facility. To match in the other direction (identifying a DHIS2 match for each MFL facility), or in both directions, modify and/or repeat the code accordingly.
Step 6.1: Identify matched and unmatched facilities
We first identify exact matches between DHIS2 and MFL facility names, both before and after text standardization. These exact matches are set aside while we focus on unmatched facilities for fuzzy matching.
We use many-to-one matching, meaning multiple DHIS2 records may link to the same MFL facility if their names are sufficiently similar. This approach helps assess linkage reliability and flags potential naming inconsistencies. It is also possible to see many-to-many detections, for example when the same health facility appears in DHIS2 under different spellings or formats and matches against multiple candidate facilities in the MFL. These cases require closer review to confirm the correct linkage.
The output from this step is unmatched DHIS2 names that will be processed through fuzzy matching algorithms.
Show the code
# flag for one-to-one enforcement
enforce_one_to_one <- FALSE
# exact matches using raw names
matched_dhis2_raw <- dhis2_hf_unmatched |>
dplyr::select(
adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_uid_new
) |>
dplyr::inner_join(
master_hf_df |>
dplyr::select(hf_mfl_raw),
by = c("hf_dhis2_raw" = "hf_mfl_raw")
)
# exact matches using standardized names
matched_dhis2 <- dhis2_hf_unmatched |>
dplyr::select(
adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_dhis2, hf_uid_new
) |>
dplyr::inner_join(
master_hf_df |>
dplyr::select(hf_mfl_raw, hf_mfl),
by = c("hf_dhis2" = "hf_mfl"),
keep = TRUE
) |>
# exclude facilities already matched via raw names
dplyr::anti_join(matched_dhis2_raw, by = c("hf_dhis2_raw")) |>
# label exact matches
dplyr::mutate(
final_method = paste0(
"Matched Without Fuzzy Matching (standardization)"
),
score = 100
)
# only include facilities that were actually matched via geographic
# standardization
# these should be facilities that had geographic corrections applied
matched_dhis2_prepgeoname <-
dhis2_df_cleaned |>
dplyr::anti_join(
dhis2_hf_unmatched,
by = c("adm0", "adm1", "adm2", "adm3", "hf" = "hf_dhis2_raw")
) |>
dplyr::left_join(
master_hf_df |>
dplyr::mutate(hf = toupper(hf)) |>
dplyr::select(adm0, adm1, adm2, adm3, hf, hf_mfl_raw, hf_mfl),
by = c("adm0", "adm1", "adm2", "adm3", "hf")
) |>
# attach hf_uid_new from dhis2_map (by admin + hf)
dplyr::left_join(
dhis2_map |> dplyr::select(adm0, adm1, adm2, adm3, hf, hf_uid_new),
by = c("adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw" = "hf")
) |>
# label matches
dplyr::mutate(
final_method = "Interactive Stratified Geographic Matching",
score = 100
)
# compute unmatched after standardization
unmatched_dhis2 <- dhis2_hf_unmatched |>
dplyr::select(hf_dhis2) |>
dplyr::anti_join(
master_hf_df |>
dplyr::select(hf_mfl),
by = c("hf_dhis2" = "hf_mfl")
) |>
dplyr::distinct(hf_dhis2)
# collect already matched mfl to enforce one-to-one in candidate pool
used_mfl_stand <- matched_dhis2 |>
dplyr::pull(hf_mfl) |>
unique()
use_mfl_prepgeoname <- matched_dhis2_prepgeoname |>
dplyr::pull(hf_mfl) |>
unique()
used_mfl <- c(used_mfl_stand, use_mfl_prepgeoname)
# build candidate MFL pool for fuzzy step
candidate_mfl_df <- master_hf_df |>
dplyr::select(hf_mfl)
# if enforcing one-to-one, drop MFLs already used by exact matches
if (enforce_one_to_one) {
candidate_mfl_df <- candidate_mfl_df |>
dplyr::filter(!hf_mfl %in% used_mfl)
}
# summary counts
total_dhis2_hf <- dplyr::n_distinct(dhis2_hf_unmatched$hf_dhis2_raw)
raw_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2_raw$hf_dhis2_raw)
raw_unmatch_dhis2_hf <- total_dhis2_hf - raw_match_dhis2_hf
standardized_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2$hf_dhis2)
standardized_unmatch_dhis2_hf <- total_dhis2_hf - standardized_match_dhis2_hf
total_mfl_hf <- dplyr::n_distinct(master_hf_df$hf_mfl)
candidate_mfl_count <- dplyr::n_distinct(candidate_mfl_df$hf_mfl)
# display summary
cli::cli_h2("Summary of Matching Status")
cli::cli_alert_info(
"Total DHIS2 facilities: {format(total_dhis2_hf, big.mark = ',')}"
)
cli::cli_alert_success(
paste0(
"Matched after standardization: ",
"{format(standardized_match_dhis2_hf, big.mark = ',')}"
)
)
cli::cli_alert_danger(
paste0(
"Unmatched using raw names: ",
"{format(raw_unmatch_dhis2_hf, big.mark = ',')}"
)
)
cli::cli_alert_danger(
paste0(
"Unmatched after standardization: ",
"{format(standardized_unmatch_dhis2_hf, big.mark = ',')}"
)
)
cli::cli_alert_info(
paste0(
"MFL facilities in candidate pool for matching: ",
"{format(candidate_mfl_count, big.mark = ',')} of ",
"{format(total_mfl_hf, big.mark = ',')}"
)
)To adapt the code:
- Line 2: Set
enforce_one_to_onetoTRUEto prevent reusing an hf_mfl already matched exactly. Set toFALSEto allow one-to-many. - Lines 15, 18: Replace column names:
hf_dhis2andhf_mfl→ standardized name fields;hf_dhis2_rawandhf_mfl_raw→ raw name fields. - Lines 14, 17: Replace dataset names:
dhis2_hf_unmatched→ the unmatched DHIS2 dataset;master_hf_df→ the MFL dataset. - Lines 61–68:
unmatched_dhis2contains DHIS2 facilities not matched to the MFL and will be passed to the fuzzy matching step. - Lines 88–91: One-to-many filter. Keep as-is to drop already matched hf_mfl when
enforce_one_to_one == FALSE. Remove these lines to always allow reuse.
Show the code
# flag for one-to-one enforcement
enforce_one_to_one = False
# exact matches using raw names
matched_dhis2_raw = (
dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
.merge(
master_hf_df[["hf_mfl_raw"]],
left_on="hf_dhis2_raw",
right_on="hf_mfl_raw",
how="inner",
)
)
# exact matches using standardized names
matched_dhis2 = (
dhis2_hf_unmatched[
["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_dhis2", "hf_uid_new"]
]
.merge(
master_hf_df[["hf_mfl_raw", "hf_mfl"]],
left_on="hf_dhis2",
right_on="hf_mfl",
how="inner",
)
)
# exclude facilities already matched via raw names
matched_dhis2 = anti_join(matched_dhis2, matched_dhis2_raw, on=["hf_dhis2_raw"])
matched_dhis2 = matched_dhis2.assign(
final_method="Matched Without Fuzzy Matching (standardization)",
score=100,
)
# compute unmatched after standardization
# cross-named anti-join: dhis2_hf_unmatched.hf_dhis2 vs master_hf_df.hf_mfl
unmatched_dhis2 = (
dhis2_hf_unmatched.loc[
~dhis2_hf_unmatched["hf_dhis2"].isin(master_hf_df["hf_mfl"]),
["hf_dhis2"],
]
.drop_duplicates(subset="hf_dhis2")
)
used_mfl = list(matched_dhis2["hf_mfl"].unique())
# build candidate MFL pool for fuzzy step
candidate_mfl_df = master_hf_df[["hf_mfl"]].copy()
if enforce_one_to_one:
candidate_mfl_df = candidate_mfl_df.loc[~candidate_mfl_df["hf_mfl"].isin(used_mfl)]
# summary counts
total_dhis2_hf = dhis2_hf_unmatched["hf_dhis2_raw"].nunique()
raw_match_dhis2_hf = matched_dhis2_raw["hf_dhis2_raw"].nunique()
raw_unmatch_dhis2_hf = total_dhis2_hf - raw_match_dhis2_hf
standardized_match_dhis2_hf = matched_dhis2["hf_dhis2"].nunique()
standardized_unmatch_dhis2_hf = total_dhis2_hf - standardized_match_dhis2_hf
total_mfl_hf = master_hf_df["hf_mfl"].nunique()
candidate_mfl_count = candidate_mfl_df["hf_mfl"].nunique()
cli_header("Summary of Matching Status")
cli_info(f"Total DHIS2 facilities: {total_dhis2_hf:,}")
cli_success(f"Matched after standardization: {standardized_match_dhis2_hf:,}")
cli_danger(f"Unmatched using raw names: {raw_unmatch_dhis2_hf:,}")
cli_danger(f"Unmatched after standardization: {standardized_unmatch_dhis2_hf:,}")
cli_info(
f"MFL facilities in candidate pool for matching: "
f"{candidate_mfl_count:,} of {total_mfl_hf:,}"
)To adapt the code:
- Line 2: Set
enforce_one_to_onetoTrueto prevent reusing an hf_mfl already matched exactly. Set toFalseto allow one-to-many. - Lines 15, 18: Replace column names:
hf_dhis2andhf_mfl→ standardized name fields;hf_dhis2_rawandhf_mfl_raw→ raw name fields. - Lines 14, 17: Replace dataset names:
dhis2_hf_unmatched→ the unmatched DHIS2 dataset;master_hf_df→ the MFL dataset. - Lines 37–39:
unmatched_dhis2contains DHIS2 facilities not matched to the MFL and will be passed to the fuzzy matching step.
Our standardization process resolved some of the initially unmatched DHIS2 facilities as shown in the output above. This shows the value of text cleaning before applying fuzzy matching. The remaining unmatched DHIS2 facilities are the focus of our fuzzy matching efforts in the next steps.
The output above shows the total DHIS2 facilities that were unmatched at the raw-name stage, and the MFL facilities that remain in the candidate pool for fuzzy matching after enforcing one-to-many alignment with DHIS2 matches.
We will use unmatched_dhis2 as the input for fuzzy matching. These are the standardized DHIS2 facility names that could not be matched through exact string comparison. The matched_dhis2 dataset contains facility names that matched exactly after standardization. Once fuzzy matching is complete, we will append these exact matches to the final results to produce a complete, joined dataset.
Step 6.1: Create the matching grid
Once we identify the unmatched_dhis2 names, the next step is to create a matching grid. This step sets up the full set of possible name pairings that we will later evaluate using fuzzy string comparison. The purpose of this grid is to define the set of candidate pairs that we want to compare in the next step, rather than checking every name against every other name in the full dataset, which would be computationally expensive and potentially introduce false matches.
Show the code
# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities.
# create simple candidate match pool
candidate_match_pool <- tidyr::crossing(
unmatched_dhis2,
candidate_mfl_df
)
# create match grid with dhis2 and mfl names
match_grid <- candidate_match_pool
# preview sample - top 10 matches
match_grid |>
dplyr::slice_head(n = 10)To adapt the code:
- Lines 7–8: Replace
unmatched_dhis2andcandidate_mfl_dfwith the dataset names if different.
Show the code
# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities
candidate_match_pool = unmatched_dhis2.merge(candidate_mfl_df, how="cross")
# create match grid with dhis2 and mfl names
match_grid = candidate_match_pool.copy()
# preview sample - top 10 matches
match_grid.head(10)To adapt the code:
- Lines 2–3: Replace
unmatched_dhis2andcandidate_mfl_dfwith the dataset names if different.
The match_grid is a flat list of candidate name pairs. Each row links one unmatched DHIS2 name to a possible MFL match. This is not a fuzzy match result, but the full set of pairings that will be evaluated in Step 6.3 using the string distance algorithms.
At this stage, the grid includes only facility names (hf_dhis2 and hf_mfl) to keep it lightweight. IDs, coordinates, and other metadata are excluded for now and will be reattached after scoring. If identical names appear across locations, the grid can be extended with extra fields for disambiguation. The preview shows a random sample of candidate pairs. It may not include obvious matches, but helps confirm that the pairing logic and geographic filters are working. This grid is the input for fuzzy matching, where each row will be scored and ranked to identify likely matches.
Step 6.2: Compute similarity scores
We compute name similarity between each unmatched DHIS2 facility and all candidates in the master list using multiple algorithms: Jaro-Winkler, Levenshtein, Qgram, and LCS. Each algorithm captures different types of variation, such as typos, truncation, or word reordering, that may be found in real-world datasets.
This step outputs individual similarity scores from each method, allowing us to explore, compare, or prioritize specific algorithms depending on the matching goals.
Show the code
# calculate fuzzy scores
match_grid <- match_grid |>
dplyr::mutate(
len_max = pmax(nchar(hf_dhis2), nchar(hf_mfl)),
score_jw = 1 - stringdist::stringdist(hf_dhis2, hf_mfl, method = "jw"),
score_lv = 1 -
stringdist::stringdist(hf_dhis2, hf_mfl, method = "lv") / len_max,
score_qg = 1 -
stringdist::stringdist(hf_dhis2, hf_mfl, method = "qgram") / len_max,
score_lcs = 1 -
stringdist::stringdist(hf_dhis2, hf_mfl, method = "lcs") / len_max
) |>
dplyr::mutate(
dplyr::across(
.cols = dplyr::starts_with("score_"),
.fns = ~ ifelse(is.nan(.x) | .x < 0, 0, .x)
),
dplyr::across(
.cols = dplyr::contains("score_"),
.fns = ~ .x * 100
)
)To adapt the code:
- Lines 4–10: Swap in or out any similarity algorithm based on the matching needs. See
stringdistmanual, p.23 for available methods.
Show the code
from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz
def _score_lv(a, b):
"""Normalized Levenshtein similarity (0–100)."""
len_max = max(len(a), len(b))
if len_max == 0:
return 100.0
return max(0.0, (1 - Levenshtein.distance(a, b) / len_max) * 100)
def _score_jw(a, b):
"""Jaro-Winkler similarity (0–100)."""
return JaroWinkler.similarity(a, b) * 100
def _score_qg(a, b):
"""Q-gram similarity (0–100) using token sort ratio."""
return float(fuzz.token_sort_ratio(a, b))
def _score_lcs(a, b):
"""LCS-based similarity (0–100) using partial ratio."""
return float(fuzz.partial_ratio(a, b))
# calculate fuzzy scores
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]
match_grid = match_grid.assign(
score_jw=lambda d: d.apply(
lambda r: _score_jw(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
score_lv=lambda d: d.apply(
lambda r: _score_lv(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
score_qg=lambda d: d.apply(
lambda r: _score_qg(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
score_lcs=lambda d: d.apply(
lambda r: _score_lcs(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
)
# clip negatives and NaN to zero
for col in score_cols:
match_grid[col] = match_grid[col].clip(lower=0).fillna(0)To adapt the code:
- Lines 14–32: Swap in or out any similarity function based on the matching needs.
rapidfuzzprovides Levenshtein, Jaro-Winkler, token sort ratio, and partial ratio out of the box.
Above is a sample of ten DHIS2–MFL name pairs from the match grid. The first five are strong matches with near-perfect scores across all algorithms, reflecting nearly identical string matches. The last five are weak matches with low or zero similarity, often due to unrelated names or different facility types. Mid-range cases like kpetema mchp vs clinic gbonkorlenken show partial overlap but remain well below matching thresholds. These patterns confirm that the scoring works as intended and set the stage for building composite similarity scores in the next step, which help reduce dependence on any single metric.
Step 6.3: Create composite scores
To improve fuzzy match reliability, we generate composite similarity metrics that combine multiple string distance algorithms. This helps minimize over-reliance on any single metric and supports a more reliable match decision. However, if we have a preferred single metric from the previous step, we may skip generating a composite score and proceed directly to Step 4.5.
We generate composite metrics in two ways:
Composite Similarity Score (
composite_score): We calculate a simple average across the six similarity algorithms from the previous step (Jaro-Winkler, Levenshtein, Qgram and LCS) to create a composite_score. This value reflects the overall string similarity between DHIS2 and MFL names.Composite Rank Score (
rank_avg): For each DHIS2 facility, we rank the match quality of each candidate MFL name, with a ranking done for each of the six similarity algorithms. Then we average these ranks into arank_avg. A low rank_avg (e.g., 1) indicates the same MFL name consistently scores as an excellent match across methods, signaling strong consensus and high match confidence. This ensemble ranking helps balance the strengths of different algorithms.
Show the code
# define similarity score columns
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")
# compute composite_score dynamically using score_cols
match_grid <- match_grid |>
dplyr::mutate(
composite_score = rowMeans(
dplyr::across(dplyr::all_of(score_cols))
)
)
# compute average rank across all similarity methods
ranked_grid <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::mutate(
dplyr::across(
dplyr::all_of(score_cols),
~ dplyr::min_rank(dplyr::desc(.)),
.names = "rank_{.col}"
)
) |>
dplyr::ungroup() |>
dplyr::mutate(
rank_avg = rowSums(dplyr::across(
dplyr::all_of(paste0("rank_", score_cols))
)),
rank_avg = round(rank_avg / length(score_cols))
)
# preview ranked results to show multiple candidates per DHIS2 facility
# select a few facilities and show their top 3 candidates each
sample_facilities <- c(
"charity clinic kamba of",
"arab clinic shad",
"al arab clinic sheefa"
)
ranked_grid |>
dplyr::filter(hf_dhis2 %in% sample_facilities) |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_min(rank_avg, n = 4, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::select(
hf_dhis2,
hf_mfl,
dplyr::all_of(score_cols),
composite_score,
rank_avg
) |>
dplyr::arrange(hf_dhis2, rank_avg)To adapt the code:
- Line 2: If the similarity algorithms change (e.g., add or remove metrics), update the
score_colsvector to reflect the new set.
Show the code
# define similarity score columns
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]
# compute composite_score (simple average)
match_grid["composite_score"] = match_grid[score_cols].mean(axis=1)
# compute average rank across all similarity methods
for col in score_cols:
match_grid[f"rank_{col}"] = (
match_grid.groupby("hf_dhis2")[col]
.rank(method="min", ascending=False)
)
rank_cols = [f"rank_{c}" for c in score_cols]
match_grid["rank_avg"] = (
match_grid[rank_cols].sum(axis=1) / len(score_cols)
).round()
ranked_grid = match_grid.copy()
# preview ranked results for a few sample facilities
sample_facilities = [
"charity clinic kamba of",
"arab clinic shad",
"al arab clinic sheefa",
]
ranked_grid.loc[ranked_grid["hf_dhis2"].isin(sample_facilities)].groupby(
"hf_dhis2"
).apply(lambda g: g.nsmallest(4, "rank_avg")).reset_index(drop=True)[
["hf_dhis2", "hf_mfl"] + score_cols + ["composite_score", "rank_avg"]
].sort_values(["hf_dhis2", "rank_avg"])To adapt the code:
- Line 2: If the similarity algorithms change, update
score_colsto reflect the new set.
Step 6.4: Extract best match
To finalize matching, we select the best candidate for each DHIS2 facility. This can be based on any one similarity algorithm or use a blended approach such as a composite score or an average rank across all methods.
The output includes one top match per DHIS2 name for each method, allowing side-by-side comparison to evaluate performance.
Show the code
# for each fuzzy matching method, we select the top-ranked MFL name
# per DHIS2 hf.
best_jw <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_jw, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Jaro-Winkler") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_jw, method
)
best_lv <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_lv, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Levenshtein") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_lv, method
)
best_qg <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_qg, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Qgram") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_qg, method
)
best_lcs <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_lcs, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "LCS") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_lcs, method
)
best_comp<- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(composite_score, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Composite-Score") |>
dplyr::select(
hf_dhis2, hf_mfl, score = composite_score, method
)
best_ranked_match <- ranked_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_min(rank_avg, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Rank-Ensemble") |>
dplyr::select(
hf_dhis2, hf_mfl, score = rank_avg, method
)
# combine top matches from all methods for comparison across approaches
all_best <- dplyr::bind_rows(
best_jw, best_lv, best_qg, best_lcs,
best_comp, best_ranked_match)
# preview best matches to show extraction results
all_best |>
dplyr::slice_head(n = 10) |>
dplyr::select(hf_dhis2, hf_mfl, score, method)To adapt the code:
- Lines 3–56: For each method block, change the score column and
methodlabel to match the algorithm used. - Lines 3–56: Remove blocks for methods that were not computed.
- Lines 58–60: Include only the
best_*objects kept inbind_rows().
Show the code
# for each fuzzy matching method, select the top-ranked MFL name per DHIS2 hf
best_jw = (
match_grid.sort_values("score_jw", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_jw"]]
.rename(columns={"score_jw": "score"})
.assign(method="Jaro-Winkler")
)
best_lv = (
match_grid.sort_values("score_lv", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_lv"]]
.rename(columns={"score_lv": "score"})
.assign(method="Levenshtein")
)
best_qg = (
match_grid.sort_values("score_qg", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_qg"]]
.rename(columns={"score_qg": "score"})
.assign(method="Qgram")
)
best_lcs = (
match_grid.sort_values("score_lcs", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_lcs"]]
.rename(columns={"score_lcs": "score"})
.assign(method="LCS")
)
best_comp = (
match_grid.sort_values("composite_score", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "composite_score"]]
.rename(columns={"composite_score": "score"})
.assign(method="Composite-Score")
)
best_ranked_match = (
ranked_grid.sort_values("rank_avg")
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "rank_avg"]]
.rename(columns={"rank_avg": "score"})
.assign(method="Rank-Ensemble")
)
# combine top matches from all methods for comparison across approaches
all_best = pd.concat(
[best_jw, best_lv, best_qg, best_lcs, best_comp, best_ranked_match],
ignore_index=True,
)
# preview best matches
all_best.head(10)[["hf_dhis2", "hf_mfl", "score", "method"]]To adapt the code:
- Lines 3–47: For each method block, change the score column and
methodlabel to match the algorithm used. - Lines 3–47: Remove blocks for methods that were not computed.
- Lines 50–55: Include only the
best_*objects inpd.concat().
Step 7: Evaluate Fuzzy Match Quality
High similarity scores can sometimes mask poor-quality matches, particularly when facility names differ in structure. To ensure we’re not misled by high numerical scores alone, we combine similarity metrics with diagnostic checks that examine how names align at the token level.
Step 7.1: Visualize score distributions across methods
Before getting into structural evaluations, it’s useful to inspect how the distribution of match scores varies across algorithms. This helps assess whether some methods consistently assign higher scores, and whether their matches tend to cluster tightly or are more dispersed.
This is particularly important when applying score thresholds, or when we want to understand how lenient or strict a method is in assigning high-confidence scores.
Show the code
# plot score dist across methods
score_dist_plot <- all_best |>
dplyr::filter(method != "Rank-Ensemble") |>
ggplot2::ggplot(ggplot2::aes(x = score)) +
ggplot2::geom_density(fill = "steelblue", alpha = 0.6) +
ggplot2::facet_wrap(~method) +
ggplot2::geom_density(color = "steelblue", linewidth = 1) +
ggplot2::labs(
title = "Distribution of Fuzzy Matching Scores",
x = "\nMatch Score (%)",
y = "Density\n"
) +
ggplot2::theme_minimal(base_size = 14)
# save plot
ggplot2::ggsave(
plot = score_dist_plot,
filename = here::here("03_output/3a_figures/u5mr_sle_adm2.png"),
width = 12,
height = 9,
dpi = 300
)To adapt the code:
- Do not modify anything in this section.
Show the code
import matplotlib.pyplot as plt
# plot score distribution across methods (exclude rank-based method)
methods_for_plot = all_best.loc[all_best["method"] != "Rank-Ensemble"]
method_names = sorted(methods_for_plot["method"].unique())
fig, axes = plt.subplots(1, len(method_names), figsize=(12, 6), sharey=True)
for ax, name in zip(axes, method_names):
grp = methods_for_plot.loc[methods_for_plot["method"] == name, "score"].dropna()
grp.plot.kde(ax=ax, color="steelblue", linewidth=1)
ax.fill_between(
ax.lines[0].get_xdata(),
ax.lines[0].get_ydata(),
alpha=0.6,
color="steelblue",
)
ax.set_title(name, fontsize=11)
ax.set_xlabel("\nMatch Score (%)")
if ax == axes[0]:
ax.set_ylabel("Density\n")
fig.suptitle("Distribution of Fuzzy Matching Scores", fontsize=14)
plt.tight_layout()
# save plot
output_dir = Path(here("03_output/3a_figures"))
output_dir.mkdir(parents=True, exist_ok=True)
fig.savefig(output_dir / "fuzzy_score_distributions.png", dpi=300, bbox_inches="tight")To adapt the code:
- Do not modify anything in this section.
Looking at the plots, a tight, high peak near 100 means a method produces many strong matches; a broader or flatter distribution may indicate more uncertainty; and bimodal patterns may suggest a mix of clearly good and ambiguous matches. These patterns help guide our matching strategy: for our final selection in Step 6, we will use an 85% threshold as a starting point and build a fallback system that leans on the most reliable methods first.
Step 7.2: Define match quality diagnostic function
To evaluate the structural quality of a match, we define a simple diagnostic function that compares two names at the token level. It splits each name into individual words (which we will call tokens, since not everything in the name is a word), then does four things:
- checks whether the first words align (prefix match)
- checks whether the last words align (suffix match)
- calculates the difference in word count (token difference)
- calculates the difference in character length (character difference) These simple heuristics can catch mismatches that score high but are structurally off.
assess_match_quality <- function(name1, name2) {
purrr::map2_dfr(name1, name2, function(a, b) {
tokens1 <- strsplit(a, "\\s+")[[1]]
tokens2 <- strsplit(b, "\\s+")[[1]]
tibble::tibble(
prefix_match = tolower(tokens1[1]) == tolower(tokens2[1]),
suffix_match = tolower(tail(tokens1, 1)) == tolower(tail(tokens2, 1)),
token_diff = abs(length(tokens1) - length(tokens2)),
char_diff = abs(nchar(a) - nchar(b))
)
})
}
assess_match_quality("Makeni Govt Hospital", "Makeni Government Hospital")To adapt the code:
- Do not modify anything in this section.
def assess_match_quality(names1, names2):
"""Compute structural quality diagnostics for name pairs.
Returns a DataFrame with prefix_match, suffix_match,
token_diff, and char_diff for each name pair.
"""
rows = []
for a, b in zip(names1, names2):
tokens1 = str(a).split()
tokens2 = str(b).split()
rows.append({
"prefix_match": tokens1[0].lower() == tokens2[0].lower()
if tokens1 and tokens2 else False,
"suffix_match": tokens1[-1].lower() == tokens2[-1].lower()
if tokens1 and tokens2 else False,
"token_diff": abs(len(tokens1) - len(tokens2)),
"char_diff": abs(len(str(a)) - len(str(b))),
})
return pd.DataFrame(rows)
assess_match_quality(["Makeni Govt Hospital"], ["Makeni Government Hospital"])To adapt the code:
- Do not modify anything in this section.
In this example, using our assess_match_quality function, the first and last words match, so both prefix_match and suffix_match are TRUE. There is no difference in the number of tokens (token_diff = 0), but the second name is six characters longer (char_diff = 6) due to “Govt” vs “Government”.
These diagnostics help flag cases where a fuzzy score may be high but important name elements differ, such as CHC vs MCHP, Clinic vs Hospital, or Primary vs Secondary, which can signal a mismatch worth reviewing.
Step 7.3: Evaluate quality of top fuzzy matches
We now run this diagnostic on our best matches. The goal is to highlight cases where, despite high similarity scores, names may differ significantly in format or structure: for example, mismatched prefixes or suffixes. This step supports manual review or adds a layer of filtering to improve final match accuracy.
We also combine the diagnostics into a structure_score (0–100), giving 30% weight to token difference, 20% to character difference, and 25% each to prefix and suffix matches. This provides a quick summary of overall structural alignment.
Show the code
# assess structural match quality
diagnostics_df <- dplyr::bind_cols(
all_best,
assess_match_quality(all_best$hf_dhis2, all_best$hf_mfl)
)
# compare method diagnostics
summary_stats <- diagnostics_df |>
dplyr::group_by(method) |>
dplyr::summarise(
score = mean(score),
avg_token_diff = mean(token_diff) |> round(2),
avg_char_diff = mean(char_diff) |> round(2),
pct_prefix_match = (mean(prefix_match) * 100) |> round(2),
pct_suffix_match = (mean(suffix_match) * 100) |> round(2),
total = dplyr::n(),
.groups = "drop"
)
# create a a final overall score
summary_stats <- summary_stats |>
dplyr::mutate(
# rescale negative of average token difference (smaller is better)
token_score = scales::rescale(-avg_token_diff, to = c(0, 100)),
# rescale negative of average character difference (smaller is better)
char_score = scales::rescale(-avg_char_diff, to = c(0, 100)),
# rescale prefix match percentage (higher is better)
prefix_score = scales::rescale(pct_prefix_match, to = c(0, 100)),
# rescale suffix match percentage (higher is better)
suffix_score = scales::rescale(pct_suffix_match, to = c(0, 100)),
# combine all four metrics into a weighted structural quality score
structure_score = round(
0.3 * token_score + # emphasize fewer token differences
0.2 * char_score + # moderate weight on character similarity
0.25 * prefix_score + # give weight to matching initial words
0.25 * suffix_score, # give equal weight to matching final words
1
)
) |>
dplyr::arrange(desc(structure_score)) |>
# assign rank based on descending structure score
dplyr::mutate(rank = dplyr::row_number()) |>
dplyr::select(
method, avg_token_diff, avg_char_diff,
pct_prefix_match, pct_suffix_match,
total, structure_score, rank
)
# check results
summary_stats |>
dplyr::select(
Method = method,
`Avg. Token Difference` = avg_token_diff,
`Avg. Character Difference` = avg_char_diff,
`% Prefix Match` = pct_prefix_match,
`% Suffix Match` = pct_suffix_match,
`Structural Score` = structure_score,
Rank = rank
) |> as.data.frame()To adapt the code:
- Lines 34–37: Adjust the weights used to calculate the
structure_scoreif the matching context values certain components (e.g., prefix matches or token length) more than others.
Show the code
def _rescale(series, new_min, new_max):
"""Rescale a numeric series to [new_min, new_max]."""
s_min, s_max = series.min(), series.max()
if s_max == s_min:
return pd.Series([new_min] * len(series), index=series.index)
return new_min + (series - s_min) / (s_max - s_min) * (new_max - new_min)
# assess structural match quality
diagnostics_df = pd.concat(
[all_best.reset_index(drop=True),
assess_match_quality(all_best["hf_dhis2"], all_best["hf_mfl"])],
axis=1,
)
# compare method diagnostics
summary_stats = (
diagnostics_df
.groupby("method")
.agg(
avg_token_diff=("token_diff", "mean"),
avg_char_diff=("char_diff", "mean"),
pct_prefix_match=("prefix_match", "mean"),
pct_suffix_match=("suffix_match", "mean"),
total=("hf_dhis2", "count"),
)
.reset_index()
.assign(
avg_token_diff=lambda d: d["avg_token_diff"].round(2),
avg_char_diff=lambda d: d["avg_char_diff"].round(2),
pct_prefix_match=lambda d: (d["pct_prefix_match"] * 100).round(2),
pct_suffix_match=lambda d: (d["pct_suffix_match"] * 100).round(2),
)
)
# create a final overall score
summary_stats = summary_stats.assign(
# rescale negative of average token difference (smaller is better)
token_score=lambda d: _rescale(-d["avg_token_diff"], 0, 100),
# rescale negative of average character difference (smaller is better)
char_score=lambda d: _rescale(-d["avg_char_diff"], 0, 100),
# rescale prefix match percentage (higher is better)
prefix_score=lambda d: _rescale(d["pct_prefix_match"], 0, 100),
# rescale suffix match percentage (higher is better)
suffix_score=lambda d: _rescale(d["pct_suffix_match"], 0, 100),
).assign(
# combine all four metrics into a weighted structural quality score
structure_score=lambda d: (
0.3 * d["token_score"] + # emphasize fewer token differences
0.2 * d["char_score"] + # moderate weight on character similarity
0.25 * d["prefix_score"] + # give weight to matching initial words
0.25 * d["suffix_score"] # give equal weight to matching final words
).round(1)
).sort_values("structure_score", ascending=False)
summary_stats["rank"] = range(1, len(summary_stats) + 1)
summary_stats.rename(columns={
"method": "Method",
"avg_token_diff": "Avg. Token Difference",
"avg_char_diff": "Avg. Character Difference",
"pct_prefix_match": "% Prefix Match",
"pct_suffix_match": "% Suffix Match",
"structure_score": "Structural Score",
"rank": "Rank",
})[["Method", "Avg. Token Difference", "Avg. Character Difference",
"% Prefix Match", "% Suffix Match", "Structural Score", "Rank"]]To adapt the code:
- Lines 47–50: Adjust the weights used to calculate the
structure_scoreif the matching context values certain components (e.g., prefix matches or token length) more than others.
These diagnostics show how well each method preserves the structure of matched names in our example health facility name matching exercise. This helps guide which method to trust most and how to build fallback rules when multiple methods are combined. A different project context may result in a different fuzzy matching algorithm being preferred.
Method Performance Summaries
Levenshtein: Lowest token and character differences, with strong prefix and suffix alignment. Highest structural score (84.1, Rank 1) – best overall performer.
Rank-Ensemble: Balanced across metrics, particularly strong on prefix alignment. Score 71.0 (Rank 2) – reliable fallback option.
Composite-Score: Consistent across metrics with solid suffix alignment. Score 68.5 (Rank 3) – flexible, general-purpose option.
Jaro-Winkler: Strongest prefix alignment but weaker on suffixes. Score 67.1 (Rank 4) – useful when beginnings of names align closely.
Qgram: Good at substring similarity, but weaker structurally overall. Score 50.0 (Rank 5) – more suitable for partial overlaps.
LCS Larger token and character gaps despite reasonable suffix matches. Lowest score (43.9, Rank 6) – penalized for structural differences.
Levenshtein was the strongest method, showing low structural differences and good prefix/suffix alignment. Rank-Ensemble and Jaro-Winkler followed as dependable options, with different strengths on prefixes. Composite-Score was steady but slightly weaker. Qgram and LCS were less effective overall, penalized for larger structural gaps despite occasional suffix strengths.
Step 7.4: Weighted thresholding using structural quality and false-positive risk
Now that the structure score is in place, it can be used to calculate additional metrics that strengthen the matching approach. These metrics can feed into a weighted thresholding process, where scores are adjusted by method quality and thresholds are tailored to the false-positive risk of each method. This shifts the threshold setting from a single arbitrary cutoff to an evidence-based process that reflects the strengths and weaknesses of each algorithm, improving precision without sacrificing matching quality.
The outputs from this step serve different purposes depending on the chosen matching strategy in Step 8. When using a single method or simple composite approach (Options 1-2), only the structure score rankings from Step 7.3 are needed to identify the best-performing method. When using weighted composite scoring or the fallback loop approach (Options 3-4), both the method weights and thresholds calculated in this step are required.
This step calculates two key metrics for each fuzzy matching method:
1. Method weights: Proportional to structure scores from Step 7.3, rescaled so all weights sum to 1. Methods with better structural quality get higher influence when combining multiple methods in Options 3-4.
2. Method-specific thresholds: Use inverse scaling based on structural quality. A threshold determines the minimum similarity score required for a match (e.g., threshold 85 means only candidates scoring ≥85 are accepted). High-performing methods get lower thresholds (easier matching) because they are reliable, while low-performing methods get higher thresholds (stricter matching) to prevent false positives. For example, Levenshtein might get threshold 75 (trusted method), while LCS gets 95 (requires high confidence).
Note In the code below, the thresholds are created by rescaling each method’s structure_score (0–100) into a new range defined by the 70th and 95th percentiles of all_best$score. The rev() function flips this mapping so that higher scores are pushed toward the lower end of the range and lower scores toward the higher end. This means stronger methods receive stricter thresholds, while weaker methods are assigned looser thresholds.
# calculate method-specific weights and thresholds
method_threshold <- summary_stats |>
dplyr::mutate(
score = structure_score / 100,
weight = score / sum(score),
threshold = scales::rescale(
structure_score,
to = rev(unname(
stats::quantile(all_best$score, c(0.70, 0.95), na.rm = TRUE)
))
) |>
round()
) |>
dplyr::select(method, weight, threshold)
# display results
method_thresholdTo adapt the code:
- Line 2: Keep
summary_statsfrom Step 7.3 with thestructure_scorecolumn. - Lines 6–11: Adjust the quantile range (default: 0.70 to 0.95) to control threshold spread.
- Line 14: Modify column selection if additional metrics are needed in the output.
# calculate method-specific weights and thresholds
q70 = all_best["score"].quantile(0.70)
q95 = all_best["score"].quantile(0.95)
method_threshold = summary_stats[["method", "structure_score"]].copy()
method_threshold["score_frac"] = method_threshold["structure_score"] / 100
method_threshold["weight"] = (
method_threshold["score_frac"] / method_threshold["score_frac"].sum()
)
# inverse scale: higher structure_score → lower threshold (easier matching)
method_threshold["threshold"] = (
_rescale(method_threshold["structure_score"], q95, q70)
).round()
method_threshold = method_threshold[["method", "weight", "threshold"]]
# display results
method_thresholdTo adapt the code:
- Line 2: Keep
summary_statsfrom Step 7.3 with thestructure_scorecolumn. - Lines 4–5: Adjust the quantile range (default: 0.70 to 0.95) to control threshold spread.
- Line 14: Modify column selection if additional metrics are needed in the output.
This table shows each method’s relative weight, scaled so all weights sum to one, and its method-specific score threshold derived from observed performance. The relationship between structural quality, weights, and thresholds follows this pattern:
| Method | Weight | Threshold | Interpretation |
|---|---|---|---|
| Levenshtein | 0.219 | 75 | Highest-performing method: gets most influence in composite scoring and lowest threshold for matching |
| Composite-Score | 0.185 | 80 | Strong performer: high influence and low threshold |
| Rank-Ensemble | 0.178 | 81 | Good performer: moderate-high influence and low-medium threshold |
| Jaro-Winkler | 0.174 | 82 | Balanced method: moderate influence and medium threshold |
| Qgram | 0.130 | 88 | Lower performer: reduced influence and higher threshold to prevent false positives |
| LCS | 0.114 | 92 | Lowest performer: minimal influence and strictest threshold required |
Key insight: Methods with better structural alignment get rewarded with higher weights (more influence) and lower thresholds (easier matching), while unreliable methods are constrained with lower weights and stricter thresholds.
Step 8: Finalize Fuzzy Match Selection
This step assigns a final match to each DHIS2 facility by choosing the most appropriate candidate from the fuzzy matching results. At this stage, every unmatched facility from earlier steps should have a set of one or more candidate matches with similarity scores and any supporting diagnostics (e.g., prefix/suffix matches, token differences). The goal is to select one match per facility in a way that balances accuracy, efficiency, and coverage.
The choice of final selection strategy depends on:
- Confidence in the matching methods: If the algorithms and thresholds are already well-tuned to the data, a simpler approach may be sufficient.
- Tolerance for false positives: In some contexts, a slightly lower match rate is acceptable if it reduces the risk of incorrect matches.
- Analytical and operational priorities: For example, whether a quick, explainable method is needed for an urgent report, or a more comprehensive, iterative process for a master dataset update.
We offer four structured options, each with its own strengths, weaknesses, and level of complexity. In practice, we may start with a simpler option and progress to more advanced approaches if too many facilities remain unmatched or if match quality is inconsistent. These options are also compatible with an SNT team review step, allowing human validation before matches are finalized in the dataset.
Summary of Match Selection Options
| Option | Description | Use Case | Code Complexity |
|---|---|---|---|
| Option 1: Single Method | Use one preferred method (e.g., Levenshtein) with a threshold | When confident in a single algorithm | Low |
| Option 2: Composite/Rank | Use composite score or rank average to select best match | When using blended match signals | Moderate |
| Option 3: Weighted Composite Score | Use performance-weighted composite score that accounts for method reliability and false-positive rates | When using blended match signals with known method performance differences | Moderate |
| Option 4: Fallback Loop | Stepwise fallback through all methods based on structural quality scores | When coverage is most important | High |
Step 8.1: Match selection options
Option 1: Single Preferred Method
This approach uses one similarity algorithm, such as Levenshtein or Jaro-Winkler, to select the highest-scoring match for each facility. We apply a fixed threshold so that scores above the cut-off are accepted as matches, while lower scores are flagged for review. It is fast, easy to explain, and works well when the chosen method aligns with the types of name differences in the data.
Show the code
# apply single-method threshold and flag matches
best_lv_final <- best_lv |>
dplyr::mutate(
match_flag = dplyr::if_else(score >= 85, "match", "review"),
final_method = "Levenshtein"
) |>
dplyr::left_join(
dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
by = "hf_mfl"
) |>
dplyr::select(hf_dhis2, hf_mfl, hf_mfl_raw, score, final_method, match_flag)
# show top 5 and bottom 5 matches by score
top_5 <- best_lv_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- best_lv_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Levenshtein Score")
# preview results
top_bottomTo adapt the code:
- Line 2: Replace
best_lvwith another object from Step 4.5 (e.g.,best_jw,best_qg) depending on the preferred method. - Line 4: Adjust the score threshold (default = 85) to define what is considered a
match. - Line 11: Keep column names
hf_dhis2andhf_mfl, or replace them with the project-specific names.
Show the code
# apply single-method threshold and flag matches
best_lv_final = (
best_lv
.assign(
match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"),
final_method="Levenshtein",
)
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
[["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)
# show top 5 and bottom 5 matches by score
top_5 = best_lv_final.sort_values("score", ascending=False).head(5)
bottom_5 = best_lv_final.sort_values("score").head(5)
# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)
cli_header("Top 5 and Bottom 5 Matches by Levenshtein Score")
# preview results
top_bottomTo adapt the code:
- Line 2: Replace
best_lvwith another object from Step 6.4 (e.g.,best_jw,best_qg) depending on the preferred method. - Line 4: Adjust the score threshold (default = 85) to define what is considered a
match. - Line 9: Keep column names
hf_dhis2andhf_mfl, or replace them with the project-specific names.
In our example dataset, the Levenshtein method with an 85% threshold achieved the match rate shown in the output above. This conservative approach prioritizes precision over recall, making it suitable when false positives are costly and manual review resources are available.
Option 2: Composite or Rank-Based Match
This approach selects the best match using a combined measure from multiple algorithms, such as a composite score or average rank. It avoids relying on a single method while keeping the process simpler than a full fallback loop. This works well when different algorithms capture different types of variation, and a single combined score can be used with a threshold to separate confident matches from those needing review.
Show the code
# select best match using rank average
composite_final <- ranked_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::arrange(rank_avg) |>
dplyr::slice_head(n = 1) |>
dplyr::ungroup() |>
dplyr::mutate(
match_flag = dplyr::if_else(composite_score >= 85, "match", "review")
) |>
dplyr::select(
adm1,
adm2,
hf_dhis2,
hf_mfl,
composite_score,
rank_avg,
match_flag
)
# show top 5 and bottom 5 matches by composite score
top_5 <- composite_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- composite_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Composite Score")
# preview results
top_bottomTo adapt the code:
- Line 2: Keep
ranked_gridfrom Step 6.4, or replace with the combined results dataframe. - Line 8: Adjust the composite score threshold (default = 85) to define what is considered a
match. - Lines 10–18: Keep score columns (
composite_score,rank_avg) or replace with the relevant scoring metrics.
Show the code
# select best match using rank average
composite_final = (
ranked_grid.sort_values("rank_avg")
.groupby("hf_dhis2", as_index=False)
.first()
.assign(
match_flag=lambda d: np.where(d["composite_score"] >= 85, "match", "review"),
score=lambda d: d["composite_score"],
final_method="Composite-Score",
)
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
[["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)
# show top 5 and bottom 5 matches by composite score
top_5 = composite_final.sort_values("score", ascending=False).head(5)
bottom_5 = composite_final.sort_values("score").head(5)
# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)
cli_header("Top 5 and Bottom 5 Matches by Composite Score")
# preview results
top_bottomTo adapt the code:
- Line 2: Keep
ranked_gridfrom Step 6.4, or replace with the combined results dataframe. - Line 6: Adjust the composite score threshold (default = 85) to define what is considered a
match. - Lines 8–13: Keep score columns (
composite_score,rank_avg) or replace with the relevant scoring metrics.
The composite score approach results in the match rate shown in the output above. This reflects the averaging effect where strong matches in one method may be diluted by weaker scores in others. Best suited for datasets with consistent naming patterns across methods.
Option 3: Weighted Composite Score
This approach selects the best match using a performance-weighted composite score that accounts for method reliability and false-positive rates. Instead of treating all methods equally, it uses the weights calculated in Step 7.4 to emphasize methods that demonstrated better structural alignment. This provides a different approach from simple averaging while being less complex than a full fallback system.
The weighted composite score uses a weighted average that maintains the 0-100 scale: Methods with higher structural quality scores get more weight in the final composite. For example, if Levenshtein has weight 0.4 and Qgram has weight 0.1, the Levenshtein score contributes 4x more to the final weighted score.
Show the code
# define score columns and extract weights
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")
weights <- method_threshold |>
dplyr::filter(
method %in% c("Jaro-Winkler", "Levenshtein", "Qgram", "LCS")
) |>
dplyr::pull(weight)
# calculate weighted composite score for each facility-match pair
weighted_final <- ranked_grid |>
dplyr::rowwise() |>
dplyr::mutate(
# weighted average maintains 0-100 scale
weighted_composite_score = stats::weighted.mean(
c(score_jw, score_lv, score_qg, score_lcs),
weights,
na.rm = TRUE
)
) |>
dplyr::ungroup() |>
dplyr::group_by(hf_dhis2) |>
dplyr::arrange(dplyr::desc(weighted_composite_score)) |>
dplyr::slice_head(n = 1) |>
dplyr::ungroup() |>
dplyr::mutate(
match_flag = dplyr::if_else(
weighted_composite_score >= 85,
"match",
"review"
),
score = weighted_composite_score, # standardize column name for Step 7
final_method = "Weighted-Composite"
) |>
dplyr::left_join(
dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
by = "hf_mfl"
) |>
dplyr::select(
hf_dhis2,
hf_mfl,
hf_mfl_raw,
score,
final_method,
match_flag
)
# show top 5 and bottom 5 matches by weighted composite score
top_5 <- weighted_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- weighted_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Weighted Composite Score")
# preview results
bottom_5To adapt the code:
- Line 2: Adjust
score_colsif the dataset uses different similarity score column names. - Lines 4–8: Ensure
method_thresholdfrom Step 7.4 contains weights for the methods being used. - Lines 15–18: The
weighted.mean()function calculates a weighted average that maintains the 0-100 scale. - Lines 27–31: Adjust the weighted composite score threshold (default = 85) to define what is considered a
match. - Lines 39–46: Select columns to include in the final output.
Show the code
# define score columns and extract weights
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]
weights_df = method_threshold.loc[
method_threshold["method"].isin(["Jaro-Winkler", "Levenshtein", "Qgram", "LCS"])
]
weights = weights_df.set_index("method")["weight"]
# extract individual weights for each method
w_jw = float(weights.get("Jaro-Winkler", 0))
w_lv = float(weights.get("Levenshtein", 0))
w_qg = float(weights.get("Qgram", 0))
w_lcs = float(weights.get("LCS", 0))
# calculate weighted composite score for each facility-match pair
# weighted average maintains 0-100 scale
weighted_final = (
ranked_grid.assign(
weighted_composite_score=lambda d: (
d["score_jw"] * w_jw +
d["score_lv"] * w_lv +
d["score_qg"] * w_qg +
d["score_lcs"] * w_lcs
)
)
.sort_values("weighted_composite_score", ascending=False)
.groupby("hf_dhis2", as_index=False)
.first()
.assign(
match_flag=lambda d: np.where(
d["weighted_composite_score"] >= 85, "match", "review"
),
score=lambda d: d["weighted_composite_score"],
final_method="Weighted-Composite",
)
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
[["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)
# show top 5 and bottom 5 matches by weighted composite score
top_5 = weighted_final.sort_values("score", ascending=False).head(5)
bottom_5 = weighted_final.sort_values("score").head(5)
cli_header("Top 5 and Bottom 5 Matches by Weighted Composite Score")
# preview results
bottom_5To adapt the code:
- Line 2: Adjust
score_colsif the dataset uses different similarity score column names. - Lines 4–7: Ensure
method_thresholdfrom Step 7.4 contains weights for the methods being used. - Lines 14–19: The weighted sum calculates a weighted average that maintains the 0-100 scale.
- Lines 25–27: Adjust the weighted composite score threshold (default = 85) to define what is considered a
match. - Line 33: Select columns to include in the final output.
The weighted composite approach achieved the improved match rate shown in the output above. This improvement over the simple composite shows the benefit of emphasizing high-quality methods. The weighting balances the strengths of multiple models while prioritizing the most reliable algorithms.
Option 4: Fallback Loop Across Methods (Recommended for Coverage)
This approach maximizes match coverage through a sophisticated tiered fallback system that dynamically uses the method ordering and thresholds calculated in Steps 7.3 and 7.4. Methods are applied in descending order of their structure scores, ensuring high-precision algorithms get the first chance to match.
A key strength is the use of method-specific thresholds from Step 7.4. Each method uses its own calculated threshold based on its structural quality and false-positive risk profile. This evidence-based approach is more reliable than arbitrary cutoffs.
The method order and thresholds are dynamically determined from the structural quality analysis in Steps 7.3-7.4. Methods with higher structure scores are prioritized, and each uses its calculated threshold based on false-positive risk:
- High-quality methods (e.g., Levenshtein): Lower thresholds due to reliable matching
- Medium-quality methods (e.g., Jaro-Winkler): Moderate thresholds for balanced precision
- Lower-quality methods (e.g., Qgram, LCS): Higher thresholds to prevent false positives
The fallback process works as follows:
- Method ordering: Methods are ordered by their structure scores from Step 7.3 (highest to lowest)
- Tie-breaking: Within each method, candidates are pre-sorted by:
- Similarity score (descending) - highest scores first
- Average rank from ensemble methods (ascending) - best ranks as secondary criterion This ensures that when multiple candidates exceed the threshold, the best match is selected
- Progressive matching: Each method attempts to match remaining facilities using its calculated threshold from Step 7.4
- Threshold relaxation: If facilities remain unmatched after a full pass, thresholds are reduced by 1 point and the cycle repeats
- Quality floor: Process stops when all facilities are matched or thresholds drop below 50
This dynamic approach with reliable tie-breaking ensures the best possible match for each facility while maintaining quality standards based on empirical performance metrics.
Show the code
# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats <- summary_stats |>
dplyr::arrange(dplyr::desc(structure_score)) |>
dplyr::filter(method != "Rank-Ensemble") # exclude rank-based method
method_order <- method_stats$method
# use thresholds calculated in step 7.4
method_thresholds_df <- method_threshold |>
dplyr::filter(method != "Rank-Ensemble")
# convert to named vector for easy lookup
method_thresholds <- setNames(
method_thresholds_df$threshold,
method_thresholds_df$method
)
# build combined table of top matches from real methods only
all_best_nonrank <- dplyr::bind_rows(
best_lv,
best_jw,
best_qg,
best_lcs,
best_comp
) |>
# normalize method labels
dplyr::mutate(method = trimws(method))
# add rank once
rank_key <- ranked_grid |>
dplyr::select(hf_dhis2, hf_mfl, rank_avg)
all_best_ranked <- all_best_nonrank |>
dplyr::left_join(rank_key, by = c("hf_dhis2", "hf_mfl")) |>
dplyr::mutate(
# worst for missing ranks
rank_avg = dplyr::if_else(is.na(rank_avg), Inf, rank_avg)
)
# split by method and pre-arrange for stable tiebreak
by_method <- split(all_best_ranked, all_best_ranked$method) |>
purrr::map(\(x) {
x |>
dplyr::arrange(dplyr::desc(score), rank_avg)
})
# keep copy for pass-rate check
initial_thr <- method_thresholds
# init
fallback_chunks <- list()
unmatched <- unique(all_best_ranked$hf_dhis2)
# fast fallback loop (no joins, no grouping inside)
repeat {
for (m in method_order) {
if (!m %in% names(by_method)) {
next
}
# current threshold from step 7.4
thr <- method_thresholds[[m]]
if (is.null(thr)) {
thr <- 85 # fallback if method not found
}
# filter by score and still-unmatched
cand <- by_method[[m]] |>
dplyr::filter(score >= thr, hf_dhis2 %in% unmatched)
if (nrow(cand) > 0) {
# pick first row per hf_dhis2 (pre-arranged by score desc, rank asc)
best_per_hf <- cand |>
dplyr::distinct(hf_dhis2, .keep_all = TRUE) |>
dplyr::mutate(
final_method = glue::glue("Fuzzy-matched using: {m}")
)
# store and update unmatched
fallback_chunks[[length(fallback_chunks) + 1L]] <- best_per_hf
unmatched <- setdiff(unmatched, best_per_hf$hf_dhis2)
}
if (length(unmatched) == 0) break
}
if (length(unmatched) == 0) {
break
}
# relax thresholds gradually
method_thresholds <- method_thresholds - 1
if (any(method_thresholds <= 50)) break # stop if too low
}
# bind once
fallback_matched <- dplyr::bind_rows(fallback_chunks)
# build a 1:1 map from standardized DHIS2 name to admin + raw +
# hf_uid_new
hf_uid_new_map_by_stand <- dhis2_hf_unmatched |>
dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
dplyr::group_by(hf_dhis2) |>
dplyr::summarise(
adm0 = dplyr::first(adm0),
adm1 = dplyr::first(adm1),
adm2 = dplyr::first(adm2),
adm3 = dplyr::first(adm3),
hf_dhis2_raw = dplyr::first(hf_dhis2_raw),
hf_uid_new = dplyr::first(hf_uid_new),
.groups = "drop"
)
# final flags and attach admin + ids
fallback_final <- fallback_matched |>
dplyr::mutate(
match_flag = dplyr::if_else(score >= 85, "match", "review")
) |>
dplyr::left_join(
dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
by = "hf_mfl"
) |>
dplyr::left_join(hf_uid_new_map_by_stand, by = "hf_dhis2") |>
dplyr::select(
adm0, adm1, adm2, adm3,
hf_uid_new,
hf_dhis2_raw,
hf_dhis2,
hf_mfl_raw,
hf_mfl,
score,
final_method,
match_flag
)
top_5 <- fallback_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- fallback_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Fallback")
bottom_5 <- dplyr::bind_rows(top_5, bottom_5)
# preview results
bottom_5To adapt the code:
- Lines 2–8: The method order is automatically determined from
summary_statsstructure scores. - Lines 10–18: Thresholds are extracted from Step 7.4’s
method_thresholddataframe. - Line 66: The fallback threshold (default = 85) is used only if a method lacks a calculated threshold.
- Line 95: The minimum threshold (default = 50) prevents accepting very low-quality matches.
- Lines 33, 36, throughout: Replace
hf_dhis2andhf_mflwith the dataset’s column names if they differ.
Show the code
# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats = summary_stats.loc[
summary_stats["method"] != "Rank-Ensemble"
].sort_values("structure_score", ascending=False)
method_order = list(method_stats["method"])
# use thresholds calculated in step 7.4
method_thresholds_df = method_threshold.loc[method_threshold["method"] != "Rank-Ensemble"]
method_thresholds = dict(zip(method_thresholds_df["method"], method_thresholds_df["threshold"]))
# build combined table of top matches from real methods only
all_best_nonrank = pd.concat(
[best_lv, best_jw, best_qg, best_lcs, best_comp], ignore_index=True
)
all_best_nonrank["method"] = all_best_nonrank["method"].str.strip()
# add rank once
rank_key = ranked_grid[["hf_dhis2", "hf_mfl", "rank_avg"]].copy()
all_best_ranked = (
all_best_nonrank
.merge(rank_key, on=["hf_dhis2", "hf_mfl"], how="left")
.assign(rank_avg=lambda d: d["rank_avg"].fillna(float("inf")))
)
# split by method and pre-arrange for stable tiebreak
by_method = {
m: grp.sort_values(["score", "rank_avg"], ascending=[False, True])
for m, grp in all_best_ranked.groupby("method")
}
fallback_chunks = []
unmatched_set = set(all_best_ranked["hf_dhis2"].unique())
current_thresholds = dict(method_thresholds)
# fast fallback loop
while True:
for m in method_order:
if m not in by_method:
continue
thr = current_thresholds.get(m, 85) # fallback if method not found
cand = by_method[m].loc[
(by_method[m]["score"] >= thr) &
(by_method[m]["hf_dhis2"].isin(unmatched_set))
]
if len(cand) > 0:
best_per_hf = (
cand.drop_duplicates(subset="hf_dhis2", keep="first")
.assign(final_method=f"Fuzzy-matched using: {m}")
)
fallback_chunks.append(best_per_hf)
unmatched_set -= set(best_per_hf["hf_dhis2"])
if not unmatched_set:
break
if not unmatched_set:
break
# relax thresholds gradually
current_thresholds = {k: v - 1 for k, v in current_thresholds.items()}
if any(v <= 50 for v in current_thresholds.values()): # stop if too low
break
# bind once
fallback_matched = (
pd.concat(fallback_chunks, ignore_index=True)
if fallback_chunks else pd.DataFrame()
)
# build a 1:1 map from standardized DHIS2 name to admin + raw + hf_uid_new
hf_uid_new_map_by_stand = (
dhis2_hf_unmatched
.sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
.groupby("hf_dhis2", as_index=False)
.first()[["hf_dhis2", "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
)
# final flags and attach admin + ids
fallback_final = (
fallback_matched
.assign(match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"))
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
.merge(hf_uid_new_map_by_stand, on="hf_dhis2", how="left")
[["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
"score", "final_method", "match_flag"]]
)
top_5 = fallback_final.sort_values("score", ascending=False).head(5)
bottom_5 = fallback_final.sort_values("score").head(5)
cli_header("Top 5 and Bottom 5 Matches by Fallback")
bottom_5 = pd.concat([top_5, bottom_5], ignore_index=True)
# preview results
bottom_5To adapt the code:
- Lines 2–7: The method order is automatically determined from
summary_statsstructure scores. - Lines 9–12: Thresholds are extracted from Step 7.4’s
method_thresholddataframe. - Line 46: The fallback threshold (default = 85) is used only if a method lacks a calculated threshold.
- Line 58: The minimum threshold (default = 50) prevents accepting very low-quality matches.
- Lines 22, 76, throughout: Replace
hf_dhis2andhf_mflwith the dataset’s column names if they differ.
The fallback loop achieved the highest match rate as shown in the output above. This substantial improvement over other methods demonstrates the effectiveness of method-specific thresholds and progressive fallback. The approach maximizes coverage while maintaining quality through data-driven thresholds and intelligent tie-breaking.
This is the approach that we will use going forward, as it balances coverage and match quality, reduces the manual review workload, and adapts thresholds dynamically to the strengths and weaknesses of each method.
Step 8.2: Select fuzzy matching approach
After reviewing the different fuzzy matching options from Step 8.1, select which approach to use. This step creates the fuzzy_matches dataset that will be combined with other matching methods in Step 9.
Select one approach based on the evaluation in Step 8.1: - Use single algorithm (Levenshtein, Jaro-Winkler, or Q-gram) for simplicity and interpretability - Use composite scores to balance multiple algorithms - Use weighted approach when domain expertise suggests certain algorithms are more reliable for the data - Use fallback approach for the most comprehensive matching with hierarchical scoring
Show the code
# create fuzzy_matches from step 8 results (use whichever option was
# chosen)
# step 8 produces different result datasets depending on which option
# was
# chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final
fuzzy_matches <- if (exists("fallback_final")) {
dplyr::filter(fallback_final, match_flag == "match")
} else if (exists("composite_final")) {
dplyr::filter(composite_final, match_flag == "match")
} else if (exists("weighted_final")) {
dplyr::filter(weighted_final, match_flag == "match")
} else if (exists("best_lv_final")) {
dplyr::filter(best_lv_final, match_flag == "match")
} else if (exists("best_jw_final")) {
dplyr::filter(best_jw_final, match_flag == "match")
} else if (exists("best_qg_final")) {
dplyr::filter(best_qg_final, match_flag == "match")
} else {
# empty dataframe with expected structure if no step 8 results exist
data.frame(
hf_dhis2_raw = character(0),
hf_dhis2 = character(0),
hf_mfl_raw = character(0),
hf_mfl = character(0),
score = numeric(0),
final_method = character(0),
stringsAsFactors = FALSE
)
} |>
# remove any internal duplicates (keep highest scoring match per
# facility)
dplyr::group_by(hf_dhis2_raw) |>
dplyr::slice_max(score, n = 1, with_ties = FALSE) |>
dplyr::ungroup() |>
# attach admin + hf_uid_new from unmatched (collapse to 1:1 per
# hf_dhis2_raw)
{
hf_uid_new_map_unmatched <- dhis2_hf_unmatched |>
dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
dplyr::group_by(hf_dhis2_raw) |>
dplyr::summarise(
adm0 = dplyr::first(adm0),
adm1 = dplyr::first(adm1),
adm2 = dplyr::first(adm2),
adm3 = dplyr::first(adm3),
hf_uid_new = dplyr::first(hf_uid_new),
.groups = "drop"
)
dplyr::left_join(., hf_uid_new_map_unmatched, by = "hf_dhis2_raw")
}
# display summary of selected approach
cli::cli_h2("Selected Fuzzy Matching Results")
cli::cli_alert_success("Fuzzy matches found: {nrow(fuzzy_matches)}")
if (nrow(fuzzy_matches) > 0) {
score_summary <- fuzzy_matches |>
dplyr::summarise(
avg_score = mean(score, na.rm = TRUE),
min_score = min(score, na.rm = TRUE),
max_score = max(score, na.rm = TRUE),
.groups = "drop"
)
cli::cli_alert_info("Score range: {round(score_summary$min_score, 1)} - {round(score_summary$max_score, 1)}")
cli::cli_alert_info("Average score: {round(score_summary$avg_score, 1)}")
}To adapt the code:
- Lines 12–24: The conditional logic checks for different Step 8 result datasets in priority order. Modify the order if a different approach should take precedence.
- Lines 25–34: Create empty dataframe structure if no Step 8 results exist. Ensure column names match the expected structure.
- Lines 38–40: Deduplication logic removes internal duplicates within the selected approach, keeping the highest scoring match per DHIS2 facility. Modify if different deduplication rules are needed.
- Alternative approach: Instead of conditional logic, directly assign the chosen dataset. For example, to select the fallback approach directly:
fuzzy_matches <- dplyr::filter(fallback_final, match_flag == "match") |> dplyr::group_by(hf_dhis2_raw) |> dplyr::slice_max(score, n = 1, with_ties = FALSE) |> dplyr::ungroup().
Show the code
# create fuzzy_matches from step 8 results (use whichever option was chosen)
# step 8 produces different result datasets depending on which option was chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final
if "fallback_final" in dir():
fuzzy_matches_raw = fallback_final.loc[fallback_final["match_flag"] == "match"]
elif "composite_final" in dir():
fuzzy_matches_raw = composite_final.loc[composite_final["match_flag"] == "match"]
elif "weighted_final" in dir():
fuzzy_matches_raw = weighted_final.loc[weighted_final["match_flag"] == "match"]
elif "best_lv_final" in dir():
fuzzy_matches_raw = best_lv_final.loc[best_lv_final["match_flag"] == "match"]
elif "best_jw_final" in dir():
fuzzy_matches_raw = best_jw_final.loc[best_jw_final["match_flag"] == "match"]
elif "best_qg_final" in dir():
fuzzy_matches_raw = best_qg_final.loc[best_qg_final["match_flag"] == "match"]
else:
# empty dataframe with expected structure if no step 8 results exist
fuzzy_matches_raw = pd.DataFrame(columns=[
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
"score", "final_method",
])
# deduplicate: keep highest scoring match per DHIS2 facility
fuzzy_matches = (
fuzzy_matches_raw
.sort_values("score", ascending=False)
.drop_duplicates(subset="hf_dhis2_raw", keep="first")
.merge(
dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
.sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
.drop_duplicates(subset="hf_dhis2_raw", keep="first"),
on="hf_dhis2_raw",
how="left",
)
)
# display summary of selected approach
cli_header("Selected Fuzzy Matching Results")
cli_success(f"Fuzzy matches found: {len(fuzzy_matches)}")
if len(fuzzy_matches) > 0:
cli_info(
f"Score range: {fuzzy_matches['score'].min():.1f} - "
f"{fuzzy_matches['score'].max():.1f}"
)
cli_info(f"Average score: {fuzzy_matches['score'].mean():.1f}")To adapt the code:
- Lines 12–24: The conditional logic checks for different Step 8 result datasets in priority order. Modify the order if a different approach should take precedence.
- Lines 25–28: Create empty dataframe structure if no Step 8 results exist. Ensure column names match the expected structure.
- Lines 31–41: Deduplication logic removes internal duplicates, keeping the highest scoring match per DHIS2 facility. Modify if different deduplication rules are needed.
Step 8.3: Manual review of flagged matches
Before finalizing matches and joining back to the main datasets, all records flagged as review must be manually validated or rejected by the SNT team or designated reviewers. This step ensures data quality and prevents false positives from entering the analysis.
Manual review is required for all review matches before proceeding to Step 9. The SNT team should confirm or reject uncertain matches and document patterns in false positives or missed matches to refine future matching workflows.
Preparing Data for Manual Review
Below, we export review matches to a structured file for systematic checking:
Show the code
# prepare review dataset (using fallback results as example)
review_matches <- fallback_final |>
dplyr::filter(match_flag == "review") |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::mutate(
reviewer_decision = NA_character_, # accept/reject/uncertain
reviewer_notes = NA_character_,
review_date = NA_character_,
reviewer_name = NA_character_
) |>
dplyr::select(
adm0, adm1, adm2, adm3, hf_uid_new,
hf_dhis2_raw,
hf_dhis2,
hf_mfl_raw,
hf_mfl,
score,
final_method,
reviewer_decision,
reviewer_notes,
review_date,
reviewer_name
)
# export for review
readr::write_csv(
review_matches,
here::here("03_outputs", "tables", "facility_matches_for_review.csv")
)
# show summary
cli::cli_h2("Manual Review Required")
cli::cli_alert_info(
"Exported {nrow(review_matches)} matches flagged for review"
)
cli::cli_alert_info(
"File location: outputs/facility_matches_for_review.csv"
)To adapt the code:
- Lines 6–9: Add additional columns for the review process (e.g., facility_type, district).
- Lines 26–29: Adjust the output file path and name as needed.
Show the code
# prepare review dataset (using fallback results as example)
review_matches = (
fallback_final
.loc[fallback_final["match_flag"] == "review"]
.sort_values("score", ascending=False)
.assign(
reviewer_decision=pd.NA, # accept/reject/uncertain
reviewer_notes=pd.NA,
review_date=pd.NA,
reviewer_name=pd.NA,
)
[["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
"score", "final_method",
"reviewer_decision", "reviewer_notes", "review_date", "reviewer_name"]]
)
# export for review
output_dir = Path(here("03_outputs/tables"))
output_dir.mkdir(parents=True, exist_ok=True)
review_matches.to_csv(output_dir / "facility_matches_for_review.csv", index=False)
# show summary
cli_header("Manual Review Required")
cli_info(f"Exported {len(review_matches)} matches flagged for review")
cli_info("File location: outputs/tables/facility_matches_for_review.csv")To adapt the code:
- Lines 6–9: Add additional columns for the review process (e.g., facility_type, district).
- Lines 24–25: Adjust the output file path and name as needed.
Review Process Guidelines
The manual review step is more than a simple yes/no decision. It is an opportunity to strengthen the matching workflow by confirming accurate links, rejecting false positives, and documenting recurring data quality issues. Reviewers play a key role in ensuring the integrity of the final dataset and in improving the process for future runs.
For Reviewers:
- Accept – Confirm that both names refer to the same facility. Look for strong evidence of equivalence, such as:
- Minor spelling variations:
"Makeni Gov Hospital"↔︎"Makeni Government Hospital" - Common or consistent abbreviations:
Bo CHC↔︎Bo Community Health Center - Small formatting differences that do not alter meaning
- Minor spelling variations:
- Reject – Flag cases where facilities differ or where the match is too uncertain. Common examples include:
- Different facility types in the same area:
"Makeni Hospital"↔︎"Makeni Clinic" - Generic names without geographic qualifiers:
Community Health Postwithout a location - Multiple possible matches in the same vicinity without clear identifiers
- Different facility types in the same area:
- Document patterns – Use the review to capture recurring issues. These insights help refine thresholds, weights, and matching rules. Examples:
- Frequent abbreviation mismatches (e.g.,
CHCvsCH Centre) - Regional or language-specific naming conventions
- Missing or inconsistent administrative unit data in source systems
- Frequent abbreviation mismatches (e.g.,
A systematic approach to review ensures that accepted matches are trustworthy, rejected matches are well-justified, and lessons learned feed directly into improving the next matching cycle.
Simulated Review Decisions
For demonstration purposes, we simulate reviewer decisions on the flagged matches. In practice, this would be done by domain experts reviewing the exported CSV file.
After manual review is complete, we import the data that has all the decisions.
Show the code
# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches <- readr::read_csv(
here::here("03_outputs", "tables", "facility_matches_reviewed.csv"),
col_types = readr::cols(
reviewer_decision = readr::col_character(),
reviewer_notes = readr::col_character(),
review_date = readr::col_character(),
reviewer_name = readr::col_character()
)
)To adapt the code:
- Line 4: Update the file path to the project’s output directory.
- Lines 6–9: Add additional column types if the review file includes extra fields.
Show the code
# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches = pd.read_csv(
here("03_outputs/tables/facility_matches_reviewed.csv"),
dtype={
"reviewer_decision": "object",
"reviewer_notes": "object",
"review_date": "object",
"reviewer_name": "object",
}
)To adapt the code:
- Line 4: Update the file path to the project’s output directory.
- Lines 5–9: Add additional column dtype specifications if the review file includes extra fields.
Incorporating Review Results
Now we integrate the reviewer decisions back into the matching workflow:
Show the code
final_incorp <- reviewed_matches |>
dplyr::mutate(
final_match_flag = dplyr::case_when(
reviewer_decision == "accept" ~ "match",
reviewer_decision == "reject" ~ "no_match",
is.na(reviewer_decision) ~ "pending_review",
TRUE ~ "no_match"
)
)
# get the final matched ones
final_matches <- final_incorp |>
dplyr::filter(final_match_flag == "match") |>
dplyr::select(-reviewer_decision)
# summary of all results (including rejected/pending)
all_results_summary <- final_incorp |>
dplyr::count(final_match_flag, name = "n") |>
dplyr::mutate(percentage = round(100 * n / sum(n), 1))
# summary of final accepted matches only
final_summary <- final_matches |>
dplyr::count(final_match_flag, name = "n")
cli::cli_h2("Final Matching Results After Manual Review")
cli::cli_alert_success(
"Accepted matches: {format(sum(final_summary$n), big.mark = ',')}"
)
# breakdown by decision type
rejected_count <- all_results_summary |>
dplyr::filter(final_match_flag == "no_match") |>
dplyr::pull(n)
if (length(rejected_count) == 0) {
rejected_count <- 0
}
pending_count <- all_results_summary |>
dplyr::filter(final_match_flag == "pending_review") |>
dplyr::pull(n)
if (length(pending_count) == 0) {
pending_count <- 0
}
if (rejected_count > 0) {
cli::cli_alert_danger(
"Rejected matches: {format(rejected_count, big.mark = ',')} (will have NA MFL values in final dataset)"
)
}
if (pending_count > 0) {
cli::cli_alert_warning(
"Pending review: {format(pending_count, big.mark = ',')} (require additional review)"
)
}
# show breakdown table
all_results_summaryTo adapt the code:
- [Code removed]: Update the file path from
"03_outputs"to the project’s output directory (no file path in this chunk; see the simulate reviewer decisions chunk above). - [Code removed]: Add additional column types if the review file includes extra fields (no column types in this chunk; see the simulate reviewer decisions chunk above).
- Lines 3–9: Adjust logic for handling different reviewer decision categories based on the project workflow.
Show the code
final_incorp = reviewed_matches.assign(
final_match_flag=lambda d: np.select(
[
d["reviewer_decision"] == "accept",
d["reviewer_decision"] == "reject",
d["reviewer_decision"].isna(),
],
["match", "no_match", "pending_review"],
default="no_match",
)
)
# get the final matched ones
final_matches = (
final_incorp
.loc[final_incorp["final_match_flag"] == "match"]
.drop(columns="reviewer_decision", errors="ignore")
)
# summary of all results (including rejected/pending)
all_results_summary = (
final_incorp.groupby("final_match_flag", as_index=False)
.size()
.rename(columns={"size": "n"})
.assign(percentage=lambda d: (d["n"] / d["n"].sum() * 100).round(1))
)
cli_header("Final Matching Results After Manual Review")
cli_success(f"Accepted matches: {len(final_matches):,}")
# breakdown by decision type
rejected_count = int(
all_results_summary.loc[all_results_summary["final_match_flag"] == "no_match", "n"].sum()
)
pending_count = int(
all_results_summary.loc[
all_results_summary["final_match_flag"] == "pending_review", "n"
].sum()
)
if rejected_count > 0:
cli_danger(
f"Rejected matches: {rejected_count:,} "
"(will have NA MFL values in final dataset)"
)
if pending_count > 0:
cli_warning(f"Pending review: {pending_count:,} (require additional review)")
# show breakdown table
all_results_summaryTo adapt the code:
- Lines 3–9: Adjust logic for handling different reviewer decision categories based on the project workflow.
Out of the 159 proposed matches, none were accepted, as all appeared to be different. This may suggest underlying data quality issues or the presence of new health facility names.
DHIS2 facilities with rejected matches will appear in the final dataset (Step 9) with NA values for MFL columns, preserving the facility record while clearly indicating no reliable match was found. Facilities with pending review should be resolved before proceeding to production use.
Step 9: Combine All Matched Results
In this step, we combine results from all matching approaches and finalize the dataset by joining back to the original DHIS2 health facility list. This creates a comprehensive dataset that includes:
- High-confidence matches from Step 4 (Interactive Stratified Geographic Matching)
- Fuzzy matches from Step 6-8 (Various fuzzy matching strategies on remaining unmatched)
- Unmatched facilities (Preserved with NA values for MFL columns)
Step 9.1: Combine results from all matching approaches
When using the enhanced workflow with stratified matching, combine results from all stages:
Show the code
# create manual_matches from manual review results (both matched and
# unmatched)
manual_matches <- final_incorp |>
dplyr::mutate(
# for unmatched, set appropriate values
hf_mfl = dplyr::if_else(
reviewer_decision == "accept",
hf_mfl,
NA_character_
),
hf_mfl_raw = dplyr::if_else(
reviewer_decision == "accept",
hf_mfl_raw,
NA_character_
),
final_method = dplyr::if_else(
reviewer_decision == "accept",
"Manual Review - Accepted",
"Manual Review - Rejected"
),
score = dplyr::if_else(
reviewer_decision == "accept",
100,
0
)
) |>
# manual reviewed results already carry adm0-3 + hf_uid_new from
# fallback_final → review export → reimport
dplyr::select(
adm0,
adm1,
adm2,
adm3,
hf_uid_new,
hf_dhis2_raw,
hf_dhis2,
hf_mfl_raw,
hf_mfl,
score,
final_method
)
# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and
# rejected
final_facilities_all <- dplyr::bind_rows(
# step 4: stratified geographic matches (if performed)
matched_dhis2_prepgeoname,
# step 5: exact matches after standardization
matched_dhis2,
# step 6-8: complete matches from fuzzy matching
fuzzy_matches,
# manual matching (if performed)
manual_matches
)
# persist for the python render-state to load (Rule 6.4 parity)
write.csv(
final_facilities_all,
here::here("01_data/1.1_foundational/1.1c_health_facilities/processed/final_facilities_all.csv"),
row.names = FALSE
)
# collapse to one record per DHIS2 facility for downstream integration
final_facilities_one_per_hf <- final_facilities_all |>
dplyr::select(-match_flag, -hf)
cli::cli_alert_info(
"All facilities processed (rows): {nrow(final_facilities_all)}"
)
cli::cli_alert_success(
"All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): {dplyr::n_distinct(final_facilities_all$hf_uid_new)}"
)
# generate summary by matching method
matching_summary <- final_facilities_one_per_hf |>
dplyr::group_by(final_method) |>
dplyr::summarise(
n_matched = n(),
avg_score = mean(score, na.rm = TRUE),
min_score = min(score, na.rm = TRUE),
max_score = max(score, na.rm = TRUE),
.groups = "drop"
)
# display summary
cli::cli_h2("Matching Results by Method")
matching_summaryTo adapt the code:
- [Code removed]: The
fuzzy_matcheslogic automatically detects which Step 8 approach was used. Modify the order or add additional Step 8 result variables if needed (this logic lives in the select fuzzy approach chunk, Step 8.2). - Lines 3–41: The
manual_matcheslogic processes both matched and rejected facilities fromfinal_incorp. Replacefinal_incorpwith the manual review results dataset name. - Lines 16–25: Facilities that were rejected in manual review get
final_method = "Manual Review - Rejected"andscore = 0. Adjust these labels if needed. - Lines 46–55: The
bind_rowscombines all matching approaches. Add or remove rows based on which steps were used in the workflow. - [Code removed]: Deduplication logic removes duplicates by keeping the highest scoring match for each DHIS2 facility (grouped by
hf_dhis2_rawonly). Modify if different deduplication rules are needed (deduplication lives in the select fuzzy approach chunk, Step 8.2). - [Code removed]: Validation checks detect internal duplicates within each dataset and verify the final combined dataset has no duplicates. Remove or modify these checks if not needed (not present in this chunk).
Show the code
# create manual_matches from manual review results (both matched and unmatched)
manual_matches = final_incorp.assign(
# for unmatched, set appropriate values
hf_mfl=lambda d: np.where(d["reviewer_decision"] == "accept", d["hf_mfl"], pd.NA),
hf_mfl_raw=lambda d: np.where(
d["reviewer_decision"] == "accept", d["hf_mfl_raw"], pd.NA
),
final_method=lambda d: np.where(
d["reviewer_decision"] == "accept",
"Manual Review - Accepted",
"Manual Review - Rejected",
),
score=lambda d: np.where(d["reviewer_decision"] == "accept", 100, 0),
)
cols_needed = [
"adm0", "adm1", "adm2", "adm3", "hf_uid_new",
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl", "score", "final_method",
]
manual_matches = manual_matches[[c for c in cols_needed if c in manual_matches.columns]]
# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and rejected
final_facilities_all = pd.concat(
[
# step 4: stratified geographic matches (if performed)
matched_dhis2_prepgeoname[[c for c in cols_needed if c in matched_dhis2_prepgeoname.columns]],
# step 5: exact matches after standardization
matched_dhis2[[c for c in cols_needed if c in matched_dhis2.columns]],
# step 6-8: complete matches from fuzzy matching
fuzzy_matches[[c for c in cols_needed if c in fuzzy_matches.columns]],
# manual matching (if performed)
manual_matches,
],
ignore_index=True,
)
# collapse to one record per DHIS2 facility for downstream integration
cols_to_drop = [c for c in ["match_flag", "hf"] if c in final_facilities_all.columns]
final_facilities_one_per_hf = final_facilities_all.drop(columns=cols_to_drop, errors="ignore")
cli_info(f"All facilities processed (rows): {len(final_facilities_all)}")
cli_success(
f"All facilities processed (distinct DHIS2 by hf_uid_new): "
f"{final_facilities_all['hf_uid_new'].nunique()}"
)
# generate summary by matching method
matching_summary = (
final_facilities_one_per_hf
.groupby("final_method", as_index=False)
.agg(
n_matched=("hf_dhis2_raw", "count"),
avg_score=("score", "mean"),
min_score=("score", "min"),
max_score=("score", "max"),
)
)
# display summary
cli_header("Matching Results by Method")
matching_summaryTo adapt the code:
- Lines 3–22: The
manual_matcheslogic processes both matched and rejected facilities fromfinal_incorp. Replacefinal_incorpwith the manual review results dataset name. - Lines 12–15: Facilities rejected in manual review get
final_method = "Manual Review - Rejected"andscore = 0. Adjust these labels if needed. - Lines 28–37:
pd.concatcombines all matching approaches. Add or remove dataframes based on which steps were used.
Step 9.2: Create final integrated DHIS2-MFL dataset
Create the final analytical dataset by starting with DHIS2 as the base and enriching it with MFL data where matches exist. This ensures no DHIS2 facilities are lost while adding valuable MFL attributes.
Show the code
# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final <- dhis2_df |>
dplyr::left_join(
dhis2_map,
by = c("adm0", "adm1", "adm2", "adm3", "hf")
)
# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf <- final_facilities_one_per_hf |>
dplyr::select(hf_uid_new, hf_mfl_raw)
# create the integrated dataset
final_dhis2_mfl_df <- dhis2_df_final |>
dplyr::left_join(final_match_per_hf, by = "hf_uid_new") |>
# join MFL attributes for matched facilities
dplyr::left_join(
dplyr::select(master_hf_df, -hf_mfl, -hf, -adm0, -adm1, -adm2, -adm3),
by = "hf_mfl_raw"
)
# counts on distinct facilities via hf_uid_new (robust to row-level
# duplication)
total <- dhis2_df_final |> dplyr::distinct(hf_uid_new) |> nrow()
with_mfl <- final_match_per_hf |>
dplyr::filter(!is.na(hf_mfl_raw)) |>
nrow()
without_mfl <- total - with_mfl
# validation summary
cli::cli_h2("DHIS2-MFL Integration Complete")
cli::cli_alert_success(
"Total DHIS2 facilities preserved: {total}"
)
cli::cli_alert_info(
"Facilities with MFL data: {with_mfl}"
)
cli::cli_alert_warning(
"Facilities without MFL match: {without_mfl}"
)
cli::cli_alert_info("Total rows (facility-months): {nrow(final_dhis2_mfl_df)}")To adapt the code:
- Line 5: Change
hfto match the DHIS2 facility name column. - Line 17: Adjust which MFL columns to exclude (we remove geographic columns to avoid conflicts with DHIS2’s geography).
- Line 18: Ensure
hf_mfl_rawmatches the MFL facility name column from the matching results.
Show the code
# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final = dhis2_df.merge(
dhis2_map,
on=["adm0", "adm1", "adm2", "adm3", "hf"],
how="left",
)
# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf = final_facilities_one_per_hf[["hf_uid_new", "hf_mfl_raw"]].copy()
# create the integrated dataset
final_dhis2_mfl_df = (
dhis2_df_final
.merge(final_match_per_hf, on="hf_uid_new", how="left")
# join MFL attributes for matched facilities
.merge(
master_hf_df.drop(
columns=["hf_mfl", "hf", "adm0", "adm1", "adm2", "adm3"],
errors="ignore",
),
on="hf_mfl_raw",
how="left",
)
)
# counts on distinct facilities via hf_uid_new
total = dhis2_df_final["hf_uid_new"].nunique()
with_mfl = final_match_per_hf.dropna(subset=["hf_mfl_raw"])["hf_uid_new"].nunique()
without_mfl = total - with_mfl
# validation summary
cli_header("DHIS2-MFL Integration Complete")
cli_success(f"Total DHIS2 facilities preserved: {total}")
cli_info(f"Facilities with MFL data: {with_mfl}")
cli_warning(f"Facilities without MFL match: {without_mfl}")
cli_info(f"Total rows (facility-months): {len(final_dhis2_mfl_df)}")To adapt the code:
- Line 5: Change
hfto match the DHIS2 facility name column. - Lines 19–22: Adjust which MFL columns to exclude (we remove geographic columns to avoid conflicts with DHIS2’s geography).
- Line 23: Ensure
hf_mfl_rawmatches the MFL facility name column from the matching results.
Step 10: Final Checks
Perform critical quality checks before finalizing the matching workflow. This step identifies potential data quality issues that require SNT team review: one-to-many matches (which may indicate duplicate facilities or generic names) and unmatched facilities (which could represent new facilities, closed facilities, or data quality issues). These checks ensure match quality and create actionable review lists for the SNT team.
Step 10.1: Detecting one-to-many matches for review
After finalization, check for MFL facilities matched to multiple DHIS2 facilities, which may indicate data quality issues requiring review:
Show the code
# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many <- final_facilities_all |>
dplyr::filter(!is.na(hf_mfl_raw)) |> # only matched facilities
dplyr::group_by(hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |> # MFL matched to >1 DHIS2
dplyr::ungroup() |>
dplyr::arrange(hf_mfl_raw, hf_dhis2_raw)
if (nrow(one_to_many) > 0) {
cli::cli_alert_warning(
paste0(
"One-to-many matches detected: {nrow(one_to_many)} DHIS2 facilities ",
"share {dplyr::n_distinct(one_to_many$hf_mfl_raw)} MFL facilities"
)
)
# show sample
knitr::kable(
one_to_many |>
dplyr::select(hf_dhis2_raw, hf_mfl_raw, score, final_method) |>
dplyr::slice_head(n = 6),
caption = "Sample one-to-many matches requiring review"
)
} else {
cli::cli_alert_success("No one-to-many matches detected")
}To adapt the code:
- [Code removed]: Modify output path for the review file as needed (no file export in this chunk).
- Lines 19–23: Adjust which columns to include in the review file.
Show the code
# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many = (
final_facilities_all
.dropna(subset=["hf_mfl_raw"]) # only matched facilities
.groupby("hf_mfl_raw")
.filter(lambda g: len(g) > 1) # MFL matched to >1 DHIS2
.sort_values(["hf_mfl_raw", "hf_dhis2_raw"])
)
if len(one_to_many) > 0:
cli_warning(
f"One-to-many matches detected: {len(one_to_many)} DHIS2 facilities "
f"share {one_to_many['hf_mfl_raw'].nunique()} MFL facilities"
)
# show sample
print(
one_to_many[["hf_dhis2_raw", "hf_mfl_raw", "score", "final_method"]]
.head(6)
.to_string(index=False)
)
else:
cli_success("No one-to-many matches detected")To adapt the code:
- Lines 15–19: Adjust which columns to include in the review output.
One-to-many matches often signal underlying issues in the data. They may arise from duplicate or near-duplicate facility records in DHIS2, or from generic names in the MFL that align with multiple, more specific DHIS2 entries. In other cases, they reflect inconsistent naming conventions that need further standardization before integration.
Step 10.2: Unmatched facilities analysis
Analyze facilities that remain unmatched to identify potential reasons and create actionable review lists for the SNT team. Unmatched facilities could represent new facilities, closed facilities, or facilities with significant data quality issues.
Show the code
# analyze unmatched facilities by reason
unmatched_analysis <- final_facilities_all |>
dplyr::filter(is.na(hf_mfl_raw)) |>
dplyr::group_by(final_method) |>
dplyr::summarise(
count = n(),
.groups = "drop"
) |>
dplyr::arrange(desc(count))
# display unmatched analysis
cli::cli_h2("Unmatched Facilities Analysis")
cli::cli_alert_info(
"Total unmatched facilities: {sum(unmatched_analysis$count)}"
)
# show breakdown by reason
purrr::pwalk(
unmatched_analysis,
~ cli::cli_alert_warning("{.x}: {.y} facilities")
)
# create detailed unmatched facilities export
unmatched_detailed <- final_facilities_one_per_hf |>
# unmatched are those without an MFL linkage
dplyr::filter(is.na(hf_mfl_raw)) |>
# select required context and scoring/method fields
dplyr::select(
hf_dhis2_raw,
adm1,
adm2,
adm3,
final_method,
score
) |>
# derive a clear potential reason for review
dplyr::mutate(
potential_reason = dplyr::case_when(
is.na(final_method) ~ "Never matched - possible new facility",
final_method == "Manual Review - Rejected" ~
"Rejected in review - possible different facility",
score > 0 & score < 50 ~ "Low similarity - possible data quality issue",
TRUE ~ "Unmatched - needs investigation"
)
) |>
dplyr::arrange(adm1, adm2, adm3, hf_dhis2_raw)
# export for SNT team review
readr::write_csv(
unmatched_detailed,
here::here("03_outputs", "unmatched_facilities_for_snt_review.csv")
)
cli::cli_alert_success(
"Exported {nrow(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli::cli_text("File: outputs/unmatched_facilities_for_snt_review.csv")To adapt the code:
- Lines 3–9: Analysis groups unmatched facilities by reason (final_method). Customize grouping as needed.
- Lines 23–46: Detailed export includes geographic context and potential reasons. Modify columns as needed.
- Lines 49–52: Export path for SNT team review can be customized.
- Lines 38–44: The potential reasons logic can be adjusted based on the specific workflow patterns.
Show the code
# analyze unmatched facilities by reason
unmatched_analysis = (
final_facilities_all
.loc[final_facilities_all["hf_mfl_raw"].isna()]
.groupby("final_method", as_index=False)
.size()
.rename(columns={"size": "count"})
.sort_values("count", ascending=False)
)
# display unmatched analysis
cli_header("Unmatched Facilities Analysis")
cli_info(f"Total unmatched facilities: {unmatched_analysis['count'].sum()}")
# show breakdown by reason
for _, row in unmatched_analysis.iterrows():
cli_warning(f"{row['final_method']}: {row['count']} facilities")
# create detailed unmatched facilities export
unmatched_detailed = (
final_facilities_one_per_hf
.loc[final_facilities_one_per_hf["hf_mfl_raw"].isna()]
[["hf_dhis2_raw", "adm1", "adm2", "adm3", "final_method", "score"]]
.assign(
potential_reason=lambda d: np.select(
[
d["final_method"].isna(),
d["final_method"] == "Manual Review - Rejected",
(d["score"] > 0) & (d["score"] < 50),
],
[
"Never matched - possible new facility",
"Rejected in review - possible different facility",
"Low similarity - possible data quality issue",
],
default="Unmatched - needs investigation",
)
)
.sort_values(["adm1", "adm2", "adm3", "hf_dhis2_raw"])
)
# export for SNT team review
unmatched_export_dir = Path(here("03_outputs"))
unmatched_export_dir.mkdir(parents=True, exist_ok=True)
unmatched_detailed.to_csv(
unmatched_export_dir / "unmatched_facilities_for_snt_review.csv", index=False
)
cli_success(
f"Exported {len(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli_info("File: outputs/unmatched_facilities_for_snt_review.csv")To adapt the code:
- Lines 3–9: Analysis groups unmatched facilities by reason (
final_method). Customize grouping as needed. - Lines 22–37: Detailed export includes geographic context and potential reasons. Modify columns as needed.
- Lines 41–43: Export path for SNT team review can be customized.
- Lines 29–35: The potential reasons logic can be adjusted based on the specific workflow patterns.
This check helps the SNT team focus their review on the most important cases. Some unmatched facilities may represent new sites that should be added to the MFL, while others point to data quality issues where name variations need correction. A portion may be closed facilities that need to be marked inactive. The results can also highlight geographic patterns, showing areas where unmatched facilities are more common and may require targeted review.
Step 11: Save Final Datasets
Save all final datasets for analysis, reporting, and SNT team review. This step consolidates all export operations and creates the outputs needed for downstream use.
Show the code
# save main integrated DHIS2-MFL dataset
rio::export(
final_dhis2_mfl_df,
here::here("03_outputs", "final_dhis2_mfl_integrated.xlsx")
)
# save matching results summary
rio::export(
final_facilities_all,
here::here("03_outputs", "facility_matching_results.xlsx")
)
if (nrow(unmatched_facilities) > 0) {
readr::write_csv(
unmatched_facilities,
here::here("03_outputs", "unmatched_dhis2_facilities.csv")
)
}
if (nrow(one_to_many) > 0) {
readr::write_csv(
one_to_many,
here::here("03_outputs", "one_to_many_matches_for_review.csv")
)
}
# summary of saved files
cli::cli_h2("Final Datasets Saved")
cli::cli_alert_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli::cli_alert_success("Matching results: facility_matching_results.xlsx")
cli::cli_alert_info("Unmatched facilities: {nrow(unmatched_facilities)} exported")
cli::cli_alert_info("One-to-many matches: {nrow(one_to_many)} exported")To adapt the code:
- Lines 3–7: Main integrated dataset path can be customized for the project structure.
- Lines 9–13: Matching results summary export path.
- Lines 14–19: Unmatched facilities export - modify path as needed.
- Lines 21–26: One-to-many matches export - automatic detection and export.
- Lines 29–33: Summary messages can be customized.
Show the code
# save main integrated DHIS2-MFL dataset
output_dir = Path(here("03_outputs"))
output_dir.mkdir(parents=True, exist_ok=True)
final_dhis2_mfl_df.to_excel(
output_dir / "final_dhis2_mfl_integrated.xlsx", index=False
)
# save matching results summary
final_facilities_all.to_excel(
output_dir / "facility_matching_results.xlsx", index=False
)
if len(unmatched_detailed) > 0:
unmatched_detailed.to_csv(
output_dir / "unmatched_dhis2_facilities.csv", index=False
)
if len(one_to_many) > 0:
one_to_many.to_csv(
output_dir / "one_to_many_matches_for_review.csv", index=False
)
# summary of saved files
cli_header("Final Datasets Saved")
cli_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli_success("Matching results: facility_matching_results.xlsx")
cli_info(f"Unmatched facilities: {len(unmatched_detailed)} exported")
cli_info(f"One-to-many matches: {len(one_to_many)} exported")To adapt the code:
- Lines 3–6: Main integrated dataset path can be customized for the project structure.
- Lines 8–10: Matching results summary export path.
- Lines 12–14: Unmatched facilities export — modify path as needed.
- Lines 16–18: One-to-many matches export — automatic detection and export.
- Lines 21–24: Summary messages can be customized.
Summary
This guide outlines a structured approach to matching health facility names between DHIS2 datasets and the Master Facility List (MFL). It features a two-stage workflow designed for efficiency and accuracy. Users may implement the full workflow or apply fuzzy matching directly if only basic string similarity is required.
Stage 1: Interactive Stratified Geographic Matching uses prep_geonames() (available as sntutils::prep_geonames() in R and sntutils.geo.prep_geonames() in Python) to combine human validation with geographic constraints. It typically resolves 60–70% of matches with high confidence, exposes misaligned facilities early, and provides opportunities for immediate review of suggested matches.
Stage 2: Comprehensive Fuzzy Matching addresses the remaining unmatched facilities through cleaning and standardization of names, applying multiple string similarity algorithms (Jaro-Winkler, Levenshtein, Qgram, LCS), and checking structural quality. This helps capture spelling variations, abbreviations, and geographic misplacements that prevent exact matching.
The two-stage approach resolves most matches through stratified geographic matching, reducing manual review and leaving only difficult cases for fuzzy matching. It detects misalignments early, improves confidence through human validation, and remains flexible for both interactive and automated use. By combining multiple strategies, the approach achieves broad coverage while maintaining quality.
For facilities that require fuzzy matching, four strategies are available: single-method matching, composite scoring, weighted composites, and fallback loops with progressive thresholds. Each is suited to different data contexts and confidence needs. Transparency and comparability are built into the process, but validation with the SNT team remains essential for low-confidence matches and critical applications.
Full Code
Find the full code script for fuzzy matching of health facility names below.
Show full code
################################################################################
############ ~ Fuzzy matching of names across datasets full code ~ #############
################################################################################
### Step 1: Install and Load Required Libraries --------------------------------
# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# load all required packages using pacman
pacman::p_load(
readxl, # for reading Excel files
dplyr, # for data manipulation
stringdist, # for calculating string distances (fuzzy matching)
tibble, # for working with modern data frames
knitr, # for creating formatted tables
openxlsx, # for writing Excel files
httr, # for HTTP requests to download files (optional)
here # for cross-platform file paths
)
### Step 2: Load Data ----------------------------------------------------------
# set up path to dhis2 hf data
dhis2_path <- here::here(
"01_data",
"1.2_epidemiology",
"1.2a_routine_surveillance",
"processed"
)
hf_path <- here::here(
"01_data",
"1.1_foundational",
"1.1c_health_facilities",
"processed"
)
# read DHIS2 health facilities data
dhis2_df <- readRDS(
here::here(dhis2_path, "sle_dhis2_with_clean_adm3.rds")
) |>
# rename facility name columns for clarity and consistency
dplyr::rename(hf_dhis2_raw = hf)
# get distinct admin and hf cols
dhis2_hf_df <- dhis2_df |>
dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw)
# read MFL health facilities data
master_hf_df <- read.csv(
here::here(hf_path, "hf_final_clean_data.csv")
) |>
dplyr::distinct(
adm0, adm1, adm2, adm3, hf, lat, long, .keep_all = TRUE
) |>
# rename facility name columns for clarity and consistency
dplyr::mutate(hf_mfl_raw = hf)
# attach a stable DHIS2 facility ID for consistent counting across
# steps
# distinct by geography + name to avoid collisions across admins
dhis2_map <- dhis2_df |>
dplyr::distinct(adm0, adm1, adm2, adm3, hf) |>
dplyr::mutate(
hf_uid_new = sntutils::vdigest(
paste(
tolower(stringr::str_squish(adm0)),
tolower(stringr::str_squish(adm1)),
tolower(stringr::str_squish(adm2)),
tolower(stringr::str_squish(adm3)),
tolower(stringr::str_squish(hf)),
sep = "|"
),
algo = "xxhash32"
)
)
# print head of data
cli::cli_h3("Sample of DHIS2 data:")
head(dhis2_hf_df)
cli::cli_h3("Sample of MFL data:")
head(master_hf_df)
### Step 3: Initial Matching Diagnostics ---------------------------------------
#### Step 3.1: Overall exact match check ---------------------------------------
# check for exact matches on raw names (no admin constraint)
exact_matches_all <- dhis2_hf_df |>
dplyr::inner_join(
master_hf_df,
by = c("hf_dhis2_raw" = "hf_mfl_raw"),
relationship = "many-to-many"
)
# calculate matching potential
total_dhis2 <- nrow(dhis2_hf_df)
total_mfl <- nrow(master_hf_df)
unmatched_dhis2 <- total_dhis2 - nrow(exact_matches_all)
cli::cli_h3("Overall Matching Summary")
cli::cli_alert_info("Total DHIS2 facilities: {total_dhis2}")
cli::cli_alert_info("Total MFL facilities: {total_mfl}")
cli::cli_alert_success(
paste0(
"Exact matches found: {nrow(exact_matches_all)} (",
"{round(nrow(exact_matches_all)/total_dhis2*100, 1)}%)
)
)
cli::cli_alert_warning("Remaining to match: {unmatched_dhis2}")
#### Step 3.2: Administrative level match check --------------------------------
# check matches at adm2 level (district)
dhis2_by_adm2 <- dhis2_hf_df |>
dplyr::group_by(adm2) |>
dplyr::summarise(total_dhis2 = dplyr::n())
matches_by_adm2 <- dhis2_hf_df |>
dplyr::inner_join(
master_hf_df,
by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2")
) |>
dplyr::group_by(adm2) |>
dplyr::summarise(exact_matches = dplyr::n()) |>
dplyr::left_join(dhis2_by_adm2, by = "adm2") |>
dplyr::mutate(
match_rate = round(exact_matches / total_dhis2 * 100, 1)
) |>
dplyr::select(adm2, exact_matches, total_dhis2, match_rate) |>
dplyr::arrange(dplyr::desc(match_rate))
cli::cli_h3("Exact matches by district (adm2))
matches_by_adm2
# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 <- dhis2_hf_df |>
dplyr::group_by(adm2, adm3) |>
dplyr::summarise(total_dhis2 = dplyr::n(), .groups = "drop")
matches_by_adm3 <- dhis2_hf_df |>
dplyr::inner_join(
master_hf_df,
by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2", "adm3")
) |>
dplyr::group_by(adm2, adm3) |>
dplyr::summarise(exact_matches = dplyr::n(), .groups = "drop") |>
dplyr::left_join(dhis2_by_adm3, by = c("adm2", "adm3")) |>
dplyr::mutate(
match_rate = round(exact_matches / total_dhis2 * 100, 1)
) |>
dplyr::filter(total_dhis2 >= 5) |> # only show areas with 5+ facilities
dplyr::arrange(dplyr::desc(match_rate)) |>
dplyr::slice_head(n = 10) # show top 10 performing adm3 areas
cli::cli_h3("Top exact matches by chiefdom (adm3))
matches_by_adm3
#### Step 3.3: Duplicate name check --------------------------------------------
# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 <- dhis2_hf_df |>
dplyr::group_by(adm2, hf_dhis2_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, hf_dhis2_raw)
mfl_dups_adm2 <- master_hf_df |>
dplyr::group_by(adm2, hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, hf_mfl_raw)
cli::cli_h3("Duplicates within same district (adm2)")
cli::cli_alert_warning(
paste0(
"DHIS2 duplicates within districts: ",
"{length(unique(dhis2_dups_adm2$hf_dhis2_raw))}"
)
)
cli::cli_alert_warning(
paste0(
"MFL duplicates within districts: ",
"{length(unique(mfl_dups_adm2$hf_mfl_raw))}"
)
)
# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 <- dhis2_hf_df |>
dplyr::group_by(adm2, adm3, hf_dhis2_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, adm3, hf_dhis2_raw)
mfl_dups_adm3 <- master_hf_df |>
dplyr::group_by(adm2, adm3, hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(adm2, adm3, hf_mfl_raw)
cli::cli_h3("Duplicates within same chiefdom (adm3)")
cli::cli_alert_danger(
paste0(
"DHIS2 duplicates within chiefdoms: ",
"{length(unique(dhis2_dups_adm3$hf_dhis2_raw))}"
)
)
cli::cli_alert_danger(
paste0(
"MFL duplicates within chiefdoms: ",
"{length(unique(mfl_dups_adm3$hf_mfl_raw))}"
)
)
# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall <- dhis2_hf_df |>
dplyr::group_by(hf_dhis2_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(hf_dhis2_raw)
mfl_dups_overall <- master_hf_df |>
dplyr::group_by(hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |>
dplyr::arrange(hf_mfl_raw)
cli::cli_h3("Overall duplicate names (across all areas)")
cli::cli_alert_info(
"DHIS2 overall duplicates: {length(
unique(dhis2_dups_overall$hf_dhis2_raw)
)}"
)
cli::cli_alert_info(
"MFL overall duplicates: {length(
unique(mfl_dups_overall$hf_mfl_raw)
)}"
)
### Step 4: Interactive Stratified Geographic Matching -------------------------
# inner join (keep only matched polygons)
# set up location to save cache
cache_loc <- "01_data/1.1_foundational/1.1f_cache_files/processed"
# interactive stratified matching with automatic standardization
# this function handles name standardization internally
dhis2_df_cleaned <-
sntutils::prep_geonames(
target_df = dhis2_hf_df, # dataset to be cleaned
lookup_df = master_hf_df, # reference dataset with correct admin
level0 = "adm0",
level1 = "adm1",
level2 = "adm2",
level3 = "adm3",
level4 = "hf",
cache_path = here::here(cache_loc, "geoname_cache.rds"),
unmatched_export_path = here::here(cache_loc, "dhis2_hf_unmatched.rds")
)
# load the unmatched facilities for further processing (step 5–8 fuzzy
# matching)
dhis2_hf_to_process <- readRDS(
here::here(cache_loc, "dhis2_hf_unmatched.rds")
) |>
dplyr::select(adm0, adm1, adm2, adm3, hf_dhis2_raw = hf)
# summary statistics
n_original <- nrow(dhis2_hf_df)
n_matched <- n_original - nrow(dhis2_hf_to_process)
match_rate <- (n_matched / n_original) * 100
cli::cli_alert_success(
paste0(
"Stratified matching completed: ",
"{format(n_matched, big.mark = ',')}/{format(n_original, big.mark = ',')}",
" facilities matched ({round(match_rate, 1)}%)"
)
)
cli::cli_alert_info(
"Remaining unmatched: {nrow(dhis2_hf_to_process)} facilities"
)
### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------
#### Step 5.1: Standardize health facility names -------------------------------
# create function to standardize hf names
standardize_names <- function(name_vec) {
# validate input
if (!rlang::is_atomic(name_vec)) {
cli::cli_abort("`name_vec` must be an atomic vector.")
}
name_vec |>
# ensure character type
as.character() |>
# convert to lowercase
stringr::str_to_lower() |>
# replace punctuation with space
stringr::str_replace_all("[[:punct:]]", " ") |>
# remove extra spaces and trim
stringr::str_squish() |>
# normalize accents
stringi::stri_trans_general("Latin-ASCII") |>
# normalize all space-like characters
stringi::stri_replace_all_regex("\\p{Zs}+", " ") |>
# convert roman numerals to arabic numerals
stringr::str_replace_all(
c(
" ix\\b" = " 9",
" viii\\b" = " 8",
" vii\\b" = " 7",
" vi\\b" = " 6",
" v\\b" = " 5",
" iv\\b" = " 4",
" iii\\b" = " 3",
" ii\\b" = " 2",
" i\\b" = " 1"
)
) |>
# sort tokens with letters first, numbers last; within groups
# alphabetical
purrr::map_chr(\(.x) {
# split on one-or-more spaces
tokens <- strsplit(.x, " +")[[1]]
# detect pure numeric tokens
is_num <- stringr::str_detect(tokens, "^[0-9]+$")
# order alphabetic first, then numeric; sort within each group
ordered <- c(sort(tokens[!is_num]), sort(tokens[is_num]))
# rejoin
paste(ordered, collapse = " ")
})
}
# set up example with messy formatting
example_word <- factor("Clínica! Rahmâ IV ( New clinic) East")
# print original structure
cat("\nExample before standardization:\n")
str(example_word)
# apply standardization
example_word_st <- standardize_names(example_word)
# print cleaned example
cat("\nExample after standardization:\n")
str(example_word_st)
# remove duplicates from MFL and format hf col
master_hf_df <- master_hf_df |>
dplyr::distinct(hf_mfl_raw, .keep_all = TRUE) |>
dplyr::mutate(hf_mfl = standardize_names(hf_mfl_raw))
# important: preserve original full dataset before processing unmatched
# (this is our base N)
dhis2_hf_df_original <- dhis2_hf_df |>
dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw))
# build geo-aware stable ID once and attach to originals and unmatched
hf_uid_new_map <- dhis2_hf_df |>
dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
dplyr::mutate(
hf_uid_new = sntutils::vdigest(
paste(
tolower(stringr::str_squish(adm0)),
tolower(stringr::str_squish(adm1)),
tolower(stringr::str_squish(adm2)),
tolower(stringr::str_squish(adm3)),
tolower(stringr::str_squish(hf_dhis2_raw)),
sep = "|"
),
algo = "xxhash32"
)
)
dhis2_hf_df_original <- dhis2_hf_df_original |>
dplyr::left_join(
hf_uid_new_map,
by = c(
"adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
)
)
# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched <- dhis2_hf_to_process |>
dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw)) |>
dplyr::left_join(
hf_uid_new_map |>
dplyr::mutate(hf_dhis2_raw = toupper(hf_dhis2_raw)),
by = c(
"adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
)
)
knitr::kable(
# check if it worked
dhis2_hf_unmatched |>
dplyr::distinct(hf_dhis2_raw, hf_dhis2) |>
dplyr::slice_head(n = 10)
)
#### Step 5.2: Handle abbreviations in health facility names -------------------
abbrev_dictionary <-
dplyr::bind_rows(
dplyr::select(dhis2_hf_unmatched, hf = hf_dhis2_raw),
dplyr::select( master_hf_df, hf = hf_mfl_raw)) |>
# split into words
tidyr::separate_rows(hf, sep = " ") |>
# remove blanks
dplyr::filter(hf != "") |>
# detect 2-4 uppercase letter patterns
dplyr::filter(stringr::str_detect(hf, "^[A-Z]{2,4}$")) |>
# count frequencies
dplyr::count(hf, sort = TRUE) |>
# rename for clarity
dplyr::rename(word = hf, freq = n) |>
dplyr::filter(freq > 2) |>
as.data.frame()
# check output
abbrev_dictionary
# define abbreviation dictionary (all lowercase, because we use the
# standardize column)
abbrev_dict <- c(
"maternal child health post" = "mchp",
"community health post" = "chp",
"community health center" = "chc",
"urban maternal clinic" = "umi",
"expanded programme on immunization" = "epi"
)
# apply replacements to MFL names
master_hf_df <- master_hf_df |>
dplyr::mutate(
hf_mfl = stringr::str_replace_all(hf_mfl, abbrev_dict)
)
# apply replacements to DHIS2 names
dhis2_hf_unmatched <- dhis2_hf_unmatched |>
dplyr::mutate(
hf_dhis2 = stringr::str_replace_all(hf_dhis2, abbrev_dict)
)
# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched |>
dplyr::filter(
stringr::str_detect(
hf_dhis2_raw,
paste0(
"(?i)Community Health Center|Maternal Child Health Post|",
"Community Health Post"
)
) |
stringr::str_detect(hf_dhis2_raw, "CHC|MCHP|CHP")
) |>
dplyr::select(hf_dhis2_raw, hf_dhis2) |>
head()
### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------
#### Step 6.1: Identify matched and unmatched facilities -----------------------
# flag for one-to-one enforcement
enforce_one_to_one <- FALSE
# exact matches using raw names
matched_dhis2_raw <- dhis2_hf_unmatched |>
dplyr::select(
adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_uid_new
) |>
dplyr::inner_join(
master_hf_df |>
dplyr::select(hf_mfl_raw),
by = c("hf_dhis2_raw" = "hf_mfl_raw")
)
# exact matches using standardized names
matched_dhis2 <- dhis2_hf_unmatched |>
dplyr::select(
adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_dhis2, hf_uid_new
) |>
dplyr::inner_join(
master_hf_df |>
dplyr::select(hf_mfl_raw, hf_mfl),
by = c("hf_dhis2" = "hf_mfl"),
keep = TRUE
) |>
# exclude facilities already matched via raw names
dplyr::anti_join(matched_dhis2_raw, by = c("hf_dhis2_raw")) |>
# label exact matches
dplyr::mutate(
final_method = paste0(
"Matched Without Fuzzy Matching (standardization)"
),
score = 100
)
# only include facilities that were actually matched via geographic
# standardization
# these should be facilities that had geographic corrections applied
matched_dhis2_prepgeoname <-
dhis2_df_cleaned |>
dplyr::anti_join(
dhis2_hf_unmatched,
by = c("adm0", "adm1", "adm2", "adm3", "hf" = "hf_dhis2_raw")
) |>
dplyr::left_join(
master_hf_df |>
dplyr::mutate(hf = toupper(hf)) |>
dplyr::select(adm0, adm1, adm2, adm3, hf, hf_mfl_raw, hf_mfl),
by = c("adm0", "adm1", "adm2", "adm3", "hf")
) |>
# attach hf_uid_new from dhis2_map (by admin + hf)
dplyr::left_join(
dhis2_map |> dplyr::select(adm0, adm1, adm2, adm3, hf, hf_uid_new),
by = c("adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw" = "hf")
) |>
# label matches
dplyr::mutate(
final_method = "Interactive Stratified Geographic Matching",
score = 100
)
# compute unmatched after standardization
unmatched_dhis2 <- dhis2_hf_unmatched |>
dplyr::select(hf_dhis2) |>
dplyr::anti_join(
master_hf_df |>
dplyr::select(hf_mfl),
by = c("hf_dhis2" = "hf_mfl")
) |>
dplyr::distinct(hf_dhis2)
# collect already matched mfl to enforce one-to-one in candidate pool
used_mfl_stand <- matched_dhis2 |>
dplyr::pull(hf_mfl) |>
unique()
use_mfl_prepgeoname <- matched_dhis2_prepgeoname |>
dplyr::pull(hf_mfl) |>
unique()
used_mfl <- c(used_mfl_stand, use_mfl_prepgeoname)
# build candidate MFL pool for fuzzy step
candidate_mfl_df <- master_hf_df |>
dplyr::select(hf_mfl)
# if enforcing one-to-one, drop MFLs already used by exact matches
if (enforce_one_to_one) {
candidate_mfl_df <- candidate_mfl_df |>
dplyr::filter(!hf_mfl %in% used_mfl)
}
# summary counts
total_dhis2_hf <- dplyr::n_distinct(dhis2_hf_unmatched$hf_dhis2_raw)
raw_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2_raw$hf_dhis2_raw)
raw_unmatch_dhis2_hf <- total_dhis2_hf - raw_match_dhis2_hf
standardized_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2$hf_dhis2)
standardized_unmatch_dhis2_hf <- total_dhis2_hf - standardized_match_dhis2_hf
total_mfl_hf <- dplyr::n_distinct(master_hf_df$hf_mfl)
candidate_mfl_count <- dplyr::n_distinct(candidate_mfl_df$hf_mfl)
# display summary
cli::cli_h2("Summary of Matching Status")
cli::cli_alert_info(
"Total DHIS2 facilities: {format(total_dhis2_hf, big.mark = ',')}"
)
cli::cli_alert_success(
paste0(
"Matched after standardization: ",
"{format(standardized_match_dhis2_hf, big.mark = ',')}"
)
)
cli::cli_alert_danger(
paste0(
"Unmatched using raw names: ",
"{format(raw_unmatch_dhis2_hf, big.mark = ',')}"
)
)
cli::cli_alert_danger(
paste0(
"Unmatched after standardization: ",
"{format(standardized_unmatch_dhis2_hf, big.mark = ',')}"
)
)
cli::cli_alert_info(
paste0(
"MFL facilities in candidate pool for matching: ",
"{format(candidate_mfl_count, big.mark = ',')} of ",
"{format(total_mfl_hf, big.mark = ',')}"
)
)
#### Step 6.1: Create the matching grid ----------------------------------------
# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities.
# create simple candidate match pool
candidate_match_pool <- tidyr::crossing(
unmatched_dhis2,
candidate_mfl_df
)
# create match grid with dhis2 and mfl names
match_grid <- candidate_match_pool
# preview sample - top 10 matches
match_grid |>
dplyr::slice_head(n = 10)
#### Step 6.2: Compute similarity scores ---------------------------------------
# calculate fuzzy scores
match_grid <- match_grid |>
dplyr::mutate(
len_max = pmax(nchar(hf_dhis2), nchar(hf_mfl)),
score_jw = 1 - stringdist::stringdist(hf_dhis2, hf_mfl, method = "jw"),
score_lv = 1 -
stringdist::stringdist(hf_dhis2, hf_mfl, method = "lv") / len_max,
score_qg = 1 -
stringdist::stringdist(hf_dhis2, hf_mfl, method = "qgram") / len_max,
score_lcs = 1 -
stringdist::stringdist(hf_dhis2, hf_mfl, method = "lcs") / len_max
) |>
dplyr::mutate(
dplyr::across(
.cols = dplyr::starts_with("score_"),
.fns = ~ ifelse(is.nan(.x) | .x < 0, 0, .x)
),
dplyr::across(
.cols = dplyr::contains("score_"),
.fns = ~ .x * 100
)
)
#### Step 6.3: Create composite scores -----------------------------------------
# define similarity score columns
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")
# compute composite_score dynamically using score_cols
match_grid <- match_grid |>
dplyr::mutate(
composite_score = rowMeans(
dplyr::across(dplyr::all_of(score_cols))
)
)
# compute average rank across all similarity methods
ranked_grid <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::mutate(
dplyr::across(
dplyr::all_of(score_cols),
~ dplyr::min_rank(dplyr::desc(.)),
.names = "rank_{.col}"
)
) |>
dplyr::ungroup() |>
dplyr::mutate(
rank_avg = rowSums(dplyr::across(
dplyr::all_of(paste0("rank_", score_cols))
)),
rank_avg = round(rank_avg / length(score_cols))
)
# preview ranked results to show multiple candidates per DHIS2 facility
# select a few facilities and show their top 3 candidates each
sample_facilities <- c(
"charity clinic kamba of",
"arab clinic shad",
"al arab clinic sheefa"
)
ranked_grid |>
dplyr::filter(hf_dhis2 %in% sample_facilities) |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_min(rank_avg, n = 4, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::select(
hf_dhis2,
hf_mfl,
dplyr::all_of(score_cols),
composite_score,
rank_avg
) |>
dplyr::arrange(hf_dhis2, rank_avg)
#### Step 6.4: Extract best match ----------------------------------------------
# for each fuzzy matching method, we select the top-ranked MFL name
# per DHIS2 hf.
best_jw <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_jw, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Jaro-Winkler") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_jw, method
)
best_lv <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_lv, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Levenshtein") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_lv, method
)
best_qg <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_qg, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Qgram") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_qg, method
)
best_lcs <- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(score_lcs, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "LCS") |>
dplyr::select(
hf_dhis2, hf_mfl, score = score_lcs, method
)
best_comp<- match_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_max(composite_score, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Composite-Score") |>
dplyr::select(
hf_dhis2, hf_mfl, score = composite_score, method
)
best_ranked_match <- ranked_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::slice_min(rank_avg, with_ties = FALSE) |>
dplyr::ungroup() |>
dplyr::mutate(method = "Rank-Ensemble") |>
dplyr::select(
hf_dhis2, hf_mfl, score = rank_avg, method
)
# combine top matches from all methods for comparison across approaches
all_best <- dplyr::bind_rows(
best_jw, best_lv, best_qg, best_lcs,
best_comp, best_ranked_match)
# preview best matches to show extraction results
all_best |>
dplyr::slice_head(n = 10) |>
dplyr::select(hf_dhis2, hf_mfl, score, method)
### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------
#### Step 7.1: Visualize score distributions across methods --------------------
# plot score dist across methods
score_dist_plot <- all_best |>
dplyr::filter(method != "Rank-Ensemble") |>
ggplot2::ggplot(ggplot2::aes(x = score)) +
ggplot2::geom_density(fill = "steelblue", alpha = 0.6) +
ggplot2::facet_wrap(~method) +
ggplot2::geom_density(color = "steelblue", linewidth = 1) +
ggplot2::labs(
title = "Distribution of Fuzzy Matching Scores",
x = "\nMatch Score (%)",
y = "Density\n"
) +
ggplot2::theme_minimal(base_size = 14)
# save plot
ggplot2::ggsave(
plot = score_dist_plot,
filename = here::here("03_output/3a_figures/u5mr_sle_adm2.png"),
width = 12,
height = 9,
dpi = 300
)
#### Step 7.2: Define match quality diagnostic function ------------------------
assess_match_quality <- function(name1, name2) {
purrr::map2_dfr(name1, name2, function(a, b) {
tokens1 <- strsplit(a, "\\s+")[[1]]
tokens2 <- strsplit(b, "\\s+")[[1]]
tibble::tibble(
prefix_match = tolower(tokens1[1]) == tolower(tokens2[1]),
suffix_match = tolower(tail(tokens1, 1)) == tolower(tail(tokens2, 1)),
token_diff = abs(length(tokens1) - length(tokens2)),
char_diff = abs(nchar(a) - nchar(b))
)
})
}
assess_match_quality("Makeni Govt Hospital", "Makeni Government Hospital")
#### Step 7.3: Evaluate quality of top fuzzy matches ---------------------------
# assess structural match quality
diagnostics_df <- dplyr::bind_cols(
all_best,
assess_match_quality(all_best$hf_dhis2, all_best$hf_mfl)
)
# compare method diagnostics
summary_stats <- diagnostics_df |>
dplyr::group_by(method) |>
dplyr::summarise(
score = mean(score),
avg_token_diff = mean(token_diff) |> round(2),
avg_char_diff = mean(char_diff) |> round(2),
pct_prefix_match = (mean(prefix_match) * 100) |> round(2),
pct_suffix_match = (mean(suffix_match) * 100) |> round(2),
total = dplyr::n(),
.groups = "drop"
)
# create a a final overall score
summary_stats <- summary_stats |>
dplyr::mutate(
# rescale negative of average token difference (smaller is better)
token_score = scales::rescale(-avg_token_diff, to = c(0, 100)),
# rescale negative of average character difference (smaller is better)
char_score = scales::rescale(-avg_char_diff, to = c(0, 100)),
# rescale prefix match percentage (higher is better)
prefix_score = scales::rescale(pct_prefix_match, to = c(0, 100)),
# rescale suffix match percentage (higher is better)
suffix_score = scales::rescale(pct_suffix_match, to = c(0, 100)),
# combine all four metrics into a weighted structural quality score
structure_score = round(
0.3 * token_score + # emphasize fewer token differences
0.2 * char_score + # moderate weight on character similarity
0.25 * prefix_score + # give weight to matching initial words
0.25 * suffix_score, # give equal weight to matching final words
1
)
) |>
dplyr::arrange(desc(structure_score)) |>
# assign rank based on descending structure score
dplyr::mutate(rank = dplyr::row_number()) |>
dplyr::select(
method, avg_token_diff, avg_char_diff,
pct_prefix_match, pct_suffix_match,
total, structure_score, rank
)
# check results
summary_stats |>
dplyr::select(
Method = method,
`Avg. Token Difference` = avg_token_diff,
`Avg. Character Difference` = avg_char_diff,
`% Prefix Match` = pct_prefix_match,
`% Suffix Match` = pct_suffix_match,
`Structural Score` = structure_score,
Rank = rank
) |> as.data.frame()
#### Step 7.4: Weighted thresholding using structural quality and false-positive risk
# calculate method-specific weights and thresholds
method_threshold <- summary_stats |>
dplyr::mutate(
score = structure_score / 100,
weight = score / sum(score),
threshold = scales::rescale(
structure_score,
to = rev(unname(
stats::quantile(all_best$score, c(0.70, 0.95), na.rm = TRUE)
))
) |>
round()
) |>
dplyr::select(method, weight, threshold)
# display results
method_threshold
### Step 8: Finalize Fuzzy Match Selection -------------------------------------
#### Step 8.1: Match selection options -----------------------------------------
# apply single-method threshold and flag matches
best_lv_final <- best_lv |>
dplyr::mutate(
match_flag = dplyr::if_else(score >= 85, "match", "review"),
final_method = "Levenshtein"
) |>
dplyr::left_join(
dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
by = "hf_mfl"
) |>
dplyr::select(hf_dhis2, hf_mfl, hf_mfl_raw, score, final_method, match_flag)
# show top 5 and bottom 5 matches by score
top_5 <- best_lv_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- best_lv_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Levenshtein Score")
# preview results
top_bottom
# select best match using rank average
composite_final <- ranked_grid |>
dplyr::group_by(hf_dhis2) |>
dplyr::arrange(rank_avg) |>
dplyr::slice_head(n = 1) |>
dplyr::ungroup() |>
dplyr::mutate(
match_flag = dplyr::if_else(composite_score >= 85, "match", "review")
) |>
dplyr::select(
adm1,
adm2,
hf_dhis2,
hf_mfl,
composite_score,
rank_avg,
match_flag
)
# show top 5 and bottom 5 matches by composite score
top_5 <- composite_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- composite_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Composite Score")
# preview results
top_bottom
# define score columns and extract weights
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")
weights <- method_threshold |>
dplyr::filter(
method %in% c("Jaro-Winkler", "Levenshtein", "Qgram", "LCS")
) |>
dplyr::pull(weight)
# calculate weighted composite score for each facility-match pair
weighted_final <- ranked_grid |>
dplyr::rowwise() |>
dplyr::mutate(
# weighted average maintains 0-100 scale
weighted_composite_score = stats::weighted.mean(
c(score_jw, score_lv, score_qg, score_lcs),
weights,
na.rm = TRUE
)
) |>
dplyr::ungroup() |>
dplyr::group_by(hf_dhis2) |>
dplyr::arrange(dplyr::desc(weighted_composite_score)) |>
dplyr::slice_head(n = 1) |>
dplyr::ungroup() |>
dplyr::mutate(
match_flag = dplyr::if_else(
weighted_composite_score >= 85,
"match",
"review"
),
score = weighted_composite_score, # standardize column name for Step 7
final_method = "Weighted-Composite"
) |>
dplyr::left_join(
dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
by = "hf_mfl"
) |>
dplyr::select(
hf_dhis2,
hf_mfl,
hf_mfl_raw,
score,
final_method,
match_flag
)
# show top 5 and bottom 5 matches by weighted composite score
top_5 <- weighted_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- weighted_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Weighted Composite Score")
# preview results
bottom_5
# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats <- summary_stats |>
dplyr::arrange(dplyr::desc(structure_score)) |>
dplyr::filter(method != "Rank-Ensemble") # exclude rank-based method
method_order <- method_stats$method
# use thresholds calculated in step 7.4
method_thresholds_df <- method_threshold |>
dplyr::filter(method != "Rank-Ensemble")
# convert to named vector for easy lookup
method_thresholds <- setNames(
method_thresholds_df$threshold,
method_thresholds_df$method
)
# build combined table of top matches from real methods only
all_best_nonrank <- dplyr::bind_rows(
best_lv,
best_jw,
best_qg,
best_lcs,
best_comp
) |>
# normalize method labels
dplyr::mutate(method = trimws(method))
# add rank once
rank_key <- ranked_grid |>
dplyr::select(hf_dhis2, hf_mfl, rank_avg)
all_best_ranked <- all_best_nonrank |>
dplyr::left_join(rank_key, by = c("hf_dhis2", "hf_mfl")) |>
dplyr::mutate(
# worst for missing ranks
rank_avg = dplyr::if_else(is.na(rank_avg), Inf, rank_avg)
)
# split by method and pre-arrange for stable tiebreak
by_method <- split(all_best_ranked, all_best_ranked$method) |>
purrr::map(\(x) {
x |>
dplyr::arrange(dplyr::desc(score), rank_avg)
})
# keep copy for pass-rate check
initial_thr <- method_thresholds
# init
fallback_chunks <- list()
unmatched <- unique(all_best_ranked$hf_dhis2)
# fast fallback loop (no joins, no grouping inside)
repeat {
for (m in method_order) {
if (!m %in% names(by_method)) {
next
}
# current threshold from step 7.4
thr <- method_thresholds[[m]]
if (is.null(thr)) {
thr <- 85 # fallback if method not found
}
# filter by score and still-unmatched
cand <- by_method[[m]] |>
dplyr::filter(score >= thr, hf_dhis2 %in% unmatched)
if (nrow(cand) > 0) {
# pick first row per hf_dhis2 (pre-arranged by score desc, rank asc)
best_per_hf <- cand |>
dplyr::distinct(hf_dhis2, .keep_all = TRUE) |>
dplyr::mutate(
final_method = glue::glue("Fuzzy-matched using: {m}")
)
# store and update unmatched
fallback_chunks[[length(fallback_chunks) + 1L]] <- best_per_hf
unmatched <- setdiff(unmatched, best_per_hf$hf_dhis2)
}
if (length(unmatched) == 0) break
}
if (length(unmatched) == 0) {
break
}
# relax thresholds gradually
method_thresholds <- method_thresholds - 1
if (any(method_thresholds <= 50)) break # stop if too low
}
# bind once
fallback_matched <- dplyr::bind_rows(fallback_chunks)
# build a 1:1 map from standardized DHIS2 name to admin + raw +
# hf_uid_new
hf_uid_new_map_by_stand <- dhis2_hf_unmatched |>
dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
dplyr::group_by(hf_dhis2) |>
dplyr::summarise(
adm0 = dplyr::first(adm0),
adm1 = dplyr::first(adm1),
adm2 = dplyr::first(adm2),
adm3 = dplyr::first(adm3),
hf_dhis2_raw = dplyr::first(hf_dhis2_raw),
hf_uid_new = dplyr::first(hf_uid_new),
.groups = "drop"
)
# final flags and attach admin + ids
fallback_final <- fallback_matched |>
dplyr::mutate(
match_flag = dplyr::if_else(score >= 85, "match", "review")
) |>
dplyr::left_join(
dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
by = "hf_mfl"
) |>
dplyr::left_join(hf_uid_new_map_by_stand, by = "hf_dhis2") |>
dplyr::select(
adm0, adm1, adm2, adm3,
hf_uid_new,
hf_dhis2_raw,
hf_dhis2,
hf_mfl_raw,
hf_mfl,
score,
final_method,
match_flag
)
top_5 <- fallback_final |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::slice_head(n = 5)
bottom_5 <- fallback_final |>
dplyr::arrange(score) |>
dplyr::slice_head(n = 5)
cli::cli_h3("Top 5 and Bottom 5 Matches by Fallback")
bottom_5 <- dplyr::bind_rows(top_5, bottom_5)
# preview results
bottom_5
#### Step 8.2: Select fuzzy matching approach ----------------------------------
# create fuzzy_matches from step 8 results (use whichever option was
# chosen)
# step 8 produces different result datasets depending on which option
# was
# chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final
fuzzy_matches <- if (exists("fallback_final")) {
dplyr::filter(fallback_final, match_flag == "match")
} else if (exists("composite_final")) {
dplyr::filter(composite_final, match_flag == "match")
} else if (exists("weighted_final")) {
dplyr::filter(weighted_final, match_flag == "match")
} else if (exists("best_lv_final")) {
dplyr::filter(best_lv_final, match_flag == "match")
} else if (exists("best_jw_final")) {
dplyr::filter(best_jw_final, match_flag == "match")
} else if (exists("best_qg_final")) {
dplyr::filter(best_qg_final, match_flag == "match")
} else {
# empty dataframe with expected structure if no step 8 results exist
data.frame(
hf_dhis2_raw = character(0),
hf_dhis2 = character(0),
hf_mfl_raw = character(0),
hf_mfl = character(0),
score = numeric(0),
final_method = character(0),
stringsAsFactors = FALSE
)
} |>
# remove any internal duplicates (keep highest scoring match per
# facility)
dplyr::group_by(hf_dhis2_raw) |>
dplyr::slice_max(score, n = 1, with_ties = FALSE) |>
dplyr::ungroup() |>
# attach admin + hf_uid_new from unmatched (collapse to 1:1 per
# hf_dhis2_raw)
{
hf_uid_new_map_unmatched <- dhis2_hf_unmatched |>
dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
dplyr::group_by(hf_dhis2_raw) |>
dplyr::summarise(
adm0 = dplyr::first(adm0),
adm1 = dplyr::first(adm1),
adm2 = dplyr::first(adm2),
adm3 = dplyr::first(adm3),
hf_uid_new = dplyr::first(hf_uid_new),
.groups = "drop"
)
dplyr::left_join(., hf_uid_new_map_unmatched, by = "hf_dhis2_raw")
}
# display summary of selected approach
cli::cli_h2("Selected Fuzzy Matching Results")
cli::cli_alert_success("Fuzzy matches found: {nrow(fuzzy_matches)}")
if (nrow(fuzzy_matches) > 0) {
score_summary <- fuzzy_matches |>
dplyr::summarise(
avg_score = mean(score, na.rm = TRUE),
min_score = min(score, na.rm = TRUE),
max_score = max(score, na.rm = TRUE),
.groups = "drop"
)
cli::cli_alert_info("Score range: {round(score_summary$min_score, 1)} - {round(score_summary$max_score, 1)}")
cli::cli_alert_info("Average score: {round(score_summary$avg_score, 1)}")
}
#### Step 8.3: Manual review of flagged matches --------------------------------
# prepare review dataset (using fallback results as example)
review_matches <- fallback_final |>
dplyr::filter(match_flag == "review") |>
dplyr::arrange(dplyr::desc(score)) |>
dplyr::mutate(
reviewer_decision = NA_character_, # accept/reject/uncertain
reviewer_notes = NA_character_,
review_date = NA_character_,
reviewer_name = NA_character_
) |>
dplyr::select(
adm0, adm1, adm2, adm3, hf_uid_new,
hf_dhis2_raw,
hf_dhis2,
hf_mfl_raw,
hf_mfl,
score,
final_method,
reviewer_decision,
reviewer_notes,
review_date,
reviewer_name
)
# export for review
readr::write_csv(
review_matches,
here::here("03_outputs", "tables", "facility_matches_for_review.csv")
)
# show summary
cli::cli_h2("Manual Review Required")
cli::cli_alert_info(
"Exported {nrow(review_matches)} matches flagged for review"
)
cli::cli_alert_info(
"File location: outputs/facility_matches_for_review.csv"
)
# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches <- readr::read_csv(
here::here("03_outputs", "tables", "facility_matches_reviewed.csv"),
col_types = readr::cols(
reviewer_decision = readr::col_character(),
reviewer_notes = readr::col_character(),
review_date = readr::col_character(),
reviewer_name = readr::col_character()
)
)
final_incorp <- reviewed_matches |>
dplyr::mutate(
final_match_flag = dplyr::case_when(
reviewer_decision == "accept" ~ "match",
reviewer_decision == "reject" ~ "no_match",
is.na(reviewer_decision) ~ "pending_review",
TRUE ~ "no_match"
)
)
# get the final matched ones
final_matches <- final_incorp |>
dplyr::filter(final_match_flag == "match") |>
dplyr::select(-reviewer_decision)
# summary of all results (including rejected/pending)
all_results_summary <- final_incorp |>
dplyr::count(final_match_flag, name = "n") |>
dplyr::mutate(percentage = round(100 * n / sum(n), 1))
# summary of final accepted matches only
final_summary <- final_matches |>
dplyr::count(final_match_flag, name = "n")
cli::cli_h2("Final Matching Results After Manual Review")
cli::cli_alert_success(
"Accepted matches: {format(sum(final_summary$n), big.mark = ',')}"
)
# breakdown by decision type
rejected_count <- all_results_summary |>
dplyr::filter(final_match_flag == "no_match") |>
dplyr::pull(n)
if (length(rejected_count) == 0) {
rejected_count <- 0
}
pending_count <- all_results_summary |>
dplyr::filter(final_match_flag == "pending_review") |>
dplyr::pull(n)
if (length(pending_count) == 0) {
pending_count <- 0
}
if (rejected_count > 0) {
cli::cli_alert_danger(
"Rejected matches: {format(rejected_count, big.mark = ',')} (will have NA MFL values in final dataset)"
)
}
if (pending_count > 0) {
cli::cli_alert_warning(
"Pending review: {format(pending_count, big.mark = ',')} (require additional review)"
)
}
# show breakdown table
all_results_summary
### Step 9: Combine All Matched Results ----------------------------------------
#### Step 9.1: Combine results from all matching approaches --------------------
# create manual_matches from manual review results (both matched and
# unmatched)
manual_matches <- final_incorp |>
dplyr::mutate(
# for unmatched, set appropriate values
hf_mfl = dplyr::if_else(
reviewer_decision == "accept",
hf_mfl,
NA_character_
),
hf_mfl_raw = dplyr::if_else(
reviewer_decision == "accept",
hf_mfl_raw,
NA_character_
),
final_method = dplyr::if_else(
reviewer_decision == "accept",
"Manual Review - Accepted",
"Manual Review - Rejected"
),
score = dplyr::if_else(
reviewer_decision == "accept",
100,
0
)
) |>
# manual reviewed results already carry adm0-3 + hf_uid_new from
# fallback_final → review export → reimport
dplyr::select(
adm0,
adm1,
adm2,
adm3,
hf_uid_new,
hf_dhis2_raw,
hf_dhis2,
hf_mfl_raw,
hf_mfl,
score,
final_method
)
# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and
# rejected
final_facilities_all <- dplyr::bind_rows(
# step 4: stratified geographic matches (if performed)
matched_dhis2_prepgeoname,
# step 5: exact matches after standardization
matched_dhis2,
# step 6-8: complete matches from fuzzy matching
fuzzy_matches,
# manual matching (if performed)
manual_matches
)
# persist for the python render-state to load (Rule 6.4 parity)
write.csv(
final_facilities_all,
here::here("01_data/1.1_foundational/1.1c_health_facilities/processed/final_facilities_all.csv"),
row.names = FALSE
)
# collapse to one record per DHIS2 facility for downstream integration
final_facilities_one_per_hf <- final_facilities_all |>
dplyr::select(-match_flag, -hf)
cli::cli_alert_info(
"All facilities processed (rows): {nrow(final_facilities_all)}"
)
cli::cli_alert_success(
"All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): {dplyr::n_distinct(final_facilities_all$hf_uid_new)}"
)
# generate summary by matching method
matching_summary <- final_facilities_one_per_hf |>
dplyr::group_by(final_method) |>
dplyr::summarise(
n_matched = n(),
avg_score = mean(score, na.rm = TRUE),
min_score = min(score, na.rm = TRUE),
max_score = max(score, na.rm = TRUE),
.groups = "drop"
)
# display summary
cli::cli_h2("Matching Results by Method")
matching_summary
#### Step 9.2: Create final integrated DHIS2-MFL dataset -----------------------
# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final <- dhis2_df |>
dplyr::left_join(
dhis2_map,
by = c("adm0", "adm1", "adm2", "adm3", "hf")
)
# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf <- final_facilities_one_per_hf |>
dplyr::select(hf_uid_new, hf_mfl_raw)
# create the integrated dataset
final_dhis2_mfl_df <- dhis2_df_final |>
dplyr::left_join(final_match_per_hf, by = "hf_uid_new") |>
# join MFL attributes for matched facilities
dplyr::left_join(
dplyr::select(master_hf_df, -hf_mfl, -hf, -adm0, -adm1, -adm2, -adm3),
by = "hf_mfl_raw"
)
# counts on distinct facilities via hf_uid_new (robust to row-level
# duplication)
total <- dhis2_df_final |> dplyr::distinct(hf_uid_new) |> nrow()
with_mfl <- final_match_per_hf |>
dplyr::filter(!is.na(hf_mfl_raw)) |>
nrow()
without_mfl <- total - with_mfl
# validation summary
cli::cli_h2("DHIS2-MFL Integration Complete")
cli::cli_alert_success(
"Total DHIS2 facilities preserved: {total}"
)
cli::cli_alert_info(
"Facilities with MFL data: {with_mfl}"
)
cli::cli_alert_warning(
"Facilities without MFL match: {without_mfl}"
)
cli::cli_alert_info("Total rows (facility-months): {nrow(final_dhis2_mfl_df)}")
### Step 10: Final Checks ------------------------------------------------------
#### Step 10.1: Detecting one-to-many matches for review -----------------------
# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many <- final_facilities_all |>
dplyr::filter(!is.na(hf_mfl_raw)) |> # only matched facilities
dplyr::group_by(hf_mfl_raw) |>
dplyr::filter(dplyr::n() > 1) |> # MFL matched to >1 DHIS2
dplyr::ungroup() |>
dplyr::arrange(hf_mfl_raw, hf_dhis2_raw)
if (nrow(one_to_many) > 0) {
cli::cli_alert_warning(
paste0(
"One-to-many matches detected: {nrow(one_to_many)} DHIS2 facilities ",
"share {dplyr::n_distinct(one_to_many$hf_mfl_raw)} MFL facilities"
)
)
# show sample
knitr::kable(
one_to_many |>
dplyr::select(hf_dhis2_raw, hf_mfl_raw, score, final_method) |>
dplyr::slice_head(n = 6),
caption = "Sample one-to-many matches requiring review"
)
} else {
cli::cli_alert_success("No one-to-many matches detected")
}
#### Step 10.2: Unmatched facilities analysis ----------------------------------
# analyze unmatched facilities by reason
unmatched_analysis <- final_facilities_all |>
dplyr::filter(is.na(hf_mfl_raw)) |>
dplyr::group_by(final_method) |>
dplyr::summarise(
count = n(),
.groups = "drop"
) |>
dplyr::arrange(desc(count))
# display unmatched analysis
cli::cli_h2("Unmatched Facilities Analysis")
cli::cli_alert_info(
"Total unmatched facilities: {sum(unmatched_analysis$count)}"
)
# show breakdown by reason
purrr::pwalk(
unmatched_analysis,
~ cli::cli_alert_warning("{.x}: {.y} facilities")
)
# create detailed unmatched facilities export
unmatched_detailed <- final_facilities_one_per_hf |>
# unmatched are those without an MFL linkage
dplyr::filter(is.na(hf_mfl_raw)) |>
# select required context and scoring/method fields
dplyr::select(
hf_dhis2_raw,
adm1,
adm2,
adm3,
final_method,
score
) |>
# derive a clear potential reason for review
dplyr::mutate(
potential_reason = dplyr::case_when(
is.na(final_method) ~ "Never matched - possible new facility",
final_method == "Manual Review - Rejected" ~
"Rejected in review - possible different facility",
score > 0 & score < 50 ~ "Low similarity - possible data quality issue",
TRUE ~ "Unmatched - needs investigation"
)
) |>
dplyr::arrange(adm1, adm2, adm3, hf_dhis2_raw)
# export for SNT team review
readr::write_csv(
unmatched_detailed,
here::here("03_outputs", "unmatched_facilities_for_snt_review.csv")
)
cli::cli_alert_success(
"Exported {nrow(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli::cli_text("File: outputs/unmatched_facilities_for_snt_review.csv")
### Step 11: Save Final Datasets -----------------------------------------------
# save main integrated DHIS2-MFL dataset
rio::export(
final_dhis2_mfl_df,
here::here("03_outputs", "final_dhis2_mfl_integrated.xlsx")
)
# save matching results summary
rio::export(
final_facilities_all,
here::here("03_outputs", "facility_matching_results.xlsx")
)
if (nrow(unmatched_facilities) > 0) {
readr::write_csv(
unmatched_facilities,
here::here("03_outputs", "unmatched_dhis2_facilities.csv")
)
}
if (nrow(one_to_many) > 0) {
readr::write_csv(
one_to_many,
here::here("03_outputs", "one_to_many_matches_for_review.csv")
)
}
# summary of saved files
cli::cli_h2("Final Datasets Saved")
cli::cli_alert_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli::cli_alert_success("Matching results: facility_matching_results.xlsx")
cli::cli_alert_info("Unmatched facilities: {nrow(unmatched_facilities)} exported")
cli::cli_alert_info("One-to-many matches: {nrow(one_to_many)} exported")Show full code
################################################################################
############ ~ Fuzzy matching of names across datasets full code ~ #############
################################################################################
### Step 1: Install and Load Required Libraries --------------------------------
from pathlib import Path
import re
import unicodedata
import numpy as np
import pandas as pd
import pyreadr
import xxhash
import matplotlib.pyplot as plt
from pyprojroot import here
from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz
def read_rds(path):
"""Read a single-object RDS file as a pandas DataFrame."""
result = pyreadr.read_r(str(path))
return next(iter(result.values()))
def vdigest(x, algo="xxhash32"):
"""Vectorised xxhash32 digest. Mirrors sntutils::vdigest()."""
return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())
def cli_header(message):
print(f"\n{message}")
def cli_info(message):
print(f"INFO: {message}")
def cli_success(message):
print(f"SUCCESS: {message}")
def cli_warning(message):
print(f"WARNING: {message}")
def cli_danger(message):
print(f"ERROR: {message}")
def anti_join(left, right, on):
"""Return rows in left with no matching key in right."""
right_keys = right[on].drop_duplicates()
return (
left.merge(right_keys, on=on, how="left", indicator=True)
.loc[lambda x: x["_merge"] == "left_only"]
.drop(columns="_merge")
)
def show_table(df, n=10, caption=None):
"""Render a compact scrollable HTML table with the .out-table style.
Chunks calling this must set #| results: asis."""
from IPython.display import display, HTML
rows = df.head(n)
cap_html = f"<caption>{caption}</caption>" if caption else ""
table_html = rows.to_html(
index=False,
classes="out-table",
border=0,
na_rep="",
)
# inject caption before the table header
if cap_html:
table_html = table_html.replace(
"<thead>", cap_html + "<thead>", 1
)
display(HTML(f'<div class="out-scroll">{table_html}</div>'))
### Step 2: Load Data ----------------------------------------------------------
# set up path to dhis2 hf data
dhis2_path = Path(here("01_data/1.2_epidemiology/1.2a_routine_surveillance/processed"))
hf_path = Path(here("01_data/1.1_foundational/1.1c_health_facilities/processed"))
# read DHIS2 health facilities data
dhis2_df = read_rds(dhis2_path / "sle_dhis2_with_clean_adm3.rds")
dhis2_df = dhis2_df.assign(hf_dhis2_raw=dhis2_df["hf"])
# get distinct admin and hf cols
dhis2_hf_df = dhis2_df[
["adm0", "adm1", "adm2", "adm3", "hf", "hf_dhis2_raw"]
].drop_duplicates()
# read MFL health facilities data
master_hf_df = pd.read_csv(hf_path / "hf_final_clean_data.csv")
master_hf_df = (
master_hf_df
.drop_duplicates(subset=["adm0", "adm1", "adm2", "adm3", "hf", "lat", "long"])
.assign(hf_mfl_raw=lambda d: d["hf"])
)
# attach a stable DHIS2 facility ID
dhis2_map = (
dhis2_df[["adm0", "adm1", "adm2", "adm3", "hf"]]
.drop_duplicates()
.assign(
hf_uid_new=lambda d: vdigest(
d["adm0"].str.lower().str.strip() + "|" +
d["adm1"].str.lower().str.strip() + "|" +
d["adm2"].str.lower().str.strip() + "|" +
d["adm3"].str.lower().str.strip() + "|" +
d["hf"].str.lower().str.strip()
)
)
)
# print head of data
cli_header("Sample of DHIS2 data:")
dhis2_hf_df.head()
cli_header("Sample of MFL data:")
master_hf_df.head()
### Step 3: Initial Matching Diagnostics ---------------------------------------
#### Step 3.1: Overall exact match check ---------------------------------------
# check for exact matches on raw names (no admin constraint)
exact_matches_all = dhis2_hf_df.merge(
master_hf_df[["hf_mfl_raw"]],
left_on="hf_dhis2_raw",
right_on="hf_mfl_raw",
how="inner",
)
# calculate matching potential
total_dhis2 = len(dhis2_hf_df)
total_mfl = len(master_hf_df)
unmatched_dhis2 = total_dhis2 - len(exact_matches_all)
cli_header("Overall Matching Summary")
cli_info(f"Total DHIS2 facilities: {total_dhis2}")
cli_info(f"Total MFL facilities: {total_mfl}")
cli_success(
f"Exact matches found: {len(exact_matches_all)} "
f"({round(len(exact_matches_all) / total_dhis2 * 100, 1)}%)"
)
cli_warning(f"Remaining to match: {unmatched_dhis2}")
#### Step 3.2: Administrative level match check --------------------------------
# check matches at adm2 level (district)
dhis2_by_adm2 = (
dhis2_hf_df.groupby("adm2", as_index=False)
.agg(total_dhis2=("hf_dhis2_raw", "count"))
)
matches_by_adm2 = (
dhis2_hf_df
.merge(
master_hf_df[["hf_mfl_raw", "adm2"]],
left_on=["hf_dhis2_raw", "adm2"],
right_on=["hf_mfl_raw", "adm2"],
how="inner",
)
.groupby("adm2", as_index=False)
.agg(exact_matches=("hf_dhis2_raw", "count"))
.merge(dhis2_by_adm2, on="adm2", how="left")
.assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
[["adm2", "exact_matches", "total_dhis2", "match_rate"]]
.sort_values("match_rate", ascending=False)
)
cli_header("Exact matches by district (adm2)")
matches_by_adm2
# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 = (
dhis2_hf_df.groupby(["adm2", "adm3"], as_index=False)
.agg(total_dhis2=("hf_dhis2_raw", "count"))
)
matches_by_adm3 = (
dhis2_hf_df
.merge(
master_hf_df[["hf_mfl_raw", "adm2", "adm3"]],
left_on=["hf_dhis2_raw", "adm2", "adm3"],
right_on=["hf_mfl_raw", "adm2", "adm3"],
how="inner",
)
.groupby(["adm2", "adm3"], as_index=False)
.agg(exact_matches=("hf_dhis2_raw", "count"))
.merge(dhis2_by_adm3, on=["adm2", "adm3"], how="left")
.assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
.loc[lambda d: d["total_dhis2"] >= 5]
.sort_values("match_rate", ascending=False)
.head(10)
)
cli_header("Top exact matches by chiefdom (adm3)")
matches_by_adm3
#### Step 3.3: Duplicate name check --------------------------------------------
# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 = (
dhis2_hf_df
.groupby(["adm2", "hf_dhis2_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "hf_dhis2_raw"])
)
mfl_dups_adm2 = (
master_hf_df
.groupby(["adm2", "hf_mfl_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "hf_mfl_raw"])
)
cli_header("Duplicates within same district (adm2)")
cli_warning(
f"DHIS2 duplicates within districts: {dhis2_dups_adm2['hf_dhis2_raw'].nunique()}"
)
cli_warning(
f"MFL duplicates within districts: {mfl_dups_adm2['hf_mfl_raw'].nunique()}"
)
# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 = (
dhis2_hf_df
.groupby(["adm2", "adm3", "hf_dhis2_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "adm3", "hf_dhis2_raw"])
)
mfl_dups_adm3 = (
master_hf_df
.groupby(["adm2", "adm3", "hf_mfl_raw"])
.filter(lambda x: len(x) > 1)
.sort_values(["adm2", "adm3", "hf_mfl_raw"])
)
cli_header("Duplicates within same chiefdom (adm3)")
cli_danger(
f"DHIS2 duplicates within chiefdoms: {dhis2_dups_adm3['hf_dhis2_raw'].nunique()}"
)
cli_danger(
f"MFL duplicates within chiefdoms: {mfl_dups_adm3['hf_mfl_raw'].nunique()}"
)
# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall = (
dhis2_hf_df
.groupby("hf_dhis2_raw")
.filter(lambda x: len(x) > 1)
.sort_values("hf_dhis2_raw")
)
mfl_dups_overall = (
master_hf_df
.groupby("hf_mfl_raw")
.filter(lambda x: len(x) > 1)
.sort_values("hf_mfl_raw")
)
cli_header("Overall duplicate names (across all areas)")
cli_info(f"DHIS2 overall duplicates: {dhis2_dups_overall['hf_dhis2_raw'].nunique()}")
cli_info(f"MFL overall duplicates: {mfl_dups_overall['hf_mfl_raw'].nunique()}")
### Step 4: Interactive Stratified Geographic Matching -------------------------
from sntutils.geo import prep_geonames
# set up location to save cache
cache_loc = "01_data/1.1_foundational/1.1f_cache_files/processed"
# interactive stratified matching with automatic standardization
dhis2_df_cleaned = prep_geonames(
target_df=dhis2_hf_df,
lookup_df=master_hf_df,
level0="adm0",
level1="adm1",
level2="adm2",
level3="adm3",
level4="hf",
cache_path=here(cache_loc, "geoname_cache.csv"),
unmatched_export_path=here(cache_loc, "dhis2_hf_unmatched.csv"),
)
# load the unmatched facilities for further processing (steps 5–8)
dhis2_hf_to_process = pd.read_csv(
here(cache_loc, "dhis2_hf_unmatched.csv")
).rename(columns={"hf": "hf_dhis2_raw"})
# summary statistics
n_original = len(dhis2_hf_df)
n_matched = n_original - len(dhis2_hf_to_process)
match_rate = n_matched / n_original * 100
cli_success(
f"Stratified matching completed: {n_matched:,}/{n_original:,} "
f"facilities matched ({round(match_rate, 1)}%)"
)
cli_info(f"Remaining unmatched: {len(dhis2_hf_to_process)} facilities")
### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------
#### Step 5.1: Standardize health facility names -------------------------------
import unicodedata
def standardize_names(series):
"""Standardize health facility name strings for fuzzy matching.
Applies lowercase conversion, punctuation removal, accent normalization,
Roman numeral conversion, and alphabetical token sorting.
"""
roman_map = {
r"\bix\b": "9", r"\bviii\b": "8", r"\bvii\b": "7",
r"\bvi\b": "6", r"\bv\b": "5", r"\biv\b": "4",
r"\biii\b": "3", r"\bii\b": "2", r"\bi\b": "1",
}
def _clean(text):
if pd.isna(text):
return text
text = str(text).lower()
# replace punctuation with space
text = re.sub(r"[^\w\s]", " ", text)
# collapse multiple spaces
text = re.sub(r"\s+", " ", text).strip()
# normalize accents
text = unicodedata.normalize("NFD", text)
text = "".join(c for c in text if unicodedata.category(c) != "Mn")
# convert roman numerals
for pattern, replacement in roman_map.items():
text = re.sub(pattern, replacement, text)
# sort tokens: letters first (alphabetical), numbers last
tokens = text.split()
alpha = sorted(t for t in tokens if not t.isdigit())
numeric = sorted(t for t in tokens if t.isdigit())
return " ".join(alpha + numeric)
return series.apply(_clean)
# set up example with messy formatting
example_word = "Clínica! Rahmâ IV ( New clinic) East"
# print original
print(f"\nExample before standardization:\n{example_word}")
# apply standardization
example_word_st = standardize_names(pd.Series([example_word]))[0]
# print cleaned example
print(f"\nExample after standardization:\n{example_word_st}")
# remove duplicates from MFL and create standardized column
master_hf_df = (
master_hf_df
.drop_duplicates(subset="hf_mfl_raw")
.assign(hf_mfl=lambda d: standardize_names(d["hf_mfl_raw"]))
)
# preserve original full dataset before processing unmatched
dhis2_hf_df_original = dhis2_hf_df.assign(
hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"])
)
# build geo-aware stable ID map
hf_uid_new_map = (
dhis2_hf_df[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"]]
.drop_duplicates()
.assign(
hf_uid_new=lambda d: vdigest(
d["adm0"].str.lower().str.strip() + "|" +
d["adm1"].str.lower().str.strip() + "|" +
d["adm2"].str.lower().str.strip() + "|" +
d["adm3"].str.lower().str.strip() + "|" +
d["hf_dhis2_raw"].str.lower().str.strip()
)
)
)
dhis2_hf_df_original = dhis2_hf_df_original.merge(
hf_uid_new_map,
on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
how="left",
)
# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched = (
dhis2_hf_to_process
.assign(hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"]))
.merge(
hf_uid_new_map.assign(
hf_dhis2_raw=lambda d: d["hf_dhis2_raw"].str.upper()
),
on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
how="left",
)
)
# check if standardization worked
dhis2_hf_unmatched[["hf_dhis2_raw", "hf_dhis2"]].drop_duplicates().head(10)
#### Step 5.2: Handle abbreviations in health facility names -------------------
# combine DHIS2 and MFL names for abbreviation detection
all_names = pd.concat([
dhis2_hf_unmatched[["hf_dhis2_raw"]].rename(columns={"hf_dhis2_raw": "hf"}),
master_hf_df[["hf_mfl_raw"]].rename(columns={"hf_mfl_raw": "hf"}),
])
# split into words and filter to 2-4 uppercase letter tokens
abbrev_dictionary = (
all_names["hf"]
.dropna()
.str.split(expand=True)
.stack()
.reset_index(drop=True)
.rename("word")
.loc[lambda s: s.str.match(r"^[A-Z]{2,4}$")]
.value_counts()
.reset_index()
.rename(columns={"count": "freq"})
.loc[lambda d: d["freq"] > 2]
)
# check output
abbrev_dictionary
# define abbreviation dictionary (all lowercase, using standardized column)
abbrev_dict = {
"maternal child health post": "mchp",
"community health post": "chp",
"community health center": "chc",
"urban maternal clinic": "umi",
"expanded programme on immunization": "epi",
}
# apply replacements to MFL names
for long_form, short_form in abbrev_dict.items():
master_hf_df["hf_mfl"] = master_hf_df["hf_mfl"].str.replace(
long_form, short_form, regex=False
)
# apply replacements to DHIS2 names
for long_form, short_form in abbrev_dict.items():
dhis2_hf_unmatched["hf_dhis2"] = dhis2_hf_unmatched["hf_dhis2"].str.replace(
long_form, short_form, regex=False
)
# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched.loc[
dhis2_hf_unmatched["hf_dhis2_raw"].str.contains(
r"(?i)Community Health Center|Maternal Child Health Post|Community Health Post|CHC|MCHP|CHP",
regex=True,
na=False,
),
["hf_dhis2_raw", "hf_dhis2"],
].drop_duplicates().head()
### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------
#### Step 6.1: Identify matched and unmatched facilities -----------------------
# flag for one-to-one enforcement
enforce_one_to_one = False
# exact matches using raw names
matched_dhis2_raw = (
dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
.merge(
master_hf_df[["hf_mfl_raw"]],
left_on="hf_dhis2_raw",
right_on="hf_mfl_raw",
how="inner",
)
)
# exact matches using standardized names
matched_dhis2 = (
dhis2_hf_unmatched[
["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_dhis2", "hf_uid_new"]
]
.merge(
master_hf_df[["hf_mfl_raw", "hf_mfl"]],
left_on="hf_dhis2",
right_on="hf_mfl",
how="inner",
)
)
# exclude facilities already matched via raw names
matched_dhis2 = anti_join(matched_dhis2, matched_dhis2_raw, on=["hf_dhis2_raw"])
matched_dhis2 = matched_dhis2.assign(
final_method="Matched Without Fuzzy Matching (standardization)",
score=100,
)
# compute unmatched after standardization
# cross-named anti-join: dhis2_hf_unmatched.hf_dhis2 vs master_hf_df.hf_mfl
unmatched_dhis2 = (
dhis2_hf_unmatched.loc[
~dhis2_hf_unmatched["hf_dhis2"].isin(master_hf_df["hf_mfl"]),
["hf_dhis2"],
]
.drop_duplicates(subset="hf_dhis2")
)
used_mfl = list(matched_dhis2["hf_mfl"].unique())
# build candidate MFL pool for fuzzy step
candidate_mfl_df = master_hf_df[["hf_mfl"]].copy()
if enforce_one_to_one:
candidate_mfl_df = candidate_mfl_df.loc[~candidate_mfl_df["hf_mfl"].isin(used_mfl)]
# summary counts
total_dhis2_hf = dhis2_hf_unmatched["hf_dhis2_raw"].nunique()
raw_match_dhis2_hf = matched_dhis2_raw["hf_dhis2_raw"].nunique()
raw_unmatch_dhis2_hf = total_dhis2_hf - raw_match_dhis2_hf
standardized_match_dhis2_hf = matched_dhis2["hf_dhis2"].nunique()
standardized_unmatch_dhis2_hf = total_dhis2_hf - standardized_match_dhis2_hf
total_mfl_hf = master_hf_df["hf_mfl"].nunique()
candidate_mfl_count = candidate_mfl_df["hf_mfl"].nunique()
cli_header("Summary of Matching Status")
cli_info(f"Total DHIS2 facilities: {total_dhis2_hf:,}")
cli_success(f"Matched after standardization: {standardized_match_dhis2_hf:,}")
cli_danger(f"Unmatched using raw names: {raw_unmatch_dhis2_hf:,}")
cli_danger(f"Unmatched after standardization: {standardized_unmatch_dhis2_hf:,}")
cli_info(
f"MFL facilities in candidate pool for matching: "
f"{candidate_mfl_count:,} of {total_mfl_hf:,}"
)
#### Step 6.1: Create the matching grid ----------------------------------------
# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities
candidate_match_pool = unmatched_dhis2.merge(candidate_mfl_df, how="cross")
# create match grid with dhis2 and mfl names
match_grid = candidate_match_pool.copy()
# preview sample - top 10 matches
match_grid.head(10)
#### Step 6.2: Compute similarity scores ---------------------------------------
from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz
def _score_lv(a, b):
"""Normalized Levenshtein similarity (0–100)."""
len_max = max(len(a), len(b))
if len_max == 0:
return 100.0
return max(0.0, (1 - Levenshtein.distance(a, b) / len_max) * 100)
def _score_jw(a, b):
"""Jaro-Winkler similarity (0–100)."""
return JaroWinkler.similarity(a, b) * 100
def _score_qg(a, b):
"""Q-gram similarity (0–100) using token sort ratio."""
return float(fuzz.token_sort_ratio(a, b))
def _score_lcs(a, b):
"""LCS-based similarity (0–100) using partial ratio."""
return float(fuzz.partial_ratio(a, b))
# calculate fuzzy scores
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]
match_grid = match_grid.assign(
score_jw=lambda d: d.apply(
lambda r: _score_jw(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
score_lv=lambda d: d.apply(
lambda r: _score_lv(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
score_qg=lambda d: d.apply(
lambda r: _score_qg(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
score_lcs=lambda d: d.apply(
lambda r: _score_lcs(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
),
)
# clip negatives and NaN to zero
for col in score_cols:
match_grid[col] = match_grid[col].clip(lower=0).fillna(0)
#### Step 6.3: Create composite scores -----------------------------------------
# define similarity score columns
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]
# compute composite_score (simple average)
match_grid["composite_score"] = match_grid[score_cols].mean(axis=1)
# compute average rank across all similarity methods
for col in score_cols:
match_grid[f"rank_{col}"] = (
match_grid.groupby("hf_dhis2")[col]
.rank(method="min", ascending=False)
)
rank_cols = [f"rank_{c}" for c in score_cols]
match_grid["rank_avg"] = (
match_grid[rank_cols].sum(axis=1) / len(score_cols)
).round()
ranked_grid = match_grid.copy()
# preview ranked results for a few sample facilities
sample_facilities = [
"charity clinic kamba of",
"arab clinic shad",
"al arab clinic sheefa",
]
ranked_grid.loc[ranked_grid["hf_dhis2"].isin(sample_facilities)].groupby(
"hf_dhis2"
).apply(lambda g: g.nsmallest(4, "rank_avg")).reset_index(drop=True)[
["hf_dhis2", "hf_mfl"] + score_cols + ["composite_score", "rank_avg"]
].sort_values(["hf_dhis2", "rank_avg"])
#### Step 6.4: Extract best match ----------------------------------------------
# for each fuzzy matching method, select the top-ranked MFL name per DHIS2 hf
best_jw = (
match_grid.sort_values("score_jw", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_jw"]]
.rename(columns={"score_jw": "score"})
.assign(method="Jaro-Winkler")
)
best_lv = (
match_grid.sort_values("score_lv", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_lv"]]
.rename(columns={"score_lv": "score"})
.assign(method="Levenshtein")
)
best_qg = (
match_grid.sort_values("score_qg", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_qg"]]
.rename(columns={"score_qg": "score"})
.assign(method="Qgram")
)
best_lcs = (
match_grid.sort_values("score_lcs", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "score_lcs"]]
.rename(columns={"score_lcs": "score"})
.assign(method="LCS")
)
best_comp = (
match_grid.sort_values("composite_score", ascending=False)
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "composite_score"]]
.rename(columns={"composite_score": "score"})
.assign(method="Composite-Score")
)
best_ranked_match = (
ranked_grid.sort_values("rank_avg")
.groupby("hf_dhis2", as_index=False).first()
[["hf_dhis2", "hf_mfl", "rank_avg"]]
.rename(columns={"rank_avg": "score"})
.assign(method="Rank-Ensemble")
)
# combine top matches from all methods for comparison across approaches
all_best = pd.concat(
[best_jw, best_lv, best_qg, best_lcs, best_comp, best_ranked_match],
ignore_index=True,
)
# preview best matches
all_best.head(10)[["hf_dhis2", "hf_mfl", "score", "method"]]
### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------
#### Step 7.1: Visualize score distributions across methods --------------------
import matplotlib.pyplot as plt
# plot score distribution across methods (exclude rank-based method)
methods_for_plot = all_best.loc[all_best["method"] != "Rank-Ensemble"]
method_names = sorted(methods_for_plot["method"].unique())
fig, axes = plt.subplots(1, len(method_names), figsize=(12, 6), sharey=True)
for ax, name in zip(axes, method_names):
grp = methods_for_plot.loc[methods_for_plot["method"] == name, "score"].dropna()
grp.plot.kde(ax=ax, color="steelblue", linewidth=1)
ax.fill_between(
ax.lines[0].get_xdata(),
ax.lines[0].get_ydata(),
alpha=0.6,
color="steelblue",
)
ax.set_title(name, fontsize=11)
ax.set_xlabel("\nMatch Score (%)")
if ax == axes[0]:
ax.set_ylabel("Density\n")
fig.suptitle("Distribution of Fuzzy Matching Scores", fontsize=14)
plt.tight_layout()
# save plot
output_dir = Path(here("03_output/3a_figures"))
output_dir.mkdir(parents=True, exist_ok=True)
fig.savefig(output_dir / "fuzzy_score_distributions.png", dpi=300, bbox_inches="tight")
#### Step 7.2: Define match quality diagnostic function ------------------------
def assess_match_quality(names1, names2):
"""Compute structural quality diagnostics for name pairs.
Returns a DataFrame with prefix_match, suffix_match,
token_diff, and char_diff for each name pair.
"""
rows = []
for a, b in zip(names1, names2):
tokens1 = str(a).split()
tokens2 = str(b).split()
rows.append({
"prefix_match": tokens1[0].lower() == tokens2[0].lower()
if tokens1 and tokens2 else False,
"suffix_match": tokens1[-1].lower() == tokens2[-1].lower()
if tokens1 and tokens2 else False,
"token_diff": abs(len(tokens1) - len(tokens2)),
"char_diff": abs(len(str(a)) - len(str(b))),
})
return pd.DataFrame(rows)
assess_match_quality(["Makeni Govt Hospital"], ["Makeni Government Hospital"])
#### Step 7.3: Evaluate quality of top fuzzy matches ---------------------------
def _rescale(series, new_min, new_max):
"""Rescale a numeric series to [new_min, new_max]."""
s_min, s_max = series.min(), series.max()
if s_max == s_min:
return pd.Series([new_min] * len(series), index=series.index)
return new_min + (series - s_min) / (s_max - s_min) * (new_max - new_min)
# assess structural match quality
diagnostics_df = pd.concat(
[all_best.reset_index(drop=True),
assess_match_quality(all_best["hf_dhis2"], all_best["hf_mfl"])],
axis=1,
)
# compare method diagnostics
summary_stats = (
diagnostics_df
.groupby("method")
.agg(
avg_token_diff=("token_diff", "mean"),
avg_char_diff=("char_diff", "mean"),
pct_prefix_match=("prefix_match", "mean"),
pct_suffix_match=("suffix_match", "mean"),
total=("hf_dhis2", "count"),
)
.reset_index()
.assign(
avg_token_diff=lambda d: d["avg_token_diff"].round(2),
avg_char_diff=lambda d: d["avg_char_diff"].round(2),
pct_prefix_match=lambda d: (d["pct_prefix_match"] * 100).round(2),
pct_suffix_match=lambda d: (d["pct_suffix_match"] * 100).round(2),
)
)
# create a final overall score
summary_stats = summary_stats.assign(
# rescale negative of average token difference (smaller is better)
token_score=lambda d: _rescale(-d["avg_token_diff"], 0, 100),
# rescale negative of average character difference (smaller is better)
char_score=lambda d: _rescale(-d["avg_char_diff"], 0, 100),
# rescale prefix match percentage (higher is better)
prefix_score=lambda d: _rescale(d["pct_prefix_match"], 0, 100),
# rescale suffix match percentage (higher is better)
suffix_score=lambda d: _rescale(d["pct_suffix_match"], 0, 100),
).assign(
# combine all four metrics into a weighted structural quality score
structure_score=lambda d: (
0.3 * d["token_score"] + # emphasize fewer token differences
0.2 * d["char_score"] + # moderate weight on character similarity
0.25 * d["prefix_score"] + # give weight to matching initial words
0.25 * d["suffix_score"] # give equal weight to matching final words
).round(1)
).sort_values("structure_score", ascending=False)
summary_stats["rank"] = range(1, len(summary_stats) + 1)
summary_stats.rename(columns={
"method": "Method",
"avg_token_diff": "Avg. Token Difference",
"avg_char_diff": "Avg. Character Difference",
"pct_prefix_match": "% Prefix Match",
"pct_suffix_match": "% Suffix Match",
"structure_score": "Structural Score",
"rank": "Rank",
})[["Method", "Avg. Token Difference", "Avg. Character Difference",
"% Prefix Match", "% Suffix Match", "Structural Score", "Rank"]]
#### Step 7.4: Weighted thresholding using structural quality and false-positive risk
# calculate method-specific weights and thresholds
q70 = all_best["score"].quantile(0.70)
q95 = all_best["score"].quantile(0.95)
method_threshold = summary_stats[["method", "structure_score"]].copy()
method_threshold["score_frac"] = method_threshold["structure_score"] / 100
method_threshold["weight"] = (
method_threshold["score_frac"] / method_threshold["score_frac"].sum()
)
# inverse scale: higher structure_score → lower threshold (easier matching)
method_threshold["threshold"] = (
_rescale(method_threshold["structure_score"], q95, q70)
).round()
method_threshold = method_threshold[["method", "weight", "threshold"]]
# display results
method_threshold
### Step 8: Finalize Fuzzy Match Selection -------------------------------------
#### Step 8.1: Match selection options -----------------------------------------
# apply single-method threshold and flag matches
best_lv_final = (
best_lv
.assign(
match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"),
final_method="Levenshtein",
)
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
[["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)
# show top 5 and bottom 5 matches by score
top_5 = best_lv_final.sort_values("score", ascending=False).head(5)
bottom_5 = best_lv_final.sort_values("score").head(5)
# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)
cli_header("Top 5 and Bottom 5 Matches by Levenshtein Score")
# preview results
top_bottom
# select best match using rank average
composite_final = (
ranked_grid.sort_values("rank_avg")
.groupby("hf_dhis2", as_index=False)
.first()
.assign(
match_flag=lambda d: np.where(d["composite_score"] >= 85, "match", "review"),
score=lambda d: d["composite_score"],
final_method="Composite-Score",
)
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
[["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)
# show top 5 and bottom 5 matches by composite score
top_5 = composite_final.sort_values("score", ascending=False).head(5)
bottom_5 = composite_final.sort_values("score").head(5)
# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)
cli_header("Top 5 and Bottom 5 Matches by Composite Score")
# preview results
top_bottom
# define score columns and extract weights
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]
weights_df = method_threshold.loc[
method_threshold["method"].isin(["Jaro-Winkler", "Levenshtein", "Qgram", "LCS"])
]
weights = weights_df.set_index("method")["weight"]
# extract individual weights for each method
w_jw = float(weights.get("Jaro-Winkler", 0))
w_lv = float(weights.get("Levenshtein", 0))
w_qg = float(weights.get("Qgram", 0))
w_lcs = float(weights.get("LCS", 0))
# calculate weighted composite score for each facility-match pair
# weighted average maintains 0-100 scale
weighted_final = (
ranked_grid.assign(
weighted_composite_score=lambda d: (
d["score_jw"] * w_jw +
d["score_lv"] * w_lv +
d["score_qg"] * w_qg +
d["score_lcs"] * w_lcs
)
)
.sort_values("weighted_composite_score", ascending=False)
.groupby("hf_dhis2", as_index=False)
.first()
.assign(
match_flag=lambda d: np.where(
d["weighted_composite_score"] >= 85, "match", "review"
),
score=lambda d: d["weighted_composite_score"],
final_method="Weighted-Composite",
)
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
[["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)
# show top 5 and bottom 5 matches by weighted composite score
top_5 = weighted_final.sort_values("score", ascending=False).head(5)
bottom_5 = weighted_final.sort_values("score").head(5)
cli_header("Top 5 and Bottom 5 Matches by Weighted Composite Score")
# preview results
bottom_5
# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats = summary_stats.loc[
summary_stats["method"] != "Rank-Ensemble"
].sort_values("structure_score", ascending=False)
method_order = list(method_stats["method"])
# use thresholds calculated in step 7.4
method_thresholds_df = method_threshold.loc[method_threshold["method"] != "Rank-Ensemble"]
method_thresholds = dict(zip(method_thresholds_df["method"], method_thresholds_df["threshold"]))
# build combined table of top matches from real methods only
all_best_nonrank = pd.concat(
[best_lv, best_jw, best_qg, best_lcs, best_comp], ignore_index=True
)
all_best_nonrank["method"] = all_best_nonrank["method"].str.strip()
# add rank once
rank_key = ranked_grid[["hf_dhis2", "hf_mfl", "rank_avg"]].copy()
all_best_ranked = (
all_best_nonrank
.merge(rank_key, on=["hf_dhis2", "hf_mfl"], how="left")
.assign(rank_avg=lambda d: d["rank_avg"].fillna(float("inf")))
)
# split by method and pre-arrange for stable tiebreak
by_method = {
m: grp.sort_values(["score", "rank_avg"], ascending=[False, True])
for m, grp in all_best_ranked.groupby("method")
}
fallback_chunks = []
unmatched_set = set(all_best_ranked["hf_dhis2"].unique())
current_thresholds = dict(method_thresholds)
# fast fallback loop
while True:
for m in method_order:
if m not in by_method:
continue
thr = current_thresholds.get(m, 85) # fallback if method not found
cand = by_method[m].loc[
(by_method[m]["score"] >= thr) &
(by_method[m]["hf_dhis2"].isin(unmatched_set))
]
if len(cand) > 0:
best_per_hf = (
cand.drop_duplicates(subset="hf_dhis2", keep="first")
.assign(final_method=f"Fuzzy-matched using: {m}")
)
fallback_chunks.append(best_per_hf)
unmatched_set -= set(best_per_hf["hf_dhis2"])
if not unmatched_set:
break
if not unmatched_set:
break
# relax thresholds gradually
current_thresholds = {k: v - 1 for k, v in current_thresholds.items()}
if any(v <= 50 for v in current_thresholds.values()): # stop if too low
break
# bind once
fallback_matched = (
pd.concat(fallback_chunks, ignore_index=True)
if fallback_chunks else pd.DataFrame()
)
# build a 1:1 map from standardized DHIS2 name to admin + raw + hf_uid_new
hf_uid_new_map_by_stand = (
dhis2_hf_unmatched
.sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
.groupby("hf_dhis2", as_index=False)
.first()[["hf_dhis2", "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
)
# final flags and attach admin + ids
fallback_final = (
fallback_matched
.assign(match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"))
.merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
.merge(hf_uid_new_map_by_stand, on="hf_dhis2", how="left")
[["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
"score", "final_method", "match_flag"]]
)
top_5 = fallback_final.sort_values("score", ascending=False).head(5)
bottom_5 = fallback_final.sort_values("score").head(5)
cli_header("Top 5 and Bottom 5 Matches by Fallback")
bottom_5 = pd.concat([top_5, bottom_5], ignore_index=True)
# preview results
bottom_5
#### Step 8.2: Select fuzzy matching approach ----------------------------------
# create fuzzy_matches from step 8 results (use whichever option was chosen)
# step 8 produces different result datasets depending on which option was chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final
if "fallback_final" in dir():
fuzzy_matches_raw = fallback_final.loc[fallback_final["match_flag"] == "match"]
elif "composite_final" in dir():
fuzzy_matches_raw = composite_final.loc[composite_final["match_flag"] == "match"]
elif "weighted_final" in dir():
fuzzy_matches_raw = weighted_final.loc[weighted_final["match_flag"] == "match"]
elif "best_lv_final" in dir():
fuzzy_matches_raw = best_lv_final.loc[best_lv_final["match_flag"] == "match"]
elif "best_jw_final" in dir():
fuzzy_matches_raw = best_jw_final.loc[best_jw_final["match_flag"] == "match"]
elif "best_qg_final" in dir():
fuzzy_matches_raw = best_qg_final.loc[best_qg_final["match_flag"] == "match"]
else:
# empty dataframe with expected structure if no step 8 results exist
fuzzy_matches_raw = pd.DataFrame(columns=[
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
"score", "final_method",
])
# deduplicate: keep highest scoring match per DHIS2 facility
fuzzy_matches = (
fuzzy_matches_raw
.sort_values("score", ascending=False)
.drop_duplicates(subset="hf_dhis2_raw", keep="first")
.merge(
dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
.sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
.drop_duplicates(subset="hf_dhis2_raw", keep="first"),
on="hf_dhis2_raw",
how="left",
)
)
# display summary of selected approach
cli_header("Selected Fuzzy Matching Results")
cli_success(f"Fuzzy matches found: {len(fuzzy_matches)}")
if len(fuzzy_matches) > 0:
cli_info(
f"Score range: {fuzzy_matches['score'].min():.1f} - "
f"{fuzzy_matches['score'].max():.1f}"
)
cli_info(f"Average score: {fuzzy_matches['score'].mean():.1f}")
#### Step 8.3: Manual review of flagged matches --------------------------------
# prepare review dataset (using fallback results as example)
review_matches = (
fallback_final
.loc[fallback_final["match_flag"] == "review"]
.sort_values("score", ascending=False)
.assign(
reviewer_decision=pd.NA, # accept/reject/uncertain
reviewer_notes=pd.NA,
review_date=pd.NA,
reviewer_name=pd.NA,
)
[["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
"score", "final_method",
"reviewer_decision", "reviewer_notes", "review_date", "reviewer_name"]]
)
# export for review
output_dir = Path(here("03_outputs/tables"))
output_dir.mkdir(parents=True, exist_ok=True)
review_matches.to_csv(output_dir / "facility_matches_for_review.csv", index=False)
# show summary
cli_header("Manual Review Required")
cli_info(f"Exported {len(review_matches)} matches flagged for review")
cli_info("File location: outputs/tables/facility_matches_for_review.csv")
# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches = pd.read_csv(
here("03_outputs/tables/facility_matches_reviewed.csv"),
dtype={
"reviewer_decision": "object",
"reviewer_notes": "object",
"review_date": "object",
"reviewer_name": "object",
}
)
final_incorp = reviewed_matches.assign(
final_match_flag=lambda d: np.select(
[
d["reviewer_decision"] == "accept",
d["reviewer_decision"] == "reject",
d["reviewer_decision"].isna(),
],
["match", "no_match", "pending_review"],
default="no_match",
)
)
# get the final matched ones
final_matches = (
final_incorp
.loc[final_incorp["final_match_flag"] == "match"]
.drop(columns="reviewer_decision", errors="ignore")
)
# summary of all results (including rejected/pending)
all_results_summary = (
final_incorp.groupby("final_match_flag", as_index=False)
.size()
.rename(columns={"size": "n"})
.assign(percentage=lambda d: (d["n"] / d["n"].sum() * 100).round(1))
)
cli_header("Final Matching Results After Manual Review")
cli_success(f"Accepted matches: {len(final_matches):,}")
# breakdown by decision type
rejected_count = int(
all_results_summary.loc[all_results_summary["final_match_flag"] == "no_match", "n"].sum()
)
pending_count = int(
all_results_summary.loc[
all_results_summary["final_match_flag"] == "pending_review", "n"
].sum()
)
if rejected_count > 0:
cli_danger(
f"Rejected matches: {rejected_count:,} "
"(will have NA MFL values in final dataset)"
)
if pending_count > 0:
cli_warning(f"Pending review: {pending_count:,} (require additional review)")
# show breakdown table
all_results_summary
### Step 9: Combine All Matched Results ----------------------------------------
#### Step 9.1: Combine results from all matching approaches --------------------
# create manual_matches from manual review results (both matched and unmatched)
manual_matches = final_incorp.assign(
# for unmatched, set appropriate values
hf_mfl=lambda d: np.where(d["reviewer_decision"] == "accept", d["hf_mfl"], pd.NA),
hf_mfl_raw=lambda d: np.where(
d["reviewer_decision"] == "accept", d["hf_mfl_raw"], pd.NA
),
final_method=lambda d: np.where(
d["reviewer_decision"] == "accept",
"Manual Review - Accepted",
"Manual Review - Rejected",
),
score=lambda d: np.where(d["reviewer_decision"] == "accept", 100, 0),
)
cols_needed = [
"adm0", "adm1", "adm2", "adm3", "hf_uid_new",
"hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl", "score", "final_method",
]
manual_matches = manual_matches[[c for c in cols_needed if c in manual_matches.columns]]
# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and rejected
final_facilities_all = pd.concat(
[
# step 4: stratified geographic matches (if performed)
matched_dhis2_prepgeoname[[c for c in cols_needed if c in matched_dhis2_prepgeoname.columns]],
# step 5: exact matches after standardization
matched_dhis2[[c for c in cols_needed if c in matched_dhis2.columns]],
# step 6-8: complete matches from fuzzy matching
fuzzy_matches[[c for c in cols_needed if c in fuzzy_matches.columns]],
# manual matching (if performed)
manual_matches,
],
ignore_index=True,
)
# collapse to one record per DHIS2 facility for downstream integration
cols_to_drop = [c for c in ["match_flag", "hf"] if c in final_facilities_all.columns]
final_facilities_one_per_hf = final_facilities_all.drop(columns=cols_to_drop, errors="ignore")
cli_info(f"All facilities processed (rows): {len(final_facilities_all)}")
cli_success(
f"All facilities processed (distinct DHIS2 by hf_uid_new): "
f"{final_facilities_all['hf_uid_new'].nunique()}"
)
# generate summary by matching method
matching_summary = (
final_facilities_one_per_hf
.groupby("final_method", as_index=False)
.agg(
n_matched=("hf_dhis2_raw", "count"),
avg_score=("score", "mean"),
min_score=("score", "min"),
max_score=("score", "max"),
)
)
# display summary
cli_header("Matching Results by Method")
matching_summary
#### Step 9.2: Create final integrated DHIS2-MFL dataset -----------------------
# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final = dhis2_df.merge(
dhis2_map,
on=["adm0", "adm1", "adm2", "adm3", "hf"],
how="left",
)
# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf = final_facilities_one_per_hf[["hf_uid_new", "hf_mfl_raw"]].copy()
# create the integrated dataset
final_dhis2_mfl_df = (
dhis2_df_final
.merge(final_match_per_hf, on="hf_uid_new", how="left")
# join MFL attributes for matched facilities
.merge(
master_hf_df.drop(
columns=["hf_mfl", "hf", "adm0", "adm1", "adm2", "adm3"],
errors="ignore",
),
on="hf_mfl_raw",
how="left",
)
)
# counts on distinct facilities via hf_uid_new
total = dhis2_df_final["hf_uid_new"].nunique()
with_mfl = final_match_per_hf.dropna(subset=["hf_mfl_raw"])["hf_uid_new"].nunique()
without_mfl = total - with_mfl
# validation summary
cli_header("DHIS2-MFL Integration Complete")
cli_success(f"Total DHIS2 facilities preserved: {total}")
cli_info(f"Facilities with MFL data: {with_mfl}")
cli_warning(f"Facilities without MFL match: {without_mfl}")
cli_info(f"Total rows (facility-months): {len(final_dhis2_mfl_df)}")
### Step 10: Final Checks ------------------------------------------------------
#### Step 10.1: Detecting one-to-many matches for review -----------------------
# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many = (
final_facilities_all
.dropna(subset=["hf_mfl_raw"]) # only matched facilities
.groupby("hf_mfl_raw")
.filter(lambda g: len(g) > 1) # MFL matched to >1 DHIS2
.sort_values(["hf_mfl_raw", "hf_dhis2_raw"])
)
if len(one_to_many) > 0:
cli_warning(
f"One-to-many matches detected: {len(one_to_many)} DHIS2 facilities "
f"share {one_to_many['hf_mfl_raw'].nunique()} MFL facilities"
)
# show sample
print(
one_to_many[["hf_dhis2_raw", "hf_mfl_raw", "score", "final_method"]]
.head(6)
.to_string(index=False)
)
else:
cli_success("No one-to-many matches detected")
#### Step 10.2: Unmatched facilities analysis ----------------------------------
# analyze unmatched facilities by reason
unmatched_analysis = (
final_facilities_all
.loc[final_facilities_all["hf_mfl_raw"].isna()]
.groupby("final_method", as_index=False)
.size()
.rename(columns={"size": "count"})
.sort_values("count", ascending=False)
)
# display unmatched analysis
cli_header("Unmatched Facilities Analysis")
cli_info(f"Total unmatched facilities: {unmatched_analysis['count'].sum()}")
# show breakdown by reason
for _, row in unmatched_analysis.iterrows():
cli_warning(f"{row['final_method']}: {row['count']} facilities")
# create detailed unmatched facilities export
unmatched_detailed = (
final_facilities_one_per_hf
.loc[final_facilities_one_per_hf["hf_mfl_raw"].isna()]
[["hf_dhis2_raw", "adm1", "adm2", "adm3", "final_method", "score"]]
.assign(
potential_reason=lambda d: np.select(
[
d["final_method"].isna(),
d["final_method"] == "Manual Review - Rejected",
(d["score"] > 0) & (d["score"] < 50),
],
[
"Never matched - possible new facility",
"Rejected in review - possible different facility",
"Low similarity - possible data quality issue",
],
default="Unmatched - needs investigation",
)
)
.sort_values(["adm1", "adm2", "adm3", "hf_dhis2_raw"])
)
# export for SNT team review
unmatched_export_dir = Path(here("03_outputs"))
unmatched_export_dir.mkdir(parents=True, exist_ok=True)
unmatched_detailed.to_csv(
unmatched_export_dir / "unmatched_facilities_for_snt_review.csv", index=False
)
cli_success(
f"Exported {len(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli_info("File: outputs/unmatched_facilities_for_snt_review.csv")
### Step 11: Save Final Datasets -----------------------------------------------
# save main integrated DHIS2-MFL dataset
output_dir = Path(here("03_outputs"))
output_dir.mkdir(parents=True, exist_ok=True)
final_dhis2_mfl_df.to_excel(
output_dir / "final_dhis2_mfl_integrated.xlsx", index=False
)
# save matching results summary
final_facilities_all.to_excel(
output_dir / "facility_matching_results.xlsx", index=False
)
if len(unmatched_detailed) > 0:
unmatched_detailed.to_csv(
output_dir / "unmatched_dhis2_facilities.csv", index=False
)
if len(one_to_many) > 0:
one_to_many.to_csv(
output_dir / "one_to_many_matches_for_review.csv", index=False
)
# summary of saved files
cli_header("Final Datasets Saved")
cli_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli_success("Matching results: facility_matching_results.xlsx")
cli_info(f"Unmatched facilities: {len(unmatched_detailed)} exported")
cli_info(f"One-to-many matches: {len(one_to_many)} exported")
