• English
  • Français
  1. 2. Data Assembly and Management
  2. 2.2 Health Facilities Data
  3. Fuzzy matching of names across datasets
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
      • Determining active and inactive status
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Assessing missing data
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputing missing data and correcting outliers
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • Understanding Fuzzy Matching
    • Why Exact Matching Isn’t Sufficient
    • What Is Fuzzy String Matching?
  • Choosing a Matching Strategy
    • How Do We Know if a Match Is “Good”?
    • Choosing Among String Matching Algorithms
      • Levenshtein distance (edit distance)
      • Jaro-Winkler similarity
      • Q-gram distance
      • Longest Common Subsequence (LCS)
      • Soundex and other phonetic algorithms
  • Best Practices for an Effective Fuzzy Matching Workflow
    • Tip 1: Work from a cleaned copy of the original name column
    • Tip 2: Preprocess text to reduce noise
    • Tip 3: Handle abbreviations and domain-specific terms
    • Tip 4: Limit matching scope using geographic information
    • Tip 5: Apply multiple similarity algorithms
    • Tip 6: Review and validate with the SNT team
  • Matching Workflow Overview
    • Two-Phase Approach
    • 11-Step Process Summary
    • Workflow Diagram
  • Step-by-Step
    • Step 1: Install and Load Required Libraries
    • Step 2: Load Data
    • Step 3: Initial Matching Diagnostics
      • Step 3.1: Overall exact match check
      • Step 3.2: Administrative level match check
      • Step 3.3: Duplicate name check
    • Step 4: Interactive Stratified Geographic Matching
    • Step 5: Process and Prepare Unmatched Data for Fuzzy Matching
      • Step 5.1: Standardize health facility names
      • Step 5.2: Handle abbreviations in health facility names
    • Step 6: Perform Fuzzy Matching on Remaining Unmatched
      • Step 6.1: Identify matched and unmatched facilities
      • Step 6.1: Create the matching grid
      • Step 6.2: Compute similarity scores
      • Step 6.3: Create composite scores
      • Step 6.4: Extract best match
    • Step 7: Evaluate Fuzzy Match Quality
      • Step 7.1: Visualize score distributions across methods
      • Step 7.2: Define match quality diagnostic function
      • Step 7.3: Evaluate quality of top fuzzy matches
    • Method Performance Summaries
      • Step 7.4: Weighted thresholding using structural quality and false-positive risk
    • Step 8: Finalize Fuzzy Match Selection
      • Step 8.1: Match selection options
      • Step 8.2: Select fuzzy matching approach
      • Step 8.3: Manual review of flagged matches
    • Step 9: Combine All Matched Results
      • Step 9.1: Combine results from all matching approaches
      • Step 9.2: Create final integrated DHIS2-MFL dataset
    • Step 10: Final Checks
      • Step 10.1: Detecting one-to-many matches for review
      • Step 10.2: Unmatched facilities analysis
    • Step 11: Save Final Datasets
  • Summary
  • Full code
  1. 2. Data Assembly and Management
  2. 2.2 Health Facilities Data
  3. Fuzzy matching of names across datasets

Fuzzy matching of names across datasets

Intermediate

Overview

On this page, we demonstrate how to apply fuzzy matching between DHIS2 and the master health facility list (MFL) in a structured, reviewable way. While the worked example uses DHIS2 and the MFL, analysts can apply the same fuzzy matching approach to any dataset with facility names, or any string field used as a join key across multiple datasets. For example, if you are trying to join a shapefile with tabular data and admin2 names are not joining properly, you may want to try fuzzy name matching on the admin2 names across the two datasets.

The MFL is the national reference list of health facilities. Maintained by the Ministry of Health or national statistics office, it provides standardized names, unique facility IDs, coordinates, type, and administrative units, and can also provide other information such as operational status and availability of particular services. The MFL is often the only dataset of health facilities that includes their geolocation.

Joining other datasets, such as routine surveillance, with the MFL therefore makes possible many kinds of subnational analysis. If unique health facility IDs are not available across both the MFL and the dataset to be joined or if unique IDs are unreliable, then the joining will rely on matching health facility names. However, variations in the spelling, abbreviations, word order, or formatting of health facility names can make this process challenging.

One possible solution that is reasonably scalable and automatable is fuzzy name matching, a technique that uses string similarity algorithms to identify close matches between text fields even when they differ slightly. While not perfect, this approach can improve the matching rate between datasets where facility names differ due to inconsistent formatting, spelling, or abbreviations, leaving fewer unmatched facilities for the analyst to resolve manually and through consultation with the SNT team. Fuzzy matches should still be reviewed, especially in important applications.

Objectives
  • Run diagnostics to check exact matches, admin alignment, and duplicates
  • Use stratified geographic matching with human validation for high-confidence matches
  • Standardize names and abbreviations to prepare unmatched data
  • Apply fuzzy matching on remaining facilities with candidate grids and similarity scores
  • Assess fuzzy match quality with diagnostics, visuals, and thresholds
  • Select a matching strategy (single, composite, weighted, or fallback) based on data needs
  • Merge stratified and fuzzy matches into one dataset
  • Save final matched output for analysis and integration

Understanding Fuzzy Matching

Fuzzy matching helps us reconcile text fields that don’t exactly align but are highly likely to be referring to the same entity, something that can happen when working with health facility names in real-world data. In this section, we explain what fuzzy matching is, why it’s needed, how it works, and what analysts need to consider when applying it in the context of SNT.

Why Exact Matching Isn’t Sufficient

In theory, if the same facility appears in two datasets, we should be able to join them directly by name. In practice, facility names may sometimes be spelled differently, use inconsistent abbreviations, or contain formatting differences that prevent a clean join.

For example, a facility might appear in DHIS2 as “Makeni Gov. Hosp” and in the MFL as “Makeni Government Hospital”. A direct match won’t work, even though the two clearly refer to the same place. Small differences in spelling, punctuation, or word order require a more flexible matching approach.

What Is Fuzzy String Matching?

Fuzzy string matching is a method for finding approximate matches between text strings (a text string is the data format that coding languages use for handling text). Instead of requiring an exact match, fuzzy string matching calculates how similar two strings are, usually with a score between 0 and 100. A higher score means the strings are more alike.

This is useful when working with real-world facility names. Fuzzy matching allows us to say: “these two names are close enough to be considered a likely match,” and lets us decide where to draw the line between automatic match, manual review, or no match.

In the SNT process, health facility data can come from multiple sources with varying naming conventions. For example, the MFL might serve as the authoritative source, while the DHIS2 could contain operational variations of the same facility name. Fuzzy matching helps reconcile these differences systematically.

Consult with SNT team

Before performing fuzzy matching, review data considerations with the SNT team. These considerations may include:

  • Facility name completeness (no missing values in key columns)
  • Local naming conventions and common abbreviations, so that you can set up fuzzy matching to account for them
  • Protocol for questionable matches

Below is an example output from fuzzy matching. The algorithm compares facility names from DHIS2 to entries in the MFL, assigning a similarity score to each match. Higher scores (closer to 100) indicate stronger similarity; lower scores may need manual review to determine acceptability or exclusion.

Facility Name (DHIS2) Best Match in MFL Score Decision
Makeni Govt. Hospital Makeni Government Hospital 96 High confidence
Makeni Goverment Hospital Makeni Government Hospital 93 High confidence
Police CHC Police Community Health Center 90 High confidence
Loreto Clinic Clinic Loreto 88 High confidence
Centre Medical Centre Médical 85 High confidence
An-Nour Hospital An-Noor Hospital 82 Accept with caution
Rahma Clinic Rahmah Clinic 87 High confidence
Bo MCHP Bo Maternal Child Health Post 84 Accept with caution
Clinic A Kenema Government Hospital 41 Needs manual review
ABC Health Post Makeni Government Hospital 29 Needs manual review

The table above illustrates fuzzy matching scenarios you may encounter. Below are some examples of the types of variations we’re addressing:

  • Abbreviations: Bo MCHP → Bo Maternal Child Health Post
  • Spelling inconsistencies or typos: Makeni Goverment Hospital → Makeni Government Hospital
  • Formatting differences: Police CHC → Police Community Health Center
  • Word order variations: Loreto Clinic → Clinic Loreto
  • Inconsistent accents: Centre Medical → Centre Médical
  • Inconsistent transliterations: Rahma Clinic → Rahmah Clinic

Choosing a Matching Strategy

Fuzzy name matching involves defining what counts as “similar” and deciding which kinds of differences (like spelling, order, or abbreviations) are acceptable. Here we discuss how to assess the quality of matches and when to use different types of string similarity algorithms. Understanding these concepts helps ensure the methods we apply are appropriate for the kinds of mismatches we expect in the data.

How Do We Know if a Match Is “Good”?

Fuzzy matching algorithms return similarity scores that estimate how alike two names are. But a high similarity score doesn’t always guarantee a correct match. For example, facilities with shared components like “community health post” may receive high scores, even if the match is incorrect. This highlights the value of looking beyond raw scores.

To improve confidence in matches, we can use additional diagnostics that assess match quality from different angles. These include:

  • Prefix Match: Names start with similar words or characters
  • Suffix Match: Names end in similar ways, such as shared final words
  • Token Difference: Similar number of words in each name
  • Character Difference: Overall lengths of the names are close to each other

These diagnostics are meant to complement similarity scores, not replace them. The diagnostics help identify cases where a high score may be misleading and guide how we filter, rank, or combine results across multiple algorithms.

Consider this example:

Let’s say Makeni Govt Hospital matches with Makeni Government Hospital. But how good is this match?

We can assess the quality using a few intuitive features:

The prefix matches: both start with Makeni. The suffix matches: both end with Hospital. The token difference is 0, both contain three words. The character difference is small, only 6 characters apart. The difference mainly comes from Govt vs Government.

These features suggest a strong match, even if the raw string similarity score is imperfect.

Choosing Among String Matching Algorithms

Different algorithms are better suited to different mismatch types, such as typos, abbreviations, or word order. Choosing the right algorithm depends on the specific inconsistencies in your data, and you may want to try multiple algorithms to match across different types of inconsistencies. Below, we highlight the main algorithms, their strengths, and their limitations.

Levenshtein distance (edit distance)

Counts the number of single-character edits (insertions, deletions, substitutions) needed to turn one string into another.

  • Best for: Typos and misspellings
  • Example: Makeni Goverment Hospital → Makeni Government Hospital
  • Limitations: Does not handle word swaps or phonetic differences

Jaro-Winkler similarity

Measures character overlap and transpositions, with extra weight for common prefixes.

  • Best for: Prefix alignment and small rearrangements
  • Examples: Loreto Clinic → Clinic Loreto, or Makeni Govt Hospital → Makeni Government Hospital
  • Limitations: Weak with missing words or abbreviations

Q-gram distance

Compares overlapping sequences (e.g., 2- or 3-character chunks) between strings.

  • Best for: Substring overlap and fuzzy substring detection
  • Example: Kenema Town MCHP → Kenema MCH Post
  • Limitations: Can over-penalize short names or spacing issues

Longest Common Subsequence (LCS)

Finds the longest sequence of characters that appear left-to-right in both strings (not necessarily contiguously).

  • Best for: Partial matches with insertions
  • Example: St Mary Hosp → Saint Mary Hospital
  • Limitations: Ignores spacing and order changes beyond sequence alignment

Soundex and other phonetic algorithms

Convert words into phonetic codes based on how they sound.

  • Best for: Names with different spellings but similar pronunciation
  • Examples: Rahma Clinic → Rahmah Clinic, or An-Nour Hospital → An-Noor Hospital
  • Limitations: Fails with structural or abbreviation-based mismatches

Best Practices for an Effective Fuzzy Matching Workflow

Fuzzy matching is only as effective as the preparation and decision rules that support it. This section outlines key steps and good practices that improve the accuracy, transparency, and efficiency of your matching workflow. Below are some tips to improve your fuzzy matching:

Tip 1: Work from a cleaned copy of the original name column

Always preserve the original facility name column and perform all cleaning steps on a new copy. This ensures:

  • You can always refer back to the original, untouched names
  • All joins and fuzzy matching occur on the cleaned and standardized version
  • You retain the ability to inspect and manually resolve low-confidence or unmatched records later using the original, unformatted names

This separation is especially useful for manual matching workflows, where human judgment relies on seeing names as they originally appeared.

Tip 2: Preprocess text to reduce noise

Before running any matching algorithm, clean and normalize your text. The list below provides several suggestions, but you will want to confirm whether each makes sense in the context in which you are working:

  • Remove extra spaces to prevent false mismatches
    • Trim whitespace: Remove extra spaces at the beginning or end of text
    • Collapse multiple spaces: Replace repeated spaces with a single space
  • Standardize text characters to ensure consistency
    • Convert all text to lowercase: Ensure consistency regardless of original casing
    • Normalize accented characters: Convert characters like “é” to “e”
  • Standardize non-text characters
    • Standardize punctuation: Replace or remove punctuation to reduce irrelevant variation
    • Convert Roman numerals to numbers: e.g., “Koinadugu II CHC” → “Koinadugu 2 CHC”
  • Standardize word use, if applicable
    • Sort words alphabetically within each name: Handle word order differences by splitting names into words, sorting them, and rejoining. For example, both “Port Loko CHC” and “CHC Port Loko” become “CHC Loko Port”, making them directly comparable despite naming convention differences.

This step reduces irrelevant variation and reduces matching burden on the algorithms.

Tip 3: Handle abbreviations and domain-specific terms

Abbreviations and local terms in facility names can lead to mismatches if not standardized. Depending on your context, you can either expand abbreviations (e.g., CHC → Community Health Center) or contract long forms into standard short forms. Each approach may yield different matching outcomes. What matters is that you apply the transformation consistently to both datasets.

Examples of variations:

  • CHC vs Community Health Center
  • PHU vs Peripheral Health Unit
  • MCHP vs Maternal and Child Health Post

Recommended practices:

  • Build a simple abbreviation dictionary for your context
  • Apply substitution rules (for example, replace all CHC with Community Health Center) before running matching algorithms
  • Use domain-specific expansions or contractions where appropriate
Tip: Identifying likely abbreviations

You can scan your facility name fields for potential abbreviations by looking for short, all-uppercase tokens that appear within longer strings. A common heuristic is:

Look for 2–5 letter sequences in all capital letters that appear standalone among lowercase or mixed-case words.

This often finds abbreviations like CHC, PHU, or MCHP embedded in names such as Bo MCHP or Kailahun PHU. You can do this by eye, or with code.

Tip 4: Limit matching scope using geographic information

Fuzzy matching is more accurate and efficient when restricted to a plausible geographic scope. Instead of comparing every DHIS2 facility name to every MFL name in the entire country, limit comparisons to within the same administrative unit, such as adm1 (region) or adm3 (chiefdom), depending on the available data quality.

  • For example, only compare names where adm1 and adm3 are the same for both facilities
  • This speeds up processing and ensures that identically named facilities in different geographic areas are not mistakenly matched to each other

This approach assumes that:

  • The administrative fields (adm1, adm2, adm3) are already cleaned and harmonized across both the DHIS2 and MFL datasets
  • Facilities are correctly assigned to their administrative units, with no major misplacements or missing assignments
  • Remaining unmatched facilities are indeed name mismatches, not geographic mismatches needing spatial correction

The fuzzy matching approaches outlined on this page can be used to clean and match admin unit names.

By grounding fuzzy matching in geographic logic, this step improves both the relevance and reliability of the matching results.

Tip 5: Apply multiple similarity algorithms

Rather than relying on a single algorithm, select the best match based on the highest score across multiple matching algorithms, all while considering supporting quality checks. This match-first approach ensures flexibility and avoids locking decisions to one metric.

This is important because each fuzzy matching algorithm captures a different aspect of similarity. For example:

  • Levenshtein: Good for typos and deletions
  • Jaro-Winkler: Sensitive to character transpositions and prefix agreement
  • Qgram / LCS: Detect reordered or partial word overlaps
  • Rank-Ensemble: Prioritizes agreement across methods by comparing rank positions
  • Composite Scores: Integrates multiple metrics for better consensus

By comparing and combining outputs from these methods, you can better handle the diverse inconsistencies found in real-world datasets.

Tip 6: Review and validate with the SNT team

Automated matching can produce strong results, but human review is required. Always check:

  • High-confidence matches (85+): Very likely a true match and can be quickly verified by eye
  • Mid-confidence matches (70–84): May require correction
  • Low scores (below 70): Likely to be false matches
  • Similar-looking names in different districts: May indicate over-matching
  • Unmatched facilities: May need manual review or database updates
SNT Team Review Required

Always conduct a structured review of final match results with the SNT team. Validation ensures the SNT team applies local knowledge and prevents misclassification, especially when matches inform planning, facility mapping, or coverage analysis.

Matching Workflow Overview

This overview explains the end-to-end reconciliation used in Sierra Leone to match DHIS2 facility names to the national MFL. The workflow is auditable, prioritizes high-confidence matches first, and produces a single review-ready table with the selected match, similarity diagnostics, decision flags (accept/review), and stable IDs for downstream use.

Two-Phase Approach

  • Phase 1: Stratified geographic matching (adm2/adm3): Anchor comparisons within administrative boundaries to capture most matches with high confidence and expose misassigned admins early
  • Phase 2: Fuzzy name matching on remainders: Standardize names, generate candidates without geographic constraints, score with multiple algorithms, and use diagnostics to select matches for review/acceptance

11-Step Process Summary

Follow these 11 steps; detailed, runnable code appears in the next section.

  1. Install and load required libraries for data manipulation, fuzzy string matching, and file handling.
  2. Load data from DHIS2 and Master Facility List sources with initial cleaning.
  3. Conduct initial matching diagnostics to assess exact match rates, administrative alignment, and duplicate name patterns.
  4. Apply interactive stratified geographic matching to resolve high-confidence matches using geographic constraints and human validation.
  5. Process and prepare unmatched data by standardizing names and handling abbreviations to reduce variation.
  6. Perform fuzzy matching on remaining unmatched by constructing candidate grids, computing similarity scores, creating composite scores, and extracting best matches.
  7. Evaluate fuzzy match quality using diagnostics, visualizations, and weighted thresholding based on structural quality.
  8. Finalize fuzzy match selection using one of four strategies: single method, composite score, weighted composite, or progressive fallback loop.
  9. Combine all matched results from stratified matching and fuzzy matching approaches into a comprehensive dataset.
  10. Perform final checks to validate the matching results and identify any remaining issues.
  11. Save final datasets including matched results, unmatched facilities, and summary statistics for downstream analysis.

Together, the two phases minimize manual reconciliation while keeping ambiguous cases transparent for SNT team review.

Workflow Diagram

The diagram below summarizes the phased process from diagnostics to final output.

flowchart TD
    Start[DHIS2 Health Facilities] --> Step3[Step 3: Initial Matching<br/>Diagnostics]

    Step3 --> Step4[Step 4: Interactive Stratified<br/>Geographic Matching]
    Step3 --> Step5[Step 5: Process & Prepare<br/>Unmatched Data]:::altPath

    Step4 --> Decision1{Matched?}
    Decision1 -->|Yes | Matched1[High-Confidence<br/>Matched Records]
    Decision1 -->|No | Step5

    Step5 --> Step6[Step 6: Fuzzy Matching<br/>on Remaining Unmatched]
    Step6 --> Step7[Step 7: Evaluate Fuzzy<br/>Match Quality]
    Step7 --> Step8[Step 8: Finalize Fuzzy<br/>Match Selection]

    Step8 --> Decision2{Matched?}
    Decision2 -->|Yes| Matched2[Fuzzy Matched<br/>Records]
    Decision2 -->|No| Unmatched[Final Unmatched<br/>for Manual Review]

    Matched1 --> Step9[Step 9: Combine All<br/>Matched Results]
    Matched2 --> Step9
    Unmatched --> Step9

    Step9 --> Step10[Step 10: Final<br/>Checks]
    Step10 --> Step11[Step 11: Save<br/>Final Datasets]

    classDef altPath fill:#d1c4e9,stroke:#673ab7,stroke-width:2px;

    style Start fill:#e1f5fe
    style Step3 fill:#e8f5e9
    style Step4 fill:#fff3e0
    style Step5 fill:#f3e5f5
    style Step6 fill:#f3e5f5
    style Step7 fill:#fff9c4
    style Step8 fill:#f3e5f5
    style Step9 fill:#e8f5e9
    style Step10 fill:#fff9c4
    style Step11 fill:#c8e6c9
    style Matched1 fill:#a5d6a7
    style Matched2 fill:#ffcc80
    style Unmatched fill:#ffab91

When to follow each path
  • For direct fuzzy matching without geographic stratification, follow the purple path from Step 3 → Step 5 (bypassing Step 4).
  • For the complete matching workflow with geographic stratification, follow all steps sequentially from Step 3 through Step 11.

This gives you flexibility: skip stratification if you only need name-based matching, or include it when location context is important.

After stratified geographic matching, a smaller set of facilities remain unresolved. These are usually due to misassigned administrative units, newly opened or closed facilities, or name variations too large for geographic matching to resolve. These unmatched records then move into Step 5, where standardization and fuzzy matching provide another opportunity to recover valid matches.

Step-by-Step

Follow the 11-step workflow outlined in Matching Workflow Overview. The sections below provide runnable, Sierra Leone–specific code for each step and notes on how to adapt it to your context.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Install and Load Required Libraries

Install and load the necessary packages for data manipulation, fuzzy string matching, and file handling.

  • R
  • Python
# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load all required packages using pacman
pacman::p_load(
  readxl,     # for reading Excel files
  dplyr,      # for data manipulation
  stringdist, # for calculating string distances (fuzzy matching)
  tibble,     # for working with modern data frames
  knitr,      # for creating formatted tables
  openxlsx,   # for writing Excel files
  httr,       # for HTTP requests to download files (optional)
  here        # for cross-platform file paths
)

Step 2: Load Data

This step imports the DHIS2 and MFL health facility datasets. It then creates dhis2_hf_df, which contains the unique administrative units and facility names for fuzzy matching. Finally, it outputs samples from each dataset to review the columns and preview the first rows, ensuring the outputs look as expected

  • R
  • Python
Show the code
# set up path to dhis2 hf data
dhis2_path <- here::here(
  "01_data",
  "02_epidemiology",
  "2a_routine_surveillance",
  "processed"
)

hf_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1b_health_facilities",
  "processed"
)

# read DHIS2 health facilities data
dhis2_df <- readRDS(
  here::here(dhis2_path, "sle_dhis2_with_clean_adm3.rds")
) |>
  # rename facility name columns for clarity and consistency
  dplyr::rename(hf_dhis2_raw = hf)

# get distinct admin and hf cols
dhis2_hf_df <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw)

# read MFL health facilities data
master_hf_df <- read.csv(
  here::here(hf_path, "hf_final_clean_data.csv")
) |>
   dplyr::distinct(
    adm0, adm1, adm2, adm3, hf, lat, long, .keep_all = TRUE
  ) |>
  # rename facility name columns for clarity and consistency
  dplyr::mutate(hf_mfl_raw = hf)

# attach a stable DHIS2 facility ID for consistent counting across
# steps
# distinct by geography + name to avoid collisions across admins
dhis2_map <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf) |>
  dplyr::mutate(
    hf_uid_new = paste0(
      "hf_uid_new::",
      as.integer(as.factor(paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf)),
        sep = "|"
      )))
    )
  )

# print head of data
cli::cli_h3("Sample of DHIS2 data:")
head(dhis2_hf_df)

cli::cli_h3("Sample of MFL data:")
head(master_hf_df)
Output
Sample of DHIS2 data:
adm0 adm1 adm2 adm3 hf hf_dhis2_raw
SIERRA LEONE SOUTHERN BO BO TOWN Aethel CHP Aethel CHP
SIERRA LEONE SOUTHERN BO BO TOWN Agape Way CHP Agape Way CHP
SIERRA LEONE SOUTHERN BO BO TOWN Anglican Diocese Clinic Anglican Diocese Clinic
SIERRA LEONE SOUTHERN BO KAKUA Bandajuma MCHP Bandajuma MCHP
SIERRA LEONE SOUTHERN BO BOAMA Baoma Station CHP Baoma Station CHP
SIERRA LEONE SOUTHERN BO VALUNIA Baomahun CHC Baomahun CHC
Sample of MFL data:
adm0 adm1 adm2 adm3 community hf type Ownership functional lat long
SIERRA LEONE NORTH EAST BOMBALI MARA Kiampkakolo Kiampkakolo MCHP MCHP Government Functional 8.610318 -12.20295
SIERRA LEONE NORTH EAST BOMBALI MAKENI CITY Station Road/Vincent Kanu Road Loreto Clinic CLINIC Faith Based Functional 8.886622 -12.04387
SIERRA LEONE NORTH EAST BOMBALI MAKENI CITY Makama Road Makeni Govt. Hospital HOSPITAL Government Functional 8.871770 -12.05627
SIERRA LEONE NORTH EAST BOMBALI MAKENI CITY Arabic College Road Tonko CHP CHP Government Functional 8.888880 -12.03255
SIERRA LEONE NORTH EAST BOMBALI MAKENI CITY Fullah Town Community Fullah Town CHP CHP Government Functional 8.903324 -12.04545
SIERRA LEONE NORTH EAST BOMBALI BOMBALI SEBORA Makama Community Makama CHP CHP Government Functional 8.862456 -12.05584

To adapt the code:

  • Lines 2–7: Update dhis2 file paths to match the location of your data files
  • Lines 9–14: Update MFL file paths to match the location of your data files
  • Lines 18, 29: Change file names to match your specific data files
  • Lines 21, 35: Update hf and the renamed columns (hf_mfl_raw, hf_dhis2_raw) to match the actual health facility name columns in your datasets

Step 3: Initial Matching Diagnostics

Before attempting any matching, we assess the data systematically to understand the matching challenge. We break this into three focused checks: overall exact matches, administrative-level matching patterns, and duplicate name issues.

Step 3.1: Overall exact match check

Next, we check overall exact matches without administrative constraints and calculate the total matching potential.

  • R
  • Python
Show the code
# check for exact matches on raw names (no admin constraint)
exact_matches_all <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw"),
    relationship = "many-to-many"
  )

# calculate matching potential
total_dhis2 <- nrow(dhis2_hf_df)
total_mfl <- nrow(master_hf_df)
unmatched_dhis2 <- total_dhis2 - nrow(exact_matches_all)

cli::cli_h3("Overall Matching Summary")
cli::cli_alert_info("Total DHIS2 facilities: {total_dhis2}")
cli::cli_alert_info("Total MFL facilities: {total_mfl}")
cli::cli_alert_success(
  paste0(
    "Exact matches found: {nrow(exact_matches_all)} (",
    "{round(nrow(exact_matches_all)/total_dhis2*100, 1)}%)
  )
)
cli::cli_alert_warning("Remaining to match: {unmatched_dhis2}")
Output
── Overall Matching Summary 
ℹ Total DHIS2 facilities: 1771
ℹ Total MFL facilities: 1554
✔ Exact matches found: 1082 (61.1%
! Remaining to match: 689

To adapt the code:

  • Line 5: Update join columns to match your facility name fields

A majority of facilities match exactly by name as shown in the output above. This baseline highlights the scope before applying geographic or fuzzy matching approaches.

Step 3.2: Administrative level match check

First, we examine how well facilities match when we consider administrative boundaries, checking matches at both adm2 (district) and adm3 (chiefdom) levels.

  • R
  • Python
Show the code
# check matches at adm2 level (district)
dhis2_by_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(total_dhis2 = dplyr::n())

matches_by_adm2 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2")
  ) |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(exact_matches = dplyr::n()) |>
  dplyr::left_join(dhis2_by_adm2, by = "adm2") |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::select(adm2, exact_matches, total_dhis2, match_rate) |>
  dplyr::arrange(dplyr::desc(match_rate))

cli::cli_h3("Exact matches by district (adm2))
matches_by_adm2

# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(total_dhis2 = dplyr::n(), .groups = "drop")

matches_by_adm3 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2", "adm3")
  ) |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(exact_matches = dplyr::n(), .groups = "drop") |>
  dplyr::left_join(dhis2_by_adm3, by = c("adm2", "adm3")) |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::filter(total_dhis2 >= 5) |> # Only show areas with 5+ facilities
  dplyr::arrange(dplyr::desc(match_rate)) |>
  dplyr::slice_head(n = 10) # Show top 10 performing adm3 areas

cli::cli_h3("Top exact matches by chiefdom (adm3))
matches_by_adm3
Output
Exact matches by district (adm2)
adm2 exact_matches total_dhis2 match_rate
KARENE 53 62 85.5
KAMBIA 62 75 82.7
TONKOLILI 95 122 77.9
PORT LOKO 87 122 71.3
BOMBALI 74 106 69.8
PUJEHUN 68 104 65.4
KOINADUGU 45 69 65.2
MOYAMBA 73 115 63.5
KAILAHUN 57 96 59.4
BONTHE 54 95 56.8
BO 95 172 55.2
KENEMA 83 151 55.0
FALABA 28 51 54.9
KONO 72 137 52.6
WESTERN RURAL 40 88 45.5
WESTERN URBAN 81 206 39.3
Top exact matches by chiefdom (adm3)
adm2 adm3 exact_matches total_dhis2 match_rate
BOMBALI KAMARANKA 5 5 100
BOMBALI PAKI MASABONG 7 7 100
BOMBALI SAFROKO LIMBA 10 10 100
KAMBIA GBINLE 5 5 100
KARENE ROMENDE 5 5 100
KARENE SANDA MAGBOLONTOR 5 5 100
MOYAMBA TIMDALE 5 5 100
PORT LOKO KAMASONDO 8 8 100
PUJEHUN SOWA 5 5 100
TONKOLILI KHOLIFA MAMUNTHA/MAYOSSO 6 6 100

To adapt the code:

  • Line 9: Update join columns to match your facility name fields
  • Line 11: Modify the administrative unit column (adm2) to match your district level
  • Lines 25, 31: Update adm2 and adm3 columns to match your sub-district level

District match rates vary widely, from 86% (Karene) to 39% (Western Urban). Some chiefdoms achieve 100% matches while others struggle. This shows which areas have clean, standardized names versus those needing intensive matching work.

Step 3.3: Duplicate name check

Finally, we check duplicate facility names to understand potential matching complications.

  • R
  • Python
Show the code
# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_dhis2_raw)

mfl_dups_adm2 <- master_hf_df |>
  dplyr::group_by(adm2, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_mfl_raw)

cli::cli_h3("Duplicates within same district (adm2)")
cli::cli_alert_warning(
  paste0(
    "DHIS2 duplicates within districts: ",
    "{length(unique(dhis2_dups_adm2$hf_dhis2_raw))}"
  )
)
cli::cli_alert_warning(
  paste0(
    "MFL duplicates within districts: ",
    "{length(unique(mfl_dups_adm2$hf_mfl_raw))}"
  )
)

# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_dhis2_raw)

mfl_dups_adm3 <- master_hf_df |>
  dplyr::group_by(adm2, adm3, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_mfl_raw)

cli::cli_h3("Duplicates within same chiefdom (adm3)")
cli::cli_alert_danger(
  paste0(
    "DHIS2 duplicates within chiefdoms: ",
    "{length(unique(dhis2_dups_adm3$hf_dhis2_raw))}"
  )
)
cli::cli_alert_danger(
  paste0(
    "MFL duplicates within chiefdoms: ",
    "{length(unique(mfl_dups_adm3$hf_mfl_raw))}"
  )
)

# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall <- dhis2_hf_df |>
  dplyr::group_by(hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_dhis2_raw)

mfl_dups_overall <- master_hf_df |>
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_mfl_raw)

cli::cli_h3("Overall duplicate names (across all areas)")
cli::cli_alert_info(
  "DHIS2 overall duplicates: {length(
    unique(dhis2_dups_overall$hf_dhis2_raw)
  )}"
)
cli::cli_alert_info(
  "MFL overall duplicates: {length(
    unique(mfl_dups_overall$hf_mfl_raw)
  )}"
)
Output
── Duplicates within same district (adm2) 
! DHIS2 duplicates within districts: 18
! MFL duplicates within districts: 7
── Duplicates within same chiefdom (adm3) 
✖ DHIS2 duplicates within chiefdoms: 0
✖ MFL duplicates within chiefdoms: 1
── Overall duplicate names (across all areas) 
ℹ DHIS2 overall duplicates: 19
ℹ MFL overall duplicates: 49

To adapt the code:

  • Lines 3–4, 8–9: Update adm2 and facility name columns
  • Lines 28–29, 33–34: Update adm2, adm3 and facility name columns
  • Lines 54, 56, 59, 61: Update facility name columns for overall duplicates

Duplicates are few at chiefdom level but more noticeable within districts, where DHIS2 lists twice as many as the MFL. Nationally, the MFL has many more duplicate names (49 vs 19). These patterns highlight the need to use geography and unique IDs alongside names to avoid mismatches.

Step 4: Interactive Stratified Geographic Matching

Before standardizing names or performing fuzzy matching, geographic stratification with user judgment can be used to resolve most facility matches through interactive matching. This approach combines the precision of human review with the efficiency of automated suggestions, using the approach described in the Merging shapefiles with tabular data but applied here at the health facility level.

The sntutils::prep_geonames() function handles this entire process, including internal name standardization, geographic stratification, and interactive or automated matching based on user needs.

  • R
  • Python
Show the code
# inner join (keep only matched polygons)
# set up location to save cache
cache_loc <- "1.1_foundational/1d_cache_files"

# interactive stratified matching with automatic standardization
# this function handles name standardization internally
dhis2_df_cleaned <-
  sntutils::prep_geonames(
    target_df = dhis2_hf_df, # dataset to be cleaned
    lookup_df = master_hf_df, # reference dataset with correct admin
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    level4 = "hf",
    cache_path = here::here(cache_loc, "geoname_cache.rds"),
    unmatched_export_path = here::here(cache_loc, "dhis2_hf_unmatched.rds")
  )

# load the unmatched facilities for further processing (step 5–8 fuzzy
# matching)
dhis2_hf_to_process <- readRDS(
  here::here(cache_loc, "dhis2_hf_unmatched.rds")
) |>
  dplyr::select(adm0, adm1, adm2, adm3, hf_dhis2_raw = hf)

# summary statistics
n_original <- nrow(dhis2_hf_df)
n_matched <- n_original - nrow(dhis2_hf_to_process)
match_rate <- (n_matched / n_original) * 100

cli::cli_alert_success(
  paste0(
    "Stratified matching completed: ",
    "{format(n_matched, big.mark = ',')}/{format(n_original, big.mark = ',')}",
    " facilities matched ({round(match_rate, 1)}%)"
  )
)
cli::cli_alert_info(
  "Remaining unmatched: {nrow(dhis2_hf_to_process)} facilities"
)
Output
ℹ Match Summary:
• adm0 (level 0): 1 out of 1 matched
• adm1 (level 1): 5 out of 5 matched
• adm2 (level 2): 16 out of 16 matched
• adm3 (level 3): 208 out of 208 matched
• hf (level 4): 1,440 out of 1,711 matched
✔ In non-interactive mode. Exiting after matching with cache...
✔ Unmatched data exported to: /Users/mohamedyusuf/ahadi-analytics/GitHub/snt-code-library/english/data_r/shapefiles/dhis2_hf_unmatched.rds
ℹ Exported 271 unique unmatched rows for column 'hf'
✔ Stratified matching completed: 1,500/1,771 facilities matched (84.7%)
ℹ Remaining unmatched: 271 facilities

To adapt the code:

  • Line 3: Update cache_loc to point to the folder where you want to save both the cache file and the unmatched facilities export. This should be a stable location that can be reused in later steps (e.g., "1.1_foundational/1d_cache_files").
  • Lines 9-10: Replace dhis2_hf_df and master_hf_df with your dataset names
  • Lines 11-15: Update column names to match your geographic hierarchy (adm0, adm1, adm2, adm3) and facility name column (hf)
  • Lines 16-17: Update file paths for both the cache (cache_path) and the unmatched facilities export (unmatched_export_path), which will be used as the input for the fuzzy matching steps (5–8)
  • Line 23: Use the same path as unmatched_export_path to load your unmatched facilities data for the subsequent cleaning and fuzzy matching steps.

With our interactive stratified matching, we achieved a high success rate as shown in the output above. All administrative levels matched perfectly, but some facilities remain unmatched at the facility name level. The results suggest that combining geographic constraints with human validation is effective, leaving a smaller subset of facilities still to be addressed.

Review with SNT team

The cached results should always be validated with the SNT team before integration. This review step helps to:

  • Confirm matches using local knowledge, especially for common or similar names
  • Resolve ambiguities where facilities are misassigned or have multiple candidates
  • Ensure consistency with national naming standards and classifications
  • Document decisions to support future harmonization and reduce repeated effort

The cache is both a validation record and a resource for improving data quality over time. The table below shows the contents of the saved cache from the stratified matching process (cache_path).

The cache can also be reused in future runs to avoid repeating interactive matching. It preserves past decisions, ensures consistency, and provides an efficient starting point for updating facility names with the SNT team.

The next step is to run these unmatched facilities through the standardization and fuzzy matching pipeline to recover any remaining valid matches.

Step 5: Process and Prepare Unmatched Data for Fuzzy Matching

After stratified geographic matching, a smaller set of facilities remain unresolved. These are typically due to misassigned administrative units, newly opened or closed facilities, or name variations too large for geographic matching to resolve. To address these cases, Step 5 focuses on preparing the unmatched data by standardizing names and handling abbreviations to reduce variation before applying fuzzy matching.

Step 5.1: Standardize health facility names

Now we apply standardization to the unmatched facilities. This step helps resolve matches that failed due to formatting differences:

As outlined in the Best Practices for an Effective Fuzzy Matching Workflow section, we begin by standardizing the health facility name columns. To do this, we first apply common text cleaning operations such as converting to lowercase, trimming whitespace, removing punctuation, collapsing multiple spaces, normalizing accented characters and converting Roman numerals (e.g., II, III) into standard numeric form (e.g., 2, 3). Each name is then split into individual words, sorted alphabetically, and rejoined, ensuring that names with the same words in different orders are treated consistently. These preprocessing steps reduce irrelevant variation and improve the reliability of downstream fuzzy matching.

In this illustrative example, we apply every standardization option so that code is available for all. However, for your use case, you may decide that some of these are unnecessary or lead to unanticipated mismatches. Review carefully which standardization options you would like to include in your workflow. You may want to create multiple versions with more and fewer standardization steps included, then review all of together to make your decision.

  • R
  • Python
# create function to standardize hf names
standardize_names <- function(name_vec) {
  # validate input
  if (!rlang::is_atomic(name_vec)) {
    cli::cli_abort("`name_vec` must be an atomic vector.")
  }

  name_vec |>
    # ensure character type
    as.character() |>
    # convert to lowercase
    stringr::str_to_lower() |>
    # replace punctuation with space
    stringr::str_replace_all("[[:punct:]]", " ") |>
    # remove extra spaces and trim
    stringr::str_squish() |>
    # normalize accents
    stringi::stri_trans_general("Latin-ASCII") |>
    # normalize all space-like characters
    stringi::stri_replace_all_regex("\\p{Zs}+", " ") |>
    # convert roman numerals to arabic numerals
    stringr::str_replace_all(
      c(
        " ix\\b" = " 9",
        " viii\\b" = " 8",
        " vii\\b" = " 7",
        " vi\\b" = " 6",
        " v\\b" = " 5",
        " iv\\b" = " 4",
        " iii\\b" = " 3",
        " ii\\b" = " 2",
        " i\\b" = " 1"
      )
    ) |>
    # sort tokens with letters first, numbers last; within groups
    # alphabetical
    purrr::map_chr(\(.x) {
      # split on one-or-more spaces
      tokens <- strsplit(.x, " +")[[1]]

      # detect pure numeric tokens
      is_num <- stringr::str_detect(tokens, "^[0-9]+$")

      # order alphabetic first, then numeric; sort within each group
      ordered <- c(sort(tokens[!is_num]), sort(tokens[is_num]))

      # rejoin
      paste(ordered, collapse = " ")
    })
}

# set up example with messy formatting
example_word <- factor("Clínica! Rahmâ    IV (  New clinic) East")

# print original structure
cat("\nExample before standardization:\n")
str(example_word)

# apply standardization
example_word_st <- standardize_names(example_word)

# print cleaned example
cat("\nExample after standardization:\n")
str(example_word_st)
Output

Example before standardization:
 Factor w/ 1 level "Clínica! Rahmâ    IV (  New clinic) East": 1

Example after standardization:
 chr "clinic clinica east new rahma 4"

To adapt the code:

  • Lines 1–50: If you would like to keep the full suite of standardizations, there is no need to modify the standardize_names function. Otherwise, remove the standardizations you wish to reject

You can see that our function worked: it turned the original factor Clínica! Rahmâ IV ( New clinic) East into a clean character string clinic clinica east new rahma 4. This confirms that the standardization steps worked as intended.

  • The original name had extra spaces, accents, punctuation, and Roman numerals.
  • All text was converted to lowercase.
  • Punctuation was replaced with spaces, and multiple spaces were collapsed into one.
  • Accents were removed, and Roman numerals were converted to numbers.
  • Words were split, sorted alphabetically, and any numbers placed at the end.

We’ll now apply the standardization function to both the DHIS2 and MFL facility name columns to ensure consistent formatting. This helps ensure that string similarity scores reflect true differences, not artifacts of inconsistent text formatting.

  • R
  • Python
# remove duplicates from MFL and format hf col
master_hf_df <- master_hf_df |>
  dplyr::distinct(hf_mfl_raw, .keep_all = TRUE) |>
  dplyr::mutate(hf_mfl = standardize_names(hf_mfl_raw))

# important: preserve original full dataset before processing unmatched
# (this is our base N)
dhis2_hf_df_original <- dhis2_hf_df |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw))

# build geo-aware stable ID once and attach to originals and unmatched
hf_uid_new_map <- dhis2_hf_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::mutate(
    hf_uid_new = paste0(
      "hf_uid_new::",
      as.integer(as.factor(paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf_dhis2_raw)),
        sep = "|"
      )))
    )
  )

dhis2_hf_df_original <- dhis2_hf_df_original |>
  dplyr::left_join(
    hf_uid_new_map,
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched <- dhis2_hf_to_process |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw)) |>
  dplyr::left_join(
    hf_uid_new_map |>
      dplyr::mutate(hf_dhis2_raw = toupper(hf_dhis2_raw)),
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

knitr::kable(
  # check if it worked
  dhis2_hf_unmatched |>
    dplyr::distinct(hf_dhis2_raw, hf_dhis2) |>
    dplyr::slice_head(n = 10)
)
Output
hf_dhis2_raw hf_dhis2
ANGLICAN DIOCESE CLINIC anglican clinic diocese
BREAKTHROUGH MCHP breakthrough mchp
GBANJA TOWN MCHP gbanja mchp town
GENDA MCHP genda mchp
HAIKEL CHP chp haikel
KAKUA STATIC CHC chc kakua static
KEINDEYELLA MCHP keindeyella mchp
KPAMAJAMA MCHP kpamajama mchp
KPEWAMA MCHP kpewama mchp
LEWAIBU CHP chp lewaibu

To adapt the code:

  • Lines 2–4: Update column names to match your data (hf_mfl_raw, hf_dhis2_raw, hf_mfl, hf_dhis2)
  • Line 3: Modify the distinct() function to use different duplicate removal criteria if needed
  • Lines 6–9: Create dhis2_hf_df_original to preserve all original facilities (this is our base N that will be used in Step 9.2)
  • Lines 33–34: Create dhis2_hf_unmatched for fuzzy matching steps - subsequent steps should use this instead of dhis2_hf_df

We keep the original columns (hf_mfl_raw, hf_dhis2_raw) unchanged and create standardized versions (hf_mfl, hf_dhis2) for matching. This preserves the raw names for later review or manual matching.

Step 5.2: Handle abbreviations in health facility names

Health facility names in both DHIS2 and the MFL may contain abbreviations such as CHC, PHU, or MCHP. If these abbreviations are used inconsistently across or within datasets, standardization can improve matching accuracy.

To improve consistency and match quality, we begin by identifying frequently used abbreviations across both datasets by analyzing word patterns. Rather than expanding these abbreviations into their full forms, which can introduce unnecessary verbosity and noise, we have chosen to apply a predefined dictionary to standardize both datasets to use abbreviations. This approach simplifies name structures, shortens string lengths, and reduces the influence of repetitive generic terms on similarity scores.

We count the number of times each abbreviation is appearing in the datasets (freq) and define a “real” abbreviation as one that appears more than 2 times.

  • R
  • Python
abbrev_dictionary <-
  dplyr::bind_rows(
    dplyr::select(dhis2_hf_unmatched, hf = hf_dhis2_raw),
    dplyr::select( master_hf_df, hf = hf_mfl_raw)) |>
  tidyr::separate_rows(hf, sep = " ") |>                      # Split into words
  dplyr::filter(hf != "") |>                                  # Remove blanks
  # detect 2-4 uppercase letter patterns
  dplyr::filter(stringr::str_detect(hf, "^[A-Z]{2,4}$")) |>
  dplyr::count(hf, sort = TRUE) |>                            # Count frequencies
  dplyr::rename(word = hf, freq = n) |>                       # Rename for clarity
  dplyr::filter(freq > 2) |>
  as.data.frame()

# check output
abbrev_dictionary
Output
word freq
MCHP 708
CHP 449
CHC 276
DR 15
ARAB 8
AND 6
HOPE 4
NEW 4
OF 4
RED 4

To adapt the code:

  • Line 8: Modify the pattern {2,4} to change the character length range for detected abbreviations (currently 2-4 characters)
  • Line 11: Change the frequency threshold (freq > 2) to match your context: lower it to capture more potential abbreviations or raise it to focus on the most common ones
  • Handle all-caps facility names: If your facility names are entirely in uppercase, this pattern detection may not work as expected. Consider pre-processing to convert names to title case or adjusting the detection pattern

Now that we’ve identified common abbreviations (e.g., MCHP, CHP, CHC), we use a shared dictionary to standardize health facility names across both datasets. Rather than expanding these terms, we ensure they’re used consistently as abbreviations in both DHIS2 and the MFL. Even within the MFL, some names may use abbreviations while others use full terms. Standardization helps align these differences and increases match accuracy.

Limitation: Handling Misspellings and Variations

This abbreviation dictionary approach has limitations when dealing with misspelled or variant terms. For example:

  • If “Community Health Center” is misspelled as “Comunity Health Center”, the dictionary lookup won’t match
  • Regional variations like “Health Centre” vs “Health Center” may not be captured
  • Typos in abbreviations themselves (e.g., CHC vs CHS) won’t be standardized

Mitigation strategies:

  • Use fuzzy matching algorithms in Step 4 to handle remaining mismatches after abbreviation standardization
  • Consider adding common misspellings to your dictionary if they appear frequently
  • Implement additional text cleaning steps (spell checking, variant standardization) before applying the abbreviation dictionary
  • R
  • Python
# define abbreviation dictionary (all lowercase, because we use the
# standardize column)
abbrev_dict <- c(
  "maternal child health post" = "mchp",
  "community health post" = "chp",
  "community health center" = "chc",
  "urban maternal clinic" = "umi",
  "expanded programme on immunization" = "epi"
)

# apply replacements to MFL names
master_hf_df <- master_hf_df |>
  dplyr::mutate(
    hf_mfl = stringr::str_replace_all(hf_mfl, abbrev_dict)
  )

# apply replacements to DHIS2 names
dhis2_hf_unmatched <- dhis2_hf_unmatched |>
  dplyr::mutate(
    hf_dhis2 = stringr::str_replace_all(hf_dhis2, abbrev_dict)
  )

# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched |>
  dplyr::filter(
    stringr::str_detect(
      hf_dhis2_raw,
      paste0(
        "(?i)Community Health Center|Maternal Child Health Post|",
        "Community Health Post"
      )
    ) |
      stringr::str_detect(hf_dhis2_raw, "CHC|MCHP|CHP")
  ) |>
  dplyr::select(hf_dhis2_raw, hf_dhis2) |>
  head()
Output
hf_dhis2_raw hf_dhis2
BREAKTHROUGH MATERNAL CHILD HEALTH POST breakthrough mchp
GBANJA TOWN MCHP gbanja mchp town
GENDA MCHP genda mchp
HAIKEL CHP chp haikel
KAKUA STATIC COMMUNITY HEALTH CENTER chc kakua static
KEINDEYELLA MCHP keindeyella mchp
KPAMAJAMA MCHP kpamajama mchp
KPEWAMA MCHP kpewama mchp
LEWAIBU COMMUNITY HEALTH POST chp lewaibu
LYN MATERNITY MCHP lyn maternity mchp

To adapt the code:

  • Lines 3–9: Modify the abbrev_dict if your data includes other frequent abbreviations
  • Lines 14, 20: Make sure you are using the standardized lowercase columns (hf_mfl and hf_dhis2) as created in Step 5.1 - Only apply this step after confirming your abbreviations through frequency checks (see Step 5.2)

Step 6: Perform Fuzzy Matching on Remaining Unmatched

We now apply fuzzy matching algorithms to the facilities that remain unmatched after stratified geographic matching in Step 4. These represent the most challenging cases where geographic alignment or name similarity couldn’t establish a clear match.

Processing Remaining Unmatched

This step focuses on facilities that couldn’t be matched through: - Step 4: Interactive stratified geographic matching - Step 5: Name standardization and exact matching

These remaining facilities often have geographic misalignments or extreme name variations that require more flexible matching approaches.

In our example, we are aiming to find an MFL match for each remaining unmatched DHIS2 facility. If you are interested in matching in the other direction (identifying a DHIS2 match for each MFL facility), or matching in both directions, you will need to modify and/or repeat the code to match the other way.

Step 6.1: Identify matched and unmatched facilities

We first identify exact matches between DHIS2 and MFL facility names, both before and after text standardization. These exact matches are set aside while we focus on unmatched facilities for fuzzy matching.

We use many-to-one matching, meaning multiple DHIS2 records may link to the same MFL facility if their names are sufficiently similar. This approach helps assess linkage reliability and flags potential naming inconsistencies. It is also possible to see many-to-many detections, for example when the same health facility appears in DHIS2 under different spellings or formats and matches against multiple candidate facilities in the MFL. These cases require closer review to confirm the correct linkage.

The output from this step is unmatched DHIS2 names that will be processed through fuzzy matching algorithms.

  • R
  • Python
Show the code
# flag for one-to-one enforcement
enforce_one_to_one <- FALSE

# exact matches using raw names
matched_dhis2_raw <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw),
    by = c("hf_dhis2_raw" = "hf_mfl_raw")
  )

# exact matches using standardized names
matched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_dhis2, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw, hf_mfl),
    by = c("hf_dhis2" = "hf_mfl"),
    keep = TRUE
  ) |>
  # exclude facilities already matched via raw names
  dplyr::anti_join(matched_dhis2_raw, by = c("hf_dhis2_raw")) |>
  # label exact matches
  dplyr::mutate(
    final_method = paste0(
      "Matched Without Fuzzy Matching (standardization)"
    ),
    score = 100
  )

# only include facilities that were actually matched via geographic
# standardization
# these should be facilities that had geographic corrections applied
matched_dhis2_prepgeoname <-
  dhis2_df_cleaned |>
  dplyr::anti_join(
    dhis2_hf_unmatched,
    by = c("adm0", "adm1", "adm2", "adm3", "hf" = "hf_dhis2_raw")
  ) |>
  dplyr::left_join(
    master_hf_df |>
      dplyr::mutate(hf = toupper(hf)) |>
      dplyr::select(adm0, adm1, adm2, adm3, hf, hf_mfl_raw, hf_mfl),
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  ) |>
  # attach hf_uid_new from dhis2_map (by admin + hf)
  dplyr::left_join(
    dhis2_map |> dplyr::select(adm0, adm1, adm2, adm3, hf, hf_uid_new),
    by = c("adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw" = "hf")
  ) |>
  # label matches
  dplyr::mutate(
    final_method = "Interactive Stratified Geographic Matching",
    score = 100
  )

# compute unmatched after standardization
unmatched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(hf_dhis2) |>
  dplyr::anti_join(
    master_hf_df |>
      dplyr::select(hf_mfl),
    by = c("hf_dhis2" = "hf_mfl")
  ) |>
  dplyr::distinct(hf_dhis2)

# collect already matched mfl to enforce one-to-one in candidate pool
used_mfl_stand <- matched_dhis2 |>
  dplyr::pull(hf_mfl) |>
  unique()

use_mfl_prepgeoname <- matched_dhis2_prepgeoname |>
  dplyr::pull(hf_mfl) |>
  unique()

used_mfl <- c(used_mfl_stand, use_mfl_prepgeoname)

# build candidate MFL pool for fuzzy step
candidate_mfl_df <- master_hf_df |>
  dplyr::select(hf_mfl)

# if enforcing one-to-one, drop MFLs already used by exact matches
if (enforce_one_to_one) {
  candidate_mfl_df <- candidate_mfl_df |>
    dplyr::filter(!hf_mfl %in% used_mfl)
}

# summary counts
total_dhis2_hf <- dplyr::n_distinct(dhis2_hf_unmatched$hf_dhis2_raw)
raw_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2_raw$hf_dhis2_raw)
raw_unmatch_dhis2_hf <- total_dhis2_hf - raw_match_dhis2_hf

standardized_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2$hf_dhis2)
standardized_unmatch_dhis2_hf <- total_dhis2_hf - standardized_match_dhis2_hf

total_mfl_hf <- dplyr::n_distinct(master_hf_df$hf_mfl)
candidate_mfl_count <- dplyr::n_distinct(candidate_mfl_df$hf_mfl)

# display summary
cli::cli_h2("Summary of Matching Status")

cli::cli_alert_info(
  "Total DHIS2 facilities: {format(total_dhis2_hf, big.mark = ',')}"
)

cli::cli_alert_success(
  paste0(
    "Matched after standardization: ",
    "{format(standardized_match_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched using raw names: ",
    "{format(raw_unmatch_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched after standardization: ",
    "{format(standardized_unmatch_dhis2_hf, big.mark = ',')}"
  )
)

cli::cli_alert_info(
  paste0(
    "MFL facilities in candidate pool for matching: ",
    "{format(candidate_mfl_count, big.mark = ',')} of ",
    "{format(total_mfl_hf, big.mark = ',')}"
  )
)
Output
── Summary of Matching Status ──
ℹ Total DHIS2 facilities: 268
✔ Matched after standardization: 64
✖ Unmatched using raw names: 268
✖ Unmatched after standardization: 204
ℹ MFL facilities in candidate pool for matching: 1,492 of 1,492

To adapt the code:

  • Line 2: Set enforce_one_to_one to TRUE to prevent reusing an hf_mfl already matched exactly. Set to FALSE to allow one-to-many
  • Lines 15, 18: Replace column names: hf_dhis2 and hf_mfl → standardized name fields; hf_dhis2_raw and hf_mfl_raw → raw name fields
  • Lines 14, 17: Replace dataset names: dhis2_hf_unmatched → your unmatched DHIS2 dataset; master_hf_df → your MFL dataset
  • Lines 61–68: unmatched_dhis2 contains DHIS2 facilities not matched to the MFL and will be passed to the fuzzy matching step
  • Lines 81-83: One-to-many filter. Keep as-is to drop already matched hf_mfl when enforce_one_to_one == FALSE. Remove these lines if you want to always allow reuse

Our standardization process resolved some of the initially unmatched DHIS2 facilities as shown in the output above. This shows the value of text cleaning before applying fuzzy matching. The remaining unmatched DHIS2 facilities are the focus of our fuzzy matching efforts in the next steps.

The output above shows the total DHIS2 facilities that were unmatched at the raw-name stage, and the MFL facilities that remain in the candidate pool for fuzzy matching after enforcing one-to-many alignment with DHIS2 matches.

We’ll use unmatched_dhis2 as the input for fuzzy matching. These are the standardized DHIS2 facility names that couldn’t be matched through exact string comparison. The matched_dhis2 dataset contains facility names that matched exactly after standardization. Once fuzzy matching is complete, we’ll append these exact matches to the final results to produce a complete, joined dataset.

Step 6.1: Create the matching grid

Once we identify the unmatched_dhis2 names, the next step is to create a matching grid. This step sets up the full set of possible name pairings that we will later evaluate using fuzzy string comparison. The purpose of this grid is to define the set of candidate pairs that we want to compare in the next step, rather than checking every name against every other name in the full dataset, which would be computationally expensive and potentially introduce false matches.

  • R
  • Python
Show the code
# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities.

# create simple candidate match pool
candidate_match_pool <- tidyr::crossing(
  unmatched_dhis2,
  candidate_mfl_df
)

# create match grid with dhis2 and mfl names
match_grid <- candidate_match_pool

# preview sample - top 10 matches
match_grid |>
  dplyr::slice_head(n = 10)
Output
hf_dhis2 hf_mfl
a clinic dr edwin aapdep clinic wind
a clinic dr edwin aberdeen centre hospital women
a clinic dr edwin aberdeen chc coach sea
a clinic dr edwin abernita hospital
a clinic dr edwin acre chc mange
a clinic dr edwin ad bangs chp quarry
a clinic dr edwin adama chc marth memorial
a clinic dr edwin adama mabaima mchp
a clinic dr edwin ado clinic dr wright
a clinic dr edwin adolescent and clinic happy kid

To adapt the code:

  • Lines 5-7: Replace unmatched_dhis2 and candidate_mfl_df with your dataset names if different

The match_grid is a flat list of candidate name pairs. Each row links one unmatched DHIS2 name to a possible MFL match. This is not a fuzzy match result, but the full set of pairings that will be evaluated in Step 6.3 using the string distance algorithms.

At this stage, the grid includes only facility names (hf_dhis2 and hf_mfl) to keep it lightweight. IDs, coordinates, and other metadata are excluded for now and will be reattached after scoring. If identical names appear across locations, the grid can be extended with extra fields for disambiguation. The preview shows a random sample of candidate pairs. It may not include obvious matches, but helps confirm that the pairing logic and geographic filters are working. This grid is the input for fuzzy matching, where each row will be scored and ranked to identify likely matches.

Step 6.2: Compute similarity scores

We compute name similarity between each unmatched DHIS2 facility and all candidates in the master list using multiple algorithms: Jaro-Winkler, Levenshtein, Qgram, and LCS. Each algorithm captures different types of variation, such as typos, truncation, or word reordering, that may be found in real-world datasets.

This step outputs individual similarity scores from each method, allowing you to flexibly explore, compare, or prioritize specific algorithms depending on your matching goals.

  • R
  • Python
Show the code
# calculate fuzzy scores
match_grid <- match_grid |>
  dplyr::mutate(
    len_max = pmax(nchar(hf_dhis2), nchar(hf_mfl)),
    score_jw = 1 - stringdist::stringdist(hf_dhis2, hf_mfl, method = "jw"),
    score_lv = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lv") / len_max,
    score_qg = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "qgram") / len_max,
    score_lcs = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lcs") / len_max
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("score_"),
      .fns = ~ ifelse(is.nan(.x) | .x < 0, 0, .x)
    ),
    dplyr::across(
      .cols = dplyr::contains("score_"),
      .fns = ~ .x * 100
    )
  )
Output
hf_dhis2 hf_mfl score_jw score_lv score_qg score_lcs
branda centre hospital medical branda center hospital medical 98.9 93.3 100.0 93.3
clinic health redeemer clinic health redeemers 98.6 95.7 95.7 95.7
kayimba mchp vawahun kayimba mchp vaawahun 96.7 95.2 95.2 95.2
mchp momorimaria mchp momorimaia 97.9 93.8 93.8 93.8
chc manjiama chc manjama 97.2 91.7 91.7 91.7
chp haikal gbongongor mchp 0.0 0.0 0.0 0.0
chp kulia gbongongor mchp 0.0 0.0 0.0 0.0
keindeyella mchp chc mottuo 0.0 0.0 0.0 0.0
keindeyella mchp chc potoru 0.0 0.0 0.0 0.0
keindeyella mchp chp fogbo 0.0 0.0 0.0 0.0

To adapt the code:

  • Lines 4–10: Swap in or out any similarity algorithm based on your matching needs. See stringdist manual, p.23 for available methods

Above is a sample of ten DHIS2–MFL name pairs from the match grid. The first five are strong matches with near-perfect scores across all algorithms, reflecting nearly identical string matches. The last five are weak matches with low or zero similarity, often due to unrelated names or different facility types. Mid-range cases like kpetema mchp vs clinic gbonkorlenken show partial overlap but remain well below matching thresholds. These patterns confirm that the scoring works as intended and set the stage for building composite similarity scores in the next step, which help reduce dependence on any single metric.

Step 6.3: Create composite scores

To improve fuzzy match reliability, we generate composite similarity metrics that combine multiple string distance algorithms. This helps minimize over-reliance on any single metric and supports a more robust match decision. However, if you have a preferred single metric from the previous step, you may skip generating a composite score and proceed directly to Step 4.5.

We generate composite metrics in two ways:

  • Composite Similarity Score (composite_score): We calculate a simple average across the six similarity algorithms from the previous step (Jaro-Winkler, Levenshtein, Qgram and LCS) to create a composite_score. This value reflects the overall string similarity between DHIS2 and MFL names.

  • Composite Rank Score (rank_avg): For each DHIS2 facility, we rank the match quality of each candidate MFL name, with a ranking done for each of the six similarity algorithms. Then we average these ranks into a rank_avg. A low rank_avg (e.g., 1) indicates the same MFL name consistently scores as an excellent match across methods, signaling strong consensus and high match confidence. This ensemble ranking helps balance the strengths of different algorithms.

  • R
  • Python
Show the code
# define similarity score columns
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

# compute composite_score dynamically using score_cols
match_grid <- match_grid |>
  dplyr::mutate(
    composite_score = rowMeans(
      dplyr::across(dplyr::all_of(score_cols))
    )
  )

# compute average rank across all similarity methods
ranked_grid <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::mutate(
    dplyr::across(
      dplyr::all_of(score_cols),
      ~ dplyr::min_rank(dplyr::desc(.)),
      .names = "rank_{.col}"
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    rank_avg = rowSums(dplyr::across(
      dplyr::all_of(paste0("rank_", score_cols))
    )),
    rank_avg = round(rank_avg / length(score_cols))
  )

# preview ranked results to show multiple candidates per DHIS2 facility
# select a few facilities and show their top 3 candidates each
sample_facilities <- c(
  "charity clinic kamba of",
  "arab clinic shad",
  "al arab clinic sheefa"
)

ranked_grid |>
  dplyr::filter(hf_dhis2 %in% sample_facilities) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, n = 4, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    dplyr::all_of(score_cols),
    composite_score,
    rank_avg
  ) |>
  dplyr::arrange(hf_dhis2, rank_avg)
Output
hf_dhis2 hf_mfl score_jw score_lv score_qg score_lcs composite_score rank_avg
al arab clinic sheefa arab clinic shell 75.5 71.4 71.4 61.9 70.1 1
al arab clinic sheefa affordable clinic health 77.1 54.2 62.5 37.5 57.8 3
al arab clinic sheefa alami arab clinic 71.1 52.4 61.9 52.4 59.4 4
al arab clinic sheefa arab clinic lunsar 72.7 57.1 57.1 47.6 58.7 4
arab clinic shad arab clinic shell 90.0 82.4 70.6 70.6 78.4 1
arab clinic shad arab clinic 89.6 68.8 68.8 68.8 74.0 2
arab clinic shad arab clinic lunsar 88.4 72.2 66.7 66.7 73.5 3
arab clinic shad arab clinic dwarzak 84.7 68.4 63.2 52.6 67.2 4
charity clinic kamba of city clinic garden 76.2 56.5 43.5 34.8 52.8 10
charity clinic kamba of care clinic life 67.7 47.8 43.5 34.8 48.5 15
charity clinic kamba of clinic hamanda 66.8 43.5 43.5 26.1 45.0 21
charity clinic kamba of caring clinic hands 68.6 56.5 39.1 30.4 48.7 25

To adapt the code:

  • Line 2: If you change the similarity algorithms used (e.g., add or remove metrics), update the score_cols vector to reflect the new set

Step 6.4: Extract best match

To finalize matching, we select the best candidate for each DHIS2 facility. This can be based on any one similarity algorithm or use a blended approach such as a composite score or an average rank across all methods.

The output includes one top match per DHIS2 name for each method, allowing side-by-side comparison to evaluate performance.

  • R
  • Python
Show the code
# for each fuzzy matching method, we select the top-ranked MFL name
# per DHIS2 hf.
best_jw <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_jw, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Jaro-Winkler") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_jw, method
  )

best_lv <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lv, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Levenshtein") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lv, method
  )

best_qg <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_qg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Qgram") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_qg, method
  )

best_lcs <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lcs, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "LCS") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lcs, method
  )

best_comp<- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(composite_score, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Composite-Score") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = composite_score, method
  )

best_ranked_match <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Rank-Ensemble")  |>
    dplyr::select(
    hf_dhis2, hf_mfl, score = rank_avg, method
  )

# combine top matches from all methods for comparison across approaches
all_best <- dplyr::bind_rows(
  best_jw, best_lv, best_qg, best_lcs,
  best_comp, best_ranked_match)

# preview best matches to show extraction results
all_best |>
  dplyr::slice_head(n = 10) |>
  dplyr::select(hf_dhis2, hf_mfl, score, method)
Output
hf_dhis2 hf_mfl score method
a clinic dr edwin aapdep clinic wind 76.07843 Jaro-Winkler
aapdep clinic development in national women aapdep clinic wind 76.03233 Jaro-Winkler
abdulai clinic dr jalloh clinic daru field hospital 71.75214 Jaro-Winkler
adama baimba m mchp adama mabaima mchp 88.75702 Jaro-Winkler
adolescence and clinic east happy kid zone 1 adolescent and clinic happy kid 85.85044 Jaro-Winkler
adra clinic mobile arab clinic 82.15488 Jaro-Winkler
afro arab clinic arab clinic 85.03788 Jaro-Winkler
ahmadiyya calaba hospital muslim town ahmadiyya hospital muslim 80.52252 Jaro-Winkler
ahmadiyya clinic mission ahmadiyya hospital muslim 76.77778 Jaro-Winkler
ahmadiyya hospital mamala muslim yoni ahmadiyya hospital muslim 85.85586 Jaro-Winkler

To adapt the code:

  • Lines 3–56: For each method block, change the score column and method label to match your algorithm
  • Lines 3–56: Remove blocks for methods you did not compute
  • Lines 58–60: Include only the best_* objects you kept in bind_rows()

Step 7: Evaluate Fuzzy Match Quality

High similarity scores can sometimes mask poor-quality matches, particularly when facility names differ in structure. To ensure we’re not misled by high numerical scores alone, we combine similarity metrics with diagnostic checks that examine how names align at the token level.

Step 7.1: Visualize score distributions across methods

Before getting into structural evaluations, it’s useful to inspect how the distribution of match scores varies across algorithms. This helps assess whether some methods consistently assign higher scores, and whether their matches tend to cluster tightly or are more dispersed.

This can be particularly important if you intend to apply score thresholds, or want to understand how lenient or strict a method is in assigning high-confidence scores.

  • R
  • Python
Show the code
# plot score dist across methods
all_best |>
  dplyr::filter(method != "Rank-Ensemble") |>
  ggplot2::ggplot(ggplot2::aes(x = score)) +
  ggplot2::geom_density(fill = "steelblue", alpha = 0.6) +

  ggplot2::facet_wrap(~method) +
  ggplot2::geom_density(color = "steelblue", linewidth = 1) +
  ggplot2::labs(
    title = "Distribution of Fuzzy Matching Scores",
    x = "\nMatch Score (%)",
    y = "Density\n"
  ) +
  ggplot2::theme_minimal(base_size = 14)

# save plot
ggplot2::ggsave(
  plot = map_u5mr,
  filename = here::here("03_output/3a_figures/u5mr_sle_adm2.png"),
  width = 12,
  height = 9,
  dpi = 300
)
Output

To adapt the code:

  • Do not modify anything in this section

Looking at the plots, a tight, high peak near 100 means a method produces many strong matches, a broader or flatter distribution may indicate more uncertainty and bimodal patterns may suggest a mix of clearly good and ambiguous matches. These patterns help guide our matching strategy: for our final selection in Step 6, we’ll use an 85% threshold as a starting point and build a fallback system that leans on the most reliable methods first.

Step 7.2: Define match quality diagnostic function

To evaluate the structural quality of a match, we define a simple diagnostic function that compares two names at the token level. It splits each name into individual words (which we will call tokens, since not everything in the name is a word), then does four things:

  1. checks whether the first words align (prefix match)
  2. checks whether the last words align (suffix match)
  3. calculates the difference in word count (token difference)
  4. calculates the difference in character length (character difference) These simple heuristics can catch mismatches that score high but are structurally off.
  • R
  • Python
assess_match_quality <- function(name1, name2) {
  purrr::map2_dfr(name1, name2, function(a, b) {
    tokens1 <- strsplit(a, "\\s+")[[1]]
    tokens2 <- strsplit(b, "\\s+")[[1]]
    tibble::tibble(
      prefix_match = tolower(tokens1[1]) == tolower(tokens2[1]),
      suffix_match = tolower(tail(tokens1, 1)) == tolower(tail(tokens2, 1)),
      token_diff = abs(length(tokens1) - length(tokens2)),
      char_diff = abs(nchar(a) - nchar(b))
    )
  })
}

assess_match_quality("Makeni Govt Hospital", "Makeni Government Hospital")
Output
# A tibble: 1 × 4
  prefix_match suffix_match token_diff char_diff
  <lgl>        <lgl>             <int>     <int>
1 TRUE         TRUE                  0         6

To adapt the code:

  • Do not modify anything in this section

In this example, using our assess_match_quality function, the first and last words match, so both prefix_match and suffix_match are TRUE. There is no difference in the number of tokens (token_diff = 0), but the second name is six characters longer (char_diff = 6) due to “Govt” vs “Government”.

These diagnostics help flag cases where a fuzzy score may be high but important name elements differ, such as CHC vs MCHP, Clinic vs Hospital, or Primary vs Secondary, which can signal a mismatch worth reviewing.

Step 7.3: Evaluate quality of top fuzzy matches

We now run this diagnostic on our best matches. The goal is to highlight cases where, despite high similarity scores, names may differ significantly in format or structure: for example, mismatched prefixes or suffixes. This step supports manual review or adds a layer of filtering to improve final match accuracy.

We also combine the diagnostics into a structure_score (0–100), giving 30% weight to token difference, 20% to character difference, and 25% each to prefix and suffix matches. This provides a quick summary of overall structural alignment.

  • R
  • Python
Show the code
# assess structural match quality
diagnostics_df <- dplyr::bind_cols(
  all_best,
  assess_match_quality(all_best$hf_dhis2, all_best$hf_mfl)
)

# compare method diagnostics
summary_stats <- diagnostics_df |>
  dplyr::group_by(method) |>
  dplyr::summarise(
    score = mean(score),
    avg_token_diff = mean(token_diff) |> round(2),
    avg_char_diff = mean(char_diff) |> round(2),
    pct_prefix_match = (mean(prefix_match) * 100) |> round(2),
    pct_suffix_match = (mean(suffix_match) * 100) |> round(2),
    total = dplyr::n(),
    .groups = "drop"
  )

# create a a final overall score
summary_stats <-  summary_stats |>
  dplyr::mutate(
    # rescale negative of average token difference (smaller is better)
    token_score = scales::rescale(-avg_token_diff, to = c(0, 100)),
    # rescale negative of average character difference (smaller is better)
    char_score = scales::rescale(-avg_char_diff, to = c(0, 100)),
    # rescale prefix match percentage (higher is better)
    prefix_score = scales::rescale(pct_prefix_match, to = c(0, 100)),
    # rescale suffix match percentage (higher is better)
    suffix_score = scales::rescale(pct_suffix_match, to = c(0, 100)),

    # combine all four metrics into a weighted structural quality score
    structure_score = round(
      0.3 * token_score +         # Emphasize fewer token differences
        0.2 * char_score +        # Moderate weight on character similarity
        0.25 * prefix_score +     # Give weight to matching initial words
        0.25 * suffix_score,      # Give equal weight to matching final words
      1
    )
  ) |>
  dplyr::arrange(desc(structure_score)) |>
  # assign rank based on descending structure score
  dplyr::mutate(rank = dplyr::row_number()) |>
  dplyr::select(
    method, avg_token_diff, avg_char_diff,
    pct_prefix_match, pct_suffix_match,
    total, structure_score, rank
  )


# check results
summary_stats |>
  dplyr::select(
    Method = method,
    `Avg. Token Difference` = avg_token_diff,
    `Avg. Character Difference` = avg_char_diff,
    `% Prefix Match` = pct_prefix_match,
    `% Suffix Match` = pct_suffix_match,
    `Structural Score` = structure_score,
      Rank = rank
  ) |> as.data.frame()
Output
Method Avg. Token Difference Avg. Character Difference % Prefix Match % Suffix Match Structural Score Rank
Levenshtein 0.43 3.67 50.54 29.35 84.1 1
Rank-Ensemble 0.52 4.24 54.89 27.17 71.0 2
Composite-Score 0.55 4.48 53.80 29.35 68.5 3
Jaro-Winkler 0.54 4.55 58.70 25.54 67.1 4
Qgram 0.36 3.46 32.07 14.13 50.0 5
LCS 0.70 5.48 52.17 29.35 43.9 6
Structure Metric Guide:
    - Avg. Token Difference: Lower is better – penalizes mismatches in word count
    - Avg. Character Difference: Lower is better – penalizes large length gaps
    - % Prefix Match: Higher is better – rewards same starting word
    - % Suffix Match: Higher is better – rewards same ending word
    - Structural Score: Higher is better – summary of structural alignment

To adapt the code: - Lines 32–36: You may adjust the weights used to calculate the structure_score if your matching context values certain components (e.g., prefix matches or token length) more than others

These diagnostics show how well each method preserves the structure of matched names in our example health facility name matching exercise. This helps guide which method to trust most and how to build fallback rules when multiple methods are combined. Your context may result in a different fuzzy matching algorithm being preferred.

Method Performance Summaries

  • Levenshtein: Lowest token and character differences, with strong prefix and suffix alignment. Highest structural score (84.1, Rank 1) – best overall performer.

  • Rank-Ensemble: Balanced across metrics, particularly strong on prefix alignment. Score 71.0 (Rank 2) – reliable fallback option.

  • Composite-Score: Consistent across metrics with solid suffix alignment. Score 68.5 (Rank 3) – flexible, general-purpose option.

  • Jaro-Winkler: Strongest prefix alignment but weaker on suffixes. Score 67.1 (Rank 4) – useful when beginnings of names align closely.

  • Qgram: Good at substring similarity, but weaker structurally overall. Score 50.0 (Rank 5) – more suitable for partial overlaps.

  • LCS Larger token and character gaps despite reasonable suffix matches. Lowest score (43.9, Rank 6) – penalized for structural differences.

Levenshtein was the strongest method, showing low structural differences and good prefix/suffix alignment. Rank-Ensemble and Jaro-Winkler followed as dependable options, with different strengths on prefixes. Composite-Score was steady but slightly weaker. Qgram and LCS were less effective overall, penalized for larger structural gaps despite occasional suffix strengths.

Step 7.4: Weighted thresholding using structural quality and false-positive risk

Now that the structure score is in place, it can be used to calculate additional metrics that strengthen the matching approach. These metrics can feed into a weighted thresholding process, where scores are adjusted by method quality and thresholds are tailored to the false-positive risk of each method. This shifts the threshold setting from a single arbitrary cutoff to an evidence-based process that reflects the strengths and weaknesses of each algorithm, improving precision without sacrificing matching quality.

The outputs from this step serve different purposes depending on your chosen matching strategy in Step 8. If you plan to use a single method or simple composite approach (Options 1-2), you only need the structure score rankings from Step 7.3 to identify the best-performing method. However, if you want to use weighted composite scoring or the fallback loop approach (Options 3-4), you’ll need both the method weights and thresholds calculated in this step.

This step calculates two key metrics for each fuzzy matching method:

1. Method weights: Proportional to structure scores from Step 7.3, rescaled so all weights sum to 1. Methods with better structural quality get higher influence when combining multiple methods in Options 3-4.

2. Method-specific thresholds: Use inverse scaling based on structural quality. A threshold determines the minimum similarity score required for a match (e.g., threshold 85 means only candidates scoring ≥85 are accepted). High-performing methods get lower thresholds (easier matching) because they are reliable, while low-performing methods get higher thresholds (stricter matching) to prevent false positives. For example, Levenshtein might get threshold 75 (trusted method), while LCS gets 95 (requires high confidence).

  • R
  • Python

Note In the code below, the thresholds are created by rescaling each method’s structure_score (0–100) into a new range defined by the 70th and 95th percentiles of all_best$score. The rev() function flips this mapping so that higher scores are pushed toward the lower end of the range and lower scores toward the higher end. This means stronger methods receive stricter thresholds, while weaker methods are assigned looser thresholds.

# calculate method-specific weights and thresholds
method_threshold <- summary_stats |>
  dplyr::mutate(
    score = structure_score / 100,
    weight = score / sum(score),
    threshold = scales::rescale(
      structure_score,
      to = rev(unname(
        stats::quantile(all_best$score, c(0.70, 0.95), na.rm = TRUE)
      ))
    ) |>
      round()
  ) |>
  dplyr::select(method, weight, threshold)

# display results
method_threshold
Output
Method-Specific Weights and Thresholds
Method Weight Threshold
Levenshtein 0.219 75
Rank-Ensemble 0.185 80
Composite-Score 0.178 81
Jaro-Winkler 0.174 82
Qgram 0.130 88
LCS 0.114 91

To adapt the code:

  • Line 1: Keep summary_stats from Step 7.3 with the structure_score column
  • Lines 5–7: Adjust the quantile range (default: 0.70 to 0.95) to control threshold spread
  • Line 10: Modify column selection if you need additional metrics in the output

This table shows each method’s relative weight, scaled so all weights sum to one, and its method-specific score threshold derived from observed performance. The relationship between structural quality, weights, and thresholds follows this pattern:

Method Weight Threshold Interpretation
Levenshtein 0.219 75 Highest-performing method: gets most influence in composite scoring and lowest threshold for matching
Composite-Score 0.185 80 Strong performer: high influence and low threshold
Rank-Ensemble 0.178 81 Good performer: moderate-high influence and low-medium threshold
Jaro-Winkler 0.174 82 Balanced method: moderate influence and medium threshold
Qgram 0.130 88 Lower performer: reduced influence and higher threshold to prevent false positives
LCS 0.114 92 Lowest performer: minimal influence and strictest threshold required

Key insight: Methods with better structural alignment get rewarded with higher weights (more influence) and lower thresholds (easier matching), while unreliable methods are constrained with lower weights and stricter thresholds.

Step 8: Finalize Fuzzy Match Selection

This step assigns a final match to each DHIS2 facility by choosing the most appropriate candidate from the fuzzy matching results. At this stage, every unmatched facility from earlier steps should have a set of one or more candidate matches with similarity scores and any supporting diagnostics (e.g., prefix/suffix matches, token differences). The goal is to select one match per facility in a way that balances accuracy, efficiency, and coverage.

The choice of final selection strategy depends on:

  • Confidence in your matching methods: If your algorithms and thresholds are already well-tuned to the data, you can use a simpler approach.
  • Tolerance for false positives: In some contexts, a slightly lower match rate is acceptable if it reduces the risk of incorrect matches.
  • Analytical and operational priorities: For example, whether you need a quick, explainable method for an urgent report, or a more comprehensive, iterative process for a master dataset update.

We offer four structured options, each with its own strengths, weaknesses, and level of complexity. In practice, you may start with a simpler option and progress to more advanced approaches if too many facilities remain unmatched or if match quality is inconsistent. These options are also compatible with an SNT team review step, allowing human validation before matches are finalized in the dataset.

Summary of Match Selection Options
Option Description Use Case Code Complexity
Option 1: Single Method Use one preferred method (e.g., Levenshtein) with a threshold When confident in a single algorithm Low
Option 2: Composite/Rank Use composite score or rank average to select best match When using blended match signals Moderate
Option 3: Weighted Composite Score Use performance-weighted composite score that accounts for method reliability and false-positive rates When using blended match signals with known method performance differences Moderate
Option 4: Fallback Loop Stepwise fallback through all methods based on structural quality scores When coverage is most important High

Step 8.1: Match selection options

Option 1: Single Preferred Method

This approach uses one similarity algorithm, such as Levenshtein or Jaro-Winkler, to select the highest-scoring match for each facility. The analyst applies a fixed threshold so that scores above the cut-off are accepted as matches, while the system flags lower scores for review. It is fast, easy to explain, and works well when the chosen method aligns with the types of name differences in your data.

  • R
  • Python
Show the code
# apply single-method threshold and flag matches
best_lv_final <- best_lv |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review"),
    final_method = "Levenshtein"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(hf_dhis2, hf_mfl, hf_mfl_raw, score, final_method, match_flag)

# show top 5 and bottom 5 matches by score
top_5 <- best_lv_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- best_lv_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Levenshtein Score")

# preview results
top_bottom
Output
Top 5 and Bottom 5 Matches by Levenshtein Score: 13 out of 204 facilities matched (191 need manual review)
hf_dhis2 hf_mfl hf_mfl_raw score final_method match_flag
borongoh makarankay mchp borongor makarankay mchp Borongor Makarankay MCHP 95.83333 Levenshtein match
clinic health redeemer clinic health redeemers Health Redeemers Clinic 95.65217 Levenshtein match
kayimba mchp vawahun kayimba mchp vaawahun Vaawahun Kayimba MCHP 95.23810 Levenshtein match
mchp momorimaria mchp momorimaia Momorimaia MCHP 93.75000 Levenshtein match
branda centre hospital medical branda center hospital medical Branda Medical Center Hospital 93.33333 Levenshtein match
clinic college evangelical of theology center clinic evangelical health lutheran mark st St. Mark Evangelical Lutheran Health Center Clinic 40.81633 Levenshtein review
aid city clinic direct kenema orphanage and care clinic hand heart Heart And Hand Care Clinic 41.02564 Levenshtein review
aapdep clinic development in national women centre clinic correctional makeni Makeni Correctional Centre Clinic 41.86047 Levenshtein review
community hospital rural waterloo government hospital macauley street Macauley Street Government Hospital 42.85714 Levenshtein review
clinic college margai milton mmcet clinic countries mano river Mano River Countries Clinic 44.11765 Levenshtein review

To adapt the code:

  • Line 2: Replace best_lv with another object from Step 4.5 (e.g., best_jw, best_qg) depending on your preferred method
  • Line 4: Adjust the score threshold (default = 85) to define what is considered a match
  • Line 6: Keep column names hf_dhis2 and hf_mfl, or replace them with your own

In our example dataset, the Levenshtein method with an 85% threshold achieved the match rate shown in the output above. This conservative approach prioritizes precision over recall, making it suitable when false positives are costly and manual review resources are available.


Option 2: Composite or Rank-Based Match

This approach selects the best match using a combined measure from multiple algorithms, such as a composite score or average rank. It avoids relying on a single method while keeping the process simpler than a full fallback loop. This works well when different algorithms capture different types of variation, and a single combined score can be used with a threshold to separate confident matches from those needing review.

  • R
  • Python
Show the code
# select best match using rank average
composite_final <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(rank_avg) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(composite_score >= 85, "match", "review")
  ) |>
  dplyr::select(
    adm1,
    adm2,
    hf_dhis2,
    hf_mfl,
    composite_score,
    rank_avg,
    match_flag
  )

# show top 5 and bottom 5 matches by composite score
top_5 <- composite_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- composite_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Composite Score")

# preview results
top_bottom
Output
Top 5 and Bottom 5 Matches by Composite Score: 14 out of 204 facilities matched (190 need manual review)
hf_dhis2 hf_mfl hf_mfl_raw score final_method match_flag
branda centre hospital medical branda center hospital medical Branda Medical Center Hospital 96.38889 Composite-Score match
clinic health redeemer clinic health redeemers Health Redeemers Clinic 96.37681 Composite-Score match
kayimba mchp vawahun kayimba mchp vaawahun Vaawahun Kayimba MCHP 95.61508 Composite-Score match
mchp momorimaria mchp momorimaia Momorimaia MCHP 94.79167 Composite-Score match
chc manjiama chc manjama Manjama CHC 93.05556 Composite-Score match
chp city cross kenema red chc kpetema Kpetema CHC 44.43939 Composite-Score review
clinic college evangelical of theology clinic global tasley Tasley Global Clinic 44.90253 Composite-Score review
community foundation health hospital mile 91 choithram hospital memorial Choithram Memorial Hospital 44.95847 Composite-Score review
and chc family musa bay chc s susan Susan’s Bay CHC 45.97222 Composite-Score review
bai clinic ebk ernest koroma pl university borah clinic maternity Borah Maternity Clinic 46.18687 Composite-Score review

To adapt the code:

  • Line 2: Keep ranked_grid from Step 6.4, or replace with your combined results dataframe
  • Line 7: Adjust the composite score threshold (default = 85) to define what is considered a match
  • Lines 9–14: Keep score columns (composite_score, rank_avg) or replace with your scoring metrics

The composite score approach results in the match rate shown in the output above. This reflects the averaging effect where strong matches in one method may be diluted by weaker scores in others. Best suited for datasets with consistent naming patterns across methods.


Option 3: Weighted Composite Score

This approach selects the best match using a performance-weighted composite score that accounts for method reliability and false-positive rates. Instead of treating all methods equally, it uses the weights calculated in Step 7.4 to emphasize methods that demonstrated better structural alignment. This provides a more sophisticated alternative to simple averaging while being less complex than a full fallback system.

The weighted composite score uses a weighted average that maintains the 0-100 scale: Methods with higher structural quality scores get more weight in the final composite. For example, if Levenshtein has weight 0.4 and Qgram has weight 0.1, the Levenshtein score contributes 4x more to the final weighted score.

  • R
  • Python
Show the code
# define score columns and extract weights
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

weights <- method_threshold |>
  dplyr::filter(
    method %in% c("Jaro-Winkler", "Levenshtein", "Qgram", "LCS")
  ) |>
  dplyr::pull(weight)

# calculate weighted composite score for each facility-match pair
weighted_final <- ranked_grid |>
  dplyr::rowwise() |>
  dplyr::mutate(
    # weighted average maintains 0-100 scale
    weighted_composite_score = stats::weighted.mean(
      c(score_jw, score_lv, score_qg, score_lcs),
      weights,
      na.rm = TRUE
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(dplyr::desc(weighted_composite_score)) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(
      weighted_composite_score >= 85,
      "match",
      "review"
    ),
    score = weighted_composite_score, # standardize column name for Step 7
    final_method = "Weighted-Composite"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    hf_mfl_raw,
    score,
    final_method,
    match_flag
  )

# show top 5 and bottom 5 matches by weighted composite score
top_5 <- weighted_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- weighted_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Weighted Composite Score")

# preview results
bottom_5
Output
Top 5 and Bottom 5 Matches by Weighted Composite Score: 19 out of 204 facilities matched (185 need manual review)
hf_dhis2 hf_mfl hf_mfl_raw score final_method match_flag
clinic health redeemer clinic health redeemers Health Redeemers Clinic 96.64674 Weighted-Composite match
branda centre hospital medical branda center hospital medical Branda Medical Center Hospital 96.59957 Weighted-Composite match
kayimba mchp vawahun kayimba mchp vaawahun Vaawahun Kayimba MCHP 95.75551 Weighted-Composite match
mchp momorimaria mchp momorimaia Momorimaia MCHP 95.17969 Weighted-Composite match
chc manjiama chc manjama Manjama CHC 93.57292 Weighted-Composite match
clinic college evangelical of theology clinic global tasley Tasley Global Clinic 48.71675 Weighted-Composite review
community hospital rural waterloo choithram hospital memorial Choithram Memorial Hospital 49.77243 Weighted-Composite review
aapdep clinic development in national women aapdep clinic wind WIND AAPDEP Clinic 50.64622 Weighted-Composite review
clinic college margai milton mmcet clinic general modia Modia General Clinic 50.89632 Weighted-Composite review
community foundation health hospital mile 91 clinic community health methodist Methodist Community Health Clinic 51.16951 Weighted-Composite review

To adapt the code:

  • Line 2: Adjust score_cols if you have different similarity score columns in your dataset
  • Lines 4–8: Ensure method_threshold from Step 7.4 contains weights for the methods you’re using
  • Lines 15–18: The weighted.mean() function calculates a weighted average that maintains the 0-100 scale
  • Lines 24–27: Adjust the weighted composite score threshold (default = 85) to define what is considered a match
  • Lines 36–41: Select columns to include in the final output

The weighted composite approach achieved the improved match rate shown in the output above. This improvement over the simple composite demonstrates the value of emphasizing high-quality methods. The weighting balances the strengths of multiple models while prioritizing the most reliable algorithms.


Option 4: Fallback Loop Across Methods (Recommended for Coverage)

This approach maximizes match coverage through a sophisticated tiered fallback system that dynamically uses the method ordering and thresholds calculated in Steps 7.3 and 7.4. Methods are applied in descending order of their structure scores, ensuring high-precision algorithms get the first chance to match.

A key strength is the use of method-specific thresholds from Step 7.4. Each method uses its own calculated threshold based on its structural quality and false-positive risk profile. This evidence-based approach is more reliable than arbitrary cutoffs.

The method order and thresholds are dynamically determined from the structural quality analysis in Steps 7.3-7.4. Methods with higher structure scores are prioritized, and each uses its calculated threshold based on false-positive risk:

  • High-quality methods (e.g., Levenshtein): Lower thresholds due to reliable matching
  • Medium-quality methods (e.g., Jaro-Winkler): Moderate thresholds for balanced precision
  • Lower-quality methods (e.g., Qgram, LCS): Higher thresholds to prevent false positives

The fallback process works as follows:

  1. Method ordering: Methods are ordered by their structure scores from Step 7.3 (highest to lowest)
  2. Tie-breaking: Within each method, candidates are pre-sorted by:
    • Similarity score (descending) - highest scores first
    • Average rank from ensemble methods (ascending) - best ranks as secondary criterion This ensures that when multiple candidates exceed the threshold, the best match is selected
  3. Progressive matching: Each method attempts to match remaining facilities using its calculated threshold from Step 7.4
  4. Threshold relaxation: If facilities remain unmatched after a full pass, thresholds are reduced by 1 point and the cycle repeats
  5. Quality floor: Process stops when all facilities are matched or thresholds drop below 50

This dynamic approach with robust tie-breaking ensures the best possible match for each facility while maintaining quality standards based on empirical performance metrics.

  • R
  • Python
Show the code
# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats <- summary_stats |>
  dplyr::arrange(dplyr::desc(structure_score)) |>
  dplyr::filter(method != "Rank-Ensemble") # exclude rank-based method

method_order <- method_stats$method

# use thresholds calculated in step 7.4
method_thresholds_df <- method_threshold |>
  dplyr::filter(method != "Rank-Ensemble")

# convert to named vector for easy lookup
method_thresholds <- setNames(
  method_thresholds_df$threshold,
  method_thresholds_df$method
)

# build combined table of top matches from real methods only
all_best_nonrank <- dplyr::bind_rows(
  best_lv,
  best_jw,
  best_qg,
  best_lcs,
  best_comp
) |>
  # normalize method labels
  dplyr::mutate(method = trimws(method))

# add rank once
rank_key <- ranked_grid |>
  dplyr::select(hf_dhis2, hf_mfl, rank_avg)

all_best_ranked <- all_best_nonrank |>
  dplyr::left_join(rank_key, by = c("hf_dhis2", "hf_mfl")) |>
  dplyr::mutate(
    # worst for missing ranks
    rank_avg = dplyr::if_else(is.na(rank_avg), Inf, rank_avg)
  )

# split by method and pre-arrange for stable tiebreak
by_method <- split(all_best_ranked, all_best_ranked$method) |>
  purrr::map(\(x) {
    x |>
      dplyr::arrange(dplyr::desc(score), rank_avg)
  })

# keep copy for pass-rate check
initial_thr <- method_thresholds

# init
fallback_chunks <- list()
unmatched <- unique(all_best_ranked$hf_dhis2)

# fast fallback loop (no joins, no grouping inside)
repeat {
  for (m in method_order) {
    if (!m %in% names(by_method)) {
      next
    }

    # current threshold from step 7.4
    thr <- method_thresholds[[m]]
    if (is.null(thr)) {
      thr <- 85 # fallback if method not found
    }

    # filter by score and still-unmatched
    cand <- by_method[[m]] |>
      dplyr::filter(score >= thr, hf_dhis2 %in% unmatched)

    if (nrow(cand) > 0) {
      # pick first row per hf_dhis2 (pre-arranged by score desc, rank asc)
      best_per_hf <- cand |>
        dplyr::distinct(hf_dhis2, .keep_all = TRUE) |>
        dplyr::mutate(
          final_method = glue::glue("Fuzzy-matched using: {m}")
        )

      # store and update unmatched
      fallback_chunks[[length(fallback_chunks) + 1L]] <- best_per_hf
      unmatched <- setdiff(unmatched, best_per_hf$hf_dhis2)
    }

    if (length(unmatched) == 0) break
  }

  if (length(unmatched) == 0) {
    break
  }

  # relax thresholds gradually
  method_thresholds <- method_thresholds - 1
  if (any(method_thresholds <= 50)) break # stop if too low
}

# bind once
fallback_matched <- dplyr::bind_rows(fallback_chunks)

# build a 1:1 map from standardized DHIS2 name to admin + raw +
# hf_uid_new
hf_uid_new_map_by_stand <- dhis2_hf_unmatched |>
  dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::summarise(
    adm0 = dplyr::first(adm0),
    adm1 = dplyr::first(adm1),
    adm2 = dplyr::first(adm2),
    adm3 = dplyr::first(adm3),
    hf_dhis2_raw = dplyr::first(hf_dhis2_raw),
    hf_uid_new = dplyr::first(hf_uid_new),
    .groups = "drop"
  )

# final flags and attach admin + ids
fallback_final <- fallback_matched |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review")
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::left_join(hf_uid_new_map_by_stand, by = "hf_dhis2") |>
  dplyr::select(
    adm0, adm1, adm2, adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    match_flag
  )

top_5 <- fallback_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- fallback_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Fallback")

bottom_5 <- dplyr::bind_rows(top_5, bottom_5)

# preview results
bottom_5
Output
Top 5 and Bottom 5 from Fallback: 47 of 204 matched (157 review)
adm0 adm1 adm2 adm3 hf_uid_new hf_dhis2_raw hf_dhis2 hf_mfl_raw hf_mfl score final_method match_flag
SIERRA LEONE NORTH WEST KARENE GBANTI_KARENE hf_uid_new::817 BORONGOH MAKARANKAY MCHP borongoh makarankay mchp Borongor Makarankay MCHP borongor makarankay mchp 95.83333 Fuzzy-matched using: Levenshtein match
SIERRA LEONE WESTERN WESTERN URBAN WEST II hf_uid_new::1723 REDEEMER HEALTH CLINIC clinic health redeemer Health Redeemers Clinic clinic health redeemers 95.65217 Fuzzy-matched using: Levenshtein match
SIERRA LEONE SOUTHERN PUJEHUN PANGA hf_uid_new::1440 VAWAHUN KAYIMBA MCHP kayimba mchp vawahun Vaawahun Kayimba MCHP kayimba mchp vaawahun 95.23810 Fuzzy-matched using: Levenshtein match
SIERRA LEONE WESTERN WESTERN URBAN EAST I hf_uid_new::1618 KARGBO DOCKYARD CHP chp dockyard kargbo Kargbo Dockyard MCHP dockyard kargbo mchp 95.00000 Fuzzy-matched using: Qgram match
SIERRA LEONE NORTH WEST KARENE GBANTI_KARENE hf_uid_new::820 GBINTI MARIA MCHP gbinti maria mchp Gbinti Maria CHP chp gbinti maria 94.11765 Fuzzy-matched using: Qgram match
SIERRA LEONE SOUTHERN BONTHE IMPERRI hf_uid_new::1196 MOUNT HOPE HOSPITAL hope hospital mount Makeni Govt. Hospital govt hospital makeni 65.00000 Fuzzy-matched using: Levenshtein review
SIERRA LEONE NORTH EAST TONKOLILI YONI MAMAILA hf_uid_new::713 COMMUNITY HEALTH FOUNDATION (MILE 91) HOSPITAL community foundation health hospital mile 91 Methodist Community Health Clinic clinic community health methodist 70.25253 Fuzzy-matched using: Jaro-Winkler review
SIERRA LEONE EASTERN KENEMA KENEMA CITY hf_uid_new::140 DIRECT AID ORPHANAGE (KENEMA CITY) CLINIC aid city clinic direct kenema orphanage Heart And Hand Care Clinic and care clinic hand heart 70.72650 Fuzzy-matched using: Jaro-Winkler review
SIERRA LEONE WESTERN WESTERN URBAN WEST II hf_uid_new::1706 AMI EXPEDITIONARY HEALTHCARE CLINIC ami clinic expeditionary healthcare Sierra Leone-China Teaching Hospital china hospital leone sierra teaching 71.04497 Fuzzy-matched using: Jaro-Winkler review
SIERRA LEONE WESTERN WESTERN URBAN WEST I hf_uid_new::1696 DR CLAUDIUS COLE CLINIC claudius clinic cole dr Our Lady Of Guadalupe Clinic clinic guadalupe lady of our 71.20791 Fuzzy-matched using: Jaro-Winkler review

To adapt the code:

  • Lines 2–8: The method order is automatically determined from summary_stats structure scores
  • Lines 9–16: Thresholds are extracted from Step 7.4’s method_threshold dataframe
  • Line 65: The fallback threshold (default = 85) is used only if a method lacks a calculated threshold
  • Line 94: The minimum threshold (default = 50) prevents accepting very low-quality matches
  • Lines 33, 36, throughout: Replace hf_dhis2 and hf_mfl with your dataset’s column names if they differ

The fallback loop achieved the highest match rate as shown in the output above. This substantial improvement over other methods demonstrates the effectiveness of method-specific thresholds and progressive fallback. The approach maximizes coverage while maintaining quality through data-driven thresholds and intelligent tie-breaking.

This is the approach that we will use going forward, as it offers the best balance between coverage and match quality, reduces the manual review workload, and adapts thresholds dynamically to the strengths and weaknesses of each method.


Step 8.2: Select fuzzy matching approach

After reviewing the different fuzzy matching options from Step 8.1, select which approach to use in your final workflow. This step creates the fuzzy_matches dataset that will be combined with other matching methods in Step 9.

Choosing Your Approach

Select one approach based on your evaluation in Step 8.1: - Use single algorithm (Levenshtein, Jaro-Winkler, or Q-gram) for simplicity and interpretability - Use composite scores when you want to balance multiple algorithms - Use weighted approach when domain expertise suggests certain algorithms are more reliable for your data - Use fallback approach for the most comprehensive matching with hierarchical scoring

  • R
  • Python
Show the code
# create fuzzy_matches from step 8 results (use whichever option was
# chosen)
# step 8 produces different result datasets depending on which option
# was
# chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final

fuzzy_matches <- if (exists("fallback_final")) {
  dplyr::filter(fallback_final, match_flag == "match")
} else if (exists("composite_final")) {
  dplyr::filter(composite_final, match_flag == "match")
} else if (exists("weighted_final")) {
  dplyr::filter(weighted_final, match_flag == "match")
} else if (exists("best_lv_final")) {
  dplyr::filter(best_lv_final, match_flag == "match")
} else if (exists("best_jw_final")) {
  dplyr::filter(best_jw_final, match_flag == "match")
} else if (exists("best_qg_final")) {
  dplyr::filter(best_qg_final, match_flag == "match")
} else {
  # empty dataframe with expected structure if no step 8 results exist
  data.frame(
    hf_dhis2_raw = character(0),
    hf_dhis2 = character(0),
    hf_mfl_raw = character(0),
    hf_mfl = character(0),
    score = numeric(0),
    final_method = character(0),
    stringsAsFactors = FALSE
  )
} |>
# remove any internal duplicates (keep highest scoring match per
# facility)
dplyr::group_by(hf_dhis2_raw) |>
dplyr::slice_max(score, n = 1, with_ties = FALSE) |>
dplyr::ungroup() |>
# attach admin + hf_uid_new from unmatched (collapse to 1:1 per
# hf_dhis2_raw)
{
  hf_uid_new_map_unmatched <- dhis2_hf_unmatched |>
    dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
    dplyr::group_by(hf_dhis2_raw) |>
    dplyr::summarise(
      adm0 = dplyr::first(adm0),
      adm1 = dplyr::first(adm1),
      adm2 = dplyr::first(adm2),
      adm3 = dplyr::first(adm3),
      hf_uid_new = dplyr::first(hf_uid_new),
      .groups = "drop"
    )
  dplyr::left_join(., hf_uid_new_map_unmatched, by = "hf_dhis2_raw")
}

# display summary of selected approach
cli::cli_h2("Selected Fuzzy Matching Results")
cli::cli_alert_success("Fuzzy matches found: {nrow(fuzzy_matches)}")

if (nrow(fuzzy_matches) > 0) {
  score_summary <- fuzzy_matches |>
    dplyr::summarise(
      avg_score = mean(score, na.rm = TRUE),
      min_score = min(score, na.rm = TRUE),
      max_score = max(score, na.rm = TRUE),
      .groups = "drop"
    )

  cli::cli_alert_info("Score range: {round(score_summary$min_score, 1)} - {round(score_summary$max_score, 1)}")
  cli::cli_alert_info("Average score: {round(score_summary$avg_score, 1)}")
}
Output
── Selected Fuzzy Matching Results ──
✔ Fuzzy matches found: 47
ℹ Score range: 85 - 95.8
ℹ Average score: 88.6

To adapt the code:

  • Lines 6-20: The conditional logic checks for different Step 8 result datasets in priority order. Modify the order if you prefer a different approach to take precedence
  • Lines 21-29: Create empty dataframe structure if no Step 8 results exist. Ensure column names match your expected structure
  • Lines 31-34: Deduplication logic removes internal duplicates within the selected approach, keeping the highest scoring match per DHIS2 facility. Modify if you need different deduplication rules
  • Alternative approach: Instead of conditional logic, you can directly assign your chosen dataset: ````

Step 8.3: Manual review of flagged matches

Before finalizing matches and joining back to your main datasets, all records flagged as review must be manually validated or rejected by the SNT team or designated reviewers. This step ensures data quality and prevents false positives from entering the analysis.

Consult with SNT team

Manual review is required for all review matches before proceeding to Step 9. The SNT team should confirm or reject uncertain matches and document patterns in false positives or missed matches to refine future matching workflows.

Preparing Data for Manual Review

Below, we export review matches to a structured file for systematic checking:

  • R
  • Python
Show the code
# prepare review dataset (using fallback results as example)
review_matches <- fallback_final |>
  dplyr::filter(match_flag == "review") |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::mutate(
    reviewer_decision = NA_character_, # accept/reject/uncertain
    reviewer_notes = NA_character_,
    review_date = NA_character_,
    reviewer_name = NA_character_
  ) |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    reviewer_decision,
    reviewer_notes,
    review_date,
    reviewer_name
  )

# export for review
readr::write_csv(
  review_matches,
  here::here("03_outputs", "tables", "facility_matches_for_review.csv")
)

# show summary
cli::cli_h2("Manual Review Required")
cli::cli_alert_info(
  "Exported {nrow(review_matches)} matches flagged for review"
)
cli::cli_alert_info(
  "File location: outputs/facility_matches_for_review.csv"
)
Output
── Manual Review Required ──
ℹ Exported 157 matches flagged for review
Sample matches requiring manual review (ordered by score)
hf_dhis2 hf_mfl score final_method
arab clinic waterloo arab clinic 85.00000 Fuzzy-matched using: Jaro-Winkler
bandama galliness mchp bandama mchp 84.84848 Fuzzy-matched using: Jaro-Winkler
mchp veronical chp venima 84.76190 Fuzzy-matched using: Jaro-Winkler
clinic marie stopes street waterloo clinic marie stopes 84.76190 Fuzzy-matched using: Jaro-Winkler
chp tuakan mchp tawahun 84.62963 Fuzzy-matched using: Jaro-Winkler

To adapt the code:

  • Lines 6–12: Add additional columns for your review process (e.g., facility_type, district)
  • Line 15: Adjust the output file path and name as needed
Review Process Guidelines

The manual review step is more than a simple yes/no decision. It is an opportunity to strengthen the matching workflow by confirming accurate links, rejecting false positives, and documenting recurring data quality issues. Reviewers play a key role in ensuring the integrity of the final dataset and in improving the process for future runs.

For Reviewers:

  1. Accept – Confirm that both names refer to the same facility. Look for strong evidence of equivalence, such as:
    • Minor spelling variations: "Makeni Gov Hospital" ↔︎ "Makeni Government Hospital"
    • Common or consistent abbreviations: Bo CHC ↔︎ Bo Community Health Center
    • Small formatting differences that do not alter meaning
  2. Reject – Flag cases where facilities differ or where the match is too uncertain. Common examples include:
    • Different facility types in the same area: "Makeni Hospital" ↔︎ "Makeni Clinic"
    • Generic names without geographic qualifiers: Community Health Post without a location
    • Multiple possible matches in the same vicinity without clear identifiers
  3. Document patterns – Use the review to capture recurring issues. These insights help refine thresholds, weights, and matching rules. Examples:
    • Frequent abbreviation mismatches (e.g., CHC vs CH Centre)
    • Regional or language-specific naming conventions
    • Missing or inconsistent administrative unit data in source systems

A systematic approach to review ensures that accepted matches are trustworthy, rejected matches are well-justified, and lessons learned feed directly into improving the next matching cycle.

Simulated Review Decisions

For demonstration purposes, we simulate reviewer decisions on the flagged matches. In practice, this would be done by domain experts reviewing the exported CSV file.

After manual review is complete, we import the data that has all the decisions.

  • R
  • Python
Show the code
# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches <- readr::read_csv(
  here::here("03_outputs", "tables", "facility_matches_reviewed.csv"),
  col_types = readr::cols(
    reviewer_decision = readr::col_character(),
    reviewer_notes = readr::col_character(),
    review_date = readr::col_character(),
    reviewer_name = readr::col_character()
  )
)

To adapt the code:

  • Line 3: Update the file path to your project’s output directory
  • Lines 4–8: Add additional column types if your review file includes extra fields
Incorporating Review Results

Now we integrate the reviewer decisions back into your matching workflow:

  • R
  • Python
Show the code
final_incorp <- reviewed_matches |>
  dplyr::mutate(
    final_match_flag = dplyr::case_when(
      reviewer_decision == "accept" ~ "match",
      reviewer_decision == "reject" ~ "no_match",
      is.na(reviewer_decision) ~ "pending_review",
      TRUE ~ "no_match"
    )
  )

# get the final matched ones
final_matches <- final_incorp |>
  dplyr::filter(final_match_flag == "match") |>
  dplyr::select(-reviewer_decision)

# summary of all results (including rejected/pending)
all_results_summary <- final_incorp |>
  dplyr::count(final_match_flag, name = "n") |>
  dplyr::mutate(percentage = round(100 * n / sum(n), 1))

# summary of final accepted matches only
final_summary <- final_matches |>
  dplyr::count(final_match_flag, name = "n")

cli::cli_h2("Final Matching Results After Manual Review")
cli::cli_alert_success(
  "Accepted matches: {format(sum(final_summary$n), big.mark = ',')}"
)

# breakdown by decision type
rejected_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "no_match") |>
  dplyr::pull(n)
if (length(rejected_count) == 0) {
  rejected_count <- 0
}

pending_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "pending_review") |>
  dplyr::pull(n)
if (length(pending_count) == 0) {
  pending_count <- 0
}

if (rejected_count > 0) {
  cli::cli_alert_danger(
    "Rejected matches: {format(rejected_count, big.mark = ',')} (will have NA MFL values in final dataset)"
  )
}

if (pending_count > 0) {
  cli::cli_alert_warning(
    "Pending review: {format(pending_count, big.mark = ',')} (require additional review)"
  )
}

# show breakdown table
all_results_summary
Output
── Final Matching Results After Manual Review ──
✔ Accepted matches: 8
✖ Rejected matches: 149 (will have NA MFL values in final dataset)
Complete breakdown of reviewer decisions
Decision Count Percentage
match 8 5.1
no_match 149 94.9

To adapt the code:

  • Line 3: Update the file path from "03_outputs" to your project’s output directory
  • Lines 4–8: Add additional column types if your review file includes extra fields
  • Lines 12–18: Adjust logic for handling different reviewer decision categories based on your workflow

Out of the 159 proposed matches, none were accepted, as all appeared to be different. This may suggest underlying data quality issues or the presence of new health facility names.

DHIS2 facilities with rejected matches will appear in the final dataset (Step 9) with NA values for MFL columns, preserving the facility record while clearly indicating no reliable match was found. Facilities with pending review should be resolved before proceeding to production use.

Step 9: Combine All Matched Results

In this step, we combine results from all matching approaches and finalize the dataset by joining back to the original DHIS2 health facility list. This creates a comprehensive dataset that includes:

  1. High-confidence matches from Step 4 (Interactive Stratified Geographic Matching)
  2. Fuzzy matches from Step 6-8 (Various fuzzy matching strategies on remaining unmatched)
  3. Unmatched facilities (Preserved with NA values for MFL columns)

Step 9.1: Combine results from all matching approaches

When using the enhanced workflow with stratified matching, combine results from all stages:

  • R
  • Python
Show the code
# create manual_matches from manual review results (both matched and
# unmatched)
manual_matches <- final_incorp |>
  dplyr::mutate(
    # for unmatched, set appropriate values
    hf_mfl = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl,
      NA_character_
    ),
    hf_mfl_raw = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl_raw,
      NA_character_
    ),
    final_method = dplyr::if_else(
      reviewer_decision == "accept",
      "Manual Review - Accepted",
      "Manual Review - Rejected"
    ),
    score = dplyr::if_else(
      reviewer_decision == "accept",
      100,
      0
    )
  ) |>
  # manual reviewed results already carry adm0-3 + hf_uid_new from
  # fallback_final → review export → reimport
  dplyr::select(
    adm0,
    adm1,
    adm2,
    adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method
  )

# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and
# rejected
final_facilities_all <- dplyr::bind_rows(
  matched_dhis2_prepgeoname, # Step 4: Stratified geographic matches (if performed)
  matched_dhis2, # Step 5: Exact matches after standardization
  fuzzy_matches, # Step 6-8: Complete matches from fuzzy matching
  manual_matches # Manual matching (if performed)
)

# collapse to one record per DHIS2 facility for downstream integration
final_facilities_one_per_hf <- final_facilities_all |>
  dplyr::select(-match_flag, -hf)

cli::cli_alert_info(
  "All facilities processed (rows): {nrow(final_facilities_all)}"
)
cli::cli_alert_success(
  "All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): {dplyr::n_distinct(final_facilities_all$hf_uid_new)}"
)

# generate summary by matching method
matching_summary <- final_facilities_one_per_hf |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    n_matched = n(),
    avg_score = mean(score, na.rm = TRUE),
    min_score = min(score, na.rm = TRUE),
    max_score = max(score, na.rm = TRUE),
    .groups = "drop"
  )

# display summary
cli::cli_h2("Matching Results by Method")
matching_summary
Output
ℹ All facilities processed (rows): 1768
✔ All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): 1768
Matching Results by Method
Method Count Avg Score Min Score Max Score
Fuzzy-matched using: Jaro-Winkler 32 87.0 85.5 89.4
Fuzzy-matched using: Levenshtein 13 91.8 85.0 95.8
Fuzzy-matched using: Qgram 2 94.6 94.1 95.0
Interactive Stratified Geographic Matching 1500 100.0 100.0 100.0
Manual Review - Accepted 8 100.0 100.0 100.0
Manual Review - Rejected 149 0.0 0.0 0.0
Matched Without Fuzzy Matching (standardization) 64 100.0 100.0 100.0

To adapt the code:

  • Lines 6-13: The fuzzy_matches logic automatically detects which Step 8 approach was used. Modify the order or add additional Step 8 result variables if needed
  • Lines 15-25: The manual_matches logic processes both matched and rejected facilities from final_incorp. Replace final_incorp with your manual review results dataset name
  • Lines 19-26: Facilities that were rejected in manual review get final_method = "Manual Review - Rejected" and score = 0. Adjust these labels if needed
  • Lines 28-32: The bind_rows combines all matching approaches. Add or remove rows based on which steps you used in your workflow
  • Lines 33-36: Deduplication logic removes duplicates by keeping the highest scoring match for each DHIS2 facility (grouped by hf_dhis2_raw only). Modify if you need different deduplication rules
  • Lines 38-55: Validation checks detect internal duplicates within each dataset and verify the final combined dataset has no duplicates. Remove or modify these checks if not needed

Step 9.2: Create final integrated DHIS2-MFL dataset

Create the final analytical dataset by starting with DHIS2 as the base and enriching it with MFL data where matches exist. This ensures no DHIS2 facilities are lost while adding valuable MFL attributes.

  • R
  • Python
Show the code
# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final <- dhis2_df |>
  dplyr::left_join(
    dhis2_map,
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  )

# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf <- final_facilities_one_per_hf |>
  dplyr::select(hf_uid_new, hf_mfl_raw)

# create the integrated dataset
final_dhis2_mfl_df <- dhis2_df_final |>
  dplyr::left_join(final_match_per_hf, by = "hf_uid_new") |>
  # join MFL attributes for matched facilities
  dplyr::left_join(
    dplyr::select(master_hf_df, -hf_mfl, -hf, -adm0, -adm1, -adm2, -adm3),
    by = "hf_mfl_raw"
  )

# counts on distinct facilities via hf_uid_new (robust to row-level
# duplication)
total <- dhis2_df_final |> dplyr::distinct(hf_uid_new) |> nrow()
with_mfl <- final_match_per_hf |>
  dplyr::filter(!is.na(hf_mfl_raw)) |>
  nrow()
without_mfl <- total - with_mfl

# validation summary
cli::cli_h2("DHIS2-MFL Integration Complete")
cli::cli_alert_success(
  "Total DHIS2 facilities preserved: {total}"
)
cli::cli_alert_info(
  "Facilities with MFL data: {with_mfl}"
)
cli::cli_alert_warning(
  "Facilities without MFL match: {without_mfl}"
)
cli::cli_alert_info("Total rows (facility-months): {nrow(final_dhis2_mfl_df)}")
Output
── DHIS2-MFL Integration Complete ──
✔ Total DHIS2 facilities preserved: 1771
ℹ Facilities with MFL data: 1611
! Facilities without MFL match: 160
ℹ Total rows (facility-months): 184056

To adapt the code:

  • Line 7: Change hf to match your DHIS2 facility name column
  • Lines 11-12: Adjust which MFL columns to exclude (we remove geographic columns to avoid conflicts with DHIS2’s geography)
  • Line 13: Ensure hf_mfl_raw matches the MFL facility name column from your matching results

Step 10: Final Checks

Perform critical quality checks before finalizing the matching workflow. This step identifies potential data quality issues that require SNT team review: one-to-many matches (which may indicate duplicate facilities or generic names) and unmatched facilities (which could represent new facilities, closed facilities, or data quality issues). These checks ensure match quality and create actionable review lists for the SNT team.

Step 10.1: Detecting one-to-many matches for review

After finalization, check for MFL facilities matched to multiple DHIS2 facilities, which may indicate data quality issues requiring review:

  • R
  • Python
Show the code
# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many <- final_facilities_all |>
  dplyr::filter(!is.na(hf_mfl_raw)) |> # only matched facilities
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |> # MFL matched to >1 DHIS2
  dplyr::ungroup() |>
  dplyr::arrange(hf_mfl_raw, hf_dhis2_raw)

if (nrow(one_to_many) > 0) {
  cli::cli_alert_warning(
    paste0(
      "One-to-many matches detected: {nrow(one_to_many)} DHIS2 facilities ",
      "share {dplyr::n_distinct(one_to_many$hf_mfl_raw)} MFL facilities"
    )
  )

  # show sample
  knitr::kable(
    one_to_many |>
      dplyr::select(hf_dhis2_raw, hf_mfl_raw, score, final_method) |>
      dplyr::slice_head(n = 6),
    caption = "Sample one-to-many matches requiring review"
  )
} else {
  cli::cli_alert_success("No one-to-many matches detected")
}
Output
! One-to-many matches detected: 234 DHIS2 facilities share 112 MFL facilities
Sample one-to-many matches requiring review
hf_dhis2_raw hf_mfl_raw score final_method
Arab (Calaba Town) Clinic ARAB CLINIC 100.00000 Interactive Stratified Geographic Matching
Arab (Sengbeh) Clinic ARAB CLINIC 100.00000 Interactive Stratified Geographic Matching
ADAMA MARTH MEMORIAL CHC Adama Marth Memorial CHC 100.00000 Matched Without Fuzzy Matching (standardization)
Adama Martha Memorial CHC Adama Marth Memorial CHC 100.00000 Interactive Stratified Geographic Matching
AHMADIYYA MUSLIM (YONI MAMALA) HOSPITAL Ahmadiyya Muslim Hospital 85.85586 Fuzzy-matched using: Jaro-Winkler
Ahmadiyya Muslim (Nongowa) Hospital Ahmadiyya Muslim Hospital 100.00000 Interactive Stratified Geographic Matching

To adapt the code:

  • Line 4: Modify output path for the review file as needed
  • Lines 19-23: Adjust which columns to include in the review file

One-to-many matches often signal underlying issues in the data. They may arise from duplicate or near-duplicate facility records in DHIS2, or from generic names in the MFL that align with multiple, more specific DHIS2 entries. In other cases, they reflect inconsistent naming conventions that need further standardization before integration.

Step 10.2: Unmatched facilities analysis

Analyze facilities that remain unmatched to identify potential reasons and create actionable review lists for the SNT team. Unmatched facilities could represent new facilities, closed facilities, or facilities with significant data quality issues.

  • R
  • Python
Show the code
# analyze unmatched facilities by reason
unmatched_analysis <- final_facilities_all |>
  dplyr::filter(is.na(hf_mfl_raw)) |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    count = n(),
    .groups = "drop"
  ) |>
  dplyr::arrange(desc(count))

# display unmatched analysis
cli::cli_h2("Unmatched Facilities Analysis")
cli::cli_alert_info(
  "Total unmatched facilities: {sum(unmatched_analysis$count)}"
)

# show breakdown by reason
purrr::pwalk(
  unmatched_analysis,
  ~ cli::cli_alert_warning("{.x}: {.y} facilities")
)

# create detailed unmatched facilities export
unmatched_detailed <- final_facilities_one_per_hf |>
  # unmatched are those without an MFL linkage
  dplyr::filter(is.na(hf_mfl_raw)) |>
  # select required context and scoring/method fields
  dplyr::select(
    hf_dhis2_raw,
    adm1,
    adm2,
    adm3,
    final_method,
    score
  ) |>
  # derive a clear potential reason for review
  dplyr::mutate(
    potential_reason = dplyr::case_when(
      is.na(final_method) ~ "Never matched - possible new facility",
      final_method == "Manual Review - Rejected" ~
        "Rejected in review - possible different facility",
      score > 0 & score < 50 ~ "Low similarity - possible data quality issue",
      TRUE ~ "Unmatched - needs investigation"
    )
  ) |>
  dplyr::arrange(adm1, adm2, adm3, hf_dhis2_raw)

# export for SNT team review
readr::write_csv(
  unmatched_detailed,
  here::here("03_outputs", "unmatched_facilities_for_snt_review.csv")
)

cli::cli_alert_success(
  "Exported {nrow(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli::cli_text("File: outputs/unmatched_facilities_for_snt_review.csv")
Output
── Unmatched Facilities Analysis ──
ℹ Total unmatched facilities: 149
These unmatched facilities may represent:

To adapt the code:

  • Lines 3-9: Analysis groups unmatched facilities by reason (final_method). Customize grouping as needed
  • Lines 17-32: Detailed export includes geographic context and potential reasons. Modify columns based on your needs
  • Lines 34-39: Export path for SNT team review can be customized
  • Lines 41-44: The potential reasons logic can be adjusted based on your specific workflow patterns

This check helps the SNT team focus their review on the most important cases. Some unmatched facilities may represent new sites that should be added to the MFL, while others point to data quality issues where name variations need correction. A portion may be closed facilities that need to be marked inactive. The results can also highlight geographic patterns, showing areas where unmatched facilities are more common and may require targeted review.

Step 11: Save Final Datasets

Save all final datasets for analysis, reporting, and SNT team review. This step consolidates all export operations and creates the outputs needed for downstream use.

  • R
  • Python
Show the code
# save main integrated DHIS2-MFL dataset
rio::export(
  final_dhis2_mfl_df,
  here::here("03_outputs", "final_dhis2_mfl_integrated.xlsx")
)

# save matching results summary
rio::export(
  final_facilities_all,
  here::here("03_outputs", "facility_matching_results.xlsx")
)

if (nrow(unmatched_facilities) > 0) {
  readr::write_csv(
    unmatched_facilities,
    here::here("03_outputs", "unmatched_dhis2_facilities.csv")
  )
}

if (nrow(one_to_many) > 0) {
  readr::write_csv(
    one_to_many,
    here::here("03_outputs", "one_to_many_matches_for_review.csv")
  )
}

# summary of saved files
cli::cli_h2("Final Datasets Saved")
cli::cli_alert_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli::cli_alert_success("Matching results: facility_matching_results.xlsx")
cli::cli_alert_info("Unmatched facilities: {nrow(unmatched_facilities)} exported")
cli::cli_alert_info("One-to-many matches: {nrow(one_to_many)} exported")
Output

To adapt the code:

  • Lines 3-7: Main integrated dataset path can be customized for your project structure
  • Lines 9-13: Matching results summary export path
  • Lines 15-27: Unmatched facilities export - modify path as needed
  • Lines 29-41: One-to-many matches export - automatic detection and export
  • Lines 43-48: Summary messages can be customized

Summary

This guide outlines a structured approach to matching health facility names between DHIS2 datasets and the Master Facility List (MFL). It features a two-stage workflow designed for efficiency and accuracy. Users may implement the full workflow or apply fuzzy matching directly if only basic string similarity is required.

Stage 1: Interactive Stratified Geographic Matching uses sntutils::prep_geonames() to combine human validation with geographic constraints. It typically resolves 60–70% of matches with high confidence, exposes misaligned facilities early, and provides opportunities for immediate review of suggested matches.

Stage 2: Comprehensive Fuzzy Matching addresses the remaining unmatched facilities through cleaning and standardization of names, applying multiple string similarity algorithms (Jaro-Winkler, Levenshtein, Qgram, LCS), and checking structural quality. This helps capture spelling variations, abbreviations, and geographic misplacements that prevent exact matching.

The two-stage approach resolves most matches through stratified geographic matching, reducing manual review and leaving only difficult cases for fuzzy matching. It detects misalignments early, improves confidence through human validation, and remains flexible for both interactive and automated use. By combining multiple strategies, the approach achieves broad coverage while maintaining quality.

For facilities that require fuzzy matching, four strategies are available: single-method matching, composite scoring, weighted composites, and fallback loops with progressive thresholds. Each is suited to different data contexts and confidence needs. Transparency and comparability are built into the process, but validation with the SNT team remains essential for low-confidence matches and critical applications.

Full code

Find the full code script for fuzzy matching health facility names between datasets below.

  • R
  • Python
Show full code
################################################################################
############ ~ Fuzzy matching of names across datasets full code ~ #############
################################################################################

### Step 1: Install and Load Required Libraries --------------------------------

# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load all required packages using pacman
pacman::p_load(
  readxl,     # for reading Excel files
  dplyr,      # for data manipulation
  stringdist, # for calculating string distances (fuzzy matching)
  tibble,     # for working with modern data frames
  knitr,      # for creating formatted tables
  openxlsx,   # for writing Excel files
  httr,       # for HTTP requests to download files (optional)
  here        # for cross-platform file paths
)

### Step 2: Load Data ----------------------------------------------------------

# set up path to dhis2 hf data
dhis2_path <- here::here(
  "01_data",
  "02_epidemiology",
  "2a_routine_surveillance",
  "processed"
)

hf_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1b_health_facilities",
  "processed"
)

# read DHIS2 health facilities data
dhis2_df <- readRDS(
  here::here(dhis2_path, "sle_dhis2_with_clean_adm3.rds")
) |>
  # rename facility name columns for clarity and consistency
  dplyr::rename(hf_dhis2_raw = hf)

# get distinct admin and hf cols
dhis2_hf_df <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw)

# read MFL health facilities data
master_hf_df <- read.csv(
  here::here(hf_path, "hf_final_clean_data.csv")
) |>
   dplyr::distinct(
    adm0, adm1, adm2, adm3, hf, lat, long, .keep_all = TRUE
  ) |>
  # rename facility name columns for clarity and consistency
  dplyr::mutate(hf_mfl_raw = hf)

# attach a stable DHIS2 facility ID for consistent counting across
# steps
# distinct by geography + name to avoid collisions across admins
dhis2_map <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf) |>
  dplyr::mutate(
    hf_uid_new = paste0(
      "hf_uid_new::",
      as.integer(as.factor(paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf)),
        sep = "|"
      )))
    )
  )

# print head of data
cli::cli_h3("Sample of DHIS2 data:")
head(dhis2_hf_df)

cli::cli_h3("Sample of MFL data:")
head(master_hf_df)

### Step 3: Initial Matching Diagnostics ---------------------------------------

#### Step 3.1: Overall exact match check ---------------------------------------

# check for exact matches on raw names (no admin constraint)
exact_matches_all <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw"),
    relationship = "many-to-many"
  )

# calculate matching potential
total_dhis2 <- nrow(dhis2_hf_df)
total_mfl <- nrow(master_hf_df)
unmatched_dhis2 <- total_dhis2 - nrow(exact_matches_all)

cli::cli_h3("Overall Matching Summary")
cli::cli_alert_info("Total DHIS2 facilities: {total_dhis2}")
cli::cli_alert_info("Total MFL facilities: {total_mfl}")
cli::cli_alert_success(
  paste0(
    "Exact matches found: {nrow(exact_matches_all)} (",
    "{round(nrow(exact_matches_all)/total_dhis2*100, 1)}%)
  )
)
cli::cli_alert_warning("Remaining to match: {unmatched_dhis2}")

### Step 3: Initial Matching Diagnostics ---------------------------------------

#### Step 3.2: Administrative level match check --------------------------------

# check matches at adm2 level (district)
dhis2_by_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(total_dhis2 = dplyr::n())

matches_by_adm2 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2")
  ) |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(exact_matches = dplyr::n()) |>
  dplyr::left_join(dhis2_by_adm2, by = "adm2") |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::select(adm2, exact_matches, total_dhis2, match_rate) |>
  dplyr::arrange(dplyr::desc(match_rate))

cli::cli_h3("Exact matches by district (adm2))
matches_by_adm2

# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(total_dhis2 = dplyr::n(), .groups = "drop")

matches_by_adm3 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2", "adm3")
  ) |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(exact_matches = dplyr::n(), .groups = "drop") |>
  dplyr::left_join(dhis2_by_adm3, by = c("adm2", "adm3")) |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::filter(total_dhis2 >= 5) |> # Only show areas with 5+ facilities
  dplyr::arrange(dplyr::desc(match_rate)) |>
  dplyr::slice_head(n = 10) # Show top 10 performing adm3 areas

cli::cli_h3("Top exact matches by chiefdom (adm3))
matches_by_adm3

### Step 3: Initial Matching Diagnostics ---------------------------------------

#### Step 3.3: Duplicate name check --------------------------------------------

# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_dhis2_raw)

mfl_dups_adm2 <- master_hf_df |>
  dplyr::group_by(adm2, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_mfl_raw)

cli::cli_h3("Duplicates within same district (adm2)")
cli::cli_alert_warning(
  paste0(
    "DHIS2 duplicates within districts: ",
    "{length(unique(dhis2_dups_adm2$hf_dhis2_raw))}"
  )
)
cli::cli_alert_warning(
  paste0(
    "MFL duplicates within districts: ",
    "{length(unique(mfl_dups_adm2$hf_mfl_raw))}"
  )
)

# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_dhis2_raw)

mfl_dups_adm3 <- master_hf_df |>
  dplyr::group_by(adm2, adm3, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_mfl_raw)

cli::cli_h3("Duplicates within same chiefdom (adm3)")
cli::cli_alert_danger(
  paste0(
    "DHIS2 duplicates within chiefdoms: ",
    "{length(unique(dhis2_dups_adm3$hf_dhis2_raw))}"
  )
)
cli::cli_alert_danger(
  paste0(
    "MFL duplicates within chiefdoms: ",
    "{length(unique(mfl_dups_adm3$hf_mfl_raw))}"
  )
)

# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall <- dhis2_hf_df |>
  dplyr::group_by(hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_dhis2_raw)

mfl_dups_overall <- master_hf_df |>
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_mfl_raw)

cli::cli_h3("Overall duplicate names (across all areas)")
cli::cli_alert_info(
  "DHIS2 overall duplicates: {length(
    unique(dhis2_dups_overall$hf_dhis2_raw)
  )}"
)
cli::cli_alert_info(
  "MFL overall duplicates: {length(
    unique(mfl_dups_overall$hf_mfl_raw)
  )}"
)

### Step 4: Interactive Stratified Geographic Matching -------------------------

# inner join (keep only matched polygons)
# set up location to save cache
cache_loc <- "1.1_foundational/1d_cache_files"

# interactive stratified matching with automatic standardization
# this function handles name standardization internally
dhis2_df_cleaned <-
  sntutils::prep_geonames(
    target_df = dhis2_hf_df, # dataset to be cleaned
    lookup_df = master_hf_df, # reference dataset with correct admin
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    level4 = "hf",
    cache_path = here::here(cache_loc, "geoname_cache.rds"),
    unmatched_export_path = here::here(cache_loc, "dhis2_hf_unmatched.rds")
  )

# load the unmatched facilities for further processing (step 5–8 fuzzy
# matching)
dhis2_hf_to_process <- readRDS(
  here::here(cache_loc, "dhis2_hf_unmatched.rds")
) |>
  dplyr::select(adm0, adm1, adm2, adm3, hf_dhis2_raw = hf)

# summary statistics
n_original <- nrow(dhis2_hf_df)
n_matched <- n_original - nrow(dhis2_hf_to_process)
match_rate <- (n_matched / n_original) * 100

cli::cli_alert_success(
  paste0(
    "Stratified matching completed: ",
    "{format(n_matched, big.mark = ',')}/{format(n_original, big.mark = ',')}",
    " facilities matched ({round(match_rate, 1)}%)"
  )
)
cli::cli_alert_info(
  "Remaining unmatched: {nrow(dhis2_hf_to_process)} facilities"
)

### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------

#### Step 5.1: Standardize health facility names -------------------------------

# create function to standardize hf names
standardize_names <- function(name_vec) {
  # validate input
  if (!rlang::is_atomic(name_vec)) {
    cli::cli_abort("`name_vec` must be an atomic vector.")
  }

  name_vec |>
    # ensure character type
    as.character() |>
    # convert to lowercase
    stringr::str_to_lower() |>
    # replace punctuation with space
    stringr::str_replace_all("[[:punct:]]", " ") |>
    # remove extra spaces and trim
    stringr::str_squish() |>
    # normalize accents
    stringi::stri_trans_general("Latin-ASCII") |>
    # normalize all space-like characters
    stringi::stri_replace_all_regex("\\p{Zs}+", " ") |>
    # convert roman numerals to arabic numerals
    stringr::str_replace_all(
      c(
        " ix\\b" = " 9",
        " viii\\b" = " 8",
        " vii\\b" = " 7",
        " vi\\b" = " 6",
        " v\\b" = " 5",
        " iv\\b" = " 4",
        " iii\\b" = " 3",
        " ii\\b" = " 2",
        " i\\b" = " 1"
      )
    ) |>
    # sort tokens with letters first, numbers last; within groups
    # alphabetical
    purrr::map_chr(\(.x) {
      # split on one-or-more spaces
      tokens <- strsplit(.x, " +")[[1]]

      # detect pure numeric tokens
      is_num <- stringr::str_detect(tokens, "^[0-9]+$")

      # order alphabetic first, then numeric; sort within each group
      ordered <- c(sort(tokens[!is_num]), sort(tokens[is_num]))

      # rejoin
      paste(ordered, collapse = " ")
    })
}

# set up example with messy formatting
example_word <- factor("Clínica! Rahmâ    IV (  New clinic) East")

# print original structure
cat("\nExample before standardization:\n")
str(example_word)

# apply standardization
example_word_st <- standardize_names(example_word)

# print cleaned example
cat("\nExample after standardization:\n")
str(example_word_st)

### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------

#### Step 5.1: Standardize health facility names -------------------------------

# remove duplicates from MFL and format hf col
master_hf_df <- master_hf_df |>
  dplyr::distinct(hf_mfl_raw, .keep_all = TRUE) |>
  dplyr::mutate(hf_mfl = standardize_names(hf_mfl_raw))

# important: preserve original full dataset before processing unmatched
# (this is our base N)
dhis2_hf_df_original <- dhis2_hf_df |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw))

# build geo-aware stable ID once and attach to originals and unmatched
hf_uid_new_map <- dhis2_hf_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::mutate(
    hf_uid_new = paste0(
      "hf_uid_new::",
      as.integer(as.factor(paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf_dhis2_raw)),
        sep = "|"
      )))
    )
  )

dhis2_hf_df_original <- dhis2_hf_df_original |>
  dplyr::left_join(
    hf_uid_new_map,
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched <- dhis2_hf_to_process |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw)) |>
  dplyr::left_join(
    hf_uid_new_map |>
      dplyr::mutate(hf_dhis2_raw = toupper(hf_dhis2_raw)),
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

knitr::kable(
  # check if it worked
  dhis2_hf_unmatched |>
    dplyr::distinct(hf_dhis2_raw, hf_dhis2) |>
    dplyr::slice_head(n = 10)
)

### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------

#### Step 5.2: Handle abbreviations in health facility names -------------------

abbrev_dictionary <-
  dplyr::bind_rows(
    dplyr::select(dhis2_hf_unmatched, hf = hf_dhis2_raw),
    dplyr::select( master_hf_df, hf = hf_mfl_raw)) |>
  tidyr::separate_rows(hf, sep = " ") |>                      # Split into words
  dplyr::filter(hf != "") |>                                  # Remove blanks
  # detect 2-4 uppercase letter patterns
  dplyr::filter(stringr::str_detect(hf, "^[A-Z]{2,4}$")) |>
  dplyr::count(hf, sort = TRUE) |>                            # Count frequencies
  dplyr::rename(word = hf, freq = n) |>                       # Rename for clarity
  dplyr::filter(freq > 2) |>
  as.data.frame()

# check output
abbrev_dictionary

### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------

#### Step 5.2: Handle abbreviations in health facility names -------------------

# define abbreviation dictionary (all lowercase, because we use the
# standardize column)
abbrev_dict <- c(
  "maternal child health post" = "mchp",
  "community health post" = "chp",
  "community health center" = "chc",
  "urban maternal clinic" = "umi",
  "expanded programme on immunization" = "epi"
)

# apply replacements to MFL names
master_hf_df <- master_hf_df |>
  dplyr::mutate(
    hf_mfl = stringr::str_replace_all(hf_mfl, abbrev_dict)
  )

# apply replacements to DHIS2 names
dhis2_hf_unmatched <- dhis2_hf_unmatched |>
  dplyr::mutate(
    hf_dhis2 = stringr::str_replace_all(hf_dhis2, abbrev_dict)
  )

# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched |>
  dplyr::filter(
    stringr::str_detect(
      hf_dhis2_raw,
      paste0(
        "(?i)Community Health Center|Maternal Child Health Post|",
        "Community Health Post"
      )
    ) |
      stringr::str_detect(hf_dhis2_raw, "CHC|MCHP|CHP")
  ) |>
  dplyr::select(hf_dhis2_raw, hf_dhis2) |>
  head()

### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------

#### Step 6.1: Identify matched and unmatched facilities -----------------------

# flag for one-to-one enforcement
enforce_one_to_one <- FALSE

# exact matches using raw names
matched_dhis2_raw <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw),
    by = c("hf_dhis2_raw" = "hf_mfl_raw")
  )

# exact matches using standardized names
matched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_dhis2, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw, hf_mfl),
    by = c("hf_dhis2" = "hf_mfl"),
    keep = TRUE
  ) |>
  # exclude facilities already matched via raw names
  dplyr::anti_join(matched_dhis2_raw, by = c("hf_dhis2_raw")) |>
  # label exact matches
  dplyr::mutate(
    final_method = paste0(
      "Matched Without Fuzzy Matching (standardization)"
    ),
    score = 100
  )

# only include facilities that were actually matched via geographic
# standardization
# these should be facilities that had geographic corrections applied
matched_dhis2_prepgeoname <-
  dhis2_df_cleaned |>
  dplyr::anti_join(
    dhis2_hf_unmatched,
    by = c("adm0", "adm1", "adm2", "adm3", "hf" = "hf_dhis2_raw")
  ) |>
  dplyr::left_join(
    master_hf_df |>
      dplyr::mutate(hf = toupper(hf)) |>
      dplyr::select(adm0, adm1, adm2, adm3, hf, hf_mfl_raw, hf_mfl),
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  ) |>
  # attach hf_uid_new from dhis2_map (by admin + hf)
  dplyr::left_join(
    dhis2_map |> dplyr::select(adm0, adm1, adm2, adm3, hf, hf_uid_new),
    by = c("adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw" = "hf")
  ) |>
  # label matches
  dplyr::mutate(
    final_method = "Interactive Stratified Geographic Matching",
    score = 100
  )

# compute unmatched after standardization
unmatched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(hf_dhis2) |>
  dplyr::anti_join(
    master_hf_df |>
      dplyr::select(hf_mfl),
    by = c("hf_dhis2" = "hf_mfl")
  ) |>
  dplyr::distinct(hf_dhis2)

# collect already matched mfl to enforce one-to-one in candidate pool
used_mfl_stand <- matched_dhis2 |>
  dplyr::pull(hf_mfl) |>
  unique()

use_mfl_prepgeoname <- matched_dhis2_prepgeoname |>
  dplyr::pull(hf_mfl) |>
  unique()

used_mfl <- c(used_mfl_stand, use_mfl_prepgeoname)

# build candidate MFL pool for fuzzy step
candidate_mfl_df <- master_hf_df |>
  dplyr::select(hf_mfl)

# if enforcing one-to-one, drop MFLs already used by exact matches
if (enforce_one_to_one) {
  candidate_mfl_df <- candidate_mfl_df |>
    dplyr::filter(!hf_mfl %in% used_mfl)
}

# summary counts
total_dhis2_hf <- dplyr::n_distinct(dhis2_hf_unmatched$hf_dhis2_raw)
raw_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2_raw$hf_dhis2_raw)
raw_unmatch_dhis2_hf <- total_dhis2_hf - raw_match_dhis2_hf

standardized_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2$hf_dhis2)
standardized_unmatch_dhis2_hf <- total_dhis2_hf - standardized_match_dhis2_hf

total_mfl_hf <- dplyr::n_distinct(master_hf_df$hf_mfl)
candidate_mfl_count <- dplyr::n_distinct(candidate_mfl_df$hf_mfl)

# display summary
cli::cli_h2("Summary of Matching Status")

cli::cli_alert_info(
  "Total DHIS2 facilities: {format(total_dhis2_hf, big.mark = ',')}"
)

cli::cli_alert_success(
  paste0(
    "Matched after standardization: ",
    "{format(standardized_match_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched using raw names: ",
    "{format(raw_unmatch_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched after standardization: ",
    "{format(standardized_unmatch_dhis2_hf, big.mark = ',')}"
  )
)

cli::cli_alert_info(
  paste0(
    "MFL facilities in candidate pool for matching: ",
    "{format(candidate_mfl_count, big.mark = ',')} of ",
    "{format(total_mfl_hf, big.mark = ',')}"
  )
)

### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------

#### Step 6.1: Create the matching grid ----------------------------------------

# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities.

# create simple candidate match pool
candidate_match_pool <- tidyr::crossing(
  unmatched_dhis2,
  candidate_mfl_df
)

# create match grid with dhis2 and mfl names
match_grid <- candidate_match_pool

# preview sample - top 10 matches
match_grid |>
  dplyr::slice_head(n = 10)

### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------

#### Step 6.2: Compute similarity scores ---------------------------------------

# calculate fuzzy scores
match_grid <- match_grid |>
  dplyr::mutate(
    len_max = pmax(nchar(hf_dhis2), nchar(hf_mfl)),
    score_jw = 1 - stringdist::stringdist(hf_dhis2, hf_mfl, method = "jw"),
    score_lv = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lv") / len_max,
    score_qg = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "qgram") / len_max,
    score_lcs = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lcs") / len_max
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("score_"),
      .fns = ~ ifelse(is.nan(.x) | .x < 0, 0, .x)
    ),
    dplyr::across(
      .cols = dplyr::contains("score_"),
      .fns = ~ .x * 100
    )
  )

### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------

#### Step 6.3: Create composite scores -----------------------------------------

# define similarity score columns
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

# compute composite_score dynamically using score_cols
match_grid <- match_grid |>
  dplyr::mutate(
    composite_score = rowMeans(
      dplyr::across(dplyr::all_of(score_cols))
    )
  )

# compute average rank across all similarity methods
ranked_grid <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::mutate(
    dplyr::across(
      dplyr::all_of(score_cols),
      ~ dplyr::min_rank(dplyr::desc(.)),
      .names = "rank_{.col}"
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    rank_avg = rowSums(dplyr::across(
      dplyr::all_of(paste0("rank_", score_cols))
    )),
    rank_avg = round(rank_avg / length(score_cols))
  )

# preview ranked results to show multiple candidates per DHIS2 facility
# select a few facilities and show their top 3 candidates each
sample_facilities <- c(
  "charity clinic kamba of",
  "arab clinic shad",
  "al arab clinic sheefa"
)

ranked_grid |>
  dplyr::filter(hf_dhis2 %in% sample_facilities) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, n = 4, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    dplyr::all_of(score_cols),
    composite_score,
    rank_avg
  ) |>
  dplyr::arrange(hf_dhis2, rank_avg)

### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------

#### Step 6.4: Extract best match ----------------------------------------------

# for each fuzzy matching method, we select the top-ranked MFL name
# per DHIS2 hf.
best_jw <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_jw, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Jaro-Winkler") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_jw, method
  )

best_lv <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lv, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Levenshtein") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lv, method
  )

best_qg <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_qg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Qgram") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_qg, method
  )

best_lcs <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lcs, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "LCS") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lcs, method
  )

best_comp<- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(composite_score, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Composite-Score") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = composite_score, method
  )

best_ranked_match <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Rank-Ensemble")  |>
    dplyr::select(
    hf_dhis2, hf_mfl, score = rank_avg, method
  )

# combine top matches from all methods for comparison across approaches
all_best <- dplyr::bind_rows(
  best_jw, best_lv, best_qg, best_lcs,
  best_comp, best_ranked_match)

# preview best matches to show extraction results
all_best |>
  dplyr::slice_head(n = 10) |>
  dplyr::select(hf_dhis2, hf_mfl, score, method)

### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------

#### Step 7.1: Visualize score distributions across methods --------------------

# plot score dist across methods
all_best |>
  dplyr::filter(method != "Rank-Ensemble") |>
  ggplot2::ggplot(ggplot2::aes(x = score)) +
  ggplot2::geom_density(fill = "steelblue", alpha = 0.6) +

  ggplot2::facet_wrap(~method) +
  ggplot2::geom_density(color = "steelblue", linewidth = 1) +
  ggplot2::labs(
    title = "Distribution of Fuzzy Matching Scores",
    x = "\nMatch Score (%)",
    y = "Density\n"
  ) +
  ggplot2::theme_minimal(base_size = 14)

# save plot
ggplot2::ggsave(
  plot = map_u5mr,
  filename = here::here("03_output/3a_figures/u5mr_sle_adm2.png"),
  width = 12,
  height = 9,
  dpi = 300
)

### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------

#### Step 7.2: Define match quality diagnostic function ------------------------

assess_match_quality <- function(name1, name2) {
  purrr::map2_dfr(name1, name2, function(a, b) {
    tokens1 <- strsplit(a, "\\s+")[[1]]
    tokens2 <- strsplit(b, "\\s+")[[1]]
    tibble::tibble(
      prefix_match = tolower(tokens1[1]) == tolower(tokens2[1]),
      suffix_match = tolower(tail(tokens1, 1)) == tolower(tail(tokens2, 1)),
      token_diff = abs(length(tokens1) - length(tokens2)),
      char_diff = abs(nchar(a) - nchar(b))
    )
  })
}

assess_match_quality("Makeni Govt Hospital", "Makeni Government Hospital")

### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------

#### Step 7.3: Evaluate quality of top fuzzy matches ---------------------------

# assess structural match quality
diagnostics_df <- dplyr::bind_cols(
  all_best,
  assess_match_quality(all_best$hf_dhis2, all_best$hf_mfl)
)

# compare method diagnostics
summary_stats <- diagnostics_df |>
  dplyr::group_by(method) |>
  dplyr::summarise(
    score = mean(score),
    avg_token_diff = mean(token_diff) |> round(2),
    avg_char_diff = mean(char_diff) |> round(2),
    pct_prefix_match = (mean(prefix_match) * 100) |> round(2),
    pct_suffix_match = (mean(suffix_match) * 100) |> round(2),
    total = dplyr::n(),
    .groups = "drop"
  )

# create a a final overall score
summary_stats <-  summary_stats |>
  dplyr::mutate(
    # rescale negative of average token difference (smaller is better)
    token_score = scales::rescale(-avg_token_diff, to = c(0, 100)),
    # rescale negative of average character difference (smaller is better)
    char_score = scales::rescale(-avg_char_diff, to = c(0, 100)),
    # rescale prefix match percentage (higher is better)
    prefix_score = scales::rescale(pct_prefix_match, to = c(0, 100)),
    # rescale suffix match percentage (higher is better)
    suffix_score = scales::rescale(pct_suffix_match, to = c(0, 100)),

    # combine all four metrics into a weighted structural quality score
    structure_score = round(
      0.3 * token_score +         # Emphasize fewer token differences
        0.2 * char_score +        # Moderate weight on character similarity
        0.25 * prefix_score +     # Give weight to matching initial words
        0.25 * suffix_score,      # Give equal weight to matching final words
      1
    )
  ) |>
  dplyr::arrange(desc(structure_score)) |>
  # assign rank based on descending structure score
  dplyr::mutate(rank = dplyr::row_number()) |>
  dplyr::select(
    method, avg_token_diff, avg_char_diff,
    pct_prefix_match, pct_suffix_match,
    total, structure_score, rank
  )

# check results
summary_stats |>
  dplyr::select(
    Method = method,
    `Avg. Token Difference` = avg_token_diff,
    `Avg. Character Difference` = avg_char_diff,
    `% Prefix Match` = pct_prefix_match,
    `% Suffix Match` = pct_suffix_match,
    `Structural Score` = structure_score,
      Rank = rank
  ) |> as.data.frame()

### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------

#### Step 7.4: Weighted thresholding using structural quality and false-positive risk

# calculate method-specific weights and thresholds
method_threshold <- summary_stats |>
  dplyr::mutate(
    score = structure_score / 100,
    weight = score / sum(score),
    threshold = scales::rescale(
      structure_score,
      to = rev(unname(
        stats::quantile(all_best$score, c(0.70, 0.95), na.rm = TRUE)
      ))
    ) |>
      round()
  ) |>
  dplyr::select(method, weight, threshold)

# display results
method_threshold

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.1: Match selection options -----------------------------------------

# apply single-method threshold and flag matches
best_lv_final <- best_lv |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review"),
    final_method = "Levenshtein"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(hf_dhis2, hf_mfl, hf_mfl_raw, score, final_method, match_flag)

# show top 5 and bottom 5 matches by score
top_5 <- best_lv_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- best_lv_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Levenshtein Score")

# preview results
top_bottom

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.1: Match selection options -----------------------------------------

# select best match using rank average
composite_final <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(rank_avg) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(composite_score >= 85, "match", "review")
  ) |>
  dplyr::select(
    adm1,
    adm2,
    hf_dhis2,
    hf_mfl,
    composite_score,
    rank_avg,
    match_flag
  )

# show top 5 and bottom 5 matches by composite score
top_5 <- composite_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- composite_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Composite Score")

# preview results
top_bottom

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.1: Match selection options -----------------------------------------

# define score columns and extract weights
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

weights <- method_threshold |>
  dplyr::filter(
    method %in% c("Jaro-Winkler", "Levenshtein", "Qgram", "LCS")
  ) |>
  dplyr::pull(weight)

# calculate weighted composite score for each facility-match pair
weighted_final <- ranked_grid |>
  dplyr::rowwise() |>
  dplyr::mutate(
    # weighted average maintains 0-100 scale
    weighted_composite_score = stats::weighted.mean(
      c(score_jw, score_lv, score_qg, score_lcs),
      weights,
      na.rm = TRUE
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(dplyr::desc(weighted_composite_score)) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(
      weighted_composite_score >= 85,
      "match",
      "review"
    ),
    score = weighted_composite_score, # standardize column name for Step 7
    final_method = "Weighted-Composite"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    hf_mfl_raw,
    score,
    final_method,
    match_flag
  )

# show top 5 and bottom 5 matches by weighted composite score
top_5 <- weighted_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- weighted_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Weighted Composite Score")

# preview results
bottom_5

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.1: Match selection options -----------------------------------------

# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats <- summary_stats |>
  dplyr::arrange(dplyr::desc(structure_score)) |>
  dplyr::filter(method != "Rank-Ensemble") # exclude rank-based method

method_order <- method_stats$method

# use thresholds calculated in step 7.4
method_thresholds_df <- method_threshold |>
  dplyr::filter(method != "Rank-Ensemble")

# convert to named vector for easy lookup
method_thresholds <- setNames(
  method_thresholds_df$threshold,
  method_thresholds_df$method
)

# build combined table of top matches from real methods only
all_best_nonrank <- dplyr::bind_rows(
  best_lv,
  best_jw,
  best_qg,
  best_lcs,
  best_comp
) |>
  # normalize method labels
  dplyr::mutate(method = trimws(method))

# add rank once
rank_key <- ranked_grid |>
  dplyr::select(hf_dhis2, hf_mfl, rank_avg)

all_best_ranked <- all_best_nonrank |>
  dplyr::left_join(rank_key, by = c("hf_dhis2", "hf_mfl")) |>
  dplyr::mutate(
    # worst for missing ranks
    rank_avg = dplyr::if_else(is.na(rank_avg), Inf, rank_avg)
  )

# split by method and pre-arrange for stable tiebreak
by_method <- split(all_best_ranked, all_best_ranked$method) |>
  purrr::map(\(x) {
    x |>
      dplyr::arrange(dplyr::desc(score), rank_avg)
  })

# keep copy for pass-rate check
initial_thr <- method_thresholds

# init
fallback_chunks <- list()
unmatched <- unique(all_best_ranked$hf_dhis2)

# fast fallback loop (no joins, no grouping inside)
repeat {
  for (m in method_order) {
    if (!m %in% names(by_method)) {
      next
    }

    # current threshold from step 7.4
    thr <- method_thresholds[[m]]
    if (is.null(thr)) {
      thr <- 85 # fallback if method not found
    }

    # filter by score and still-unmatched
    cand <- by_method[[m]] |>
      dplyr::filter(score >= thr, hf_dhis2 %in% unmatched)

    if (nrow(cand) > 0) {
      # pick first row per hf_dhis2 (pre-arranged by score desc, rank asc)
      best_per_hf <- cand |>
        dplyr::distinct(hf_dhis2, .keep_all = TRUE) |>
        dplyr::mutate(
          final_method = glue::glue("Fuzzy-matched using: {m}")
        )

      # store and update unmatched
      fallback_chunks[[length(fallback_chunks) + 1L]] <- best_per_hf
      unmatched <- setdiff(unmatched, best_per_hf$hf_dhis2)
    }

    if (length(unmatched) == 0) break
  }

  if (length(unmatched) == 0) {
    break
  }

  # relax thresholds gradually
  method_thresholds <- method_thresholds - 1
  if (any(method_thresholds <= 50)) break # stop if too low
}

# bind once
fallback_matched <- dplyr::bind_rows(fallback_chunks)

# build a 1:1 map from standardized DHIS2 name to admin + raw +
# hf_uid_new
hf_uid_new_map_by_stand <- dhis2_hf_unmatched |>
  dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::summarise(
    adm0 = dplyr::first(adm0),
    adm1 = dplyr::first(adm1),
    adm2 = dplyr::first(adm2),
    adm3 = dplyr::first(adm3),
    hf_dhis2_raw = dplyr::first(hf_dhis2_raw),
    hf_uid_new = dplyr::first(hf_uid_new),
    .groups = "drop"
  )

# final flags and attach admin + ids
fallback_final <- fallback_matched |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review")
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::left_join(hf_uid_new_map_by_stand, by = "hf_dhis2") |>
  dplyr::select(
    adm0, adm1, adm2, adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    match_flag
  )

top_5 <- fallback_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- fallback_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Fallback")

bottom_5 <- dplyr::bind_rows(top_5, bottom_5)

# preview results
bottom_5

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.2: Select fuzzy matching approach ----------------------------------

# create fuzzy_matches from step 8 results (use whichever option was
# chosen)
# step 8 produces different result datasets depending on which option
# was
# chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final

fuzzy_matches <- if (exists("fallback_final")) {
  dplyr::filter(fallback_final, match_flag == "match")
} else if (exists("composite_final")) {
  dplyr::filter(composite_final, match_flag == "match")
} else if (exists("weighted_final")) {
  dplyr::filter(weighted_final, match_flag == "match")
} else if (exists("best_lv_final")) {
  dplyr::filter(best_lv_final, match_flag == "match")
} else if (exists("best_jw_final")) {
  dplyr::filter(best_jw_final, match_flag == "match")
} else if (exists("best_qg_final")) {
  dplyr::filter(best_qg_final, match_flag == "match")
} else {
  # empty dataframe with expected structure if no step 8 results exist
  data.frame(
    hf_dhis2_raw = character(0),
    hf_dhis2 = character(0),
    hf_mfl_raw = character(0),
    hf_mfl = character(0),
    score = numeric(0),
    final_method = character(0),
    stringsAsFactors = FALSE
  )
} |>
# remove any internal duplicates (keep highest scoring match per
# facility)
dplyr::group_by(hf_dhis2_raw) |>
dplyr::slice_max(score, n = 1, with_ties = FALSE) |>
dplyr::ungroup() |>
# attach admin + hf_uid_new from unmatched (collapse to 1:1 per
# hf_dhis2_raw)
{
  hf_uid_new_map_unmatched <- dhis2_hf_unmatched |>
    dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
    dplyr::group_by(hf_dhis2_raw) |>
    dplyr::summarise(
      adm0 = dplyr::first(adm0),
      adm1 = dplyr::first(adm1),
      adm2 = dplyr::first(adm2),
      adm3 = dplyr::first(adm3),
      hf_uid_new = dplyr::first(hf_uid_new),
      .groups = "drop"
    )
  dplyr::left_join(., hf_uid_new_map_unmatched, by = "hf_dhis2_raw")
}

# display summary of selected approach
cli::cli_h2("Selected Fuzzy Matching Results")
cli::cli_alert_success("Fuzzy matches found: {nrow(fuzzy_matches)}")

if (nrow(fuzzy_matches) > 0) {
  score_summary <- fuzzy_matches |>
    dplyr::summarise(
      avg_score = mean(score, na.rm = TRUE),
      min_score = min(score, na.rm = TRUE),
      max_score = max(score, na.rm = TRUE),
      .groups = "drop"
    )

  cli::cli_alert_info("Score range: {round(score_summary$min_score, 1)} - {round(score_summary$max_score, 1)}")
  cli::cli_alert_info("Average score: {round(score_summary$avg_score, 1)}")
}

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.3: Manual review of flagged matches --------------------------------

# prepare review dataset (using fallback results as example)
review_matches <- fallback_final |>
  dplyr::filter(match_flag == "review") |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::mutate(
    reviewer_decision = NA_character_, # accept/reject/uncertain
    reviewer_notes = NA_character_,
    review_date = NA_character_,
    reviewer_name = NA_character_
  ) |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    reviewer_decision,
    reviewer_notes,
    review_date,
    reviewer_name
  )

# export for review
readr::write_csv(
  review_matches,
  here::here("03_outputs", "tables", "facility_matches_for_review.csv")
)

# show summary
cli::cli_h2("Manual Review Required")
cli::cli_alert_info(
  "Exported {nrow(review_matches)} matches flagged for review"
)
cli::cli_alert_info(
  "File location: outputs/facility_matches_for_review.csv"
)

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.3: Manual review of flagged matches --------------------------------

# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches <- readr::read_csv(
  here::here("03_outputs", "tables", "facility_matches_reviewed.csv"),
  col_types = readr::cols(
    reviewer_decision = readr::col_character(),
    reviewer_notes = readr::col_character(),
    review_date = readr::col_character(),
    reviewer_name = readr::col_character()
  )
)

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.3: Manual review of flagged matches --------------------------------

final_incorp <- reviewed_matches |>
  dplyr::mutate(
    final_match_flag = dplyr::case_when(
      reviewer_decision == "accept" ~ "match",
      reviewer_decision == "reject" ~ "no_match",
      is.na(reviewer_decision) ~ "pending_review",
      TRUE ~ "no_match"
    )
  )

# get the final matched ones
final_matches <- final_incorp |>
  dplyr::filter(final_match_flag == "match") |>
  dplyr::select(-reviewer_decision)

# summary of all results (including rejected/pending)
all_results_summary <- final_incorp |>
  dplyr::count(final_match_flag, name = "n") |>
  dplyr::mutate(percentage = round(100 * n / sum(n), 1))

# summary of final accepted matches only
final_summary <- final_matches |>
  dplyr::count(final_match_flag, name = "n")

cli::cli_h2("Final Matching Results After Manual Review")
cli::cli_alert_success(
  "Accepted matches: {format(sum(final_summary$n), big.mark = ',')}"
)

# breakdown by decision type
rejected_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "no_match") |>
  dplyr::pull(n)
if (length(rejected_count) == 0) {
  rejected_count <- 0
}

pending_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "pending_review") |>
  dplyr::pull(n)
if (length(pending_count) == 0) {
  pending_count <- 0
}

if (rejected_count > 0) {
  cli::cli_alert_danger(
    "Rejected matches: {format(rejected_count, big.mark = ',')} (will have NA MFL values in final dataset)"
  )
}

if (pending_count > 0) {
  cli::cli_alert_warning(
    "Pending review: {format(pending_count, big.mark = ',')} (require additional review)"
  )
}

# show breakdown table
all_results_summary

### Step 9: Combine All Matched Results ----------------------------------------

#### Step 9.1: Combine results from all matching approaches --------------------

# create manual_matches from manual review results (both matched and
# unmatched)
manual_matches <- final_incorp |>
  dplyr::mutate(
    # for unmatched, set appropriate values
    hf_mfl = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl,
      NA_character_
    ),
    hf_mfl_raw = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl_raw,
      NA_character_
    ),
    final_method = dplyr::if_else(
      reviewer_decision == "accept",
      "Manual Review - Accepted",
      "Manual Review - Rejected"
    ),
    score = dplyr::if_else(
      reviewer_decision == "accept",
      100,
      0
    )
  ) |>
  # manual reviewed results already carry adm0-3 + hf_uid_new from
  # fallback_final → review export → reimport
  dplyr::select(
    adm0,
    adm1,
    adm2,
    adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method
  )

# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and
# rejected
final_facilities_all <- dplyr::bind_rows(
  matched_dhis2_prepgeoname, # Step 4: Stratified geographic matches (if performed)
  matched_dhis2, # Step 5: Exact matches after standardization
  fuzzy_matches, # Step 6-8: Complete matches from fuzzy matching
  manual_matches # Manual matching (if performed)
)

# collapse to one record per DHIS2 facility for downstream integration
final_facilities_one_per_hf <- final_facilities_all |>
  dplyr::select(-match_flag, -hf)

cli::cli_alert_info(
  "All facilities processed (rows): {nrow(final_facilities_all)}"
)
cli::cli_alert_success(
  "All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): {dplyr::n_distinct(final_facilities_all$hf_uid_new)}"
)

# generate summary by matching method
matching_summary <- final_facilities_one_per_hf |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    n_matched = n(),
    avg_score = mean(score, na.rm = TRUE),
    min_score = min(score, na.rm = TRUE),
    max_score = max(score, na.rm = TRUE),
    .groups = "drop"
  )

# display summary
cli::cli_h2("Matching Results by Method")
matching_summary

### Step 9: Combine All Matched Results ----------------------------------------

#### Step 9.2: Create final integrated DHIS2-MFL dataset -----------------------

# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final <- dhis2_df |>
  dplyr::left_join(
    dhis2_map,
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  )

# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf <- final_facilities_one_per_hf |>
  dplyr::select(hf_uid_new, hf_mfl_raw)

# create the integrated dataset
final_dhis2_mfl_df <- dhis2_df_final |>
  dplyr::left_join(final_match_per_hf, by = "hf_uid_new") |>
  # join MFL attributes for matched facilities
  dplyr::left_join(
    dplyr::select(master_hf_df, -hf_mfl, -hf, -adm0, -adm1, -adm2, -adm3),
    by = "hf_mfl_raw"
  )

# counts on distinct facilities via hf_uid_new (robust to row-level
# duplication)
total <- dhis2_df_final |> dplyr::distinct(hf_uid_new) |> nrow()
with_mfl <- final_match_per_hf |>
  dplyr::filter(!is.na(hf_mfl_raw)) |>
  nrow()
without_mfl <- total - with_mfl

# validation summary
cli::cli_h2("DHIS2-MFL Integration Complete")
cli::cli_alert_success(
  "Total DHIS2 facilities preserved: {total}"
)
cli::cli_alert_info(
  "Facilities with MFL data: {with_mfl}"
)
cli::cli_alert_warning(
  "Facilities without MFL match: {without_mfl}"
)
cli::cli_alert_info("Total rows (facility-months): {nrow(final_dhis2_mfl_df)}")

### Step 10: Final Checks ------------------------------------------------------

#### Step 10.1: Detecting one-to-many matches for review -----------------------

# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many <- final_facilities_all |>
  dplyr::filter(!is.na(hf_mfl_raw)) |> # only matched facilities
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |> # MFL matched to >1 DHIS2
  dplyr::ungroup() |>
  dplyr::arrange(hf_mfl_raw, hf_dhis2_raw)

if (nrow(one_to_many) > 0) {
  cli::cli_alert_warning(
    paste0(
      "One-to-many matches detected: {nrow(one_to_many)} DHIS2 facilities ",
      "share {dplyr::n_distinct(one_to_many$hf_mfl_raw)} MFL facilities"
    )
  )

  # show sample
  knitr::kable(
    one_to_many |>
      dplyr::select(hf_dhis2_raw, hf_mfl_raw, score, final_method) |>
      dplyr::slice_head(n = 6),
    caption = "Sample one-to-many matches requiring review"
  )
} else {
  cli::cli_alert_success("No one-to-many matches detected")
}

### Step 10: Final Checks ------------------------------------------------------

#### Step 10.2: Unmatched facilities analysis ----------------------------------

# analyze unmatched facilities by reason
unmatched_analysis <- final_facilities_all |>
  dplyr::filter(is.na(hf_mfl_raw)) |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    count = n(),
    .groups = "drop"
  ) |>
  dplyr::arrange(desc(count))

# display unmatched analysis
cli::cli_h2("Unmatched Facilities Analysis")
cli::cli_alert_info(
  "Total unmatched facilities: {sum(unmatched_analysis$count)}"
)

# show breakdown by reason
purrr::pwalk(
  unmatched_analysis,
  ~ cli::cli_alert_warning("{.x}: {.y} facilities")
)

# create detailed unmatched facilities export
unmatched_detailed <- final_facilities_one_per_hf |>
  # unmatched are those without an MFL linkage
  dplyr::filter(is.na(hf_mfl_raw)) |>
  # select required context and scoring/method fields
  dplyr::select(
    hf_dhis2_raw,
    adm1,
    adm2,
    adm3,
    final_method,
    score
  ) |>
  # derive a clear potential reason for review
  dplyr::mutate(
    potential_reason = dplyr::case_when(
      is.na(final_method) ~ "Never matched - possible new facility",
      final_method == "Manual Review - Rejected" ~
        "Rejected in review - possible different facility",
      score > 0 & score < 50 ~ "Low similarity - possible data quality issue",
      TRUE ~ "Unmatched - needs investigation"
    )
  ) |>
  dplyr::arrange(adm1, adm2, adm3, hf_dhis2_raw)

# export for SNT team review
readr::write_csv(
  unmatched_detailed,
  here::here("03_outputs", "unmatched_facilities_for_snt_review.csv")
)

cli::cli_alert_success(
  "Exported {nrow(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli::cli_text("File: outputs/unmatched_facilities_for_snt_review.csv")

### Step 11: Save Final Datasets -----------------------------------------------

# save main integrated DHIS2-MFL dataset
rio::export(
  final_dhis2_mfl_df,
  here::here("03_outputs", "final_dhis2_mfl_integrated.xlsx")
)

# save matching results summary
rio::export(
  final_facilities_all,
  here::here("03_outputs", "facility_matching_results.xlsx")
)

if (nrow(unmatched_facilities) > 0) {
  readr::write_csv(
    unmatched_facilities,
    here::here("03_outputs", "unmatched_dhis2_facilities.csv")
  )
}

if (nrow(one_to_many) > 0) {
  readr::write_csv(
    one_to_many,
    here::here("03_outputs", "one_to_many_matches_for_review.csv")
  )
}

# summary of saved files
cli::cli_h2("Final Datasets Saved")
cli::cli_alert_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli::cli_alert_success("Matching results: facility_matching_results.xlsx")
cli::cli_alert_info("Unmatched facilities: {nrow(unmatched_facilities)} exported")
cli::cli_alert_info("One-to-many matches: {nrow(one_to_many)} exported")
 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved