Fuzzy matching of names across datasets

Intermediate

Overview

On this page, we demonstrate how to apply fuzzy matching between DHIS2 and the master health facility list (MFL) in a structured, reviewable way. While the worked example uses DHIS2 and the MFL, we can apply the same fuzzy matching approach to any dataset with facility names, or any string field used as a join key across multiple datasets. For example, if the goal is to join a shapefile with tabular data and adm2 names are not joining properly, consider applying fuzzy name matching on the adm2 names across the two datasets.

The MFL is the national reference list of health facilities. Maintained by the Ministry of Health or national statistics office, it provides standardized names, unique facility IDs, coordinates, type, and administrative units, and can also provide other information such as operational status and availability of particular services. The MFL is often the only dataset of health facilities that includes their geolocation.

Joining other datasets, such as routine surveillance, with the MFL therefore makes possible many kinds of subnational analysis. If unique health facility IDs are not available across both the MFL and the dataset to be joined or if unique IDs are unreliable, then the joining will rely on matching health facility names. However, variations in the spelling, abbreviations, word order, or formatting of health facility names can make this process challenging.

One possible solution that is reasonably scalable and automatable is fuzzy name matching, a technique that uses string similarity algorithms to identify close matches between text fields even when they differ slightly. While not perfect, this approach can improve the matching rate between datasets where facility names differ due to inconsistent formatting, spelling, or abbreviations, leaving fewer unmatched facilities to resolve manually and through consultation with the SNT team. Fuzzy matches should still be reviewed, especially in important applications.

Objectives

Run diagnostics to check exact matches, admin alignment, and duplicates
Use stratified geographic matching with human validation for high-confidence matches
Standardize names and abbreviations to prepare unmatched data
Apply fuzzy matching on remaining facilities with candidate grids and similarity scores
Assess fuzzy match quality with diagnostics, visuals, and thresholds
Select a matching strategy (single, composite, weighted, or fallback) based on data needs
Merge stratified and fuzzy matches into one dataset
Save final matched output for analysis and integration

Understanding Fuzzy Matching

Fuzzy matching helps us reconcile text fields that don’t exactly align but are highly likely to be referring to the same entity, something that can happen when working with health facility names in real-world data. In this section, we explain what fuzzy matching is, why it’s needed, how it works, and what we need to consider when applying it in the context of SNT.

Why Exact Matching Isn’t Sufficient

In theory, if the same facility appears in two datasets, we should be able to join them directly by name. In practice, facility names may sometimes be spelled differently, use inconsistent abbreviations, or contain formatting differences that prevent a clean join.

For example, a facility might appear in DHIS2 as “Makeni Gov. Hosp” and in the MFL as “Makeni Government Hospital”. A direct match won’t work, even though the two clearly refer to the same place. Small differences in spelling, punctuation, or word order require a more flexible matching approach.

What Is Fuzzy String Matching?

Fuzzy string matching is a method for finding approximate matches between text strings (a text string is the data format that coding languages use for handling text). Instead of requiring an exact match, fuzzy string matching calculates how similar two strings are, usually with a score between 0 and 100. A higher score means the strings are more alike.

This is useful when working with real-world facility names. Fuzzy matching allows us to say: “these two names are close enough to be considered a likely match,” and lets us decide where to draw the line between automatic match, manual review, or no match.

In the SNT process, health facility data can come from multiple sources with varying naming conventions. For example, the MFL might serve as the authoritative source, while the DHIS2 could contain operational variations of the same facility name. Fuzzy matching helps reconcile these differences systematically.

Consult with SNT team

Before performing fuzzy matching, review data considerations with the SNT team. These considerations may include:

Facility name completeness (no missing values in key columns)
Local naming conventions and common abbreviations, so that fuzzy matching can be set up to account for them
Protocol for questionable matches

Below is an example output from fuzzy matching. The algorithm compares facility names from DHIS2 to entries in the MFL, assigning a similarity score to each match. Higher scores (closer to 100) indicate stronger similarity; lower scores may need manual review to determine acceptability or exclusion.

Facility Name (DHIS2)	Best Match in MFL	Score	Decision
Makeni Govt. Hospital	Makeni Government Hospital	96	High confidence
Makeni Goverment Hospital	Makeni Government Hospital	93	High confidence
Police CHC	Police Community Health Center	90	High confidence
Loreto Clinic	Clinic Loreto	88	High confidence
Centre Medical	Centre Médical	85	High confidence
An-Nour Hospital	An-Noor Hospital	82	Accept with caution
Rahma Clinic	Rahmah Clinic	87	High confidence
Bo MCHP	Bo Maternal Child Health Post	84	Accept with caution
Clinic A	Kenema Government Hospital	41	Needs manual review
ABC Health Post	Makeni Government Hospital	29	Needs manual review

The table above illustrates fuzzy matching scenarios that may arise. Below are some examples of the types of variations we are addressing:

Abbreviations: Bo MCHP → Bo Maternal Child Health Post
Spelling inconsistencies or typos: Makeni Goverment Hospital → Makeni Government Hospital
Formatting differences: Police CHC → Police Community Health Center
Word order variations: Loreto Clinic → Clinic Loreto
Inconsistent accents: Centre Medical → Centre Médical
Inconsistent transliterations: Rahma Clinic → Rahmah Clinic

Choosing a Matching Strategy

Fuzzy name matching involves defining what counts as “similar” and deciding which kinds of differences (like spelling, order, or abbreviations) are acceptable. Here we discuss how to assess the quality of matches and when to use different types of string similarity algorithms. Understanding these concepts helps ensure the methods we apply are appropriate for the kinds of mismatches we expect in the data.

How Do We Know if a Match Is “Good”?

Fuzzy matching algorithms return similarity scores that estimate how alike two names are. But a high similarity score does not always guarantee a correct match. For example, facilities with shared components like “community health post” may receive high scores, even if the match is incorrect. This highlights the value of looking beyond raw scores.

To improve confidence in matches, we can use additional diagnostics that assess match quality from different angles. These include:

Prefix Match: Names start with similar words or characters
Suffix Match: Names end in similar ways, such as shared final words
Token Difference: Similar number of words in each name
Character Difference: Overall lengths of the names are close to each other

These diagnostics are meant to complement similarity scores, not replace them. The diagnostics help identify cases where a high score may be misleading and guide how we filter, rank, or combine results across multiple algorithms.

Consider this example:

Let’s say Makeni Govt Hospital matches with Makeni Government Hospital. But how good is this match?

We can assess the quality using a few intuitive features:

The prefix matches: both start with Makeni. The suffix matches: both end with Hospital. The token difference is 0, both contain three words. The character difference is small, only 6 characters apart. The difference mainly comes from Govt vs Government.

These features suggest a strong match, even if the raw string similarity score is imperfect.

Choosing Among String Matching Algorithms

Different algorithms are better suited to different mismatch types, such as typos, abbreviations, or word order. Choosing the right algorithm depends on the specific inconsistencies in the data. We may want to try multiple algorithms to match across different types of inconsistencies. Below, we highlight the main algorithms, their strengths, and their limitations.

Levenshtein distance (edit distance)

Counts the number of single-character edits (insertions, deletions, substitutions) needed to turn one string into another.

Best for: Typos and misspellings
Example: Makeni Goverment Hospital → Makeni Government Hospital
Limitations: Does not handle word swaps or phonetic differences

Jaro-Winkler similarity

Measures character overlap and transpositions, with extra weight for common prefixes.

Best for: Prefix alignment and small rearrangements
Examples: Loreto Clinic → Clinic Loreto, or Makeni Govt Hospital → Makeni Government Hospital
Limitations: Weak with missing words or abbreviations

Q-gram distance

Compares overlapping sequences (e.g., 2- or 3-character chunks) between strings.

Best for: Substring overlap and fuzzy substring detection
Example: Kenema Town MCHP → Kenema MCH Post
Limitations: Can over-penalize short names or spacing issues

Longest common subsequence (LCS)

Finds the longest sequence of characters that appear left-to-right in both strings (not necessarily contiguously).

Best for: Partial matches with insertions
Example: St Mary Hosp → Saint Mary Hospital
Limitations: Ignores spacing and order changes beyond sequence alignment

Soundex and other phonetic algorithms

Convert words into phonetic codes based on how they sound.

Best for: Names with different spellings but similar pronunciation
Examples: Rahma Clinic → Rahmah Clinic, or An-Nour Hospital → An-Noor Hospital
Limitations: Fails with structural or abbreviation-based mismatches

Best Practices for an Effective Fuzzy Matching Workflow

Fuzzy matching is only as effective as the preparation and decision rules that support it. This section outlines key steps and good practices that improve the accuracy, transparency, and efficiency of the matching workflow. Below are some tips to improve fuzzy matching:

Tip 1: Work from a cleaned copy of the original name column

Always preserve the original facility name column and perform all cleaning steps on a new copy. This ensures:

We can always refer back to the original, untouched names
All joins and fuzzy matching occur on the cleaned and standardized version
We retain the ability to inspect and manually resolve low-confidence or unmatched records later using the original, unformatted names

This separation is especially useful for manual matching workflows, where human judgment relies on seeing names as they originally appeared.

Tip 2: Preprocess text to reduce noise

Before running any matching algorithm, clean and normalize the text. The list below provides several suggestions; confirm whether each makes sense in the context of the project:

Remove extra spaces to prevent false mismatches
- Trim whitespace: Remove extra spaces at the beginning or end of text
- Collapse multiple spaces: Replace repeated spaces with a single space
Standardize text characters to ensure consistency
- Convert all text to lowercase: Ensure consistency regardless of original casing
- Normalize accented characters: Convert characters like “é” to “e”
Standardize non-text characters
- Standardize punctuation: Replace or remove punctuation to reduce irrelevant variation
- Convert Roman numerals to numbers: e.g., “Koinadugu II CHC” → “Koinadugu 2 CHC”
Standardize word use, if applicable
- Sort words alphabetically within each name: Handle word order differences by splitting names into words, sorting them, and rejoining. For example, both “Port Loko CHC” and “CHC Port Loko” become “CHC Loko Port”, making them directly comparable despite naming convention differences.

This step reduces irrelevant variation and reduces matching burden on the algorithms.

Tip 3: Handle abbreviations and domain-specific terms

Abbreviations and local terms in facility names can lead to mismatches if not standardized. Depending on the context, we can either expand abbreviations (e.g., CHC → Community Health Center) or contract long forms into standard short forms. Each approach may yield different matching outcomes. What matters is applying the transformation consistently to both datasets.

Examples of variations:

CHC vs Community Health Center
PHU vs Peripheral Health Unit
MCHP vs Maternal and Child Health Post

Recommended practices:

Build a simple abbreviation dictionary for the project context
Apply substitution rules (for example, replace all CHC with Community Health Center) before running matching algorithms
Use domain-specific expansions or contractions where appropriate

Tip: identifying likely abbreviations

We can scan facility name fields for potential abbreviations by looking for short, all-uppercase tokens that appear within longer strings. A common heuristic is:

Look for 2–5 letter sequences in all capital letters that appear standalone among lowercase or mixed-case words.

This often finds abbreviations like CHC, PHU, or MCHP embedded in names such as Bo MCHP or Kailahun PHU. This check can be done by eye, or with code.

Tip 4: Limit matching scope using geographic information

Fuzzy matching is more accurate and efficient when restricted to a plausible geographic scope. Instead of comparing every DHIS2 facility name to every MFL name in the entire country, limit comparisons to within the same administrative unit, such as adm1 (region) or adm3 (chiefdom), depending on the available data quality.

For example, only compare names where adm1 and adm3 are the same for both facilities
This speeds up processing and ensures that identically named facilities in different geographic areas are not mistakenly matched to each other

This approach assumes that:

The administrative fields (adm1, adm2, adm3) are already cleaned and harmonized across both the DHIS2 and MFL datasets
Facilities are correctly assigned to their administrative units, with no major misplacements or missing assignments
Remaining unmatched facilities are indeed name mismatches, not geographic mismatches needing spatial correction

The fuzzy matching approaches outlined on this page can be used to clean and match admin unit names.

By grounding fuzzy matching in geographic logic, this step improves both the relevance and reliability of the matching results.

Tip 5: Apply multiple similarity algorithms

Rather than relying on a single algorithm, select the best match based on the highest score across multiple matching algorithms, all while considering supporting quality checks. This match-first approach ensures flexibility and avoids locking decisions to one metric.

This is important because each fuzzy matching algorithm captures a different aspect of similarity. For example:

Levenshtein: Good for typos and deletions
Jaro-Winkler: Sensitive to character transpositions and prefix agreement
Qgram / LCS: Detect reordered or partial word overlaps
Rank-Ensemble: Prioritizes agreement across methods by comparing rank positions
Composite Scores: Integrates multiple metrics for better consensus

By comparing and combining outputs from these methods, we can better handle the diverse inconsistencies found in real-world datasets.

Tip 6: Review and validate with the SNT team

Automated matching can produce strong results, but human review is required. Always check:

High-confidence matches (85+): Very likely a true match and can be quickly verified by eye
Mid-confidence matches (70–84): May require correction
Low scores (below 70): Likely to be false matches
Similar-looking names in different districts: May indicate over-matching
Unmatched facilities: May need manual review or database updates

Validate with SNT team

Always conduct a structured review of final match results with the SNT team. Validation ensures the SNT team applies local knowledge and prevents misclassification, especially when matches inform planning, facility mapping, or coverage analysis.

Matching Workflow Overview

This overview explains the end-to-end reconciliation used in Sierra Leone to match DHIS2 facility names to the national MFL. The workflow is auditable, prioritizes high-confidence matches first, and produces a single review-ready table with the selected match, similarity diagnostics, decision flags (accept/review), and stable IDs for downstream use.

Two-Phase Approach

Phase 1: Stratified geographic matching (adm2/adm3): Anchor comparisons within administrative boundaries to capture most matches with high confidence and expose misassigned admins early
Phase 2: Fuzzy name matching on remainders: Standardize names, generate candidates without geographic constraints, score with multiple algorithms, and use diagnostics to select matches for review/acceptance

11-Step Process Summary

Follow these 11 steps; detailed, runnable code appears in the next section.

Install and load required libraries for data manipulation, fuzzy string matching, and file handling.
Load data from DHIS2 and Master Facility List sources with initial cleaning.
Conduct initial matching diagnostics to assess exact match rates, administrative alignment, and duplicate name patterns.
Apply interactive stratified geographic matching to resolve high-confidence matches using geographic constraints and human validation.
Process and prepare unmatched data by standardizing names and handling abbreviations to reduce variation.
Perform fuzzy matching on remaining unmatched by constructing candidate grids, computing similarity scores, creating composite scores, and extracting best matches.
Evaluate fuzzy match quality using diagnostics, visualizations, and weighted thresholding based on structural quality.
Finalize fuzzy match selection using one of four strategies: single method, composite score, weighted composite, or progressive fallback loop.
Combine all matched results from stratified matching and fuzzy matching approaches into a comprehensive dataset.
Perform final checks to validate the matching results and identify any remaining issues.
Save final datasets including matched results, unmatched facilities, and summary statistics for downstream analysis.

Together, the two phases minimize manual reconciliation while keeping ambiguous cases transparent for SNT team review.

Workflow Diagram

The diagram below summarizes the phased process from diagnostics to final output.

flowchart TD
    Start[DHIS2 Health Facilities] --> Step3[Step 3: Initial Matching<br/>Diagnostics]

    Step3 --> Step4[Step 4: Interactive Stratified<br/>Geographic Matching]
    Step3 --> Step5[Step 5: Process & Prepare<br/>Unmatched Data]:::altPath

    Step4 --> Decision1{Matched?}
    Decision1 -->|Yes | Matched1[High-Confidence<br/>Matched Records]
    Decision1 -->|No | Step5

    Step5 --> Step6[Step 6: Fuzzy Matching<br/>on Remaining Unmatched]
    Step6 --> Step7[Step 7: Evaluate Fuzzy<br/>Match Quality]
    Step7 --> Step8[Step 8: Finalize Fuzzy<br/>Match Selection]

    Step8 --> Decision2{Matched?}
    Decision2 -->|Yes| Matched2[Fuzzy Matched<br/>Records]
    Decision2 -->|No| Unmatched[Final Unmatched<br/>for Manual Review]

    Matched1 --> Step9[Step 9: Combine All<br/>Matched Results]
    Matched2 --> Step9
    Unmatched --> Step9

    Step9 --> Step10[Step 10: Final<br/>Checks]
    Step10 --> Step11[Step 11: Save<br/>Final Datasets]

    classDef altPath fill:#d1c4e9,stroke:#673ab7,stroke-width:2px;

    style Start fill:#e1f5fe
    style Step3 fill:#e8f5e9
    style Step4 fill:#fff3e0
    style Step5 fill:#f3e5f5
    style Step6 fill:#f3e5f5
    style Step7 fill:#fff9c4
    style Step8 fill:#f3e5f5
    style Step9 fill:#e8f5e9
    style Step10 fill:#fff9c4
    style Step11 fill:#c8e6c9
    style Matched1 fill:#a5d6a7
    style Matched2 fill:#ffcc80
    style Unmatched fill:#ffab91

When to follow each path

For direct fuzzy matching without geographic stratification, follow the purple path from Step 3 → Step 5 (bypassing Step 4).
For the complete matching workflow with geographic stratification, follow all steps sequentially from Step 3 through Step 11.

This provides flexibility: skip stratification if only name-based matching is needed, or include it when location context is important.

After stratified geographic matching, a smaller set of facilities remain unresolved. These are usually due to misassigned administrative units, newly opened or closed facilities, or name variations too large for geographic matching to resolve. These unmatched records then move into Step 5, where standardization and fuzzy matching provide another opportunity to recover valid matches.

Step-by-Step

Follow the 11-step workflow outlined in Matching Workflow Overview. The sections below provide runnable, Sierra Leone-specific code for each step and notes on how to adapt it to the project context.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Install and Load Required Libraries

Install and load the necessary packages for data manipulation, fuzzy string matching, and file handling.

# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load all required packages using pacman
pacman::p_load(
  readxl,     # for reading Excel files
  dplyr,      # for data manipulation
  stringdist, # for calculating string distances (fuzzy matching)
  tibble,     # for working with modern data frames
  knitr,      # for creating formatted tables
  openxlsx,   # for writing Excel files
  httr,       # for HTTP requests to download files (optional)
  here        # for cross-platform file paths
)

Terminal installation required

Install packages in your terminal, if not already installed. If you need help installing packages, please refer to the Getting Started page.

pip install pandas numpy matplotlib rapidfuzz pyreadr openpyxl pyprojroot

from pathlib import Path
import re
import unicodedata

import numpy as np
import pandas as pd
import pyreadr
import xxhash
import matplotlib.pyplot as plt
from pyprojroot import here
from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz


def read_rds(path):
    """Read a single-object RDS file as a pandas DataFrame."""
    result = pyreadr.read_r(str(path))
    return next(iter(result.values()))


def vdigest(x, algo="xxhash32"):
    """Vectorised xxhash32 digest. Mirrors sntutils::vdigest()."""
    return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())


def cli_header(message):
    print(f"\n{message}")


def cli_info(message):
    print(f"INFO: {message}")


def cli_success(message):
    print(f"SUCCESS: {message}")


def cli_warning(message):
    print(f"WARNING: {message}")


def cli_danger(message):
    print(f"ERROR: {message}")


def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )


def show_table(df, n=10, caption=None):
    """Render a compact scrollable HTML table with the .out-table style.
    Chunks calling this must set #| results: asis."""
    from IPython.display import display, HTML
    rows = df.head(n)
    cap_html = f"<caption>{caption}</caption>" if caption else ""
    table_html = rows.to_html(
        index=False,
        classes="out-table",
        border=0,
        na_rep="",
    )
    # inject caption before the table header
    if cap_html:
        table_html = table_html.replace(
            "<thead>", cap_html + "<thead>", 1
        )
    display(HTML(f'<div class="out-scroll">{table_html}</div>'))

To adapt the code:

Keep these imports and helpers at the top of the Python workflow. Later Python chunks use them for fuzzy matching, data loading, and diagnostic output.

Step 2: Load Data

This step imports the DHIS2 and MFL health facility datasets. It then creates dhis2_hf_df, which contains the unique administrative units and facility names for fuzzy matching. Finally, it outputs samples from each dataset to review the columns and preview the first rows, ensuring the outputs look as expected

Show the code

# set up path to dhis2 hf data
dhis2_path <- here::here(
  "01_data",
  "1.2_epidemiology",
  "1.2a_routine_surveillance",
  "processed"
)

hf_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed"
)

# read DHIS2 health facilities data
dhis2_df <- readRDS(
  here::here(dhis2_path, "sle_dhis2_with_clean_adm3.rds")
) |>
  # rename facility name columns for clarity and consistency
  dplyr::rename(hf_dhis2_raw = hf)

# get distinct admin and hf cols
dhis2_hf_df <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw)

# read MFL health facilities data
master_hf_df <- read.csv(
  here::here(hf_path, "hf_final_clean_data.csv")
) |>
   dplyr::distinct(
    adm0, adm1, adm2, adm3, hf, lat, long, .keep_all = TRUE
  ) |>
  # rename facility name columns for clarity and consistency
  dplyr::mutate(hf_mfl_raw = hf)

# attach a stable DHIS2 facility ID for consistent counting across
# steps
# distinct by geography + name to avoid collisions across admins
dhis2_map <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf) |>
  dplyr::mutate(
    hf_uid_new = sntutils::vdigest(
      paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf)),
        sep = "|"
      ),
      algo = "xxhash32"
    )
  )

# print head of data
cli::cli_h3("Sample of DHIS2 data:")
head(dhis2_hf_df)

cli::cli_h3("Sample of MFL data:")
head(master_hf_df)

Output

Sample of DHIS2 data:
adm0	adm1	adm2	adm3	hf	hf_dhis2_raw
SIERRA LEONE	SOUTHERN	BO	BO TOWN	Aethel CHP	Aethel CHP
SIERRA LEONE	SOUTHERN	BO	BO TOWN	Agape Way CHP	Agape Way CHP
SIERRA LEONE	SOUTHERN	BO	BO TOWN	Anglican Diocese Clinic	Anglican Diocese Clinic
SIERRA LEONE	SOUTHERN	BO	KAKUA	Bandajuma MCHP	Bandajuma MCHP
SIERRA LEONE	SOUTHERN	BO	BOAMA	Baoma Station CHP	Baoma Station CHP
SIERRA LEONE	SOUTHERN	BO	VALUNIA	Baomahun CHC	Baomahun CHC

Sample of MFL data:
adm0	adm1	adm2	adm3	community	hf	type	Ownership	functional	lat	long
SIERRA LEONE	NORTH EAST	BOMBALI	MARA	Kiampkakolo	Kiampkakolo MCHP	MCHP	Government	Functional	8.610318	-12.20295
SIERRA LEONE	NORTH EAST	BOMBALI	MAKENI CITY	Station Road/Vincent Kanu Road	Loreto Clinic	CLINIC	Faith Based	Functional	8.886622	-12.04387
SIERRA LEONE	NORTH EAST	BOMBALI	MAKENI CITY	Makama Road	Makeni Govt. Hospital	HOSPITAL	Government	Functional	8.871770	-12.05627
SIERRA LEONE	NORTH EAST	BOMBALI	MAKENI CITY	Arabic College Road	Tonko CHP	CHP	Government	Functional	8.888880	-12.03255
SIERRA LEONE	NORTH EAST	BOMBALI	MAKENI CITY	Fullah Town Community	Fullah Town CHP	CHP	Government	Functional	8.903324	-12.04545
SIERRA LEONE	NORTH EAST	BOMBALI	BOMBALI SEBORA	Makama Community	Makama CHP	CHP	Government	Functional	8.862456	-12.05584

To adapt the code:

Lines 2–7: Update dhis2 file paths to match the location of the data files.
Lines 9–14: Update MFL file paths to match the location of the data files.
Lines 18, 29: Change file names to match the specific data files.
Lines 21, 35: Update hf and the renamed columns (hf_mfl_raw, hf_dhis2_raw) to match the actual health facility name columns in the datasets.

Show the code

# set up path to dhis2 hf data
dhis2_path = Path(here("01_data/1.2_epidemiology/1.2a_routine_surveillance/processed"))
hf_path = Path(here("01_data/1.1_foundational/1.1c_health_facilities/processed"))

# read DHIS2 health facilities data
dhis2_df = read_rds(dhis2_path / "sle_dhis2_with_clean_adm3.rds")
dhis2_df = dhis2_df.assign(hf_dhis2_raw=dhis2_df["hf"])

# get distinct admin and hf cols
dhis2_hf_df = dhis2_df[
    ["adm0", "adm1", "adm2", "adm3", "hf", "hf_dhis2_raw"]
].drop_duplicates()

# read MFL health facilities data
master_hf_df = pd.read_csv(hf_path / "hf_final_clean_data.csv")
master_hf_df = (
    master_hf_df
    .drop_duplicates(subset=["adm0", "adm1", "adm2", "adm3", "hf", "lat", "long"])
    .assign(hf_mfl_raw=lambda d: d["hf"])
)

# attach a stable DHIS2 facility ID
dhis2_map = (
    dhis2_df[["adm0", "adm1", "adm2", "adm3", "hf"]]
    .drop_duplicates()
    .assign(
        hf_uid_new=lambda d: vdigest(
            d["adm0"].str.lower().str.strip() + "|" +
            d["adm1"].str.lower().str.strip() + "|" +
            d["adm2"].str.lower().str.strip() + "|" +
            d["adm3"].str.lower().str.strip() + "|" +
            d["hf"].str.lower().str.strip()
        )
    )
)

# print head of data
cli_header("Sample of DHIS2 data:")
dhis2_hf_df.head()

cli_header("Sample of MFL data:")
master_hf_df.head()

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Lines 2–3: Update dhis2 and MFL file paths to match the location of the data files.
Lines 6, 14: Change file names to match the specific data files.
Lines 8, 19: Update hf and the renamed columns (hf_mfl_raw, hf_dhis2_raw) to match the actual health facility name columns in the datasets.

Step 3: Initial Matching Diagnostics

Before attempting any matching, we assess the data systematically to understand the matching challenge. We break this into three focused checks: overall exact matches, administrative-level matching patterns, and duplicate name issues.

Step 3.1: Overall exact match check

Next, we check overall exact matches without administrative constraints and calculate the total matching potential.

Show the code

# check for exact matches on raw names (no admin constraint)
exact_matches_all <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw"),
    relationship = "many-to-many"
  )

# calculate matching potential
total_dhis2 <- nrow(dhis2_hf_df)
total_mfl <- nrow(master_hf_df)
unmatched_dhis2 <- total_dhis2 - nrow(exact_matches_all)

cli::cli_h3("Overall Matching Summary")
cli::cli_alert_info("Total DHIS2 facilities: {total_dhis2}")
cli::cli_alert_info("Total MFL facilities: {total_mfl}")
cli::cli_alert_success(
  paste0(
    "Exact matches found: {nrow(exact_matches_all)} (",
    "{round(nrow(exact_matches_all)/total_dhis2*100, 1)}%)
  )
)
cli::cli_alert_warning("Remaining to match: {unmatched_dhis2}")

Output

── Overall Matching Summary

ℹ Total DHIS2 facilities: 1771

ℹ Total MFL facilities: 1554

✔ Exact matches found: 1082 (61.1%

! Remaining to match: 689

To adapt the code:

Line 5: Update join columns to match the facility name fields.

Show the code

# check for exact matches on raw names (no admin constraint)
exact_matches_all = dhis2_hf_df.merge(
    master_hf_df[["hf_mfl_raw"]],
    left_on="hf_dhis2_raw",
    right_on="hf_mfl_raw",
    how="inner",
)

# calculate matching potential
total_dhis2 = len(dhis2_hf_df)
total_mfl = len(master_hf_df)
unmatched_dhis2 = total_dhis2 - len(exact_matches_all)

cli_header("Overall Matching Summary")
cli_info(f"Total DHIS2 facilities: {total_dhis2}")
cli_info(f"Total MFL facilities: {total_mfl}")
cli_success(
    f"Exact matches found: {len(exact_matches_all)} "
    f"({round(len(exact_matches_all) / total_dhis2 * 100, 1)}%)"
)
cli_warning(f"Remaining to match: {unmatched_dhis2}")

Output


Overall Matching Summary

INFO: Total DHIS2 facilities: 1771

INFO: Total MFL facilities: 1554

SUCCESS: Exact matches found: 1082 (61.1%)

WARNING: Remaining to match: 689

To adapt the code:

Line 5: Update join columns to match the facility name fields.

A majority of facilities match exactly by name as shown in the output above. This baseline highlights the scope before applying geographic or fuzzy matching approaches.

Step 3.2: Administrative level match check

First, we examine how well facilities match when we consider administrative boundaries, checking matches at both adm2 (district) and adm3 (chiefdom) levels.

Show the code

# check matches at adm2 level (district)
dhis2_by_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(total_dhis2 = dplyr::n())

matches_by_adm2 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2")
  ) |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(exact_matches = dplyr::n()) |>
  dplyr::left_join(dhis2_by_adm2, by = "adm2") |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::select(adm2, exact_matches, total_dhis2, match_rate) |>
  dplyr::arrange(dplyr::desc(match_rate))

cli::cli_h3("Exact matches by district (adm2))
matches_by_adm2

# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(total_dhis2 = dplyr::n(), .groups = "drop")

matches_by_adm3 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2", "adm3")
  ) |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(exact_matches = dplyr::n(), .groups = "drop") |>
  dplyr::left_join(dhis2_by_adm3, by = c("adm2", "adm3")) |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::filter(total_dhis2 >= 5) |> # only show areas with 5+ facilities
  dplyr::arrange(dplyr::desc(match_rate)) |>
  dplyr::slice_head(n = 10) # show top 10 performing adm3 areas

cli::cli_h3("Top exact matches by chiefdom (adm3))
matches_by_adm3

Output

Exact matches by district (adm2)
adm2	exact_matches	total_dhis2	match_rate
KARENE	53	62	85.5
KAMBIA	62	75	82.7
TONKOLILI	95	122	77.9
PORT LOKO	87	122	71.3
BOMBALI	74	106	69.8
PUJEHUN	68	104	65.4
KOINADUGU	45	69	65.2
MOYAMBA	73	115	63.5
KAILAHUN	57	96	59.4
BONTHE	54	95	56.8
BO	95	172	55.2
KENEMA	83	151	55.0
FALABA	28	51	54.9
KONO	72	137	52.6
WESTERN RURAL	40	88	45.5
WESTERN URBAN	81	206	39.3

Top exact matches by chiefdom (adm3)
adm2	adm3	exact_matches	total_dhis2	match_rate
BOMBALI	KAMARANKA	5	5	100
BOMBALI	PAKI MASABONG	7	7	100
BOMBALI	SAFROKO LIMBA	10	10	100
KAMBIA	GBINLE	5	5	100
KARENE	ROMENDE	5	5	100
KARENE	SANDA MAGBOLONTOR	5	5	100
MOYAMBA	TIMDALE	5	5	100
PORT LOKO	KAMASONDO	8	8	100
PUJEHUN	SOWA	5	5	100
TONKOLILI	KHOLIFA MAMUNTHA/MAYOSSO	6	6	100

To adapt the code:

Line 9: Update join columns to match the facility name fields.
Line 11: Modify the administrative unit column (adm2) to match the district level.
Lines 25, 31: Update adm2 and adm3 columns to match the sub-district level.

Show the code

# check matches at adm2 level (district)
dhis2_by_adm2 = (
    dhis2_hf_df.groupby("adm2", as_index=False)
    .agg(total_dhis2=("hf_dhis2_raw", "count"))
)

matches_by_adm2 = (
    dhis2_hf_df
    .merge(
        master_hf_df[["hf_mfl_raw", "adm2"]],
        left_on=["hf_dhis2_raw", "adm2"],
        right_on=["hf_mfl_raw", "adm2"],
        how="inner",
    )
    .groupby("adm2", as_index=False)
    .agg(exact_matches=("hf_dhis2_raw", "count"))
    .merge(dhis2_by_adm2, on="adm2", how="left")
    .assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
    [["adm2", "exact_matches", "total_dhis2", "match_rate"]]
    .sort_values("match_rate", ascending=False)
)

cli_header("Exact matches by district (adm2)")
matches_by_adm2

# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 = (
    dhis2_hf_df.groupby(["adm2", "adm3"], as_index=False)
    .agg(total_dhis2=("hf_dhis2_raw", "count"))
)

matches_by_adm3 = (
    dhis2_hf_df
    .merge(
        master_hf_df[["hf_mfl_raw", "adm2", "adm3"]],
        left_on=["hf_dhis2_raw", "adm2", "adm3"],
        right_on=["hf_mfl_raw", "adm2", "adm3"],
        how="inner",
    )
    .groupby(["adm2", "adm3"], as_index=False)
    .agg(exact_matches=("hf_dhis2_raw", "count"))
    .merge(dhis2_by_adm3, on=["adm2", "adm3"], how="left")
    .assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
    .loc[lambda d: d["total_dhis2"] >= 5]
    .sort_values("match_rate", ascending=False)
    .head(10)
)

cli_header("Top exact matches by chiefdom (adm3)")
matches_by_adm3

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Line 9: Update join columns to match the facility name fields.
Line 11: Modify the administrative unit column (adm2) to match the district level.
Lines 31, 39: Update adm2 and adm3 columns to match the sub-district level.

District match rates vary widely, from 86% (Karene) to 39% (Western Urban). Some chiefdoms achieve 100% matches while others struggle. This shows which areas have clean, standardized names versus those needing intensive matching work.

Step 3.3: Duplicate name check

Finally, we check duplicate facility names to understand potential matching complications.

Show the code

# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_dhis2_raw)

mfl_dups_adm2 <- master_hf_df |>
  dplyr::group_by(adm2, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_mfl_raw)

cli::cli_h3("Duplicates within same district (adm2)")
cli::cli_alert_warning(
  paste0(
    "DHIS2 duplicates within districts: ",
    "{length(unique(dhis2_dups_adm2$hf_dhis2_raw))}"
  )
)
cli::cli_alert_warning(
  paste0(
    "MFL duplicates within districts: ",
    "{length(unique(mfl_dups_adm2$hf_mfl_raw))}"
  )
)

# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_dhis2_raw)

mfl_dups_adm3 <- master_hf_df |>
  dplyr::group_by(adm2, adm3, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_mfl_raw)

cli::cli_h3("Duplicates within same chiefdom (adm3)")
cli::cli_alert_danger(
  paste0(
    "DHIS2 duplicates within chiefdoms: ",
    "{length(unique(dhis2_dups_adm3$hf_dhis2_raw))}"
  )
)
cli::cli_alert_danger(
  paste0(
    "MFL duplicates within chiefdoms: ",
    "{length(unique(mfl_dups_adm3$hf_mfl_raw))}"
  )
)

# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall <- dhis2_hf_df |>
  dplyr::group_by(hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_dhis2_raw)

mfl_dups_overall <- master_hf_df |>
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_mfl_raw)

cli::cli_h3("Overall duplicate names (across all areas)")
cli::cli_alert_info(
  "DHIS2 overall duplicates: {length(
    unique(dhis2_dups_overall$hf_dhis2_raw)
  )}"
)
cli::cli_alert_info(
  "MFL overall duplicates: {length(
    unique(mfl_dups_overall$hf_mfl_raw)
  )}"
)

Output

── Duplicates within same district (adm2)

! DHIS2 duplicates within districts: 18

! MFL duplicates within districts: 7

── Duplicates within same chiefdom (adm3)

✖ DHIS2 duplicates within chiefdoms: 0

✖ MFL duplicates within chiefdoms: 1

── Overall duplicate names (across all areas)

ℹ DHIS2 overall duplicates: 19

ℹ MFL overall duplicates: 49

To adapt the code:

Lines 3–4, 8–9: Update adm2 and facility name columns.
Lines 28–29, 33–34: Update adm2, adm3 and facility name columns.
Lines 54, 56, 59, 61: Update facility name columns for overall duplicates.

Show the code

# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 = (
    dhis2_hf_df
    .groupby(["adm2", "hf_dhis2_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "hf_dhis2_raw"])
)

mfl_dups_adm2 = (
    master_hf_df
    .groupby(["adm2", "hf_mfl_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "hf_mfl_raw"])
)

cli_header("Duplicates within same district (adm2)")
cli_warning(
    f"DHIS2 duplicates within districts: {dhis2_dups_adm2['hf_dhis2_raw'].nunique()}"
)
cli_warning(
    f"MFL duplicates within districts: {mfl_dups_adm2['hf_mfl_raw'].nunique()}"
)

# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 = (
    dhis2_hf_df
    .groupby(["adm2", "adm3", "hf_dhis2_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "adm3", "hf_dhis2_raw"])
)

mfl_dups_adm3 = (
    master_hf_df
    .groupby(["adm2", "adm3", "hf_mfl_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "adm3", "hf_mfl_raw"])
)

cli_header("Duplicates within same chiefdom (adm3)")
cli_danger(
    f"DHIS2 duplicates within chiefdoms: {dhis2_dups_adm3['hf_dhis2_raw'].nunique()}"
)
cli_danger(
    f"MFL duplicates within chiefdoms: {mfl_dups_adm3['hf_mfl_raw'].nunique()}"
)

# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall = (
    dhis2_hf_df
    .groupby("hf_dhis2_raw")
    .filter(lambda x: len(x) > 1)
    .sort_values("hf_dhis2_raw")
)

mfl_dups_overall = (
    master_hf_df
    .groupby("hf_mfl_raw")
    .filter(lambda x: len(x) > 1)
    .sort_values("hf_mfl_raw")
)

cli_header("Overall duplicate names (across all areas)")
cli_info(f"DHIS2 overall duplicates: {dhis2_dups_overall['hf_dhis2_raw'].nunique()}")
cli_info(f"MFL overall duplicates: {mfl_dups_overall['hf_mfl_raw'].nunique()}")

Output


Duplicates within same district (adm2)

WARNING: DHIS2 duplicates within districts: 18

WARNING: MFL duplicates within districts: 7


Duplicates within same chiefdom (adm3)

ERROR: DHIS2 duplicates within chiefdoms: 0

ERROR: MFL duplicates within chiefdoms: 1


Overall duplicate names (across all areas)

INFO: DHIS2 overall duplicates: 19

INFO: MFL overall duplicates: 49

To adapt the code:

Lines 3–4, 8–9: Update adm2 and facility name columns.
Lines 26–27, 31–32: Update adm2, adm3 and facility name columns.
Lines 49, 54: Update facility name columns for overall duplicates.

Duplicates are few at chiefdom level but more noticeable within districts, where DHIS2 lists twice as many as the MFL. Nationally, the MFL has many more duplicate names (49 vs 19). These patterns highlight the need to use geography and unique IDs alongside names to avoid mismatches.

Step 4: Interactive Stratified Geographic Matching

Before standardizing names or performing fuzzy matching, geographic stratification with user judgment can be used to resolve most facility matches through interactive matching. This approach combines the precision of human review with the efficiency of automated suggestions, using the approach described in the Merging shapefiles with tabular data but applied here at the health facility level.

prep_geonames() (available as sntutils::prep_geonames() in R and sntutils.geo.prep_geonames() in Python) handles this entire process, including internal name standardization, geographic stratification, and interactive or automated matching based on user needs.

Show the code

# inner join (keep only matched polygons)
# set up location to save cache
cache_loc <- "01_data/1.1_foundational/1.1f_cache_files/processed"

# interactive stratified matching with automatic standardization
# this function handles name standardization internally
dhis2_df_cleaned <-
  sntutils::prep_geonames(
    target_df = dhis2_hf_df, # dataset to be cleaned
    lookup_df = master_hf_df, # reference dataset with correct admin
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    level4 = "hf",
    cache_path = here::here(cache_loc, "geoname_cache.rds"),
    unmatched_export_path = here::here(cache_loc, "dhis2_hf_unmatched.rds")
  )

# load the unmatched facilities for further processing (step 5–8 fuzzy
# matching)
dhis2_hf_to_process <- readRDS(
  here::here(cache_loc, "dhis2_hf_unmatched.rds")
) |>
  dplyr::select(adm0, adm1, adm2, adm3, hf_dhis2_raw = hf)

# summary statistics
n_original <- nrow(dhis2_hf_df)
n_matched <- n_original - nrow(dhis2_hf_to_process)
match_rate <- (n_matched / n_original) * 100

cli::cli_alert_success(
  paste0(
    "Stratified matching completed: ",
    "{format(n_matched, big.mark = ',')}/{format(n_original, big.mark = ',')}",
    " facilities matched ({round(match_rate, 1)}%)"
  )
)
cli::cli_alert_info(
  "Remaining unmatched: {nrow(dhis2_hf_to_process)} facilities"
)

Output

── ℹ Match Summary ─────────────────────────────────────────────────────────────

! Both sides have unmatched names; see per-level lines below.


Target data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 16 matched                                       
• adm3 (level3): 204 out of 208 matched                                     
• hf (level4): 1,164 out of 1,732 matched                                   
Lookup data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 18 matched                                       
• adm3 (level3): 204 out of 229 matched                                     
• hf (level4): 1,164 out of 1,552 matched

✔ In non-interactive mode. Exiting after matching with cache...

✔ Unmatched data exported to: '/Users/mohamedyusuf/ahadi-analytics/code/GitHub/snt-code-library/01_data/1.1_foundational/1.1f_cache_files/processed/dhis2_hf_unmatched.rds'

ℹ Exported 568 unmatched rows for column 'hf'

✔ Stratified matching completed: 1,203/1,771 facilities matched (67.9%)

ℹ Remaining unmatched: 568 facilities

To adapt the code:

Line 3: Update cache_loc to point to the folder for saving both the cache file and the unmatched facilities export. This should be a stable location that can be reused in later steps (e.g., "01_data/1.1_foundational/1.1f_cache_files/processed").
Lines 9–10: Replace dhis2_hf_df and master_hf_df with the dataset names.
Lines 11–15: Update column names to match the geographic hierarchy (adm0, adm1, adm2, adm3) and facility name column (hf).
Lines 16–17: Update file paths for both the cache (cache_path) and the unmatched facilities export (unmatched_export_path), which will be used as the input for the fuzzy matching steps (5–8).
Line 23: Use the same path as unmatched_export_path to load the unmatched facilities data for the subsequent cleaning and fuzzy matching steps.

Terminal installation required

If not already installed, install the Python version of sntutils in your terminal:

pip install git+https://github.com/ahadi-analytics/sntutils-py.git

Show the code

from sntutils.geo import prep_geonames

# set up location to save cache
cache_loc = "01_data/1.1_foundational/1.1f_cache_files/processed"

# interactive stratified matching with automatic standardization
dhis2_df_cleaned = prep_geonames(
    target_df=dhis2_hf_df,
    lookup_df=master_hf_df,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    level4="hf",
    cache_path=here(cache_loc, "geoname_cache.csv"),
    unmatched_export_path=here(cache_loc, "dhis2_hf_unmatched.csv"),
)

# load the unmatched facilities for further processing (steps 5–8)
dhis2_hf_to_process = pd.read_csv(
    here(cache_loc, "dhis2_hf_unmatched.csv")
).rename(columns={"hf": "hf_dhis2_raw"})

# summary statistics
n_original = len(dhis2_hf_df)
n_matched = n_original - len(dhis2_hf_to_process)
match_rate = n_matched / n_original * 100

cli_success(
    f"Stratified matching completed: {n_matched:,}/{n_original:,} "
    f"facilities matched ({round(match_rate, 1)}%)"
)
cli_info(f"Remaining unmatched: {len(dhis2_hf_to_process)} facilities")

Output

SUCCESS: Stratified matching completed: 1,203/1,771 facilities matched (67.9%)

INFO: Remaining unmatched: 568 facilities

To adapt the code:

Line 3: Update cache_loc to point to the folder for saving both the cache file and the unmatched facilities export.
Lines 8–9: Replace dhis2_hf_df and master_hf_df with the dataset names.
Lines 10–14: Update column names to match the geographic hierarchy (adm0, adm1, adm2, adm3) and facility name column (hf).
Lines 15–16: Update file paths for both the cache and the unmatched facilities export.
Line 21: Use the same path as unmatched_export_path to load unmatched facilities.

With our interactive stratified matching, we matched 1,203 of 1,771 facilities (67.9%). All administrative levels matched perfectly, but some facilities remain unmatched at the facility name level. The results suggest that combining geographic constraints with human validation is effective, leaving a smaller subset of facilities still to be addressed.

Validate with SNT team

The cached results should always be validated with the SNT team before integration. This review step helps to:

Confirm matches using local knowledge, especially for common or similar names
Resolve ambiguities where facilities are misassigned or have multiple candidates
Ensure consistency with national naming standards and classifications
Document decisions to support future harmonization and reduce repeated effort

The cache is both a validation record and a resource for improving data quality over time. The table below shows the contents of the saved cache from the stratified matching process (cache_path).

Cached geoname matches (click to expand)

Table: Cached geoname matches saved during stratified matching (showing first 20 rows)
level	name_to_match	replacement	level0_prepped	level1_prepped	level2_prepped	level3_prepped	level4_prepped
level4	SHUMAN (KROO BAY) HOSPITAL	DR. SHUMAN HOSPITAL	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	DR. SHUMAN HOSPITAL
level4	RED CROSS (PULTNEY STREET) CLINIC	RED CROSS CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	RED CROSS CLINIC
level4	DR VR WILLOUGHBY CLINIC	DR VICTOR WILLOUGHBY MEMORIAL HOSPITAL	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	DR VICTOR WILLOUGHBY MEMORIAL HOSPITAL
level4	DR SHUMAN MEDICAL CLINIC AND LABORATORY	DR. SHUMAN HOSPITAL	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	DR. SHUMAN HOSPITAL
level4	DR KELVIN NICOLLS CLINIC	DR. KELIVIN CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	DR. KELIVIN CLINIC
level4	DR J RUSSEL CLINIC	DR. J RUSSEL CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	DR. J RUSSEL CLINIC
level4	DR ADO WRIGHT CLINIC	DR. ADO WRIGHT CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	DR. ADO WRIGHT CLINIC
level4	CONNAUGHT CHEST CLINIC	CONNAUGHT HOSPITAL	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL II	CONNAUGHT HOSPITAL
level4	MARINA HOUSE BIRTH CENTRE CLINIC	MARINA HOUSE BIRTH CENTER CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	CENTRAL I	MARINA HOUSE BIRTH CENTER CLINIC
level4	DR SONGO WILLIAMS CLINIC	DR. SONGO WILLIAMS CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	EAST II	DR. SONGO WILLIAMS CLINIC
level4	PRINCIPAL MEDICAL OFFICE (CLINE TOWN) CHP	PRINCIPAL MEDICAL OFFICE CHP	SIERRA LEONE	WESTERN	WESTERN URBAN	EAST I	PRINCIPAL MEDICAL OFFICE CHP
level4	OLA DURING UNDER FIVES CHP	OLA DURING CHILDREN’S HOSPITAL	SIERRA LEONE	WESTERN	WESTERN URBAN	EAST I	OLA DURING CHILDREN’S HOSPITAL
level4	GUOJI (CLINE TOWN) CLINIC	GUOJI CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	EAST I	GUOJI CLINIC
level4	TREASURE HEALTH HOSPITAL	TRESURE HEALTH SPECIALIST HOSPITAL	SIERRA LEONE	WESTERN	WESTERN URBAN	WEST II	TRESURE HEALTH SPECIALIST HOSPITAL
level4	KINGHARMAN ROAD UNDER FIVES CHP	KINGHARMAN ROAD HOSPITAL	SIERRA LEONE	WESTERN	WESTERN URBAN	WEST II	KINGHARMAN ROAD HOSPITAL
level4	EPI HEADQUARTER (NEW ENGLAND) CHP	EPI HEADQUARTER CHP	SIERRA LEONE	WESTERN	WESTERN URBAN	WEST II	EPI HEADQUARTER CHP
level4	ECOMED MEDICAL CENTRE CLINIC	ECONMED MEDICAL CENTER CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	WEST II	ECONMED MEDICAL CENTER CLINIC
level4	ARAB (DWARZAK) CLINIC	ARAB DWARZAK CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	WEST II	ARAB DWARZAK CLINIC
level4	WELLNESS (WEST 3) CLINIC	ST. MARK EVANGELICAL LUTHERAN HEALTH CENTER CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	WEST III	ST. MARK EVANGELICAL LUTHERAN HEALTH CENTER CLINIC
level4	STELLA MARIS CLINIC	STELLA MARIES CLINIC	SIERRA LEONE	WESTERN	WESTERN URBAN	WEST III	STELLA MARIES CLINIC

The cache can also be reused in future runs to avoid repeating interactive matching. It preserves past decisions, ensures consistency, and provides an efficient starting point for updating facility names with the SNT team.

The next step is to run these unmatched facilities through the standardization and fuzzy matching pipeline to recover any remaining valid matches.

Step 5: Process and Prepare Unmatched Data for Fuzzy Matching

After stratified geographic matching, a smaller set of facilities remain unresolved. These are typically due to misassigned administrative units, newly opened or closed facilities, or name variations too large for geographic matching to resolve. To address these cases, Step 5 focuses on preparing the unmatched data by standardizing names and handling abbreviations to reduce variation before applying fuzzy matching.

Step 5.1: Standardize health facility names

Now we apply standardization to the unmatched facilities. This step helps resolve matches that failed due to formatting differences:

As outlined in the Best Practices for an Effective Fuzzy Matching Workflow section, we begin by standardizing the health facility name columns. To do this, we first apply common text cleaning operations such as converting to lowercase, trimming whitespace, removing punctuation, collapsing multiple spaces, normalizing accented characters and converting Roman numerals (e.g., II, III) into standard numeric form (e.g., 2, 3). Each name is then split into individual words, sorted alphabetically, and rejoined, ensuring that names with the same words in different orders are treated consistently. These preprocessing steps reduce irrelevant variation and improve the reliability of downstream fuzzy matching.

In this illustrative example, we apply every standardization option so that code is available for all. However, some of these steps may be unnecessary or lead to unanticipated mismatches for a given project. Review carefully which standardization options to include in the workflow. Creating multiple versions with more and fewer standardization steps can help inform the final decision.

Show the code

# create function to standardize hf names
standardize_names <- function(name_vec) {
  # validate input
  if (!rlang::is_atomic(name_vec)) {
    cli::cli_abort("`name_vec` must be an atomic vector.")
  }

  name_vec |>
    # ensure character type
    as.character() |>
    # convert to lowercase
    stringr::str_to_lower() |>
    # replace punctuation with space
    stringr::str_replace_all("[[:punct:]]", " ") |>
    # remove extra spaces and trim
    stringr::str_squish() |>
    # normalize accents
    stringi::stri_trans_general("Latin-ASCII") |>
    # normalize all space-like characters
    stringi::stri_replace_all_regex("\\p{Zs}+", " ") |>
    # convert roman numerals to arabic numerals
    stringr::str_replace_all(
      c(
        " ix\\b" = " 9",
        " viii\\b" = " 8",
        " vii\\b" = " 7",
        " vi\\b" = " 6",
        " v\\b" = " 5",
        " iv\\b" = " 4",
        " iii\\b" = " 3",
        " ii\\b" = " 2",
        " i\\b" = " 1"
      )
    ) |>
    # sort tokens with letters first, numbers last; within groups
    # alphabetical
    purrr::map_chr(\(.x) {
      # split on one-or-more spaces
      tokens <- strsplit(.x, " +")[[1]]

      # detect pure numeric tokens
      is_num <- stringr::str_detect(tokens, "^[0-9]+$")

      # order alphabetic first, then numeric; sort within each group
      ordered <- c(sort(tokens[!is_num]), sort(tokens[is_num]))

      # rejoin
      paste(ordered, collapse = " ")
    })
}

# set up example with messy formatting
example_word <- factor("Clínica! Rahmâ    IV (  New clinic) East")

# print original structure
cat("\nExample before standardization:\n")
str(example_word)

# apply standardization
example_word_st <- standardize_names(example_word)

# print cleaned example
cat("\nExample after standardization:\n")
str(example_word_st)

Output


Example before standardization:

 Factor w/ 1 level "Clínica! Rahmâ    IV (  New clinic) East": 1


Example after standardization:

 chr "clinic clinica east new rahma 4"

To adapt the code:

Lines 1–50: To keep the full suite of standardizations, leave the standardize_names function as-is. Otherwise, remove the standardizations that are not needed.

Show the code

import unicodedata


def standardize_names(series):
    """Standardize health facility name strings for fuzzy matching.

    Applies lowercase conversion, punctuation removal, accent normalization,
    Roman numeral conversion, and alphabetical token sorting.
    """
    roman_map = {
        r"\bix\b": "9", r"\bviii\b": "8", r"\bvii\b": "7",
        r"\bvi\b": "6", r"\bv\b": "5", r"\biv\b": "4",
        r"\biii\b": "3", r"\bii\b": "2", r"\bi\b": "1",
    }

    def _clean(text):
        if pd.isna(text):
            return text
        text = str(text).lower()
        # replace punctuation with space
        text = re.sub(r"[^\w\s]", " ", text)
        # collapse multiple spaces
        text = re.sub(r"\s+", " ", text).strip()
        # normalize accents
        text = unicodedata.normalize("NFD", text)
        text = "".join(c for c in text if unicodedata.category(c) != "Mn")
        # convert roman numerals
        for pattern, replacement in roman_map.items():
            text = re.sub(pattern, replacement, text)
        # sort tokens: letters first (alphabetical), numbers last
        tokens = text.split()
        alpha = sorted(t for t in tokens if not t.isdigit())
        numeric = sorted(t for t in tokens if t.isdigit())
        return " ".join(alpha + numeric)

    return series.apply(_clean)


# set up example with messy formatting
example_word = "Clínica! Rahmâ    IV (  New clinic) East"

# print original
print(f"\nExample before standardization:\n{example_word}")

# apply standardization
example_word_st = standardize_names(pd.Series([example_word]))[0]

# print cleaned example
print(f"\nExample after standardization:\n{example_word_st}")

Output


Example before standardization:
Clínica! Rahmâ    IV (  New clinic) East


Example after standardization:
clinic clinica east new rahma 4

To adapt the code:

Lines 1–42: To keep the full suite of standardizations, leave the standardize_names function as-is. Otherwise, remove the standardizations that are not needed.

The function worked as expected: it turned the original factor Clínica! Rahmâ IV ( New clinic) East into a clean character string clinic clinica east new rahma 4. This confirms that the standardization steps worked as intended.

The original name had extra spaces, accents, punctuation, and Roman numerals.
All text was converted to lowercase.
Punctuation was replaced with spaces, and multiple spaces were collapsed into one.
Accents were removed, and Roman numerals were converted to numbers.
Words were split, sorted alphabetically, and any numbers placed at the end.

We will now apply the standardization function to both the DHIS2 and MFL facility name columns to ensure consistent formatting. This helps ensure that string similarity scores reflect true differences, not artifacts of inconsistent text formatting.

Show the code

# remove duplicates from MFL and format hf col
master_hf_df <- master_hf_df |>
  dplyr::distinct(hf_mfl_raw, .keep_all = TRUE) |>
  dplyr::mutate(hf_mfl = standardize_names(hf_mfl_raw))

# important: preserve original full dataset before processing unmatched
# (this is our base N)
dhis2_hf_df_original <- dhis2_hf_df |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw))

# build geo-aware stable ID once and attach to originals and unmatched
hf_uid_new_map <- dhis2_hf_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::mutate(
    hf_uid_new = sntutils::vdigest(
      paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf_dhis2_raw)),
        sep = "|"
      ),
      algo = "xxhash32"
    )
  )

dhis2_hf_df_original <- dhis2_hf_df_original |>
  dplyr::left_join(
    hf_uid_new_map,
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched <- dhis2_hf_to_process |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw)) |>
  dplyr::left_join(
    hf_uid_new_map |>
      dplyr::mutate(hf_dhis2_raw = toupper(hf_dhis2_raw)),
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

knitr::kable(
  # check if it worked
  dhis2_hf_unmatched |>
    dplyr::distinct(hf_dhis2_raw, hf_dhis2) |>
    dplyr::slice_head(n = 10)
)

Output

hf_dhis2_raw	hf_dhis2
ANGLICAN DIOCESE CLINIC	anglican clinic diocese
BENDUMA BAGBWE MCHP	bagbwe benduma mchp
BONGOR BUMPE NGAO MCHP	bongor bumpe mchp ngao
BREAKTHROUGH MCHP	breakthrough mchp
EGYPTIAN BO CLINIC	bo clinic egyptian
FANIMA WONDE MCHP	fanima mchp wonde
FOINDU BAOMA MCHP	baoma foindu mchp
FOYA VALUNIA CHP	chp foya valunia
GBAHAMA BAOMA CHP	baoma chp gbahama
GBANJA TOWN MCHP	gbanja mchp town

To adapt the code:

Lines 2–4: Update column names to match the data (hf_mfl_raw, hf_dhis2_raw, hf_mfl, hf_dhis2).
Line 3: Modify the distinct() function to use different duplicate removal criteria if needed.
Lines 6–9: Create dhis2_hf_df_original to preserve all original facilities (this is our base N that will be used in Step 9.2).
Lines 37–45: Create dhis2_hf_unmatched for fuzzy matching steps - subsequent steps should use this instead of dhis2_hf_df.

Show the code

# remove duplicates from MFL and create standardized column
master_hf_df = (
    master_hf_df
    .drop_duplicates(subset="hf_mfl_raw")
    .assign(hf_mfl=lambda d: standardize_names(d["hf_mfl_raw"]))
)

# preserve original full dataset before processing unmatched
dhis2_hf_df_original = dhis2_hf_df.assign(
    hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"])
)

# build geo-aware stable ID map
hf_uid_new_map = (
    dhis2_hf_df[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"]]
    .drop_duplicates()
    .assign(
        hf_uid_new=lambda d: vdigest(
            d["adm0"].str.lower().str.strip() + "|" +
            d["adm1"].str.lower().str.strip() + "|" +
            d["adm2"].str.lower().str.strip() + "|" +
            d["adm3"].str.lower().str.strip() + "|" +
            d["hf_dhis2_raw"].str.lower().str.strip()
        )
    )
)

dhis2_hf_df_original = dhis2_hf_df_original.merge(
    hf_uid_new_map,
    on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
    how="left",
)

# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched = (
    dhis2_hf_to_process
    .assign(hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"]))
    .merge(
        hf_uid_new_map.assign(
            hf_dhis2_raw=lambda d: d["hf_dhis2_raw"].str.upper()
        ),
        on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
        how="left",
    )
)

# check if standardization worked
dhis2_hf_unmatched[["hf_dhis2_raw", "hf_dhis2"]].drop_duplicates().head(10)

Output

<IPython.core.display.HTML object>

To adapt the code:

Lines 2–4: Update column names to match the data (hf_mfl_raw, hf_dhis2_raw, hf_mfl, hf_dhis2).
Line 3: Modify drop_duplicates() to use different duplicate removal criteria if needed.
Lines 7–9: Create dhis2_hf_df_original to preserve all original facilities.
Lines 37–44: Create dhis2_hf_unmatched for fuzzy matching steps; subsequent steps should use this instead of dhis2_hf_df.

We keep the original columns (hf_mfl_raw, hf_dhis2_raw) unchanged and create standardized versions (hf_mfl, hf_dhis2) for matching. This preserves the raw names for later review or manual matching.

Step 5.2: Handle abbreviations in health facility names

Health facility names in both DHIS2 and the MFL may contain abbreviations such as CHC, PHU, or MCHP. If these abbreviations are used inconsistently across or within datasets, standardization can improve matching accuracy.

To improve consistency and match quality, we begin by identifying frequently used abbreviations across both datasets by analyzing word patterns. Rather than expanding these abbreviations into their full forms, which can introduce unnecessary verbosity and noise, we have chosen to apply a predefined dictionary to standardize both datasets to use abbreviations. This approach simplifies name structures, shortens string lengths, and reduces the influence of repetitive generic terms on similarity scores.

We count the number of times each abbreviation is appearing in the datasets (freq) and define a “real” abbreviation as one that appears more than 2 times.

Show the code

abbrev_dictionary <-
  dplyr::bind_rows(
    dplyr::select(dhis2_hf_unmatched, hf = hf_dhis2_raw),
    dplyr::select( master_hf_df, hf = hf_mfl_raw)) |>
  # split into words
  tidyr::separate_rows(hf, sep = " ") |>
  # remove blanks
  dplyr::filter(hf != "") |>
  # detect 2-4 uppercase letter patterns
  dplyr::filter(stringr::str_detect(hf, "^[A-Z]{2,4}$")) |>
  # count frequencies
  dplyr::count(hf, sort = TRUE) |>
  # rename for clarity
  dplyr::rename(word = hf, freq = n) |>
  dplyr::filter(freq > 2) |>
  as.data.frame()

# check output
abbrev_dictionary

Output

word	freq
MCHP	812
CHP	539
CHC	334
ARAB	16
DR	15
KOYA	13
TOWN	12
CITY	10
RED	9
NGAO	8

To adapt the code:

Line 10: Modify the pattern {2,4} to change the character length range for detected abbreviations (currently 2-4 characters).
Line 15: Change the frequency threshold (freq > 2) to suit the project context: lower it to capture more potential abbreviations or raise it to focus on the most common ones.
Handle all-caps facility names: If facility names are entirely in uppercase, this pattern detection may not work as expected. Consider pre-processing to convert names to title case or adjusting the detection pattern.

Show the code

# combine DHIS2 and MFL names for abbreviation detection
all_names = pd.concat([
    dhis2_hf_unmatched[["hf_dhis2_raw"]].rename(columns={"hf_dhis2_raw": "hf"}),
    master_hf_df[["hf_mfl_raw"]].rename(columns={"hf_mfl_raw": "hf"}),
])

# split into words and filter to 2-4 uppercase letter tokens
abbrev_dictionary = (
    all_names["hf"]
    .dropna()
    .str.split(expand=True)
    .stack()
    .reset_index(drop=True)
    .rename("word")
    .loc[lambda s: s.str.match(r"^[A-Z]{2,4}$")]
    .value_counts()
    .reset_index()
    .rename(columns={"count": "freq"})
    .loc[lambda d: d["freq"] > 2]
)

# check output
abbrev_dictionary

Output

<IPython.core.display.HTML object>

To adapt the code:

Line 15: Modify the pattern {2,4} to change the character length range for detected abbreviations.
Line 19: Change the frequency threshold (freq > 2) to suit the project context.

Now that we have identified common abbreviations (e.g., MCHP, CHP, CHC), we use a shared dictionary to standardize health facility names across both datasets. Rather than expanding these terms, we ensure they are used consistently as abbreviations in both DHIS2 and the MFL. Even within the MFL, some names may use abbreviations while others use full terms. Standardization helps align these differences and increases match accuracy.

Limitation: handling misspellings and variations

This abbreviation dictionary approach has limitations when dealing with misspelled or variant terms. For example:

If “Community Health Center” is misspelled as “Comunity Health Center”, the dictionary lookup will not match
Regional variations like “Health Centre” vs “Health Center” may not be captured
Typos in abbreviations themselves (e.g., CHC vs CHS) will not be standardized

Mitigation strategies:

Use fuzzy matching algorithms in Step 4 to handle remaining mismatches after abbreviation standardization
Consider adding common misspellings to the dictionary if they appear frequently
Implement additional text cleaning steps (spell checking, variant standardization) before applying the abbreviation dictionary

Show the code

# define abbreviation dictionary (all lowercase, because we use the
# standardize column)
abbrev_dict <- c(
  "maternal child health post" = "mchp",
  "community health post" = "chp",
  "community health center" = "chc",
  "urban maternal clinic" = "umi",
  "expanded programme on immunization" = "epi"
)

# apply replacements to MFL names
master_hf_df <- master_hf_df |>
  dplyr::mutate(
    hf_mfl = stringr::str_replace_all(hf_mfl, abbrev_dict)
  )

# apply replacements to DHIS2 names
dhis2_hf_unmatched <- dhis2_hf_unmatched |>
  dplyr::mutate(
    hf_dhis2 = stringr::str_replace_all(hf_dhis2, abbrev_dict)
  )

# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched |>
  dplyr::filter(
    stringr::str_detect(
      hf_dhis2_raw,
      paste0(
        "(?i)Community Health Center|Maternal Child Health Post|",
        "Community Health Post"
      )
    ) |
      stringr::str_detect(hf_dhis2_raw, "CHC|MCHP|CHP")
  ) |>
  dplyr::select(hf_dhis2_raw, hf_dhis2) |>
  head()

Output

hf_dhis2_raw	hf_dhis2
BENDUMA BAGBWE MCHP	bagbwe benduma mchp
BONGOR BUMPE NGAO MCHP	bongor bumpe mchp ngao
BREAKTHROUGH MATERNAL CHILD HEALTH POST	breakthrough mchp
FANIMA WONDE MCHP	fanima mchp wonde
FOINDU BAOMA MCHP	baoma foindu mchp
FOYA VALUNIA CHP	chp foya valunia
GBAHAMA BAOMA CHP	baoma chp gbahama
GBANJA TOWN MCHP	gbanja mchp town
GENDA MCHP	genda mchp
GONDAMA TIKONKO CHC	chc gondama tikonko

To adapt the code:

Lines 3–9: Modify the abbrev_dict if the data includes other frequent abbreviations.
Lines 14, 20: Use the standardized lowercase columns (hf_mfl and hf_dhis2) as created in Step 5.1. Only apply this step after confirming abbreviations through frequency checks (see Step 5.2).

Show the code

# define abbreviation dictionary (all lowercase, using standardized column)
abbrev_dict = {
    "maternal child health post": "mchp",
    "community health post": "chp",
    "community health center": "chc",
    "urban maternal clinic": "umi",
    "expanded programme on immunization": "epi",
}

# apply replacements to MFL names
for long_form, short_form in abbrev_dict.items():
    master_hf_df["hf_mfl"] = master_hf_df["hf_mfl"].str.replace(
        long_form, short_form, regex=False
    )

# apply replacements to DHIS2 names
for long_form, short_form in abbrev_dict.items():
    dhis2_hf_unmatched["hf_dhis2"] = dhis2_hf_unmatched["hf_dhis2"].str.replace(
        long_form, short_form, regex=False
    )

# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched.loc[
    dhis2_hf_unmatched["hf_dhis2_raw"].str.contains(
        r"(?i)Community Health Center|Maternal Child Health Post|Community Health Post|CHC|MCHP|CHP",
        regex=True,
        na=False,
    ),
    ["hf_dhis2_raw", "hf_dhis2"],
].drop_duplicates().head()

Output

<IPython.core.display.HTML object>

To adapt the code:

Lines 3–9: Modify abbrev_dict if the data includes other frequent abbreviations.
Lines 13, 20: Use the standardized lowercase columns (hf_mfl and hf_dhis2) as created in Step 5.1.

Step 6: Perform Fuzzy Matching on Remaining Unmatched

We now apply fuzzy matching algorithms to the facilities that remain unmatched after stratified geographic matching in Step 4. These represent the most challenging cases where geographic alignment or name similarity could not establish a clear match.

Processing remaining unmatched

This step focuses on facilities that could not be matched through: - Step 4: Interactive stratified geographic matching - Step 5: Name standardization and exact matching

These remaining facilities often have geographic misalignments or extreme name variations that require more flexible matching approaches.

In our example, we aim to find an MFL match for each remaining unmatched DHIS2 facility. To match in the other direction (identifying a DHIS2 match for each MFL facility), or in both directions, modify and/or repeat the code accordingly.

Step 6.1: Identify matched and unmatched facilities

We first identify exact matches between DHIS2 and MFL facility names, both before and after text standardization. These exact matches are set aside while we focus on unmatched facilities for fuzzy matching.

We use many-to-one matching, meaning multiple DHIS2 records may link to the same MFL facility if their names are sufficiently similar. This approach helps assess linkage reliability and flags potential naming inconsistencies. It is also possible to see many-to-many detections, for example when the same health facility appears in DHIS2 under different spellings or formats and matches against multiple candidate facilities in the MFL. These cases require closer review to confirm the correct linkage.

The output from this step is unmatched DHIS2 names that will be processed through fuzzy matching algorithms.

Show the code

# flag for one-to-one enforcement
enforce_one_to_one <- FALSE

# exact matches using raw names
matched_dhis2_raw <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw),
    by = c("hf_dhis2_raw" = "hf_mfl_raw")
  )

# exact matches using standardized names
matched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_dhis2, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw, hf_mfl),
    by = c("hf_dhis2" = "hf_mfl"),
    keep = TRUE
  ) |>
  # exclude facilities already matched via raw names
  dplyr::anti_join(matched_dhis2_raw, by = c("hf_dhis2_raw")) |>
  # label exact matches
  dplyr::mutate(
    final_method = paste0(
      "Matched Without Fuzzy Matching (standardization)"
    ),
    score = 100
  )

# only include facilities that were actually matched via geographic
# standardization
# these should be facilities that had geographic corrections applied
matched_dhis2_prepgeoname <-
  dhis2_df_cleaned |>
  dplyr::anti_join(
    dhis2_hf_unmatched,
    by = c("adm0", "adm1", "adm2", "adm3", "hf" = "hf_dhis2_raw")
  ) |>
  dplyr::left_join(
    master_hf_df |>
      dplyr::mutate(hf = toupper(hf)) |>
      dplyr::select(adm0, adm1, adm2, adm3, hf, hf_mfl_raw, hf_mfl),
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  ) |>
  # attach hf_uid_new from dhis2_map (by admin + hf)
  dplyr::left_join(
    dhis2_map |> dplyr::select(adm0, adm1, adm2, adm3, hf, hf_uid_new),
    by = c("adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw" = "hf")
  ) |>
  # label matches
  dplyr::mutate(
    final_method = "Interactive Stratified Geographic Matching",
    score = 100
  )

# compute unmatched after standardization
unmatched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(hf_dhis2) |>
  dplyr::anti_join(
    master_hf_df |>
      dplyr::select(hf_mfl),
    by = c("hf_dhis2" = "hf_mfl")
  ) |>
  dplyr::distinct(hf_dhis2)

# collect already matched mfl to enforce one-to-one in candidate pool
used_mfl_stand <- matched_dhis2 |>
  dplyr::pull(hf_mfl) |>
  unique()

use_mfl_prepgeoname <- matched_dhis2_prepgeoname |>
  dplyr::pull(hf_mfl) |>
  unique()

used_mfl <- c(used_mfl_stand, use_mfl_prepgeoname)

# build candidate MFL pool for fuzzy step
candidate_mfl_df <- master_hf_df |>
  dplyr::select(hf_mfl)

# if enforcing one-to-one, drop MFLs already used by exact matches
if (enforce_one_to_one) {
  candidate_mfl_df <- candidate_mfl_df |>
    dplyr::filter(!hf_mfl %in% used_mfl)
}

# summary counts
total_dhis2_hf <- dplyr::n_distinct(dhis2_hf_unmatched$hf_dhis2_raw)
raw_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2_raw$hf_dhis2_raw)
raw_unmatch_dhis2_hf <- total_dhis2_hf - raw_match_dhis2_hf

standardized_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2$hf_dhis2)
standardized_unmatch_dhis2_hf <- total_dhis2_hf - standardized_match_dhis2_hf

total_mfl_hf <- dplyr::n_distinct(master_hf_df$hf_mfl)
candidate_mfl_count <- dplyr::n_distinct(candidate_mfl_df$hf_mfl)

# display summary
cli::cli_h2("Summary of Matching Status")

cli::cli_alert_info(
  "Total DHIS2 facilities: {format(total_dhis2_hf, big.mark = ',')}"
)

cli::cli_alert_success(
  paste0(
    "Matched after standardization: ",
    "{format(standardized_match_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched using raw names: ",
    "{format(raw_unmatch_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched after standardization: ",
    "{format(standardized_unmatch_dhis2_hf, big.mark = ',')}"
  )
)

cli::cli_alert_info(
  paste0(
    "MFL facilities in candidate pool for matching: ",
    "{format(candidate_mfl_count, big.mark = ',')} of ",
    "{format(total_mfl_hf, big.mark = ',')}"
  )
)

Output

── Summary of Matching Status ──

ℹ Total DHIS2 facilities: 565

✔ Matched after standardization: 98

✖ Unmatched using raw names: 565

✖ Unmatched after standardization: 467

ℹ MFL facilities in candidate pool for matching: 1,492 of 1,492

To adapt the code:

Line 2: Set enforce_one_to_one to TRUE to prevent reusing an hf_mfl already matched exactly. Set to FALSE to allow one-to-many.
Lines 15, 18: Replace column names: hf_dhis2 and hf_mfl → standardized name fields; hf_dhis2_raw and hf_mfl_raw → raw name fields.
Lines 14, 17: Replace dataset names: dhis2_hf_unmatched → the unmatched DHIS2 dataset; master_hf_df → the MFL dataset.
Lines 61–68: unmatched_dhis2 contains DHIS2 facilities not matched to the MFL and will be passed to the fuzzy matching step.
Lines 88–91: One-to-many filter. Keep as-is to drop already matched hf_mfl when enforce_one_to_one == FALSE. Remove these lines to always allow reuse.

Show the code

# flag for one-to-one enforcement
enforce_one_to_one = False

# exact matches using raw names
matched_dhis2_raw = (
    dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
    .merge(
        master_hf_df[["hf_mfl_raw"]],
        left_on="hf_dhis2_raw",
        right_on="hf_mfl_raw",
        how="inner",
    )
)

# exact matches using standardized names
matched_dhis2 = (
    dhis2_hf_unmatched[
        ["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_dhis2", "hf_uid_new"]
    ]
    .merge(
        master_hf_df[["hf_mfl_raw", "hf_mfl"]],
        left_on="hf_dhis2",
        right_on="hf_mfl",
        how="inner",
    )
)
# exclude facilities already matched via raw names
matched_dhis2 = anti_join(matched_dhis2, matched_dhis2_raw, on=["hf_dhis2_raw"])
matched_dhis2 = matched_dhis2.assign(
    final_method="Matched Without Fuzzy Matching (standardization)",
    score=100,
)

# compute unmatched after standardization
# cross-named anti-join: dhis2_hf_unmatched.hf_dhis2 vs master_hf_df.hf_mfl
unmatched_dhis2 = (
    dhis2_hf_unmatched.loc[
        ~dhis2_hf_unmatched["hf_dhis2"].isin(master_hf_df["hf_mfl"]),
        ["hf_dhis2"],
    ]
    .drop_duplicates(subset="hf_dhis2")
)

used_mfl = list(matched_dhis2["hf_mfl"].unique())

# build candidate MFL pool for fuzzy step
candidate_mfl_df = master_hf_df[["hf_mfl"]].copy()
if enforce_one_to_one:
    candidate_mfl_df = candidate_mfl_df.loc[~candidate_mfl_df["hf_mfl"].isin(used_mfl)]

# summary counts
total_dhis2_hf = dhis2_hf_unmatched["hf_dhis2_raw"].nunique()
raw_match_dhis2_hf = matched_dhis2_raw["hf_dhis2_raw"].nunique()
raw_unmatch_dhis2_hf = total_dhis2_hf - raw_match_dhis2_hf
standardized_match_dhis2_hf = matched_dhis2["hf_dhis2"].nunique()
standardized_unmatch_dhis2_hf = total_dhis2_hf - standardized_match_dhis2_hf
total_mfl_hf = master_hf_df["hf_mfl"].nunique()
candidate_mfl_count = candidate_mfl_df["hf_mfl"].nunique()

cli_header("Summary of Matching Status")
cli_info(f"Total DHIS2 facilities: {total_dhis2_hf:,}")
cli_success(f"Matched after standardization: {standardized_match_dhis2_hf:,}")
cli_danger(f"Unmatched using raw names: {raw_unmatch_dhis2_hf:,}")
cli_danger(f"Unmatched after standardization: {standardized_unmatch_dhis2_hf:,}")
cli_info(
    f"MFL facilities in candidate pool for matching: "
    f"{candidate_mfl_count:,} of {total_mfl_hf:,}"
)

Output


Summary of Matching Status

INFO: Total DHIS2 facilities: 565

SUCCESS: Matched after standardization: 98

ERROR: Unmatched using raw names: 565

ERROR: Unmatched after standardization: 467

INFO: MFL facilities in candidate pool for matching: 1,492 of 1,492

To adapt the code:

Line 2: Set enforce_one_to_one to True to prevent reusing an hf_mfl already matched exactly. Set to False to allow one-to-many.
Lines 15, 18: Replace column names: hf_dhis2 and hf_mfl → standardized name fields; hf_dhis2_raw and hf_mfl_raw → raw name fields.
Lines 14, 17: Replace dataset names: dhis2_hf_unmatched → the unmatched DHIS2 dataset; master_hf_df → the MFL dataset.
Lines 37–39: unmatched_dhis2 contains DHIS2 facilities not matched to the MFL and will be passed to the fuzzy matching step.

Our standardization process resolved some of the initially unmatched DHIS2 facilities as shown in the output above. This shows the value of text cleaning before applying fuzzy matching. The remaining unmatched DHIS2 facilities are the focus of our fuzzy matching efforts in the next steps.

The output above shows the total DHIS2 facilities that were unmatched at the raw-name stage, and the MFL facilities that remain in the candidate pool for fuzzy matching after enforcing one-to-many alignment with DHIS2 matches.

We will use unmatched_dhis2 as the input for fuzzy matching. These are the standardized DHIS2 facility names that could not be matched through exact string comparison. The matched_dhis2 dataset contains facility names that matched exactly after standardization. Once fuzzy matching is complete, we will append these exact matches to the final results to produce a complete, joined dataset.

Step 6.1: Create the matching grid

Once we identify the unmatched_dhis2 names, the next step is to create a matching grid. This step sets up the full set of possible name pairings that we will later evaluate using fuzzy string comparison. The purpose of this grid is to define the set of candidate pairs that we want to compare in the next step, rather than checking every name against every other name in the full dataset, which would be computationally expensive and potentially introduce false matches.

Show the code

# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities.

# create simple candidate match pool
candidate_match_pool <- tidyr::crossing(
  unmatched_dhis2,
  candidate_mfl_df
)

# create match grid with dhis2 and mfl names
match_grid <- candidate_match_pool

# preview sample - top 10 matches
match_grid |>
  dplyr::slice_head(n = 10)

Output

hf_dhis2	hf_mfl
a clinic dr edwin	aapdep clinic wind
a clinic dr edwin	aberdeen centre hospital women
a clinic dr edwin	aberdeen chc coach sea
a clinic dr edwin	abernita hospital
a clinic dr edwin	acre chc mange
a clinic dr edwin	ad bangs chp quarry
a clinic dr edwin	adama chc marth memorial
a clinic dr edwin	adama mabaima mchp
a clinic dr edwin	ado clinic dr wright
a clinic dr edwin	adolescent and clinic happy kid

To adapt the code:

Lines 7–8: Replace unmatched_dhis2 and candidate_mfl_df with the dataset names if different.

Show the code

# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities
candidate_match_pool = unmatched_dhis2.merge(candidate_mfl_df, how="cross")

# create match grid with dhis2 and mfl names
match_grid = candidate_match_pool.copy()

# preview sample - top 10 matches
match_grid.head(10)

Output

<IPython.core.display.HTML object>

To adapt the code:

Lines 2–3: Replace unmatched_dhis2 and candidate_mfl_df with the dataset names if different.

The match_grid is a flat list of candidate name pairs. Each row links one unmatched DHIS2 name to a possible MFL match. This is not a fuzzy match result, but the full set of pairings that will be evaluated in Step 6.3 using the string distance algorithms.

At this stage, the grid includes only facility names (hf_dhis2 and hf_mfl) to keep it lightweight. IDs, coordinates, and other metadata are excluded for now and will be reattached after scoring. If identical names appear across locations, the grid can be extended with extra fields for disambiguation. The preview shows a random sample of candidate pairs. It may not include obvious matches, but helps confirm that the pairing logic and geographic filters are working. This grid is the input for fuzzy matching, where each row will be scored and ranked to identify likely matches.

Step 6.2: Compute similarity scores

We compute name similarity between each unmatched DHIS2 facility and all candidates in the master list using multiple algorithms: Jaro-Winkler, Levenshtein, Qgram, and LCS. Each algorithm captures different types of variation, such as typos, truncation, or word reordering, that may be found in real-world datasets.

This step outputs individual similarity scores from each method, allowing us to explore, compare, or prioritize specific algorithms depending on the matching goals.

Show the code

# calculate fuzzy scores
match_grid <- match_grid |>
  dplyr::mutate(
    len_max = pmax(nchar(hf_dhis2), nchar(hf_mfl)),
    score_jw = 1 - stringdist::stringdist(hf_dhis2, hf_mfl, method = "jw"),
    score_lv = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lv") / len_max,
    score_qg = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "qgram") / len_max,
    score_lcs = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lcs") / len_max
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("score_"),
      .fns = ~ ifelse(is.nan(.x) | .x < 0, 0, .x)
    ),
    dplyr::across(
      .cols = dplyr::contains("score_"),
      .fns = ~ .x * 100
    )
  )

Output

hf_dhis2	hf_mfl	score_jw	score_lv	score_qg	score_lcs
branda centre hospital medical	branda center hospital medical	98.9	93.3	100.0	93.3
clinic health redeemer	clinic health redeemers	98.6	95.7	95.7	95.7
kayimba mchp vawahun	kayimba mchp vaawahun	96.7	95.2	95.2	95.2
chc farm lakkaogoo	chc farm lakka ogoo	98.2	94.7	94.7	94.7
mchp momorimaria	mchp momorimaia	97.9	93.8	93.8	93.8
chp haikal	gbongongor mchp	0.0	0.0	0.0	0.0
chp kulia	gbongongor mchp	0.0	0.0	0.0	0.0
keindeyella mchp	chc mottuo	0.0	0.0	0.0	0.0
keindeyella mchp	chc potoru	0.0	0.0	0.0	0.0
keindeyella mchp	chp fogbo	0.0	0.0	0.0	0.0

To adapt the code:

Lines 4–10: Swap in or out any similarity algorithm based on the matching needs. See stringdist manual, p.23 for available methods.

Show the code

from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz


def _score_lv(a, b):
    """Normalized Levenshtein similarity (0–100)."""
    len_max = max(len(a), len(b))
    if len_max == 0:
        return 100.0
    return max(0.0, (1 - Levenshtein.distance(a, b) / len_max) * 100)


def _score_jw(a, b):
    """Jaro-Winkler similarity (0–100)."""
    return JaroWinkler.similarity(a, b) * 100


def _score_qg(a, b):
    """Q-gram similarity (0–100) using token sort ratio."""
    return float(fuzz.token_sort_ratio(a, b))


def _score_lcs(a, b):
    """LCS-based similarity (0–100) using partial ratio."""
    return float(fuzz.partial_ratio(a, b))


# calculate fuzzy scores
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]

match_grid = match_grid.assign(
    score_jw=lambda d: d.apply(
        lambda r: _score_jw(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
    score_lv=lambda d: d.apply(
        lambda r: _score_lv(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
    score_qg=lambda d: d.apply(
        lambda r: _score_qg(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
    score_lcs=lambda d: d.apply(
        lambda r: _score_lcs(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
)

# clip negatives and NaN to zero
for col in score_cols:
    match_grid[col] = match_grid[col].clip(lower=0).fillna(0)

Output

<IPython.core.display.HTML object>

To adapt the code:

Lines 14–32: Swap in or out any similarity function based on the matching needs. rapidfuzz provides Levenshtein, Jaro-Winkler, token sort ratio, and partial ratio out of the box.

Above is a sample of ten DHIS2–MFL name pairs from the match grid. The first five are strong matches with near-perfect scores across all algorithms, reflecting nearly identical string matches. The last five are weak matches with low or zero similarity, often due to unrelated names or different facility types. Mid-range cases like kpetema mchp vs clinic gbonkorlenken show partial overlap but remain well below matching thresholds. These patterns confirm that the scoring works as intended and set the stage for building composite similarity scores in the next step, which help reduce dependence on any single metric.

Step 6.3: Create composite scores

To improve fuzzy match reliability, we generate composite similarity metrics that combine multiple string distance algorithms. This helps minimize over-reliance on any single metric and supports a more reliable match decision. However, if we have a preferred single metric from the previous step, we may skip generating a composite score and proceed directly to Step 4.5.

We generate composite metrics in two ways:

Composite Similarity Score (composite_score): We calculate a simple average across the six similarity algorithms from the previous step (Jaro-Winkler, Levenshtein, Qgram and LCS) to create a composite_score. This value reflects the overall string similarity between DHIS2 and MFL names.
Composite Rank Score (rank_avg): For each DHIS2 facility, we rank the match quality of each candidate MFL name, with a ranking done for each of the six similarity algorithms. Then we average these ranks into a rank_avg. A low rank_avg (e.g., 1) indicates the same MFL name consistently scores as an excellent match across methods, signaling strong consensus and high match confidence. This ensemble ranking helps balance the strengths of different algorithms.

Show the code

# define similarity score columns
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

# compute composite_score dynamically using score_cols
match_grid <- match_grid |>
  dplyr::mutate(
    composite_score = rowMeans(
      dplyr::across(dplyr::all_of(score_cols))
    )
  )

# compute average rank across all similarity methods
ranked_grid <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::mutate(
    dplyr::across(
      dplyr::all_of(score_cols),
      ~ dplyr::min_rank(dplyr::desc(.)),
      .names = "rank_{.col}"
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    rank_avg = rowSums(dplyr::across(
      dplyr::all_of(paste0("rank_", score_cols))
    )),
    rank_avg = round(rank_avg / length(score_cols))
  )

# preview ranked results to show multiple candidates per DHIS2 facility
# select a few facilities and show their top 3 candidates each
sample_facilities <- c(
  "charity clinic kamba of",
  "arab clinic shad",
  "al arab clinic sheefa"
)

ranked_grid |>
  dplyr::filter(hf_dhis2 %in% sample_facilities) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, n = 4, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    dplyr::all_of(score_cols),
    composite_score,
    rank_avg
  ) |>
  dplyr::arrange(hf_dhis2, rank_avg)

Output

hf_dhis2	hf_mfl	score_jw	score_lv	score_qg	score_lcs	composite_score	rank_avg
al arab clinic sheefa	arab clinic shell	75.5	71.4	71.4	61.9	70.1	1
al arab clinic sheefa	affordable clinic health	77.1	54.2	62.5	37.5	57.8	3
al arab clinic sheefa	alami arab clinic	71.1	52.4	61.9	52.4	59.4	4
al arab clinic sheefa	arab clinic lunsar	72.7	57.1	57.1	47.6	58.7	4
arab clinic shad	arab clinic shell	90.0	82.4	70.6	70.6	78.4	1
arab clinic shad	arab clinic	89.6	68.8	68.8	68.8	74.0	2
arab clinic shad	arab clinic lunsar	88.4	72.2	66.7	66.7	73.5	3
arab clinic shad	arab clinic dwarzak	84.7	68.4	63.2	52.6	67.2	4
charity clinic kamba of	city clinic garden	76.2	56.5	43.5	34.8	52.8	10
charity clinic kamba of	care clinic life	67.7	47.8	43.5	34.8	48.5	15
charity clinic kamba of	clinic hamanda	66.8	43.5	43.5	26.1	45.0	21
charity clinic kamba of	caring clinic hands	68.6	56.5	39.1	30.4	48.7	25

To adapt the code:

Line 2: If the similarity algorithms change (e.g., add or remove metrics), update the score_cols vector to reflect the new set.

Show the code

# define similarity score columns
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]

# compute composite_score (simple average)
match_grid["composite_score"] = match_grid[score_cols].mean(axis=1)

# compute average rank across all similarity methods
for col in score_cols:
    match_grid[f"rank_{col}"] = (
        match_grid.groupby("hf_dhis2")[col]
        .rank(method="min", ascending=False)
    )

rank_cols = [f"rank_{c}" for c in score_cols]
match_grid["rank_avg"] = (
    match_grid[rank_cols].sum(axis=1) / len(score_cols)
).round()
ranked_grid = match_grid.copy()

# preview ranked results for a few sample facilities
sample_facilities = [
    "charity clinic kamba of",
    "arab clinic shad",
    "al arab clinic sheefa",
]

ranked_grid.loc[ranked_grid["hf_dhis2"].isin(sample_facilities)].groupby(
    "hf_dhis2"
).apply(lambda g: g.nsmallest(4, "rank_avg")).reset_index(drop=True)[
    ["hf_dhis2", "hf_mfl"] + score_cols + ["composite_score", "rank_avg"]
].sort_values(["hf_dhis2", "rank_avg"])

Output

<IPython.core.display.HTML object>

To adapt the code:

Line 2: If the similarity algorithms change, update score_cols to reflect the new set.

Step 6.4: Extract best match

To finalize matching, we select the best candidate for each DHIS2 facility. This can be based on any one similarity algorithm or use a blended approach such as a composite score or an average rank across all methods.

The output includes one top match per DHIS2 name for each method, allowing side-by-side comparison to evaluate performance.

Show the code

# for each fuzzy matching method, we select the top-ranked MFL name
# per DHIS2 hf.
best_jw <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_jw, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Jaro-Winkler") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_jw, method
  )

best_lv <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lv, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Levenshtein") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lv, method
  )

best_qg <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_qg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Qgram") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_qg, method
  )

best_lcs <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lcs, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "LCS") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lcs, method
  )

best_comp<- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(composite_score, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Composite-Score") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = composite_score, method
  )

best_ranked_match <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Rank-Ensemble")  |>
    dplyr::select(
    hf_dhis2, hf_mfl, score = rank_avg, method
  )

# combine top matches from all methods for comparison across approaches
all_best <- dplyr::bind_rows(
  best_jw, best_lv, best_qg, best_lcs,
  best_comp, best_ranked_match)

# preview best matches to show extraction results
all_best |>
  dplyr::slice_head(n = 10) |>
  dplyr::select(hf_dhis2, hf_mfl, score, method)

Output

hf_dhis2	hf_mfl	score	method
a clinic dr edwin	aapdep clinic wind	76.07843	Jaro-Winkler
aapdep clinic development in national women	aapdep clinic wind	76.03233	Jaro-Winkler
abdulai clinic dr jalloh	clinic daru field hospital	71.75214	Jaro-Winkler
aberdeen centre clinic fistula mercy ships	birth center clinic house marina	76.78742	Jaro-Winkler
aberdeen clinic marie rd stopes	aberdeen centre hospital women	76.77867	Jaro-Winkler
acf african christian clinic fellowship	african christian clinic fellowship	86.10501	Jaro-Winkler
adama baimba m mchp	adama mabaima mchp	88.75702	Jaro-Winkler
adolescence and clinic east happy kid zone 1	adolescent and clinic happy kid	85.85044	Jaro-Winkler
adolescent and city clinic happy kid makeni	adolescent and clinic happy kid	83.17079	Jaro-Winkler
adra clinic mobile	arab clinic	82.15488	Jaro-Winkler

To adapt the code:

Lines 3–56: For each method block, change the score column and method label to match the algorithm used.
Lines 3–56: Remove blocks for methods that were not computed.
Lines 58–60: Include only the best_* objects kept in bind_rows().

Show the code

# for each fuzzy matching method, select the top-ranked MFL name per DHIS2 hf
best_jw = (
    match_grid.sort_values("score_jw", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_jw"]]
    .rename(columns={"score_jw": "score"})
    .assign(method="Jaro-Winkler")
)

best_lv = (
    match_grid.sort_values("score_lv", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_lv"]]
    .rename(columns={"score_lv": "score"})
    .assign(method="Levenshtein")
)

best_qg = (
    match_grid.sort_values("score_qg", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_qg"]]
    .rename(columns={"score_qg": "score"})
    .assign(method="Qgram")
)

best_lcs = (
    match_grid.sort_values("score_lcs", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_lcs"]]
    .rename(columns={"score_lcs": "score"})
    .assign(method="LCS")
)

best_comp = (
    match_grid.sort_values("composite_score", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "composite_score"]]
    .rename(columns={"composite_score": "score"})
    .assign(method="Composite-Score")
)

best_ranked_match = (
    ranked_grid.sort_values("rank_avg")
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "rank_avg"]]
    .rename(columns={"rank_avg": "score"})
    .assign(method="Rank-Ensemble")
)

# combine top matches from all methods for comparison across approaches
all_best = pd.concat(
    [best_jw, best_lv, best_qg, best_lcs, best_comp, best_ranked_match],
    ignore_index=True,
)

# preview best matches
all_best.head(10)[["hf_dhis2", "hf_mfl", "score", "method"]]

Output

<IPython.core.display.HTML object>

To adapt the code:

Lines 3–47: For each method block, change the score column and method label to match the algorithm used.
Lines 3–47: Remove blocks for methods that were not computed.
Lines 50–55: Include only the best_* objects in pd.concat().

Step 7: Evaluate Fuzzy Match Quality

High similarity scores can sometimes mask poor-quality matches, particularly when facility names differ in structure. To ensure we’re not misled by high numerical scores alone, we combine similarity metrics with diagnostic checks that examine how names align at the token level.

Step 7.1: Visualize score distributions across methods

Before getting into structural evaluations, it’s useful to inspect how the distribution of match scores varies across algorithms. This helps assess whether some methods consistently assign higher scores, and whether their matches tend to cluster tightly or are more dispersed.

This is particularly important when applying score thresholds, or when we want to understand how lenient or strict a method is in assigning high-confidence scores.

Show the code

# plot score dist across methods
score_dist_plot <- all_best |>
  dplyr::filter(method != "Rank-Ensemble") |>
  ggplot2::ggplot(ggplot2::aes(x = score)) +
  ggplot2::geom_density(fill = "steelblue", alpha = 0.6) +

  ggplot2::facet_wrap(~method) +
  ggplot2::geom_density(color = "steelblue", linewidth = 1) +
  ggplot2::labs(
    title = "Distribution of Fuzzy Matching Scores",
    x = "\nMatch Score (%)",
    y = "Density\n"
  ) +
  ggplot2::theme_minimal(base_size = 14)

# save plot
ggplot2::ggsave(
  plot = score_dist_plot,
  filename = here::here("03_output/3a_figures/u5mr_sle_adm2.png"),
  width = 12,
  height = 9,
  dpi = 300
)

Output

To adapt the code:

Do not modify anything in this section.

Show the code

import matplotlib.pyplot as plt

# plot score distribution across methods (exclude rank-based method)
methods_for_plot = all_best.loc[all_best["method"] != "Rank-Ensemble"]
method_names = sorted(methods_for_plot["method"].unique())

fig, axes = plt.subplots(1, len(method_names), figsize=(12, 6), sharey=True)
for ax, name in zip(axes, method_names):
    grp = methods_for_plot.loc[methods_for_plot["method"] == name, "score"].dropna()
    grp.plot.kde(ax=ax, color="steelblue", linewidth=1)
    ax.fill_between(
        ax.lines[0].get_xdata(),
        ax.lines[0].get_ydata(),
        alpha=0.6,
        color="steelblue",
    )
    ax.set_title(name, fontsize=11)
    ax.set_xlabel("\nMatch Score (%)")
    if ax == axes[0]:
        ax.set_ylabel("Density\n")

fig.suptitle("Distribution of Fuzzy Matching Scores", fontsize=14)
plt.tight_layout()

# save plot
output_dir = Path(here("03_output/3a_figures"))
output_dir.mkdir(parents=True, exist_ok=True)
fig.savefig(output_dir / "fuzzy_score_distributions.png", dpi=300, bbox_inches="tight")

Output

To adapt the code:

Do not modify anything in this section.

Looking at the plots, a tight, high peak near 100 means a method produces many strong matches; a broader or flatter distribution may indicate more uncertainty; and bimodal patterns may suggest a mix of clearly good and ambiguous matches. These patterns help guide our matching strategy: for our final selection in Step 6, we will use an 85% threshold as a starting point and build a fallback system that leans on the most reliable methods first.

Step 7.2: Define match quality diagnostic function

To evaluate the structural quality of a match, we define a simple diagnostic function that compares two names at the token level. It splits each name into individual words (which we will call tokens, since not everything in the name is a word), then does four things:

checks whether the first words align (prefix match)
checks whether the last words align (suffix match)
calculates the difference in word count (token difference)
calculates the difference in character length (character difference) These simple heuristics can catch mismatches that score high but are structurally off.

assess_match_quality <- function(name1, name2) {
  purrr::map2_dfr(name1, name2, function(a, b) {
    tokens1 <- strsplit(a, "\\s+")[[1]]
    tokens2 <- strsplit(b, "\\s+")[[1]]
    tibble::tibble(
      prefix_match = tolower(tokens1[1]) == tolower(tokens2[1]),
      suffix_match = tolower(tail(tokens1, 1)) == tolower(tail(tokens2, 1)),
      token_diff = abs(length(tokens1) - length(tokens2)),
      char_diff = abs(nchar(a) - nchar(b))
    )
  })
}

assess_match_quality("Makeni Govt Hospital", "Makeni Government Hospital")

Output

prefix_match	suffix_match	token_diff	char_diff
TRUE	TRUE	0	6

To adapt the code:

Do not modify anything in this section.

def assess_match_quality(names1, names2):
    """Compute structural quality diagnostics for name pairs.

    Returns a DataFrame with prefix_match, suffix_match,
    token_diff, and char_diff for each name pair.
    """
    rows = []
    for a, b in zip(names1, names2):
        tokens1 = str(a).split()
        tokens2 = str(b).split()
        rows.append({
            "prefix_match": tokens1[0].lower() == tokens2[0].lower()
                if tokens1 and tokens2 else False,
            "suffix_match": tokens1[-1].lower() == tokens2[-1].lower()
                if tokens1 and tokens2 else False,
            "token_diff": abs(len(tokens1) - len(tokens2)),
            "char_diff": abs(len(str(a)) - len(str(b))),
        })
    return pd.DataFrame(rows)

assess_match_quality(["Makeni Govt Hospital"], ["Makeni Government Hospital"])

Output

<IPython.core.display.HTML object>

To adapt the code:

Do not modify anything in this section.

In this example, using our assess_match_quality function, the first and last words match, so both prefix_match and suffix_match are TRUE. There is no difference in the number of tokens (token_diff = 0), but the second name is six characters longer (char_diff = 6) due to “Govt” vs “Government”.

These diagnostics help flag cases where a fuzzy score may be high but important name elements differ, such as CHC vs MCHP, Clinic vs Hospital, or Primary vs Secondary, which can signal a mismatch worth reviewing.

Step 7.3: Evaluate quality of top fuzzy matches

We now run this diagnostic on our best matches. The goal is to highlight cases where, despite high similarity scores, names may differ significantly in format or structure: for example, mismatched prefixes or suffixes. This step supports manual review or adds a layer of filtering to improve final match accuracy.

We also combine the diagnostics into a structure_score (0–100), giving 30% weight to token difference, 20% to character difference, and 25% each to prefix and suffix matches. This provides a quick summary of overall structural alignment.

Show the code

# assess structural match quality
diagnostics_df <- dplyr::bind_cols(
  all_best,
  assess_match_quality(all_best$hf_dhis2, all_best$hf_mfl)
)

# compare method diagnostics
summary_stats <- diagnostics_df |>
  dplyr::group_by(method) |>
  dplyr::summarise(
    score = mean(score),
    avg_token_diff = mean(token_diff) |> round(2),
    avg_char_diff = mean(char_diff) |> round(2),
    pct_prefix_match = (mean(prefix_match) * 100) |> round(2),
    pct_suffix_match = (mean(suffix_match) * 100) |> round(2),
    total = dplyr::n(),
    .groups = "drop"
  )

# create a a final overall score
summary_stats <-  summary_stats |>
  dplyr::mutate(
    # rescale negative of average token difference (smaller is better)
    token_score = scales::rescale(-avg_token_diff, to = c(0, 100)),
    # rescale negative of average character difference (smaller is better)
    char_score = scales::rescale(-avg_char_diff, to = c(0, 100)),
    # rescale prefix match percentage (higher is better)
    prefix_score = scales::rescale(pct_prefix_match, to = c(0, 100)),
    # rescale suffix match percentage (higher is better)
    suffix_score = scales::rescale(pct_suffix_match, to = c(0, 100)),

    # combine all four metrics into a weighted structural quality score
    structure_score = round(
      0.3 * token_score +         # emphasize fewer token differences
        0.2 * char_score +        # moderate weight on character similarity
        0.25 * prefix_score +     # give weight to matching initial words
        0.25 * suffix_score,      # give equal weight to matching final words
      1
    )
  ) |>
  dplyr::arrange(desc(structure_score)) |>
  # assign rank based on descending structure score
  dplyr::mutate(rank = dplyr::row_number()) |>
  dplyr::select(
    method, avg_token_diff, avg_char_diff,
    pct_prefix_match, pct_suffix_match,
    total, structure_score, rank
  )


# check results
summary_stats |>
  dplyr::select(
    Method = method,
    `Avg. Token Difference` = avg_token_diff,
    `Avg. Character Difference` = avg_char_diff,
    `% Prefix Match` = pct_prefix_match,
    `% Suffix Match` = pct_suffix_match,
    `Structural Score` = structure_score,
      Rank = rank
  ) |> as.data.frame()

Output

Method	Avg. Token Difference	Avg. Character Difference	% Prefix Match	% Suffix Match	Structural Score	Rank
Levenshtein	0.65	4.61	52.73	34.68	73.0	1
Composite-Score	0.89	6.12	57.48	39.43	56.2	2
Rank-Ensemble	0.84	5.78	57.72	33.25	55.5	3
Qgram	0.51	4.07	28.74	16.15	50.0	4
Jaro-Winkler	0.86	6.12	65.56	23.99	47.9	5
LCS	1.02	6.90	57.96	40.86	44.8	6

Structure Metric Guide:
    - Avg. Token Difference: Lower is better – penalizes mismatches in word count
    - Avg. Character Difference: Lower is better – penalizes large length gaps
    - % Prefix Match: Higher is better – rewards same starting word
    - % Suffix Match: Higher is better – rewards same ending word
    - Structural Score: Higher is better – summary of structural alignment

To adapt the code:

Lines 34–37: Adjust the weights used to calculate the structure_score if the matching context values certain components (e.g., prefix matches or token length) more than others.

Show the code

def _rescale(series, new_min, new_max):
    """Rescale a numeric series to [new_min, new_max]."""
    s_min, s_max = series.min(), series.max()
    if s_max == s_min:
        return pd.Series([new_min] * len(series), index=series.index)
    return new_min + (series - s_min) / (s_max - s_min) * (new_max - new_min)


# assess structural match quality
diagnostics_df = pd.concat(
    [all_best.reset_index(drop=True),
     assess_match_quality(all_best["hf_dhis2"], all_best["hf_mfl"])],
    axis=1,
)

# compare method diagnostics
summary_stats = (
    diagnostics_df
    .groupby("method")
    .agg(
        avg_token_diff=("token_diff", "mean"),
        avg_char_diff=("char_diff", "mean"),
        pct_prefix_match=("prefix_match", "mean"),
        pct_suffix_match=("suffix_match", "mean"),
        total=("hf_dhis2", "count"),
    )
    .reset_index()
    .assign(
        avg_token_diff=lambda d: d["avg_token_diff"].round(2),
        avg_char_diff=lambda d: d["avg_char_diff"].round(2),
        pct_prefix_match=lambda d: (d["pct_prefix_match"] * 100).round(2),
        pct_suffix_match=lambda d: (d["pct_suffix_match"] * 100).round(2),
    )
)

# create a final overall score
summary_stats = summary_stats.assign(
    # rescale negative of average token difference (smaller is better)
    token_score=lambda d: _rescale(-d["avg_token_diff"], 0, 100),
    # rescale negative of average character difference (smaller is better)
    char_score=lambda d: _rescale(-d["avg_char_diff"], 0, 100),
    # rescale prefix match percentage (higher is better)
    prefix_score=lambda d: _rescale(d["pct_prefix_match"], 0, 100),
    # rescale suffix match percentage (higher is better)
    suffix_score=lambda d: _rescale(d["pct_suffix_match"], 0, 100),
).assign(
    # combine all four metrics into a weighted structural quality score
    structure_score=lambda d: (
        0.3 * d["token_score"] +        # emphasize fewer token differences
        0.2 * d["char_score"] +          # moderate weight on character similarity
        0.25 * d["prefix_score"] +       # give weight to matching initial words
        0.25 * d["suffix_score"]         # give equal weight to matching final words
    ).round(1)
).sort_values("structure_score", ascending=False)

summary_stats["rank"] = range(1, len(summary_stats) + 1)

summary_stats.rename(columns={
    "method": "Method",
    "avg_token_diff": "Avg. Token Difference",
    "avg_char_diff": "Avg. Character Difference",
    "pct_prefix_match": "% Prefix Match",
    "pct_suffix_match": "% Suffix Match",
    "structure_score": "Structural Score",
    "rank": "Rank",
})[["Method", "Avg. Token Difference", "Avg. Character Difference",
    "% Prefix Match", "% Suffix Match", "Structural Score", "Rank"]]

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Lines 47–50: Adjust the weights used to calculate the structure_score if the matching context values certain components (e.g., prefix matches or token length) more than others.

These diagnostics show how well each method preserves the structure of matched names in our example health facility name matching exercise. This helps guide which method to trust most and how to build fallback rules when multiple methods are combined. A different project context may result in a different fuzzy matching algorithm being preferred.

Method Performance Summaries

Levenshtein: Lowest token and character differences, with strong prefix and suffix alignment. Highest structural score (84.1, Rank 1) – best overall performer.
Rank-Ensemble: Balanced across metrics, particularly strong on prefix alignment. Score 71.0 (Rank 2) – reliable fallback option.
Composite-Score: Consistent across metrics with solid suffix alignment. Score 68.5 (Rank 3) – flexible, general-purpose option.
Jaro-Winkler: Strongest prefix alignment but weaker on suffixes. Score 67.1 (Rank 4) – useful when beginnings of names align closely.
Qgram: Good at substring similarity, but weaker structurally overall. Score 50.0 (Rank 5) – more suitable for partial overlaps.
LCS Larger token and character gaps despite reasonable suffix matches. Lowest score (43.9, Rank 6) – penalized for structural differences.

Levenshtein was the strongest method, showing low structural differences and good prefix/suffix alignment. Rank-Ensemble and Jaro-Winkler followed as dependable options, with different strengths on prefixes. Composite-Score was steady but slightly weaker. Qgram and LCS were less effective overall, penalized for larger structural gaps despite occasional suffix strengths.

Step 7.4: Weighted thresholding using structural quality and false-positive risk

Now that the structure score is in place, it can be used to calculate additional metrics that strengthen the matching approach. These metrics can feed into a weighted thresholding process, where scores are adjusted by method quality and thresholds are tailored to the false-positive risk of each method. This shifts the threshold setting from a single arbitrary cutoff to an evidence-based process that reflects the strengths and weaknesses of each algorithm, improving precision without sacrificing matching quality.

The outputs from this step serve different purposes depending on the chosen matching strategy in Step 8. When using a single method or simple composite approach (Options 1-2), only the structure score rankings from Step 7.3 are needed to identify the best-performing method. When using weighted composite scoring or the fallback loop approach (Options 3-4), both the method weights and thresholds calculated in this step are required.

This step calculates two key metrics for each fuzzy matching method:

1. Method weights: Proportional to structure scores from Step 7.3, rescaled so all weights sum to 1. Methods with better structural quality get higher influence when combining multiple methods in Options 3-4.

2. Method-specific thresholds: Use inverse scaling based on structural quality. A threshold determines the minimum similarity score required for a match (e.g., threshold 85 means only candidates scoring ≥85 are accepted). High-performing methods get lower thresholds (easier matching) because they are reliable, while low-performing methods get higher thresholds (stricter matching) to prevent false positives. For example, Levenshtein might get threshold 75 (trusted method), while LCS gets 95 (requires high confidence).

Note In the code below, the thresholds are created by rescaling each method’s structure_score (0–100) into a new range defined by the 70th and 95th percentiles of all_best$score. The rev() function flips this mapping so that higher scores are pushed toward the lower end of the range and lower scores toward the higher end. This means stronger methods receive stricter thresholds, while weaker methods are assigned looser thresholds.

# calculate method-specific weights and thresholds
method_threshold <- summary_stats |>
  dplyr::mutate(
    score = structure_score / 100,
    weight = score / sum(score),
    threshold = scales::rescale(
      structure_score,
      to = rev(unname(
        stats::quantile(all_best$score, c(0.70, 0.95), na.rm = TRUE)
      ))
    ) |>
      round()
  ) |>
  dplyr::select(method, weight, threshold)

# display results
method_threshold

Output

Method-Specific Weights and Thresholds
Method	Weight	Threshold
Levenshtein	0.223	73
Composite-Score	0.172	82
Rank-Ensemble	0.170	82
Qgram	0.153	85
Jaro-Winkler	0.146	86
LCS	0.137	88

To adapt the code:

Line 2: Keep summary_stats from Step 7.3 with the structure_score column.
Lines 6–11: Adjust the quantile range (default: 0.70 to 0.95) to control threshold spread.
Line 14: Modify column selection if additional metrics are needed in the output.

# calculate method-specific weights and thresholds
q70 = all_best["score"].quantile(0.70)
q95 = all_best["score"].quantile(0.95)

method_threshold = summary_stats[["method", "structure_score"]].copy()
method_threshold["score_frac"] = method_threshold["structure_score"] / 100
method_threshold["weight"] = (
    method_threshold["score_frac"] / method_threshold["score_frac"].sum()
)
# inverse scale: higher structure_score → lower threshold (easier matching)
method_threshold["threshold"] = (
    _rescale(method_threshold["structure_score"], q95, q70)
).round()
method_threshold = method_threshold[["method", "weight", "threshold"]]

# display results
method_threshold

Output

<IPython.core.display.HTML object>

To adapt the code:

Line 2: Keep summary_stats from Step 7.3 with the structure_score column.
Lines 4–5: Adjust the quantile range (default: 0.70 to 0.95) to control threshold spread.
Line 14: Modify column selection if additional metrics are needed in the output.

This table shows each method’s relative weight, scaled so all weights sum to one, and its method-specific score threshold derived from observed performance. The relationship between structural quality, weights, and thresholds follows this pattern:

Method	Weight	Threshold	Interpretation
Levenshtein	0.219	75	Highest-performing method: gets most influence in composite scoring and lowest threshold for matching
Composite-Score	0.185	80	Strong performer: high influence and low threshold
Rank-Ensemble	0.178	81	Good performer: moderate-high influence and low-medium threshold
Jaro-Winkler	0.174	82	Balanced method: moderate influence and medium threshold
Qgram	0.130	88	Lower performer: reduced influence and higher threshold to prevent false positives
LCS	0.114	92	Lowest performer: minimal influence and strictest threshold required

Key insight: Methods with better structural alignment get rewarded with higher weights (more influence) and lower thresholds (easier matching), while unreliable methods are constrained with lower weights and stricter thresholds.

Step 8: Finalize Fuzzy Match Selection

This step assigns a final match to each DHIS2 facility by choosing the most appropriate candidate from the fuzzy matching results. At this stage, every unmatched facility from earlier steps should have a set of one or more candidate matches with similarity scores and any supporting diagnostics (e.g., prefix/suffix matches, token differences). The goal is to select one match per facility in a way that balances accuracy, efficiency, and coverage.

The choice of final selection strategy depends on:

Confidence in the matching methods: If the algorithms and thresholds are already well-tuned to the data, a simpler approach may be sufficient.
Tolerance for false positives: In some contexts, a slightly lower match rate is acceptable if it reduces the risk of incorrect matches.
Analytical and operational priorities: For example, whether a quick, explainable method is needed for an urgent report, or a more comprehensive, iterative process for a master dataset update.

We offer four structured options, each with its own strengths, weaknesses, and level of complexity. In practice, we may start with a simpler option and progress to more advanced approaches if too many facilities remain unmatched or if match quality is inconsistent. These options are also compatible with an SNT team review step, allowing human validation before matches are finalized in the dataset.

Summary of Match Selection Options

Option	Description	Use Case	Code Complexity
Option 1: Single Method	Use one preferred method (e.g., Levenshtein) with a threshold	When confident in a single algorithm	Low
Option 2: Composite/Rank	Use composite score or rank average to select best match	When using blended match signals	Moderate
Option 3: Weighted Composite Score	Use performance-weighted composite score that accounts for method reliability and false-positive rates	When using blended match signals with known method performance differences	Moderate
Option 4: Fallback Loop	Stepwise fallback through all methods based on structural quality scores	When coverage is most important	High

Step 8.1: Match selection options

Option 1: Single Preferred Method

This approach uses one similarity algorithm, such as Levenshtein or Jaro-Winkler, to select the highest-scoring match for each facility. We apply a fixed threshold so that scores above the cut-off are accepted as matches, while lower scores are flagged for review. It is fast, easy to explain, and works well when the chosen method aligns with the types of name differences in the data.

Show the code

# apply single-method threshold and flag matches
best_lv_final <- best_lv |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review"),
    final_method = "Levenshtein"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(hf_dhis2, hf_mfl, hf_mfl_raw, score, final_method, match_flag)

# show top 5 and bottom 5 matches by score
top_5 <- best_lv_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- best_lv_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Levenshtein Score")

# preview results
top_bottom

Output

Top 5 and Bottom 5 Matches by Levenshtein Score: 19 out of 467 facilities matched (448 need manual review)
hf_dhis2	hf_mfl	hf_mfl_raw	score	final_method	match_flag
borongoh makarankay mchp	borongor makarankay mchp	Borongor Makarankay MCHP	95.83333	Levenshtein	match
clinic health redeemer	clinic health redeemers	Health Redeemers Clinic	95.65217	Levenshtein	match
kayimba mchp vawahun	kayimba mchp vaawahun	Vaawahun Kayimba MCHP	95.23810	Levenshtein	match
chc farm lakkaogoo	chc farm lakka ogoo	Lakka Ogoo Farm CHC	94.73684	Levenshtein	match
barri chp vaama	barri mchp vaama	Vaama Barri MCHP	93.75000	Levenshtein	match
clinic college evangelical of theology	center clinic evangelical health lutheran mark st	St. Mark Evangelical Lutheran Health Center Clinic	40.81633	Levenshtein	review
aid city clinic direct kenema orphanage	and care clinic hand heart	Heart And Hand Care Clinic	41.02564	Levenshtein	review
aapdep clinic development in national women	centre clinic correctional makeni	Makeni Correctional Centre Clinic	41.86047	Levenshtein	review
bombali chp maforay sebora	adama chc marth memorial	Adama Marth Memorial CHC	42.30769	Levenshtein	review
community hospital rural waterloo	government hospital macauley street	Macauley Street Government Hospital	42.85714	Levenshtein	review

To adapt the code:

Line 2: Replace best_lv with another object from Step 4.5 (e.g., best_jw, best_qg) depending on the preferred method.
Line 4: Adjust the score threshold (default = 85) to define what is considered a match.
Line 11: Keep column names hf_dhis2 and hf_mfl, or replace them with the project-specific names.

Show the code

# apply single-method threshold and flag matches
best_lv_final = (
    best_lv
    .assign(
        match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"),
        final_method="Levenshtein",
    )
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    [["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)

# show top 5 and bottom 5 matches by score
top_5 = best_lv_final.sort_values("score", ascending=False).head(5)
bottom_5 = best_lv_final.sort_values("score").head(5)

# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)

cli_header("Top 5 and Bottom 5 Matches by Levenshtein Score")

# preview results
top_bottom

Output

<IPython.core.display.HTML object>

To adapt the code:

Line 2: Replace best_lv with another object from Step 6.4 (e.g., best_jw, best_qg) depending on the preferred method.
Line 4: Adjust the score threshold (default = 85) to define what is considered a match.
Line 9: Keep column names hf_dhis2 and hf_mfl, or replace them with the project-specific names.

In our example dataset, the Levenshtein method with an 85% threshold achieved the match rate shown in the output above. This conservative approach prioritizes precision over recall, making it suitable when false positives are costly and manual review resources are available.

Option 2: Composite or Rank-Based Match

This approach selects the best match using a combined measure from multiple algorithms, such as a composite score or average rank. It avoids relying on a single method while keeping the process simpler than a full fallback loop. This works well when different algorithms capture different types of variation, and a single combined score can be used with a threshold to separate confident matches from those needing review.

Show the code

# select best match using rank average
composite_final <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(rank_avg) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(composite_score >= 85, "match", "review")
  ) |>
  dplyr::select(
    adm1,
    adm2,
    hf_dhis2,
    hf_mfl,
    composite_score,
    rank_avg,
    match_flag
  )

# show top 5 and bottom 5 matches by composite score
top_5 <- composite_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- composite_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Composite Score")

# preview results
top_bottom

Output

Top 5 and Bottom 5 Matches by Composite Score: 20 out of 467 facilities matched (447 need manual review)
hf_dhis2	hf_mfl	hf_mfl_raw	score	final_method	match_flag
branda centre hospital medical	branda center hospital medical	Branda Medical Center Hospital	96.38889	Composite-Score	match
clinic health redeemer	clinic health redeemers	Health Redeemers Clinic	96.37681	Composite-Score	match
kayimba mchp vawahun	kayimba mchp vaawahun	Vaawahun Kayimba MCHP	95.61508	Composite-Score	match
chc farm lakkaogoo	chc farm lakka ogoo	Lakka Ogoo Farm CHC	95.61404	Composite-Score	match
mchp momorimaria	mchp momorimaia	Momorimaia MCHP	94.79167	Composite-Score	match
chp city cross kenema red	chc kpetema	Kpetema CHC	44.43939	Composite-Score	review
chp city cross makeni red	chp mokassie	Mokassie CHP	44.58333	Composite-Score	review
clinic college evangelical of theology	clinic global tasley	Tasley Global Clinic	44.90253	Composite-Score	review
community foundation health hospital mile 91	choithram hospital memorial	Choithram Memorial Hospital	44.95847	Composite-Score	review
bambara chc foindu lower	mabarr mchp polie	Mabarr Polie MCHP	45.02105	Composite-Score	review

To adapt the code:

Line 2: Keep ranked_grid from Step 6.4, or replace with the combined results dataframe.
Line 8: Adjust the composite score threshold (default = 85) to define what is considered a match.
Lines 10–18: Keep score columns (composite_score, rank_avg) or replace with the relevant scoring metrics.

Show the code

# select best match using rank average
composite_final = (
    ranked_grid.sort_values("rank_avg")
    .groupby("hf_dhis2", as_index=False)
    .first()
    .assign(
        match_flag=lambda d: np.where(d["composite_score"] >= 85, "match", "review"),
        score=lambda d: d["composite_score"],
        final_method="Composite-Score",
    )
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    [["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)

# show top 5 and bottom 5 matches by composite score
top_5 = composite_final.sort_values("score", ascending=False).head(5)
bottom_5 = composite_final.sort_values("score").head(5)

# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)

cli_header("Top 5 and Bottom 5 Matches by Composite Score")

# preview results
top_bottom

Output

<IPython.core.display.HTML object>

To adapt the code:

Line 2: Keep ranked_grid from Step 6.4, or replace with the combined results dataframe.
Line 6: Adjust the composite score threshold (default = 85) to define what is considered a match.
Lines 8–13: Keep score columns (composite_score, rank_avg) or replace with the relevant scoring metrics.

The composite score approach results in the match rate shown in the output above. This reflects the averaging effect where strong matches in one method may be diluted by weaker scores in others. Best suited for datasets with consistent naming patterns across methods.

Option 3: Weighted Composite Score

This approach selects the best match using a performance-weighted composite score that accounts for method reliability and false-positive rates. Instead of treating all methods equally, it uses the weights calculated in Step 7.4 to emphasize methods that demonstrated better structural alignment. This provides a different approach from simple averaging while being less complex than a full fallback system.

The weighted composite score uses a weighted average that maintains the 0-100 scale: Methods with higher structural quality scores get more weight in the final composite. For example, if Levenshtein has weight 0.4 and Qgram has weight 0.1, the Levenshtein score contributes 4x more to the final weighted score.

Show the code

# define score columns and extract weights
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

weights <- method_threshold |>
  dplyr::filter(
    method %in% c("Jaro-Winkler", "Levenshtein", "Qgram", "LCS")
  ) |>
  dplyr::pull(weight)

# calculate weighted composite score for each facility-match pair
weighted_final <- ranked_grid |>
  dplyr::rowwise() |>
  dplyr::mutate(
    # weighted average maintains 0-100 scale
    weighted_composite_score = stats::weighted.mean(
      c(score_jw, score_lv, score_qg, score_lcs),
      weights,
      na.rm = TRUE
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(dplyr::desc(weighted_composite_score)) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(
      weighted_composite_score >= 85,
      "match",
      "review"
    ),
    score = weighted_composite_score, # standardize column name for Step 7
    final_method = "Weighted-Composite"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    hf_mfl_raw,
    score,
    final_method,
    match_flag
  )

# show top 5 and bottom 5 matches by weighted composite score
top_5 <- weighted_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- weighted_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Weighted Composite Score")

# preview results
bottom_5

Output

Top 5 and Bottom 5 Matches by Weighted Composite Score: 24 out of 467 facilities matched (443 need manual review)
hf_dhis2	hf_mfl	hf_mfl_raw	score	final_method	match_flag
branda centre hospital medical	branda center hospital medical	Branda Medical Center Hospital	96.69397	Weighted-Composite	match
clinic health redeemer	clinic health redeemers	Health Redeemers Clinic	96.63314	Weighted-Composite	match
chc farm lakkaogoo	chc farm lakka ogoo	Lakka Ogoo Farm CHC	95.92433	Weighted-Composite	match
kayimba mchp vawahun	kayimba mchp vaawahun	Vaawahun Kayimba MCHP	95.74843	Weighted-Composite	match
mchp momorimaria	mchp momorimaia	Momorimaia MCHP	95.16014	Weighted-Composite	match
clinic college evangelical of theology	clinic global tasley	Tasley Global Clinic	48.23689	Weighted-Composite	review
community hospital rural waterloo	choithram hospital memorial	Choithram Memorial Hospital	49.44549	Weighted-Composite	review
aapdep clinic development in national women	aapdep clinic wind	WIND AAPDEP Clinic	50.41514	Weighted-Composite	review
clinic college margai milton mmcet	clinic general modia	Modia General Clinic	50.46113	Weighted-Composite	review
bombali chp maforay sebora	chp maforay	Maforay CHP	50.70935	Weighted-Composite	review

To adapt the code:

Line 2: Adjust score_cols if the dataset uses different similarity score column names.
Lines 4–8: Ensure method_threshold from Step 7.4 contains weights for the methods being used.
Lines 15–18: The weighted.mean() function calculates a weighted average that maintains the 0-100 scale.
Lines 27–31: Adjust the weighted composite score threshold (default = 85) to define what is considered a match.
Lines 39–46: Select columns to include in the final output.

Show the code

# define score columns and extract weights
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]

weights_df = method_threshold.loc[
    method_threshold["method"].isin(["Jaro-Winkler", "Levenshtein", "Qgram", "LCS"])
]
weights = weights_df.set_index("method")["weight"]

# extract individual weights for each method
w_jw = float(weights.get("Jaro-Winkler", 0))
w_lv = float(weights.get("Levenshtein", 0))
w_qg = float(weights.get("Qgram", 0))
w_lcs = float(weights.get("LCS", 0))

# calculate weighted composite score for each facility-match pair
# weighted average maintains 0-100 scale
weighted_final = (
    ranked_grid.assign(
        weighted_composite_score=lambda d: (
            d["score_jw"] * w_jw +
            d["score_lv"] * w_lv +
            d["score_qg"] * w_qg +
            d["score_lcs"] * w_lcs
        )
    )
    .sort_values("weighted_composite_score", ascending=False)
    .groupby("hf_dhis2", as_index=False)
    .first()
    .assign(
        match_flag=lambda d: np.where(
            d["weighted_composite_score"] >= 85, "match", "review"
        ),
        score=lambda d: d["weighted_composite_score"],
        final_method="Weighted-Composite",
    )
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    [["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)

# show top 5 and bottom 5 matches by weighted composite score
top_5 = weighted_final.sort_values("score", ascending=False).head(5)
bottom_5 = weighted_final.sort_values("score").head(5)

cli_header("Top 5 and Bottom 5 Matches by Weighted Composite Score")

# preview results
bottom_5

Output

<IPython.core.display.HTML object>

To adapt the code:

Line 2: Adjust score_cols if the dataset uses different similarity score column names.
Lines 4–7: Ensure method_threshold from Step 7.4 contains weights for the methods being used.
Lines 14–19: The weighted sum calculates a weighted average that maintains the 0-100 scale.
Lines 25–27: Adjust the weighted composite score threshold (default = 85) to define what is considered a match.
Line 33: Select columns to include in the final output.

The weighted composite approach achieved the improved match rate shown in the output above. This improvement over the simple composite shows the benefit of emphasizing high-quality methods. The weighting balances the strengths of multiple models while prioritizing the most reliable algorithms.

Option 4: Fallback Loop Across Methods (Recommended for Coverage)

This approach maximizes match coverage through a sophisticated tiered fallback system that dynamically uses the method ordering and thresholds calculated in Steps 7.3 and 7.4. Methods are applied in descending order of their structure scores, ensuring high-precision algorithms get the first chance to match.

A key strength is the use of method-specific thresholds from Step 7.4. Each method uses its own calculated threshold based on its structural quality and false-positive risk profile. This evidence-based approach is more reliable than arbitrary cutoffs.

The method order and thresholds are dynamically determined from the structural quality analysis in Steps 7.3-7.4. Methods with higher structure scores are prioritized, and each uses its calculated threshold based on false-positive risk:

High-quality methods (e.g., Levenshtein): Lower thresholds due to reliable matching
Medium-quality methods (e.g., Jaro-Winkler): Moderate thresholds for balanced precision
Lower-quality methods (e.g., Qgram, LCS): Higher thresholds to prevent false positives

The fallback process works as follows:

Method ordering: Methods are ordered by their structure scores from Step 7.3 (highest to lowest)
Tie-breaking: Within each method, candidates are pre-sorted by:
- Similarity score (descending) - highest scores first
- Average rank from ensemble methods (ascending) - best ranks as secondary criterion This ensures that when multiple candidates exceed the threshold, the best match is selected
Progressive matching: Each method attempts to match remaining facilities using its calculated threshold from Step 7.4
Threshold relaxation: If facilities remain unmatched after a full pass, thresholds are reduced by 1 point and the cycle repeats
Quality floor: Process stops when all facilities are matched or thresholds drop below 50

This dynamic approach with reliable tie-breaking ensures the best possible match for each facility while maintaining quality standards based on empirical performance metrics.

Show the code

# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats <- summary_stats |>
  dplyr::arrange(dplyr::desc(structure_score)) |>
  dplyr::filter(method != "Rank-Ensemble") # exclude rank-based method

method_order <- method_stats$method

# use thresholds calculated in step 7.4
method_thresholds_df <- method_threshold |>
  dplyr::filter(method != "Rank-Ensemble")

# convert to named vector for easy lookup
method_thresholds <- setNames(
  method_thresholds_df$threshold,
  method_thresholds_df$method
)

# build combined table of top matches from real methods only
all_best_nonrank <- dplyr::bind_rows(
  best_lv,
  best_jw,
  best_qg,
  best_lcs,
  best_comp
) |>
  # normalize method labels
  dplyr::mutate(method = trimws(method))

# add rank once
rank_key <- ranked_grid |>
  dplyr::select(hf_dhis2, hf_mfl, rank_avg)

all_best_ranked <- all_best_nonrank |>
  dplyr::left_join(rank_key, by = c("hf_dhis2", "hf_mfl")) |>
  dplyr::mutate(
    # worst for missing ranks
    rank_avg = dplyr::if_else(is.na(rank_avg), Inf, rank_avg)
  )

# split by method and pre-arrange for stable tiebreak
by_method <- split(all_best_ranked, all_best_ranked$method) |>
  purrr::map(\(x) {
    x |>
      dplyr::arrange(dplyr::desc(score), rank_avg)
  })

# keep copy for pass-rate check
initial_thr <- method_thresholds

# init
fallback_chunks <- list()
unmatched <- unique(all_best_ranked$hf_dhis2)

# fast fallback loop (no joins, no grouping inside)
repeat {
  for (m in method_order) {
    if (!m %in% names(by_method)) {
      next
    }

    # current threshold from step 7.4
    thr <- method_thresholds[[m]]
    if (is.null(thr)) {
      thr <- 85 # fallback if method not found
    }

    # filter by score and still-unmatched
    cand <- by_method[[m]] |>
      dplyr::filter(score >= thr, hf_dhis2 %in% unmatched)

    if (nrow(cand) > 0) {
      # pick first row per hf_dhis2 (pre-arranged by score desc, rank asc)
      best_per_hf <- cand |>
        dplyr::distinct(hf_dhis2, .keep_all = TRUE) |>
        dplyr::mutate(
          final_method = glue::glue("Fuzzy-matched using: {m}")
        )

      # store and update unmatched
      fallback_chunks[[length(fallback_chunks) + 1L]] <- best_per_hf
      unmatched <- setdiff(unmatched, best_per_hf$hf_dhis2)
    }

    if (length(unmatched) == 0) break
  }

  if (length(unmatched) == 0) {
    break
  }

  # relax thresholds gradually
  method_thresholds <- method_thresholds - 1
  if (any(method_thresholds <= 50)) break # stop if too low
}

# bind once
fallback_matched <- dplyr::bind_rows(fallback_chunks)

# build a 1:1 map from standardized DHIS2 name to admin + raw +
# hf_uid_new
hf_uid_new_map_by_stand <- dhis2_hf_unmatched |>
  dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::summarise(
    adm0 = dplyr::first(adm0),
    adm1 = dplyr::first(adm1),
    adm2 = dplyr::first(adm2),
    adm3 = dplyr::first(adm3),
    hf_dhis2_raw = dplyr::first(hf_dhis2_raw),
    hf_uid_new = dplyr::first(hf_uid_new),
    .groups = "drop"
  )

# final flags and attach admin + ids
fallback_final <- fallback_matched |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review")
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::left_join(hf_uid_new_map_by_stand, by = "hf_dhis2") |>
  dplyr::select(
    adm0, adm1, adm2, adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    match_flag
  )

top_5 <- fallback_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- fallback_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Fallback")

bottom_5 <- dplyr::bind_rows(top_5, bottom_5)

# preview results
bottom_5

Output

Top 5 and Bottom 5 from Fallback: 122 of 467 matched (345 review)
adm0	adm1	adm2	adm3	hf_uid_new	hf_dhis2_raw	hf_dhis2	hf_mfl_raw	hf_mfl	score	final_method	match_flag
SIERRA LEONE	NORTH WEST	KARENE	GBANTI KARENE	171abc43	BORONGOH MAKARANKAY MCHP	borongoh makarankay mchp	Borongor Makarankay MCHP	borongor makarankay mchp	95.83333	Fuzzy-matched using: Levenshtein	match
SIERRA LEONE	WESTERN	WESTERN URBAN	WEST II	b85117a7	REDEEMER HEALTH CLINIC	clinic health redeemer	Health Redeemers Clinic	clinic health redeemers	95.65217	Fuzzy-matched using: Levenshtein	match
SIERRA LEONE	SOUTHERN	PUJEHUN	PANGA	d0219c0c	VAWAHUN KAYIMBA MCHP	kayimba mchp vawahun	Vaawahun Kayimba MCHP	kayimba mchp vaawahun	95.23810	Fuzzy-matched using: Levenshtein	match
SIERRA LEONE	WESTERN	WESTERN URBAN	EAST I	7bf33e78	KARGBO DOCKYARD CHP	chp dockyard kargbo	Kargbo Dockyard MCHP	dockyard kargbo mchp	95.00000	Fuzzy-matched using: Qgram	match
SIERRA LEONE	WESTERN	WESTERN RURAL	YORK RURAL	NA	LAKKAOGOO FARM CHC	chc farm lakkaogoo	Lakka Ogoo Farm CHC	chc farm lakka ogoo	94.73684	Fuzzy-matched using: Levenshtein	match
SIERRA LEONE	NORTH EAST	BOMBALI	MAKENI CITY	b34ad8b4	NEW HOPE HOSPITAL	hope hospital new	Blue Shield Hospital	blue hospital shield	60.00000	Fuzzy-matched using: Levenshtein	review
SIERRA LEONE	WESTERN	WESTERN URBAN	WEST III	NA	MARIE STOPES AHMED DRIVE CLINIC	ahmed clinic drive marie stopes	Marie Stopes Clinic	clinic marie stopes	61.29032	Fuzzy-matched using: Levenshtein	review
SIERRA LEONE	SOUTHERN	PUJEHUN	MONO SAKRIM	NA	NYANDEHUN MANO SAKRIM MCHP	mano mchp nyandehun sakrim	Nyandehun Malen MCHP	malen mchp nyandehun	61.53846	Fuzzy-matched using: Levenshtein	review
SIERRA LEONE	EASTERN	KONO	NIMIKORO	e260d9c1	AIAH MASS CLINIC	aiah clinic mass	Caring Hands Clinic	caring clinic hands	63.15789	Fuzzy-matched using: Levenshtein	review
SIERRA LEONE	SOUTHERN	MOYAMBA	FAKUNYA	NA	NJAGBAHUN FAKUNYA MCHP	fakunya mchp njagbahun	Njagbahun MCHP	mchp njagbahun	63.63636	Fuzzy-matched using: Levenshtein	review

To adapt the code:

Lines 2–8: The method order is automatically determined from summary_stats structure scores.
Lines 10–18: Thresholds are extracted from Step 7.4’s method_threshold dataframe.
Line 66: The fallback threshold (default = 85) is used only if a method lacks a calculated threshold.
Line 95: The minimum threshold (default = 50) prevents accepting very low-quality matches.
Lines 33, 36, throughout: Replace hf_dhis2 and hf_mfl with the dataset’s column names if they differ.

Show the code

# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats = summary_stats.loc[
    summary_stats["method"] != "Rank-Ensemble"
].sort_values("structure_score", ascending=False)

method_order = list(method_stats["method"])

# use thresholds calculated in step 7.4
method_thresholds_df = method_threshold.loc[method_threshold["method"] != "Rank-Ensemble"]
method_thresholds = dict(zip(method_thresholds_df["method"], method_thresholds_df["threshold"]))

# build combined table of top matches from real methods only
all_best_nonrank = pd.concat(
    [best_lv, best_jw, best_qg, best_lcs, best_comp], ignore_index=True
)
all_best_nonrank["method"] = all_best_nonrank["method"].str.strip()

# add rank once
rank_key = ranked_grid[["hf_dhis2", "hf_mfl", "rank_avg"]].copy()
all_best_ranked = (
    all_best_nonrank
    .merge(rank_key, on=["hf_dhis2", "hf_mfl"], how="left")
    .assign(rank_avg=lambda d: d["rank_avg"].fillna(float("inf")))
)

# split by method and pre-arrange for stable tiebreak
by_method = {
    m: grp.sort_values(["score", "rank_avg"], ascending=[False, True])
    for m, grp in all_best_ranked.groupby("method")
}

fallback_chunks = []
unmatched_set = set(all_best_ranked["hf_dhis2"].unique())
current_thresholds = dict(method_thresholds)

# fast fallback loop
while True:
    for m in method_order:
        if m not in by_method:
            continue
        thr = current_thresholds.get(m, 85)  # fallback if method not found
        cand = by_method[m].loc[
            (by_method[m]["score"] >= thr) &
            (by_method[m]["hf_dhis2"].isin(unmatched_set))
        ]
        if len(cand) > 0:
            best_per_hf = (
                cand.drop_duplicates(subset="hf_dhis2", keep="first")
                .assign(final_method=f"Fuzzy-matched using: {m}")
            )
            fallback_chunks.append(best_per_hf)
            unmatched_set -= set(best_per_hf["hf_dhis2"])
        if not unmatched_set:
            break

    if not unmatched_set:
        break

    # relax thresholds gradually
    current_thresholds = {k: v - 1 for k, v in current_thresholds.items()}
    if any(v <= 50 for v in current_thresholds.values()):  # stop if too low
        break

# bind once
fallback_matched = (
    pd.concat(fallback_chunks, ignore_index=True)
    if fallback_chunks else pd.DataFrame()
)

# build a 1:1 map from standardized DHIS2 name to admin + raw + hf_uid_new
hf_uid_new_map_by_stand = (
    dhis2_hf_unmatched
    .sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
    .groupby("hf_dhis2", as_index=False)
    .first()[["hf_dhis2", "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
)

# final flags and attach admin + ids
fallback_final = (
    fallback_matched
    .assign(match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"))
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    .merge(hf_uid_new_map_by_stand, on="hf_dhis2", how="left")
    [["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
      "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
      "score", "final_method", "match_flag"]]
)

top_5 = fallback_final.sort_values("score", ascending=False).head(5)
bottom_5 = fallback_final.sort_values("score").head(5)

cli_header("Top 5 and Bottom 5 Matches by Fallback")

bottom_5 = pd.concat([top_5, bottom_5], ignore_index=True)

# preview results
bottom_5

Output

<IPython.core.display.HTML object>

To adapt the code:

Lines 2–7: The method order is automatically determined from summary_stats structure scores.
Lines 9–12: Thresholds are extracted from Step 7.4’s method_threshold dataframe.
Line 46: The fallback threshold (default = 85) is used only if a method lacks a calculated threshold.
Line 58: The minimum threshold (default = 50) prevents accepting very low-quality matches.
Lines 22, 76, throughout: Replace hf_dhis2 and hf_mfl with the dataset’s column names if they differ.

The fallback loop achieved the highest match rate as shown in the output above. This substantial improvement over other methods demonstrates the effectiveness of method-specific thresholds and progressive fallback. The approach maximizes coverage while maintaining quality through data-driven thresholds and intelligent tie-breaking.

This is the approach that we will use going forward, as it balances coverage and match quality, reduces the manual review workload, and adapts thresholds dynamically to the strengths and weaknesses of each method.

Step 8.2: Select fuzzy matching approach

After reviewing the different fuzzy matching options from Step 8.1, select which approach to use. This step creates the fuzzy_matches dataset that will be combined with other matching methods in Step 9.

Choosing an approach

Select one approach based on the evaluation in Step 8.1: - Use single algorithm (Levenshtein, Jaro-Winkler, or Q-gram) for simplicity and interpretability - Use composite scores to balance multiple algorithms - Use weighted approach when domain expertise suggests certain algorithms are more reliable for the data - Use fallback approach for the most comprehensive matching with hierarchical scoring

Show the code

# create fuzzy_matches from step 8 results (use whichever option was
# chosen)
# step 8 produces different result datasets depending on which option
# was
# chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final

fuzzy_matches <- if (exists("fallback_final")) {
  dplyr::filter(fallback_final, match_flag == "match")
} else if (exists("composite_final")) {
  dplyr::filter(composite_final, match_flag == "match")
} else if (exists("weighted_final")) {
  dplyr::filter(weighted_final, match_flag == "match")
} else if (exists("best_lv_final")) {
  dplyr::filter(best_lv_final, match_flag == "match")
} else if (exists("best_jw_final")) {
  dplyr::filter(best_jw_final, match_flag == "match")
} else if (exists("best_qg_final")) {
  dplyr::filter(best_qg_final, match_flag == "match")
} else {
  # empty dataframe with expected structure if no step 8 results exist
  data.frame(
    hf_dhis2_raw = character(0),
    hf_dhis2 = character(0),
    hf_mfl_raw = character(0),
    hf_mfl = character(0),
    score = numeric(0),
    final_method = character(0),
    stringsAsFactors = FALSE
  )
} |>
# remove any internal duplicates (keep highest scoring match per
# facility)
dplyr::group_by(hf_dhis2_raw) |>
dplyr::slice_max(score, n = 1, with_ties = FALSE) |>
dplyr::ungroup() |>
# attach admin + hf_uid_new from unmatched (collapse to 1:1 per
# hf_dhis2_raw)
{
  hf_uid_new_map_unmatched <- dhis2_hf_unmatched |>
    dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
    dplyr::group_by(hf_dhis2_raw) |>
    dplyr::summarise(
      adm0 = dplyr::first(adm0),
      adm1 = dplyr::first(adm1),
      adm2 = dplyr::first(adm2),
      adm3 = dplyr::first(adm3),
      hf_uid_new = dplyr::first(hf_uid_new),
      .groups = "drop"
    )
  dplyr::left_join(., hf_uid_new_map_unmatched, by = "hf_dhis2_raw")
}

# display summary of selected approach
cli::cli_h2("Selected Fuzzy Matching Results")
cli::cli_alert_success("Fuzzy matches found: {nrow(fuzzy_matches)}")

if (nrow(fuzzy_matches) > 0) {
  score_summary <- fuzzy_matches |>
    dplyr::summarise(
      avg_score = mean(score, na.rm = TRUE),
      min_score = min(score, na.rm = TRUE),
      max_score = max(score, na.rm = TRUE),
      .groups = "drop"
    )

  cli::cli_alert_info("Score range: {round(score_summary$min_score, 1)} - {round(score_summary$max_score, 1)}")
  cli::cli_alert_info("Average score: {round(score_summary$avg_score, 1)}")
}

Output

── Selected Fuzzy Matching Results ──

✔ Fuzzy matches found: 122

ℹ Score range: 85 - 95.8

ℹ Average score: 87.9

To adapt the code:

Lines 12–24: The conditional logic checks for different Step 8 result datasets in priority order. Modify the order if a different approach should take precedence.
Lines 25–34: Create empty dataframe structure if no Step 8 results exist. Ensure column names match the expected structure.
Lines 38–40: Deduplication logic removes internal duplicates within the selected approach, keeping the highest scoring match per DHIS2 facility. Modify if different deduplication rules are needed.
Alternative approach: Instead of conditional logic, directly assign the chosen dataset. For example, to select the fallback approach directly: fuzzy_matches <- dplyr::filter(fallback_final, match_flag == "match") |> dplyr::group_by(hf_dhis2_raw) |> dplyr::slice_max(score, n = 1, with_ties = FALSE) |> dplyr::ungroup().

Show the code

# create fuzzy_matches from step 8 results (use whichever option was chosen)
# step 8 produces different result datasets depending on which option was chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final

if "fallback_final" in dir():
    fuzzy_matches_raw = fallback_final.loc[fallback_final["match_flag"] == "match"]
elif "composite_final" in dir():
    fuzzy_matches_raw = composite_final.loc[composite_final["match_flag"] == "match"]
elif "weighted_final" in dir():
    fuzzy_matches_raw = weighted_final.loc[weighted_final["match_flag"] == "match"]
elif "best_lv_final" in dir():
    fuzzy_matches_raw = best_lv_final.loc[best_lv_final["match_flag"] == "match"]
elif "best_jw_final" in dir():
    fuzzy_matches_raw = best_jw_final.loc[best_jw_final["match_flag"] == "match"]
elif "best_qg_final" in dir():
    fuzzy_matches_raw = best_qg_final.loc[best_qg_final["match_flag"] == "match"]
else:
    # empty dataframe with expected structure if no step 8 results exist
    fuzzy_matches_raw = pd.DataFrame(columns=[
        "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
        "score", "final_method",
    ])

# deduplicate: keep highest scoring match per DHIS2 facility
fuzzy_matches = (
    fuzzy_matches_raw
    .sort_values("score", ascending=False)
    .drop_duplicates(subset="hf_dhis2_raw", keep="first")
    .merge(
        dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
        .sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
        .drop_duplicates(subset="hf_dhis2_raw", keep="first"),
        on="hf_dhis2_raw",
        how="left",
    )
)

# display summary of selected approach
cli_header("Selected Fuzzy Matching Results")
cli_success(f"Fuzzy matches found: {len(fuzzy_matches)}")

if len(fuzzy_matches) > 0:
    cli_info(
        f"Score range: {fuzzy_matches['score'].min():.1f} - "
        f"{fuzzy_matches['score'].max():.1f}"
    )
    cli_info(f"Average score: {fuzzy_matches['score'].mean():.1f}")

Output


Selected Fuzzy Matching Results

SUCCESS: Fuzzy matches found: 122

INFO: Score range: 85.0 - 95.8
INFO: Average score: 87.9

To adapt the code:

Lines 12–24: The conditional logic checks for different Step 8 result datasets in priority order. Modify the order if a different approach should take precedence.
Lines 25–28: Create empty dataframe structure if no Step 8 results exist. Ensure column names match the expected structure.
Lines 31–41: Deduplication logic removes internal duplicates, keeping the highest scoring match per DHIS2 facility. Modify if different deduplication rules are needed.

Step 8.3: Manual review of flagged matches

Before finalizing matches and joining back to the main datasets, all records flagged as review must be manually validated or rejected by the SNT team or designated reviewers. This step ensures data quality and prevents false positives from entering the analysis.

Consult with SNT team

Manual review is required for all review matches before proceeding to Step 9. The SNT team should confirm or reject uncertain matches and document patterns in false positives or missed matches to refine future matching workflows.

Preparing Data for Manual Review

Below, we export review matches to a structured file for systematic checking:

Show the code

# prepare review dataset (using fallback results as example)
review_matches <- fallback_final |>
  dplyr::filter(match_flag == "review") |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::mutate(
    reviewer_decision = NA_character_, # accept/reject/uncertain
    reviewer_notes = NA_character_,
    review_date = NA_character_,
    reviewer_name = NA_character_
  ) |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    reviewer_decision,
    reviewer_notes,
    review_date,
    reviewer_name
  )

# export for review
readr::write_csv(
  review_matches,
  here::here("03_outputs", "tables", "facility_matches_for_review.csv")
)

# show summary
cli::cli_h2("Manual Review Required")
cli::cli_alert_info(
  "Exported {nrow(review_matches)} matches flagged for review"
)
cli::cli_alert_info(
  "File location: outputs/facility_matches_for_review.csv"
)

Output

── Manual Review Required ──

ℹ Exported 157 matches flagged for review

Sample matches requiring manual review (ordered by score)
hf_dhis2	hf_mfl	score	final_method
arab clinic waterloo	arab clinic	85.00000	Fuzzy-matched using: Jaro-Winkler
bandama galliness mchp	bandama mchp	84.84848	Fuzzy-matched using: Jaro-Winkler
mchp veronical	chp venima	84.76190	Fuzzy-matched using: Jaro-Winkler
clinic marie stopes street waterloo	clinic marie stopes	84.76190	Fuzzy-matched using: Jaro-Winkler
chp tuakan	mchp tawahun	84.62963	Fuzzy-matched using: Jaro-Winkler

To adapt the code:

Lines 6–9: Add additional columns for the review process (e.g., facility_type, district).
Lines 26–29: Adjust the output file path and name as needed.

Show the code

# prepare review dataset (using fallback results as example)
review_matches = (
    fallback_final
    .loc[fallback_final["match_flag"] == "review"]
    .sort_values("score", ascending=False)
    .assign(
        reviewer_decision=pd.NA,  # accept/reject/uncertain
        reviewer_notes=pd.NA,
        review_date=pd.NA,
        reviewer_name=pd.NA,
    )
    [["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
      "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
      "score", "final_method",
      "reviewer_decision", "reviewer_notes", "review_date", "reviewer_name"]]
)

# export for review
output_dir = Path(here("03_outputs/tables"))
output_dir.mkdir(parents=True, exist_ok=True)
review_matches.to_csv(output_dir / "facility_matches_for_review.csv", index=False)

# show summary
cli_header("Manual Review Required")
cli_info(f"Exported {len(review_matches)} matches flagged for review")
cli_info("File location: outputs/tables/facility_matches_for_review.csv")

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Lines 6–9: Add additional columns for the review process (e.g., facility_type, district).
Lines 24–25: Adjust the output file path and name as needed.

Review Process Guidelines

The manual review step is more than a simple yes/no decision. It is an opportunity to strengthen the matching workflow by confirming accurate links, rejecting false positives, and documenting recurring data quality issues. Reviewers play a key role in ensuring the integrity of the final dataset and in improving the process for future runs.

For Reviewers:

Accept – Confirm that both names refer to the same facility. Look for strong evidence of equivalence, such as:
- Minor spelling variations: "Makeni Gov Hospital" ↔︎ "Makeni Government Hospital"
- Common or consistent abbreviations: Bo CHC ↔︎ Bo Community Health Center
- Small formatting differences that do not alter meaning
Reject – Flag cases where facilities differ or where the match is too uncertain. Common examples include:
- Different facility types in the same area: "Makeni Hospital" ↔︎ "Makeni Clinic"
- Generic names without geographic qualifiers: Community Health Post without a location
- Multiple possible matches in the same vicinity without clear identifiers
Document patterns – Use the review to capture recurring issues. These insights help refine thresholds, weights, and matching rules. Examples:
- Frequent abbreviation mismatches (e.g., CHC vs CH Centre)
- Regional or language-specific naming conventions
- Missing or inconsistent administrative unit data in source systems

A systematic approach to review ensures that accepted matches are trustworthy, rejected matches are well-justified, and lessons learned feed directly into improving the next matching cycle.

Simulated Review Decisions

For demonstration purposes, we simulate reviewer decisions on the flagged matches. In practice, this would be done by domain experts reviewing the exported CSV file.

After manual review is complete, we import the data that has all the decisions.

Show the code

# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches <- readr::read_csv(
  here::here("03_outputs", "tables", "facility_matches_reviewed.csv"),
  col_types = readr::cols(
    reviewer_decision = readr::col_character(),
    reviewer_notes = readr::col_character(),
    review_date = readr::col_character(),
    reviewer_name = readr::col_character()
  )
)

To adapt the code:

Line 4: Update the file path to the project’s output directory.
Lines 6–9: Add additional column types if the review file includes extra fields.

Show the code

# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches = pd.read_csv(
    here("03_outputs/tables/facility_matches_reviewed.csv"),
    dtype={
        "reviewer_decision": "object",
        "reviewer_notes": "object",
        "review_date": "object",
        "reviewer_name": "object",
    }
)

To adapt the code:

Line 4: Update the file path to the project’s output directory.
Lines 5–9: Add additional column dtype specifications if the review file includes extra fields.

Incorporating Review Results

Now we integrate the reviewer decisions back into the matching workflow:

Show the code

final_incorp <- reviewed_matches |>
  dplyr::mutate(
    final_match_flag = dplyr::case_when(
      reviewer_decision == "accept" ~ "match",
      reviewer_decision == "reject" ~ "no_match",
      is.na(reviewer_decision) ~ "pending_review",
      TRUE ~ "no_match"
    )
  )

# get the final matched ones
final_matches <- final_incorp |>
  dplyr::filter(final_match_flag == "match") |>
  dplyr::select(-reviewer_decision)

# summary of all results (including rejected/pending)
all_results_summary <- final_incorp |>
  dplyr::count(final_match_flag, name = "n") |>
  dplyr::mutate(percentage = round(100 * n / sum(n), 1))

# summary of final accepted matches only
final_summary <- final_matches |>
  dplyr::count(final_match_flag, name = "n")

cli::cli_h2("Final Matching Results After Manual Review")
cli::cli_alert_success(
  "Accepted matches: {format(sum(final_summary$n), big.mark = ',')}"
)

# breakdown by decision type
rejected_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "no_match") |>
  dplyr::pull(n)
if (length(rejected_count) == 0) {
  rejected_count <- 0
}

pending_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "pending_review") |>
  dplyr::pull(n)
if (length(pending_count) == 0) {
  pending_count <- 0
}

if (rejected_count > 0) {
  cli::cli_alert_danger(
    "Rejected matches: {format(rejected_count, big.mark = ',')} (will have NA MFL values in final dataset)"
  )
}

if (pending_count > 0) {
  cli::cli_alert_warning(
    "Pending review: {format(pending_count, big.mark = ',')} (require additional review)"
  )
}

# show breakdown table
all_results_summary

Output

── Final Matching Results After Manual Review ──

✔ Accepted matches: 8

✖ Rejected matches: 149 (will have NA MFL values in final dataset)

Complete breakdown of reviewer decisions
Decision	Count	Percentage
match	8	5.1
no_match	149	94.9

To adapt the code:

[Code removed]: Update the file path from "03_outputs" to the project’s output directory (no file path in this chunk; see the simulate reviewer decisions chunk above).
[Code removed]: Add additional column types if the review file includes extra fields (no column types in this chunk; see the simulate reviewer decisions chunk above).
Lines 3–9: Adjust logic for handling different reviewer decision categories based on the project workflow.

Show the code

final_incorp = reviewed_matches.assign(
    final_match_flag=lambda d: np.select(
        [
            d["reviewer_decision"] == "accept",
            d["reviewer_decision"] == "reject",
            d["reviewer_decision"].isna(),
        ],
        ["match", "no_match", "pending_review"],
        default="no_match",
    )
)

# get the final matched ones
final_matches = (
    final_incorp
    .loc[final_incorp["final_match_flag"] == "match"]
    .drop(columns="reviewer_decision", errors="ignore")
)

# summary of all results (including rejected/pending)
all_results_summary = (
    final_incorp.groupby("final_match_flag", as_index=False)
    .size()
    .rename(columns={"size": "n"})
    .assign(percentage=lambda d: (d["n"] / d["n"].sum() * 100).round(1))
)

cli_header("Final Matching Results After Manual Review")
cli_success(f"Accepted matches: {len(final_matches):,}")

# breakdown by decision type
rejected_count = int(
    all_results_summary.loc[all_results_summary["final_match_flag"] == "no_match", "n"].sum()
)
pending_count = int(
    all_results_summary.loc[
        all_results_summary["final_match_flag"] == "pending_review", "n"
    ].sum()
)

if rejected_count > 0:
    cli_danger(
        f"Rejected matches: {rejected_count:,} "
        "(will have NA MFL values in final dataset)"
    )

if pending_count > 0:
    cli_warning(f"Pending review: {pending_count:,} (require additional review)")

# show breakdown table
all_results_summary

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Lines 3–9: Adjust logic for handling different reviewer decision categories based on the project workflow.

Out of the 159 proposed matches, none were accepted, as all appeared to be different. This may suggest underlying data quality issues or the presence of new health facility names.

DHIS2 facilities with rejected matches will appear in the final dataset (Step 9) with NA values for MFL columns, preserving the facility record while clearly indicating no reliable match was found. Facilities with pending review should be resolved before proceeding to production use.

Step 9: Combine All Matched Results

In this step, we combine results from all matching approaches and finalize the dataset by joining back to the original DHIS2 health facility list. This creates a comprehensive dataset that includes:

High-confidence matches from Step 4 (Interactive Stratified Geographic Matching)
Fuzzy matches from Step 6-8 (Various fuzzy matching strategies on remaining unmatched)
Unmatched facilities (Preserved with NA values for MFL columns)

Step 9.1: Combine results from all matching approaches

When using the enhanced workflow with stratified matching, combine results from all stages:

Show the code

# create manual_matches from manual review results (both matched and
# unmatched)
manual_matches <- final_incorp |>
  dplyr::mutate(
    # for unmatched, set appropriate values
    hf_mfl = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl,
      NA_character_
    ),
    hf_mfl_raw = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl_raw,
      NA_character_
    ),
    final_method = dplyr::if_else(
      reviewer_decision == "accept",
      "Manual Review - Accepted",
      "Manual Review - Rejected"
    ),
    score = dplyr::if_else(
      reviewer_decision == "accept",
      100,
      0
    )
  ) |>
  # manual reviewed results already carry adm0-3 + hf_uid_new from
  # fallback_final → review export → reimport
  dplyr::select(
    adm0,
    adm1,
    adm2,
    adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method
  )

# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and
# rejected
final_facilities_all <- dplyr::bind_rows(
  # step 4: stratified geographic matches (if performed)
  matched_dhis2_prepgeoname,
  # step 5: exact matches after standardization
  matched_dhis2,
  # step 6-8: complete matches from fuzzy matching
  fuzzy_matches,
  # manual matching (if performed)
  manual_matches
)

# persist for the python render-state to load (Rule 6.4 parity)
write.csv(
  final_facilities_all,
  here::here("01_data/1.1_foundational/1.1c_health_facilities/processed/final_facilities_all.csv"),
  row.names = FALSE
)

# collapse to one record per DHIS2 facility for downstream integration
final_facilities_one_per_hf <- final_facilities_all |>
  dplyr::select(-match_flag, -hf)

cli::cli_alert_info(
  "All facilities processed (rows): {nrow(final_facilities_all)}"
)
cli::cli_alert_success(
  "All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): {dplyr::n_distinct(final_facilities_all$hf_uid_new)}"
)

# generate summary by matching method
matching_summary <- final_facilities_one_per_hf |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    n_matched = n(),
    avg_score = mean(score, na.rm = TRUE),
    min_score = min(score, na.rm = TRUE),
    max_score = max(score, na.rm = TRUE),
    .groups = "drop"
  )

# display summary
cli::cli_h2("Matching Results by Method")
matching_summary

Output

ℹ All facilities processed (rows): 1579

✔ All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): 1460

Matching Results by Method
Method	Count	Avg Score	Min Score	Max Score
Fuzzy-matched using: Jaro-Winkler	99	87.1	85.1	90.7
Fuzzy-matched using: Levenshtein	19	91.6	85.0	95.8
Fuzzy-matched using: Qgram	4	91.0	87.5	95.0
Interactive Stratified Geographic Matching	1202	100.0	100.0	100.0
Manual Review - Accepted	8	100.0	100.0	100.0
Manual Review - Rejected	149	0.0	0.0	0.0
Matched Without Fuzzy Matching (standardization)	98	100.0	100.0	100.0

To adapt the code:

[Code removed]: The fuzzy_matches logic automatically detects which Step 8 approach was used. Modify the order or add additional Step 8 result variables if needed (this logic lives in the select fuzzy approach chunk, Step 8.2).
Lines 3–41: The manual_matches logic processes both matched and rejected facilities from final_incorp. Replace final_incorp with the manual review results dataset name.
Lines 16–25: Facilities that were rejected in manual review get final_method = "Manual Review - Rejected" and score = 0. Adjust these labels if needed.
Lines 46–55: The bind_rows combines all matching approaches. Add or remove rows based on which steps were used in the workflow.
[Code removed]: Deduplication logic removes duplicates by keeping the highest scoring match for each DHIS2 facility (grouped by hf_dhis2_raw only). Modify if different deduplication rules are needed (deduplication lives in the select fuzzy approach chunk, Step 8.2).
[Code removed]: Validation checks detect internal duplicates within each dataset and verify the final combined dataset has no duplicates. Remove or modify these checks if not needed (not present in this chunk).

Show the code

# create manual_matches from manual review results (both matched and unmatched)
manual_matches = final_incorp.assign(
    # for unmatched, set appropriate values
    hf_mfl=lambda d: np.where(d["reviewer_decision"] == "accept", d["hf_mfl"], pd.NA),
    hf_mfl_raw=lambda d: np.where(
        d["reviewer_decision"] == "accept", d["hf_mfl_raw"], pd.NA
    ),
    final_method=lambda d: np.where(
        d["reviewer_decision"] == "accept",
        "Manual Review - Accepted",
        "Manual Review - Rejected",
    ),
    score=lambda d: np.where(d["reviewer_decision"] == "accept", 100, 0),
)

cols_needed = [
    "adm0", "adm1", "adm2", "adm3", "hf_uid_new",
    "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl", "score", "final_method",
]
manual_matches = manual_matches[[c for c in cols_needed if c in manual_matches.columns]]

# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and rejected
final_facilities_all = pd.concat(
    [
        # step 4: stratified geographic matches (if performed)
        matched_dhis2_prepgeoname[[c for c in cols_needed if c in matched_dhis2_prepgeoname.columns]],
        # step 5: exact matches after standardization
        matched_dhis2[[c for c in cols_needed if c in matched_dhis2.columns]],
        # step 6-8: complete matches from fuzzy matching
        fuzzy_matches[[c for c in cols_needed if c in fuzzy_matches.columns]],
        # manual matching (if performed)
        manual_matches,
    ],
    ignore_index=True,
)

# collapse to one record per DHIS2 facility for downstream integration
cols_to_drop = [c for c in ["match_flag", "hf"] if c in final_facilities_all.columns]
final_facilities_one_per_hf = final_facilities_all.drop(columns=cols_to_drop, errors="ignore")

cli_info(f"All facilities processed (rows): {len(final_facilities_all)}")
cli_success(
    f"All facilities processed (distinct DHIS2 by hf_uid_new): "
    f"{final_facilities_all['hf_uid_new'].nunique()}"
)

# generate summary by matching method
matching_summary = (
    final_facilities_one_per_hf
    .groupby("final_method", as_index=False)
    .agg(
        n_matched=("hf_dhis2_raw", "count"),
        avg_score=("score", "mean"),
        min_score=("score", "min"),
        max_score=("score", "max"),
    )
)

# display summary
cli_header("Matching Results by Method")
matching_summary

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Lines 3–22: The manual_matches logic processes both matched and rejected facilities from final_incorp. Replace final_incorp with the manual review results dataset name.
Lines 12–15: Facilities rejected in manual review get final_method = "Manual Review - Rejected" and score = 0. Adjust these labels if needed.
Lines 28–37: pd.concat combines all matching approaches. Add or remove dataframes based on which steps were used.

Step 9.2: Create final integrated DHIS2-MFL dataset

Create the final analytical dataset by starting with DHIS2 as the base and enriching it with MFL data where matches exist. This ensures no DHIS2 facilities are lost while adding valuable MFL attributes.

Show the code

# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final <- dhis2_df |>
  dplyr::left_join(
    dhis2_map,
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  )

# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf <- final_facilities_one_per_hf |>
  dplyr::select(hf_uid_new, hf_mfl_raw)

# create the integrated dataset
final_dhis2_mfl_df <- dhis2_df_final |>
  dplyr::left_join(final_match_per_hf, by = "hf_uid_new") |>
  # join MFL attributes for matched facilities
  dplyr::left_join(
    dplyr::select(master_hf_df, -hf_mfl, -hf, -adm0, -adm1, -adm2, -adm3),
    by = "hf_mfl_raw"
  )

# counts on distinct facilities via hf_uid_new (robust to row-level
# duplication)
total <- dhis2_df_final |> dplyr::distinct(hf_uid_new) |> nrow()
with_mfl <- final_match_per_hf |>
  dplyr::filter(!is.na(hf_mfl_raw)) |>
  nrow()
without_mfl <- total - with_mfl

# validation summary
cli::cli_h2("DHIS2-MFL Integration Complete")
cli::cli_alert_success(
  "Total DHIS2 facilities preserved: {total}"
)
cli::cli_alert_info(
  "Facilities with MFL data: {with_mfl}"
)
cli::cli_alert_warning(
  "Facilities without MFL match: {without_mfl}"
)
cli::cli_alert_info("Total rows (facility-months): {nrow(final_dhis2_mfl_df)}")

Output

── DHIS2-MFL Integration Complete ──

✔ Total DHIS2 facilities preserved: 1771

ℹ Facilities with MFL data: 1302

! Facilities without MFL match: 469

ℹ Total rows (facility-months): 184056

To adapt the code:

Line 5: Change hf to match the DHIS2 facility name column.
Line 17: Adjust which MFL columns to exclude (we remove geographic columns to avoid conflicts with DHIS2’s geography).
Line 18: Ensure hf_mfl_raw matches the MFL facility name column from the matching results.

Show the code

# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final = dhis2_df.merge(
    dhis2_map,
    on=["adm0", "adm1", "adm2", "adm3", "hf"],
    how="left",
)

# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf = final_facilities_one_per_hf[["hf_uid_new", "hf_mfl_raw"]].copy()

# create the integrated dataset
final_dhis2_mfl_df = (
    dhis2_df_final
    .merge(final_match_per_hf, on="hf_uid_new", how="left")
    # join MFL attributes for matched facilities
    .merge(
        master_hf_df.drop(
            columns=["hf_mfl", "hf", "adm0", "adm1", "adm2", "adm3"],
            errors="ignore",
        ),
        on="hf_mfl_raw",
        how="left",
    )
)

# counts on distinct facilities via hf_uid_new
total = dhis2_df_final["hf_uid_new"].nunique()
with_mfl = final_match_per_hf.dropna(subset=["hf_mfl_raw"])["hf_uid_new"].nunique()
without_mfl = total - with_mfl

# validation summary
cli_header("DHIS2-MFL Integration Complete")
cli_success(f"Total DHIS2 facilities preserved: {total}")
cli_info(f"Facilities with MFL data: {with_mfl}")
cli_warning(f"Facilities without MFL match: {without_mfl}")
cli_info(f"Total rows (facility-months): {len(final_dhis2_mfl_df)}")

Output


DHIS2-MFL Integration Complete

SUCCESS: Total DHIS2 facilities preserved: 1771

INFO: Facilities with MFL data: 91

WARNING: Facilities without MFL match: 1680

INFO: Total rows (facility-months): 184056

To adapt the code:

Line 5: Change hf to match the DHIS2 facility name column.
Lines 19–22: Adjust which MFL columns to exclude (we remove geographic columns to avoid conflicts with DHIS2’s geography).
Line 23: Ensure hf_mfl_raw matches the MFL facility name column from the matching results.

Step 10: Final Checks

Perform critical quality checks before finalizing the matching workflow. This step identifies potential data quality issues that require SNT team review: one-to-many matches (which may indicate duplicate facilities or generic names) and unmatched facilities (which could represent new facilities, closed facilities, or data quality issues). These checks ensure match quality and create actionable review lists for the SNT team.

Step 10.1: Detecting one-to-many matches for review

After finalization, check for MFL facilities matched to multiple DHIS2 facilities, which may indicate data quality issues requiring review:

Show the code

# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many <- final_facilities_all |>
  dplyr::filter(!is.na(hf_mfl_raw)) |> # only matched facilities
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |> # MFL matched to >1 DHIS2
  dplyr::ungroup() |>
  dplyr::arrange(hf_mfl_raw, hf_dhis2_raw)

if (nrow(one_to_many) > 0) {
  cli::cli_alert_warning(
    paste0(
      "One-to-many matches detected: {nrow(one_to_many)} DHIS2 facilities ",
      "share {dplyr::n_distinct(one_to_many$hf_mfl_raw)} MFL facilities"
    )
  )

  # show sample
  knitr::kable(
    one_to_many |>
      dplyr::select(hf_dhis2_raw, hf_mfl_raw, score, final_method) |>
      dplyr::slice_head(n = 6),
    caption = "Sample one-to-many matches requiring review"
  )
} else {
  cli::cli_alert_success("No one-to-many matches detected")
}

Output

! One-to-many matches detected: 210 DHIS2 facilities share 101 MFL facilities

Sample one-to-many matches requiring review
hf_dhis2_raw	hf_mfl_raw	score	final_method
ADAMA MARTH MEMORIAL CHC	Adama Marth Memorial CHC	100.00000	Matched Without Fuzzy Matching (standardization)
Adama Martha Memorial CHC	Adama Marth Memorial CHC	100.00000	Interactive Stratified Geographic Matching
ARAB PORT LOKO CLINIC	Arab Port Loko Clinic	100.00000	Matched Without Fuzzy Matching (standardization)
ROKEL ARAB WATERLOO CLINIC	Arab Port Loko Clinic	86.55372	Fuzzy-matched using: Jaro-Winkler
ARAB MAGBEMA CLINIC	Arab Clinic	85.96491	Fuzzy-matched using: Jaro-Winkler
ARAB MAKENI CLINIC	Arab Clinic	87.03704	Fuzzy-matched using: Jaro-Winkler

To adapt the code:

[Code removed]: Modify output path for the review file as needed (no file export in this chunk).
Lines 19–23: Adjust which columns to include in the review file.

Show the code

# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many = (
    final_facilities_all
    .dropna(subset=["hf_mfl_raw"])  # only matched facilities
    .groupby("hf_mfl_raw")
    .filter(lambda g: len(g) > 1)  # MFL matched to >1 DHIS2
    .sort_values(["hf_mfl_raw", "hf_dhis2_raw"])
)

if len(one_to_many) > 0:
    cli_warning(
        f"One-to-many matches detected: {len(one_to_many)} DHIS2 facilities "
        f"share {one_to_many['hf_mfl_raw'].nunique()} MFL facilities"
    )

    # show sample
    print(
        one_to_many[["hf_dhis2_raw", "hf_mfl_raw", "score", "final_method"]]
        .head(6)
        .to_string(index=False)
    )
else:
    cli_success("No one-to-many matches detected")

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Lines 15–19: Adjust which columns to include in the review output.

One-to-many matches often signal underlying issues in the data. They may arise from duplicate or near-duplicate facility records in DHIS2, or from generic names in the MFL that align with multiple, more specific DHIS2 entries. In other cases, they reflect inconsistent naming conventions that need further standardization before integration.

Step 10.2: Unmatched facilities analysis

Analyze facilities that remain unmatched to identify potential reasons and create actionable review lists for the SNT team. Unmatched facilities could represent new facilities, closed facilities, or facilities with significant data quality issues.

Show the code

# analyze unmatched facilities by reason
unmatched_analysis <- final_facilities_all |>
  dplyr::filter(is.na(hf_mfl_raw)) |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    count = n(),
    .groups = "drop"
  ) |>
  dplyr::arrange(desc(count))

# display unmatched analysis
cli::cli_h2("Unmatched Facilities Analysis")
cli::cli_alert_info(
  "Total unmatched facilities: {sum(unmatched_analysis$count)}"
)

# show breakdown by reason
purrr::pwalk(
  unmatched_analysis,
  ~ cli::cli_alert_warning("{.x}: {.y} facilities")
)

# create detailed unmatched facilities export
unmatched_detailed <- final_facilities_one_per_hf |>
  # unmatched are those without an MFL linkage
  dplyr::filter(is.na(hf_mfl_raw)) |>
  # select required context and scoring/method fields
  dplyr::select(
    hf_dhis2_raw,
    adm1,
    adm2,
    adm3,
    final_method,
    score
  ) |>
  # derive a clear potential reason for review
  dplyr::mutate(
    potential_reason = dplyr::case_when(
      is.na(final_method) ~ "Never matched - possible new facility",
      final_method == "Manual Review - Rejected" ~
        "Rejected in review - possible different facility",
      score > 0 & score < 50 ~ "Low similarity - possible data quality issue",
      TRUE ~ "Unmatched - needs investigation"
    )
  ) |>
  dplyr::arrange(adm1, adm2, adm3, hf_dhis2_raw)

# export for SNT team review
readr::write_csv(
  unmatched_detailed,
  here::here("03_outputs", "unmatched_facilities_for_snt_review.csv")
)

cli::cli_alert_success(
  "Exported {nrow(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli::cli_text("File: outputs/unmatched_facilities_for_snt_review.csv")

Output

── Unmatched Facilities Analysis ──

ℹ Total unmatched facilities: 149

Breakdown of unmatched facilities by reason
Reason	Count
Manual Review - Rejected	149

These unmatched facilities may represent:

To adapt the code:

Lines 3–9: Analysis groups unmatched facilities by reason (final_method). Customize grouping as needed.
Lines 23–46: Detailed export includes geographic context and potential reasons. Modify columns as needed.
Lines 49–52: Export path for SNT team review can be customized.
Lines 38–44: The potential reasons logic can be adjusted based on the specific workflow patterns.

Show the code

# analyze unmatched facilities by reason
unmatched_analysis = (
    final_facilities_all
    .loc[final_facilities_all["hf_mfl_raw"].isna()]
    .groupby("final_method", as_index=False)
    .size()
    .rename(columns={"size": "count"})
    .sort_values("count", ascending=False)
)

# display unmatched analysis
cli_header("Unmatched Facilities Analysis")
cli_info(f"Total unmatched facilities: {unmatched_analysis['count'].sum()}")

# show breakdown by reason
for _, row in unmatched_analysis.iterrows():
    cli_warning(f"{row['final_method']}: {row['count']} facilities")

# create detailed unmatched facilities export
unmatched_detailed = (
    final_facilities_one_per_hf
    .loc[final_facilities_one_per_hf["hf_mfl_raw"].isna()]
    [["hf_dhis2_raw", "adm1", "adm2", "adm3", "final_method", "score"]]
    .assign(
        potential_reason=lambda d: np.select(
            [
                d["final_method"].isna(),
                d["final_method"] == "Manual Review - Rejected",
                (d["score"] > 0) & (d["score"] < 50),
            ],
            [
                "Never matched - possible new facility",
                "Rejected in review - possible different facility",
                "Low similarity - possible data quality issue",
            ],
            default="Unmatched - needs investigation",
        )
    )
    .sort_values(["adm1", "adm2", "adm3", "hf_dhis2_raw"])
)

# export for SNT team review
unmatched_export_dir = Path(here("03_outputs"))
unmatched_export_dir.mkdir(parents=True, exist_ok=True)
unmatched_detailed.to_csv(
    unmatched_export_dir / "unmatched_facilities_for_snt_review.csv", index=False
)

cli_success(
    f"Exported {len(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli_info("File: outputs/unmatched_facilities_for_snt_review.csv")

Output

<IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object>

To adapt the code:

Lines 3–9: Analysis groups unmatched facilities by reason (final_method). Customize grouping as needed.
Lines 22–37: Detailed export includes geographic context and potential reasons. Modify columns as needed.
Lines 41–43: Export path for SNT team review can be customized.
Lines 29–35: The potential reasons logic can be adjusted based on the specific workflow patterns.

This check helps the SNT team focus their review on the most important cases. Some unmatched facilities may represent new sites that should be added to the MFL, while others point to data quality issues where name variations need correction. A portion may be closed facilities that need to be marked inactive. The results can also highlight geographic patterns, showing areas where unmatched facilities are more common and may require targeted review.

Step 11: Save Final Datasets

Save all final datasets for analysis, reporting, and SNT team review. This step consolidates all export operations and creates the outputs needed for downstream use.

Show the code

# save main integrated DHIS2-MFL dataset
rio::export(
  final_dhis2_mfl_df,
  here::here("03_outputs", "final_dhis2_mfl_integrated.xlsx")
)

# save matching results summary
rio::export(
  final_facilities_all,
  here::here("03_outputs", "facility_matching_results.xlsx")
)

if (nrow(unmatched_facilities) > 0) {
  readr::write_csv(
    unmatched_facilities,
    here::here("03_outputs", "unmatched_dhis2_facilities.csv")
  )
}

if (nrow(one_to_many) > 0) {
  readr::write_csv(
    one_to_many,
    here::here("03_outputs", "one_to_many_matches_for_review.csv")
  )
}

# summary of saved files
cli::cli_h2("Final Datasets Saved")
cli::cli_alert_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli::cli_alert_success("Matching results: facility_matching_results.xlsx")
cli::cli_alert_info("Unmatched facilities: {nrow(unmatched_facilities)} exported")
cli::cli_alert_info("One-to-many matches: {nrow(one_to_many)} exported")

Output

To adapt the code:

Lines 3–7: Main integrated dataset path can be customized for the project structure.
Lines 9–13: Matching results summary export path.
Lines 14–19: Unmatched facilities export - modify path as needed.
Lines 21–26: One-to-many matches export - automatic detection and export.
Lines 29–33: Summary messages can be customized.

Show the code

# save main integrated DHIS2-MFL dataset
output_dir = Path(here("03_outputs"))
output_dir.mkdir(parents=True, exist_ok=True)

final_dhis2_mfl_df.to_excel(
    output_dir / "final_dhis2_mfl_integrated.xlsx", index=False
)

# save matching results summary
final_facilities_all.to_excel(
    output_dir / "facility_matching_results.xlsx", index=False
)

if len(unmatched_detailed) > 0:
    unmatched_detailed.to_csv(
        output_dir / "unmatched_dhis2_facilities.csv", index=False
    )

if len(one_to_many) > 0:
    one_to_many.to_csv(
        output_dir / "one_to_many_matches_for_review.csv", index=False
    )

# summary of saved files
cli_header("Final Datasets Saved")
cli_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli_success("Matching results: facility_matching_results.xlsx")
cli_info(f"Unmatched facilities: {len(unmatched_detailed)} exported")
cli_info(f"One-to-many matches: {len(one_to_many)} exported")

Output

To adapt the code:

Lines 3–6: Main integrated dataset path can be customized for the project structure.
Lines 8–10: Matching results summary export path.
Lines 12–14: Unmatched facilities export — modify path as needed.
Lines 16–18: One-to-many matches export — automatic detection and export.
Lines 21–24: Summary messages can be customized.

Summary

This guide outlines a structured approach to matching health facility names between DHIS2 datasets and the Master Facility List (MFL). It features a two-stage workflow designed for efficiency and accuracy. Users may implement the full workflow or apply fuzzy matching directly if only basic string similarity is required.

Stage 1: Interactive Stratified Geographic Matching uses prep_geonames() (available as sntutils::prep_geonames() in R and sntutils.geo.prep_geonames() in Python) to combine human validation with geographic constraints. It typically resolves 60–70% of matches with high confidence, exposes misaligned facilities early, and provides opportunities for immediate review of suggested matches.

Stage 2: Comprehensive Fuzzy Matching addresses the remaining unmatched facilities through cleaning and standardization of names, applying multiple string similarity algorithms (Jaro-Winkler, Levenshtein, Qgram, LCS), and checking structural quality. This helps capture spelling variations, abbreviations, and geographic misplacements that prevent exact matching.

The two-stage approach resolves most matches through stratified geographic matching, reducing manual review and leaving only difficult cases for fuzzy matching. It detects misalignments early, improves confidence through human validation, and remains flexible for both interactive and automated use. By combining multiple strategies, the approach achieves broad coverage while maintaining quality.

For facilities that require fuzzy matching, four strategies are available: single-method matching, composite scoring, weighted composites, and fallback loops with progressive thresholds. Each is suited to different data contexts and confidence needs. Transparency and comparability are built into the process, but validation with the SNT team remains essential for low-confidence matches and critical applications.

Full Code

Find the full code script for fuzzy matching of health facility names below.

Show full code

################################################################################
############ ~ Fuzzy matching of names across datasets full code ~ #############
################################################################################

### Step 1: Install and Load Required Libraries --------------------------------

# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load all required packages using pacman
pacman::p_load(
  readxl,     # for reading Excel files
  dplyr,      # for data manipulation
  stringdist, # for calculating string distances (fuzzy matching)
  tibble,     # for working with modern data frames
  knitr,      # for creating formatted tables
  openxlsx,   # for writing Excel files
  httr,       # for HTTP requests to download files (optional)
  here        # for cross-platform file paths
)

### Step 2: Load Data ----------------------------------------------------------

# set up path to dhis2 hf data
dhis2_path <- here::here(
  "01_data",
  "1.2_epidemiology",
  "1.2a_routine_surveillance",
  "processed"
)

hf_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed"
)

# read DHIS2 health facilities data
dhis2_df <- readRDS(
  here::here(dhis2_path, "sle_dhis2_with_clean_adm3.rds")
) |>
  # rename facility name columns for clarity and consistency
  dplyr::rename(hf_dhis2_raw = hf)

# get distinct admin and hf cols
dhis2_hf_df <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw)

# read MFL health facilities data
master_hf_df <- read.csv(
  here::here(hf_path, "hf_final_clean_data.csv")
) |>
   dplyr::distinct(
    adm0, adm1, adm2, adm3, hf, lat, long, .keep_all = TRUE
  ) |>
  # rename facility name columns for clarity and consistency
  dplyr::mutate(hf_mfl_raw = hf)

# attach a stable DHIS2 facility ID for consistent counting across
# steps
# distinct by geography + name to avoid collisions across admins
dhis2_map <- dhis2_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf) |>
  dplyr::mutate(
    hf_uid_new = sntutils::vdigest(
      paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf)),
        sep = "|"
      ),
      algo = "xxhash32"
    )
  )

# print head of data
cli::cli_h3("Sample of DHIS2 data:")
head(dhis2_hf_df)

cli::cli_h3("Sample of MFL data:")
head(master_hf_df)

### Step 3: Initial Matching Diagnostics ---------------------------------------

#### Step 3.1: Overall exact match check ---------------------------------------

# check for exact matches on raw names (no admin constraint)
exact_matches_all <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw"),
    relationship = "many-to-many"
  )

# calculate matching potential
total_dhis2 <- nrow(dhis2_hf_df)
total_mfl <- nrow(master_hf_df)
unmatched_dhis2 <- total_dhis2 - nrow(exact_matches_all)

cli::cli_h3("Overall Matching Summary")
cli::cli_alert_info("Total DHIS2 facilities: {total_dhis2}")
cli::cli_alert_info("Total MFL facilities: {total_mfl}")
cli::cli_alert_success(
  paste0(
    "Exact matches found: {nrow(exact_matches_all)} (",
    "{round(nrow(exact_matches_all)/total_dhis2*100, 1)}%)
  )
)
cli::cli_alert_warning("Remaining to match: {unmatched_dhis2}")

#### Step 3.2: Administrative level match check --------------------------------

# check matches at adm2 level (district)
dhis2_by_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(total_dhis2 = dplyr::n())

matches_by_adm2 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2")
  ) |>
  dplyr::group_by(adm2) |>
  dplyr::summarise(exact_matches = dplyr::n()) |>
  dplyr::left_join(dhis2_by_adm2, by = "adm2") |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::select(adm2, exact_matches, total_dhis2, match_rate) |>
  dplyr::arrange(dplyr::desc(match_rate))

cli::cli_h3("Exact matches by district (adm2))
matches_by_adm2

# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(total_dhis2 = dplyr::n(), .groups = "drop")

matches_by_adm3 <- dhis2_hf_df |>
  dplyr::inner_join(
    master_hf_df,
    by = c("hf_dhis2_raw" = "hf_mfl_raw", "adm2", "adm3")
  ) |>
  dplyr::group_by(adm2, adm3) |>
  dplyr::summarise(exact_matches = dplyr::n(), .groups = "drop") |>
  dplyr::left_join(dhis2_by_adm3, by = c("adm2", "adm3")) |>
  dplyr::mutate(
    match_rate = round(exact_matches / total_dhis2 * 100, 1)
  ) |>
  dplyr::filter(total_dhis2 >= 5) |> # only show areas with 5+ facilities
  dplyr::arrange(dplyr::desc(match_rate)) |>
  dplyr::slice_head(n = 10) # show top 10 performing adm3 areas

cli::cli_h3("Top exact matches by chiefdom (adm3))
matches_by_adm3

#### Step 3.3: Duplicate name check --------------------------------------------

# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 <- dhis2_hf_df |>
  dplyr::group_by(adm2, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_dhis2_raw)

mfl_dups_adm2 <- master_hf_df |>
  dplyr::group_by(adm2, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, hf_mfl_raw)

cli::cli_h3("Duplicates within same district (adm2)")
cli::cli_alert_warning(
  paste0(
    "DHIS2 duplicates within districts: ",
    "{length(unique(dhis2_dups_adm2$hf_dhis2_raw))}"
  )
)
cli::cli_alert_warning(
  paste0(
    "MFL duplicates within districts: ",
    "{length(unique(mfl_dups_adm2$hf_mfl_raw))}"
  )
)

# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 <- dhis2_hf_df |>
  dplyr::group_by(adm2, adm3, hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_dhis2_raw)

mfl_dups_adm3 <- master_hf_df |>
  dplyr::group_by(adm2, adm3, hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(adm2, adm3, hf_mfl_raw)

cli::cli_h3("Duplicates within same chiefdom (adm3)")
cli::cli_alert_danger(
  paste0(
    "DHIS2 duplicates within chiefdoms: ",
    "{length(unique(dhis2_dups_adm3$hf_dhis2_raw))}"
  )
)
cli::cli_alert_danger(
  paste0(
    "MFL duplicates within chiefdoms: ",
    "{length(unique(mfl_dups_adm3$hf_mfl_raw))}"
  )
)

# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall <- dhis2_hf_df |>
  dplyr::group_by(hf_dhis2_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_dhis2_raw)

mfl_dups_overall <- master_hf_df |>
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |>
  dplyr::arrange(hf_mfl_raw)

cli::cli_h3("Overall duplicate names (across all areas)")
cli::cli_alert_info(
  "DHIS2 overall duplicates: {length(
    unique(dhis2_dups_overall$hf_dhis2_raw)
  )}"
)
cli::cli_alert_info(
  "MFL overall duplicates: {length(
    unique(mfl_dups_overall$hf_mfl_raw)
  )}"
)

### Step 4: Interactive Stratified Geographic Matching -------------------------

# inner join (keep only matched polygons)
# set up location to save cache
cache_loc <- "01_data/1.1_foundational/1.1f_cache_files/processed"

# interactive stratified matching with automatic standardization
# this function handles name standardization internally
dhis2_df_cleaned <-
  sntutils::prep_geonames(
    target_df = dhis2_hf_df, # dataset to be cleaned
    lookup_df = master_hf_df, # reference dataset with correct admin
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    level4 = "hf",
    cache_path = here::here(cache_loc, "geoname_cache.rds"),
    unmatched_export_path = here::here(cache_loc, "dhis2_hf_unmatched.rds")
  )

# load the unmatched facilities for further processing (step 5–8 fuzzy
# matching)
dhis2_hf_to_process <- readRDS(
  here::here(cache_loc, "dhis2_hf_unmatched.rds")
) |>
  dplyr::select(adm0, adm1, adm2, adm3, hf_dhis2_raw = hf)

# summary statistics
n_original <- nrow(dhis2_hf_df)
n_matched <- n_original - nrow(dhis2_hf_to_process)
match_rate <- (n_matched / n_original) * 100

cli::cli_alert_success(
  paste0(
    "Stratified matching completed: ",
    "{format(n_matched, big.mark = ',')}/{format(n_original, big.mark = ',')}",
    " facilities matched ({round(match_rate, 1)}%)"
  )
)
cli::cli_alert_info(
  "Remaining unmatched: {nrow(dhis2_hf_to_process)} facilities"
)

### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------

#### Step 5.1: Standardize health facility names -------------------------------

# create function to standardize hf names
standardize_names <- function(name_vec) {
  # validate input
  if (!rlang::is_atomic(name_vec)) {
    cli::cli_abort("`name_vec` must be an atomic vector.")
  }

  name_vec |>
    # ensure character type
    as.character() |>
    # convert to lowercase
    stringr::str_to_lower() |>
    # replace punctuation with space
    stringr::str_replace_all("[[:punct:]]", " ") |>
    # remove extra spaces and trim
    stringr::str_squish() |>
    # normalize accents
    stringi::stri_trans_general("Latin-ASCII") |>
    # normalize all space-like characters
    stringi::stri_replace_all_regex("\\p{Zs}+", " ") |>
    # convert roman numerals to arabic numerals
    stringr::str_replace_all(
      c(
        " ix\\b" = " 9",
        " viii\\b" = " 8",
        " vii\\b" = " 7",
        " vi\\b" = " 6",
        " v\\b" = " 5",
        " iv\\b" = " 4",
        " iii\\b" = " 3",
        " ii\\b" = " 2",
        " i\\b" = " 1"
      )
    ) |>
    # sort tokens with letters first, numbers last; within groups
    # alphabetical
    purrr::map_chr(\(.x) {
      # split on one-or-more spaces
      tokens <- strsplit(.x, " +")[[1]]

      # detect pure numeric tokens
      is_num <- stringr::str_detect(tokens, "^[0-9]+$")

      # order alphabetic first, then numeric; sort within each group
      ordered <- c(sort(tokens[!is_num]), sort(tokens[is_num]))

      # rejoin
      paste(ordered, collapse = " ")
    })
}

# set up example with messy formatting
example_word <- factor("Clínica! Rahmâ    IV (  New clinic) East")

# print original structure
cat("\nExample before standardization:\n")
str(example_word)

# apply standardization
example_word_st <- standardize_names(example_word)

# print cleaned example
cat("\nExample after standardization:\n")
str(example_word_st)

# remove duplicates from MFL and format hf col
master_hf_df <- master_hf_df |>
  dplyr::distinct(hf_mfl_raw, .keep_all = TRUE) |>
  dplyr::mutate(hf_mfl = standardize_names(hf_mfl_raw))

# important: preserve original full dataset before processing unmatched
# (this is our base N)
dhis2_hf_df_original <- dhis2_hf_df |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw))

# build geo-aware stable ID once and attach to originals and unmatched
hf_uid_new_map <- dhis2_hf_df |>
  dplyr::distinct(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::mutate(
    hf_uid_new = sntutils::vdigest(
      paste(
        tolower(stringr::str_squish(adm0)),
        tolower(stringr::str_squish(adm1)),
        tolower(stringr::str_squish(adm2)),
        tolower(stringr::str_squish(adm3)),
        tolower(stringr::str_squish(hf_dhis2_raw)),
        sep = "|"
      ),
      algo = "xxhash32"
    )
  )

dhis2_hf_df_original <- dhis2_hf_df_original |>
  dplyr::left_join(
    hf_uid_new_map,
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched <- dhis2_hf_to_process |>
  dplyr::mutate(hf_dhis2 = standardize_names(hf_dhis2_raw)) |>
  dplyr::left_join(
    hf_uid_new_map |>
      dplyr::mutate(hf_dhis2_raw = toupper(hf_dhis2_raw)),
    by = c(
      "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"
    )
  )

knitr::kable(
  # check if it worked
  dhis2_hf_unmatched |>
    dplyr::distinct(hf_dhis2_raw, hf_dhis2) |>
    dplyr::slice_head(n = 10)
)

#### Step 5.2: Handle abbreviations in health facility names -------------------

abbrev_dictionary <-
  dplyr::bind_rows(
    dplyr::select(dhis2_hf_unmatched, hf = hf_dhis2_raw),
    dplyr::select( master_hf_df, hf = hf_mfl_raw)) |>
  # split into words
  tidyr::separate_rows(hf, sep = " ") |>
  # remove blanks
  dplyr::filter(hf != "") |>
  # detect 2-4 uppercase letter patterns
  dplyr::filter(stringr::str_detect(hf, "^[A-Z]{2,4}$")) |>
  # count frequencies
  dplyr::count(hf, sort = TRUE) |>
  # rename for clarity
  dplyr::rename(word = hf, freq = n) |>
  dplyr::filter(freq > 2) |>
  as.data.frame()

# check output
abbrev_dictionary

# define abbreviation dictionary (all lowercase, because we use the
# standardize column)
abbrev_dict <- c(
  "maternal child health post" = "mchp",
  "community health post" = "chp",
  "community health center" = "chc",
  "urban maternal clinic" = "umi",
  "expanded programme on immunization" = "epi"
)

# apply replacements to MFL names
master_hf_df <- master_hf_df |>
  dplyr::mutate(
    hf_mfl = stringr::str_replace_all(hf_mfl, abbrev_dict)
  )

# apply replacements to DHIS2 names
dhis2_hf_unmatched <- dhis2_hf_unmatched |>
  dplyr::mutate(
    hf_dhis2 = stringr::str_replace_all(hf_dhis2, abbrev_dict)
  )

# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched |>
  dplyr::filter(
    stringr::str_detect(
      hf_dhis2_raw,
      paste0(
        "(?i)Community Health Center|Maternal Child Health Post|",
        "Community Health Post"
      )
    ) |
      stringr::str_detect(hf_dhis2_raw, "CHC|MCHP|CHP")
  ) |>
  dplyr::select(hf_dhis2_raw, hf_dhis2) |>
  head()

### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------

#### Step 6.1: Identify matched and unmatched facilities -----------------------

# flag for one-to-one enforcement
enforce_one_to_one <- FALSE

# exact matches using raw names
matched_dhis2_raw <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw),
    by = c("hf_dhis2_raw" = "hf_mfl_raw")
  )

# exact matches using standardized names
matched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_dhis2_raw, hf_dhis2, hf_uid_new
  ) |>
  dplyr::inner_join(
    master_hf_df |>
      dplyr::select(hf_mfl_raw, hf_mfl),
    by = c("hf_dhis2" = "hf_mfl"),
    keep = TRUE
  ) |>
  # exclude facilities already matched via raw names
  dplyr::anti_join(matched_dhis2_raw, by = c("hf_dhis2_raw")) |>
  # label exact matches
  dplyr::mutate(
    final_method = paste0(
      "Matched Without Fuzzy Matching (standardization)"
    ),
    score = 100
  )

# only include facilities that were actually matched via geographic
# standardization
# these should be facilities that had geographic corrections applied
matched_dhis2_prepgeoname <-
  dhis2_df_cleaned |>
  dplyr::anti_join(
    dhis2_hf_unmatched,
    by = c("adm0", "adm1", "adm2", "adm3", "hf" = "hf_dhis2_raw")
  ) |>
  dplyr::left_join(
    master_hf_df |>
      dplyr::mutate(hf = toupper(hf)) |>
      dplyr::select(adm0, adm1, adm2, adm3, hf, hf_mfl_raw, hf_mfl),
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  ) |>
  # attach hf_uid_new from dhis2_map (by admin + hf)
  dplyr::left_join(
    dhis2_map |> dplyr::select(adm0, adm1, adm2, adm3, hf, hf_uid_new),
    by = c("adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw" = "hf")
  ) |>
  # label matches
  dplyr::mutate(
    final_method = "Interactive Stratified Geographic Matching",
    score = 100
  )

# compute unmatched after standardization
unmatched_dhis2 <- dhis2_hf_unmatched |>
  dplyr::select(hf_dhis2) |>
  dplyr::anti_join(
    master_hf_df |>
      dplyr::select(hf_mfl),
    by = c("hf_dhis2" = "hf_mfl")
  ) |>
  dplyr::distinct(hf_dhis2)

# collect already matched mfl to enforce one-to-one in candidate pool
used_mfl_stand <- matched_dhis2 |>
  dplyr::pull(hf_mfl) |>
  unique()

use_mfl_prepgeoname <- matched_dhis2_prepgeoname |>
  dplyr::pull(hf_mfl) |>
  unique()

used_mfl <- c(used_mfl_stand, use_mfl_prepgeoname)

# build candidate MFL pool for fuzzy step
candidate_mfl_df <- master_hf_df |>
  dplyr::select(hf_mfl)

# if enforcing one-to-one, drop MFLs already used by exact matches
if (enforce_one_to_one) {
  candidate_mfl_df <- candidate_mfl_df |>
    dplyr::filter(!hf_mfl %in% used_mfl)
}

# summary counts
total_dhis2_hf <- dplyr::n_distinct(dhis2_hf_unmatched$hf_dhis2_raw)
raw_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2_raw$hf_dhis2_raw)
raw_unmatch_dhis2_hf <- total_dhis2_hf - raw_match_dhis2_hf

standardized_match_dhis2_hf <- dplyr::n_distinct(matched_dhis2$hf_dhis2)
standardized_unmatch_dhis2_hf <- total_dhis2_hf - standardized_match_dhis2_hf

total_mfl_hf <- dplyr::n_distinct(master_hf_df$hf_mfl)
candidate_mfl_count <- dplyr::n_distinct(candidate_mfl_df$hf_mfl)

# display summary
cli::cli_h2("Summary of Matching Status")

cli::cli_alert_info(
  "Total DHIS2 facilities: {format(total_dhis2_hf, big.mark = ',')}"
)

cli::cli_alert_success(
  paste0(
    "Matched after standardization: ",
    "{format(standardized_match_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched using raw names: ",
    "{format(raw_unmatch_dhis2_hf, big.mark = ',')}"
  )
)
cli::cli_alert_danger(
  paste0(
    "Unmatched after standardization: ",
    "{format(standardized_unmatch_dhis2_hf, big.mark = ',')}"
  )
)

cli::cli_alert_info(
  paste0(
    "MFL facilities in candidate pool for matching: ",
    "{format(candidate_mfl_count, big.mark = ',')} of ",
    "{format(total_mfl_hf, big.mark = ',')}"
  )
)

#### Step 6.1: Create the matching grid ----------------------------------------

# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities.

# create simple candidate match pool
candidate_match_pool <- tidyr::crossing(
  unmatched_dhis2,
  candidate_mfl_df
)

# create match grid with dhis2 and mfl names
match_grid <- candidate_match_pool

# preview sample - top 10 matches
match_grid |>
  dplyr::slice_head(n = 10)

#### Step 6.2: Compute similarity scores ---------------------------------------

# calculate fuzzy scores
match_grid <- match_grid |>
  dplyr::mutate(
    len_max = pmax(nchar(hf_dhis2), nchar(hf_mfl)),
    score_jw = 1 - stringdist::stringdist(hf_dhis2, hf_mfl, method = "jw"),
    score_lv = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lv") / len_max,
    score_qg = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "qgram") / len_max,
    score_lcs = 1 -
      stringdist::stringdist(hf_dhis2, hf_mfl, method = "lcs") / len_max
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("score_"),
      .fns = ~ ifelse(is.nan(.x) | .x < 0, 0, .x)
    ),
    dplyr::across(
      .cols = dplyr::contains("score_"),
      .fns = ~ .x * 100
    )
  )

#### Step 6.3: Create composite scores -----------------------------------------

# define similarity score columns
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

# compute composite_score dynamically using score_cols
match_grid <- match_grid |>
  dplyr::mutate(
    composite_score = rowMeans(
      dplyr::across(dplyr::all_of(score_cols))
    )
  )

# compute average rank across all similarity methods
ranked_grid <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::mutate(
    dplyr::across(
      dplyr::all_of(score_cols),
      ~ dplyr::min_rank(dplyr::desc(.)),
      .names = "rank_{.col}"
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    rank_avg = rowSums(dplyr::across(
      dplyr::all_of(paste0("rank_", score_cols))
    )),
    rank_avg = round(rank_avg / length(score_cols))
  )

# preview ranked results to show multiple candidates per DHIS2 facility
# select a few facilities and show their top 3 candidates each
sample_facilities <- c(
  "charity clinic kamba of",
  "arab clinic shad",
  "al arab clinic sheefa"
)

ranked_grid |>
  dplyr::filter(hf_dhis2 %in% sample_facilities) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, n = 4, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    dplyr::all_of(score_cols),
    composite_score,
    rank_avg
  ) |>
  dplyr::arrange(hf_dhis2, rank_avg)

#### Step 6.4: Extract best match ----------------------------------------------

# for each fuzzy matching method, we select the top-ranked MFL name
# per DHIS2 hf.
best_jw <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_jw, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Jaro-Winkler") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_jw, method
  )

best_lv <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lv, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Levenshtein") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lv, method
  )

best_qg <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_qg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Qgram") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_qg, method
  )

best_lcs <- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(score_lcs, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "LCS") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = score_lcs, method
  )

best_comp<- match_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_max(composite_score, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Composite-Score") |>
  dplyr::select(
    hf_dhis2, hf_mfl, score = composite_score, method
  )

best_ranked_match <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::slice_min(rank_avg, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(method = "Rank-Ensemble")  |>
    dplyr::select(
    hf_dhis2, hf_mfl, score = rank_avg, method
  )

# combine top matches from all methods for comparison across approaches
all_best <- dplyr::bind_rows(
  best_jw, best_lv, best_qg, best_lcs,
  best_comp, best_ranked_match)

# preview best matches to show extraction results
all_best |>
  dplyr::slice_head(n = 10) |>
  dplyr::select(hf_dhis2, hf_mfl, score, method)

### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------

#### Step 7.1: Visualize score distributions across methods --------------------

# plot score dist across methods
score_dist_plot <- all_best |>
  dplyr::filter(method != "Rank-Ensemble") |>
  ggplot2::ggplot(ggplot2::aes(x = score)) +
  ggplot2::geom_density(fill = "steelblue", alpha = 0.6) +

  ggplot2::facet_wrap(~method) +
  ggplot2::geom_density(color = "steelblue", linewidth = 1) +
  ggplot2::labs(
    title = "Distribution of Fuzzy Matching Scores",
    x = "\nMatch Score (%)",
    y = "Density\n"
  ) +
  ggplot2::theme_minimal(base_size = 14)

# save plot
ggplot2::ggsave(
  plot = score_dist_plot,
  filename = here::here("03_output/3a_figures/u5mr_sle_adm2.png"),
  width = 12,
  height = 9,
  dpi = 300
)

#### Step 7.2: Define match quality diagnostic function ------------------------

assess_match_quality <- function(name1, name2) {
  purrr::map2_dfr(name1, name2, function(a, b) {
    tokens1 <- strsplit(a, "\\s+")[[1]]
    tokens2 <- strsplit(b, "\\s+")[[1]]
    tibble::tibble(
      prefix_match = tolower(tokens1[1]) == tolower(tokens2[1]),
      suffix_match = tolower(tail(tokens1, 1)) == tolower(tail(tokens2, 1)),
      token_diff = abs(length(tokens1) - length(tokens2)),
      char_diff = abs(nchar(a) - nchar(b))
    )
  })
}

assess_match_quality("Makeni Govt Hospital", "Makeni Government Hospital")

#### Step 7.3: Evaluate quality of top fuzzy matches ---------------------------

# assess structural match quality
diagnostics_df <- dplyr::bind_cols(
  all_best,
  assess_match_quality(all_best$hf_dhis2, all_best$hf_mfl)
)

# compare method diagnostics
summary_stats <- diagnostics_df |>
  dplyr::group_by(method) |>
  dplyr::summarise(
    score = mean(score),
    avg_token_diff = mean(token_diff) |> round(2),
    avg_char_diff = mean(char_diff) |> round(2),
    pct_prefix_match = (mean(prefix_match) * 100) |> round(2),
    pct_suffix_match = (mean(suffix_match) * 100) |> round(2),
    total = dplyr::n(),
    .groups = "drop"
  )

# create a a final overall score
summary_stats <-  summary_stats |>
  dplyr::mutate(
    # rescale negative of average token difference (smaller is better)
    token_score = scales::rescale(-avg_token_diff, to = c(0, 100)),
    # rescale negative of average character difference (smaller is better)
    char_score = scales::rescale(-avg_char_diff, to = c(0, 100)),
    # rescale prefix match percentage (higher is better)
    prefix_score = scales::rescale(pct_prefix_match, to = c(0, 100)),
    # rescale suffix match percentage (higher is better)
    suffix_score = scales::rescale(pct_suffix_match, to = c(0, 100)),

    # combine all four metrics into a weighted structural quality score
    structure_score = round(
      0.3 * token_score +         # emphasize fewer token differences
        0.2 * char_score +        # moderate weight on character similarity
        0.25 * prefix_score +     # give weight to matching initial words
        0.25 * suffix_score,      # give equal weight to matching final words
      1
    )
  ) |>
  dplyr::arrange(desc(structure_score)) |>
  # assign rank based on descending structure score
  dplyr::mutate(rank = dplyr::row_number()) |>
  dplyr::select(
    method, avg_token_diff, avg_char_diff,
    pct_prefix_match, pct_suffix_match,
    total, structure_score, rank
  )

# check results
summary_stats |>
  dplyr::select(
    Method = method,
    `Avg. Token Difference` = avg_token_diff,
    `Avg. Character Difference` = avg_char_diff,
    `% Prefix Match` = pct_prefix_match,
    `% Suffix Match` = pct_suffix_match,
    `Structural Score` = structure_score,
      Rank = rank
  ) |> as.data.frame()

#### Step 7.4: Weighted thresholding using structural quality and false-positive risk

# calculate method-specific weights and thresholds
method_threshold <- summary_stats |>
  dplyr::mutate(
    score = structure_score / 100,
    weight = score / sum(score),
    threshold = scales::rescale(
      structure_score,
      to = rev(unname(
        stats::quantile(all_best$score, c(0.70, 0.95), na.rm = TRUE)
      ))
    ) |>
      round()
  ) |>
  dplyr::select(method, weight, threshold)

# display results
method_threshold

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.1: Match selection options -----------------------------------------

# apply single-method threshold and flag matches
best_lv_final <- best_lv |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review"),
    final_method = "Levenshtein"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(hf_dhis2, hf_mfl, hf_mfl_raw, score, final_method, match_flag)

# show top 5 and bottom 5 matches by score
top_5 <- best_lv_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- best_lv_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Levenshtein Score")

# preview results
top_bottom

# select best match using rank average
composite_final <- ranked_grid |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(rank_avg) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(composite_score >= 85, "match", "review")
  ) |>
  dplyr::select(
    adm1,
    adm2,
    hf_dhis2,
    hf_mfl,
    composite_score,
    rank_avg,
    match_flag
  )

# show top 5 and bottom 5 matches by composite score
top_5 <- composite_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- composite_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

# combine into a single table
top_bottom <- dplyr::bind_rows(top_5, bottom_5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Composite Score")

# preview results
top_bottom

# define score columns and extract weights
score_cols <- c("score_jw", "score_lv", "score_qg", "score_lcs")

weights <- method_threshold |>
  dplyr::filter(
    method %in% c("Jaro-Winkler", "Levenshtein", "Qgram", "LCS")
  ) |>
  dplyr::pull(weight)

# calculate weighted composite score for each facility-match pair
weighted_final <- ranked_grid |>
  dplyr::rowwise() |>
  dplyr::mutate(
    # weighted average maintains 0-100 scale
    weighted_composite_score = stats::weighted.mean(
      c(score_jw, score_lv, score_qg, score_lcs),
      weights,
      na.rm = TRUE
    )
  ) |>
  dplyr::ungroup() |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::arrange(dplyr::desc(weighted_composite_score)) |>
  dplyr::slice_head(n = 1) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    match_flag = dplyr::if_else(
      weighted_composite_score >= 85,
      "match",
      "review"
    ),
    score = weighted_composite_score, # standardize column name for Step 7
    final_method = "Weighted-Composite"
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::select(
    hf_dhis2,
    hf_mfl,
    hf_mfl_raw,
    score,
    final_method,
    match_flag
  )

# show top 5 and bottom 5 matches by weighted composite score
top_5 <- weighted_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- weighted_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Weighted Composite Score")

# preview results
bottom_5

# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats <- summary_stats |>
  dplyr::arrange(dplyr::desc(structure_score)) |>
  dplyr::filter(method != "Rank-Ensemble") # exclude rank-based method

method_order <- method_stats$method

# use thresholds calculated in step 7.4
method_thresholds_df <- method_threshold |>
  dplyr::filter(method != "Rank-Ensemble")

# convert to named vector for easy lookup
method_thresholds <- setNames(
  method_thresholds_df$threshold,
  method_thresholds_df$method
)

# build combined table of top matches from real methods only
all_best_nonrank <- dplyr::bind_rows(
  best_lv,
  best_jw,
  best_qg,
  best_lcs,
  best_comp
) |>
  # normalize method labels
  dplyr::mutate(method = trimws(method))

# add rank once
rank_key <- ranked_grid |>
  dplyr::select(hf_dhis2, hf_mfl, rank_avg)

all_best_ranked <- all_best_nonrank |>
  dplyr::left_join(rank_key, by = c("hf_dhis2", "hf_mfl")) |>
  dplyr::mutate(
    # worst for missing ranks
    rank_avg = dplyr::if_else(is.na(rank_avg), Inf, rank_avg)
  )

# split by method and pre-arrange for stable tiebreak
by_method <- split(all_best_ranked, all_best_ranked$method) |>
  purrr::map(\(x) {
    x |>
      dplyr::arrange(dplyr::desc(score), rank_avg)
  })

# keep copy for pass-rate check
initial_thr <- method_thresholds

# init
fallback_chunks <- list()
unmatched <- unique(all_best_ranked$hf_dhis2)

# fast fallback loop (no joins, no grouping inside)
repeat {
  for (m in method_order) {
    if (!m %in% names(by_method)) {
      next
    }

    # current threshold from step 7.4
    thr <- method_thresholds[[m]]
    if (is.null(thr)) {
      thr <- 85 # fallback if method not found
    }

    # filter by score and still-unmatched
    cand <- by_method[[m]] |>
      dplyr::filter(score >= thr, hf_dhis2 %in% unmatched)

    if (nrow(cand) > 0) {
      # pick first row per hf_dhis2 (pre-arranged by score desc, rank asc)
      best_per_hf <- cand |>
        dplyr::distinct(hf_dhis2, .keep_all = TRUE) |>
        dplyr::mutate(
          final_method = glue::glue("Fuzzy-matched using: {m}")
        )

      # store and update unmatched
      fallback_chunks[[length(fallback_chunks) + 1L]] <- best_per_hf
      unmatched <- setdiff(unmatched, best_per_hf$hf_dhis2)
    }

    if (length(unmatched) == 0) break
  }

  if (length(unmatched) == 0) {
    break
  }

  # relax thresholds gradually
  method_thresholds <- method_thresholds - 1
  if (any(method_thresholds <= 50)) break # stop if too low
}

# bind once
fallback_matched <- dplyr::bind_rows(fallback_chunks)

# build a 1:1 map from standardized DHIS2 name to admin + raw +
# hf_uid_new
hf_uid_new_map_by_stand <- dhis2_hf_unmatched |>
  dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
  dplyr::group_by(hf_dhis2) |>
  dplyr::summarise(
    adm0 = dplyr::first(adm0),
    adm1 = dplyr::first(adm1),
    adm2 = dplyr::first(adm2),
    adm3 = dplyr::first(adm3),
    hf_dhis2_raw = dplyr::first(hf_dhis2_raw),
    hf_uid_new = dplyr::first(hf_uid_new),
    .groups = "drop"
  )

# final flags and attach admin + ids
fallback_final <- fallback_matched |>
  dplyr::mutate(
    match_flag = dplyr::if_else(score >= 85, "match", "review")
  ) |>
  dplyr::left_join(
    dplyr::select(master_hf_df, hf_mfl, hf_mfl_raw),
    by = "hf_mfl"
  ) |>
  dplyr::left_join(hf_uid_new_map_by_stand, by = "hf_dhis2") |>
  dplyr::select(
    adm0, adm1, adm2, adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    match_flag
  )

top_5 <- fallback_final |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::slice_head(n = 5)

bottom_5 <- fallback_final |>
  dplyr::arrange(score) |>
  dplyr::slice_head(n = 5)

cli::cli_h3("Top 5 and Bottom 5 Matches by Fallback")

bottom_5 <- dplyr::bind_rows(top_5, bottom_5)

# preview results
bottom_5

#### Step 8.2: Select fuzzy matching approach ----------------------------------

# create fuzzy_matches from step 8 results (use whichever option was
# chosen)
# step 8 produces different result datasets depending on which option
# was
# chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final

fuzzy_matches <- if (exists("fallback_final")) {
  dplyr::filter(fallback_final, match_flag == "match")
} else if (exists("composite_final")) {
  dplyr::filter(composite_final, match_flag == "match")
} else if (exists("weighted_final")) {
  dplyr::filter(weighted_final, match_flag == "match")
} else if (exists("best_lv_final")) {
  dplyr::filter(best_lv_final, match_flag == "match")
} else if (exists("best_jw_final")) {
  dplyr::filter(best_jw_final, match_flag == "match")
} else if (exists("best_qg_final")) {
  dplyr::filter(best_qg_final, match_flag == "match")
} else {
  # empty dataframe with expected structure if no step 8 results exist
  data.frame(
    hf_dhis2_raw = character(0),
    hf_dhis2 = character(0),
    hf_mfl_raw = character(0),
    hf_mfl = character(0),
    score = numeric(0),
    final_method = character(0),
    stringsAsFactors = FALSE
  )
} |>
# remove any internal duplicates (keep highest scoring match per
# facility)
dplyr::group_by(hf_dhis2_raw) |>
dplyr::slice_max(score, n = 1, with_ties = FALSE) |>
dplyr::ungroup() |>
# attach admin + hf_uid_new from unmatched (collapse to 1:1 per
# hf_dhis2_raw)
{
  hf_uid_new_map_unmatched <- dhis2_hf_unmatched |>
    dplyr::arrange(adm0, adm1, adm2, adm3, hf_dhis2_raw) |>
    dplyr::group_by(hf_dhis2_raw) |>
    dplyr::summarise(
      adm0 = dplyr::first(adm0),
      adm1 = dplyr::first(adm1),
      adm2 = dplyr::first(adm2),
      adm3 = dplyr::first(adm3),
      hf_uid_new = dplyr::first(hf_uid_new),
      .groups = "drop"
    )
  dplyr::left_join(., hf_uid_new_map_unmatched, by = "hf_dhis2_raw")
}

# display summary of selected approach
cli::cli_h2("Selected Fuzzy Matching Results")
cli::cli_alert_success("Fuzzy matches found: {nrow(fuzzy_matches)}")

if (nrow(fuzzy_matches) > 0) {
  score_summary <- fuzzy_matches |>
    dplyr::summarise(
      avg_score = mean(score, na.rm = TRUE),
      min_score = min(score, na.rm = TRUE),
      max_score = max(score, na.rm = TRUE),
      .groups = "drop"
    )

  cli::cli_alert_info("Score range: {round(score_summary$min_score, 1)} - {round(score_summary$max_score, 1)}")
  cli::cli_alert_info("Average score: {round(score_summary$avg_score, 1)}")
}

#### Step 8.3: Manual review of flagged matches --------------------------------

# prepare review dataset (using fallback results as example)
review_matches <- fallback_final |>
  dplyr::filter(match_flag == "review") |>
  dplyr::arrange(dplyr::desc(score)) |>
  dplyr::mutate(
    reviewer_decision = NA_character_, # accept/reject/uncertain
    reviewer_notes = NA_character_,
    review_date = NA_character_,
    reviewer_name = NA_character_
  ) |>
  dplyr::select(
    adm0, adm1, adm2, adm3, hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method,
    reviewer_decision,
    reviewer_notes,
    review_date,
    reviewer_name
  )

# export for review
readr::write_csv(
  review_matches,
  here::here("03_outputs", "tables", "facility_matches_for_review.csv")
)

# show summary
cli::cli_h2("Manual Review Required")
cli::cli_alert_info(
  "Exported {nrow(review_matches)} matches flagged for review"
)
cli::cli_alert_info(
  "File location: outputs/facility_matches_for_review.csv"
)

# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches <- readr::read_csv(
  here::here("03_outputs", "tables", "facility_matches_reviewed.csv"),
  col_types = readr::cols(
    reviewer_decision = readr::col_character(),
    reviewer_notes = readr::col_character(),
    review_date = readr::col_character(),
    reviewer_name = readr::col_character()
  )
)

final_incorp <- reviewed_matches |>
  dplyr::mutate(
    final_match_flag = dplyr::case_when(
      reviewer_decision == "accept" ~ "match",
      reviewer_decision == "reject" ~ "no_match",
      is.na(reviewer_decision) ~ "pending_review",
      TRUE ~ "no_match"
    )
  )

# get the final matched ones
final_matches <- final_incorp |>
  dplyr::filter(final_match_flag == "match") |>
  dplyr::select(-reviewer_decision)

# summary of all results (including rejected/pending)
all_results_summary <- final_incorp |>
  dplyr::count(final_match_flag, name = "n") |>
  dplyr::mutate(percentage = round(100 * n / sum(n), 1))

# summary of final accepted matches only
final_summary <- final_matches |>
  dplyr::count(final_match_flag, name = "n")

cli::cli_h2("Final Matching Results After Manual Review")
cli::cli_alert_success(
  "Accepted matches: {format(sum(final_summary$n), big.mark = ',')}"
)

# breakdown by decision type
rejected_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "no_match") |>
  dplyr::pull(n)
if (length(rejected_count) == 0) {
  rejected_count <- 0
}

pending_count <- all_results_summary |>
  dplyr::filter(final_match_flag == "pending_review") |>
  dplyr::pull(n)
if (length(pending_count) == 0) {
  pending_count <- 0
}

if (rejected_count > 0) {
  cli::cli_alert_danger(
    "Rejected matches: {format(rejected_count, big.mark = ',')} (will have NA MFL values in final dataset)"
  )
}

if (pending_count > 0) {
  cli::cli_alert_warning(
    "Pending review: {format(pending_count, big.mark = ',')} (require additional review)"
  )
}

# show breakdown table
all_results_summary

### Step 9: Combine All Matched Results ----------------------------------------

#### Step 9.1: Combine results from all matching approaches --------------------

# create manual_matches from manual review results (both matched and
# unmatched)
manual_matches <- final_incorp |>
  dplyr::mutate(
    # for unmatched, set appropriate values
    hf_mfl = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl,
      NA_character_
    ),
    hf_mfl_raw = dplyr::if_else(
      reviewer_decision == "accept",
      hf_mfl_raw,
      NA_character_
    ),
    final_method = dplyr::if_else(
      reviewer_decision == "accept",
      "Manual Review - Accepted",
      "Manual Review - Rejected"
    ),
    score = dplyr::if_else(
      reviewer_decision == "accept",
      100,
      0
    )
  ) |>
  # manual reviewed results already carry adm0-3 + hf_uid_new from
  # fallback_final → review export → reimport
  dplyr::select(
    adm0,
    adm1,
    adm2,
    adm3,
    hf_uid_new,
    hf_dhis2_raw,
    hf_dhis2,
    hf_mfl_raw,
    hf_mfl,
    score,
    final_method
  )

# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and
# rejected
final_facilities_all <- dplyr::bind_rows(
  # step 4: stratified geographic matches (if performed)
  matched_dhis2_prepgeoname,
  # step 5: exact matches after standardization
  matched_dhis2,
  # step 6-8: complete matches from fuzzy matching
  fuzzy_matches,
  # manual matching (if performed)
  manual_matches
)

# persist for the python render-state to load (Rule 6.4 parity)
write.csv(
  final_facilities_all,
  here::here("01_data/1.1_foundational/1.1c_health_facilities/processed/final_facilities_all.csv"),
  row.names = FALSE
)

# collapse to one record per DHIS2 facility for downstream integration
final_facilities_one_per_hf <- final_facilities_all |>
  dplyr::select(-match_flag, -hf)

cli::cli_alert_info(
  "All facilities processed (rows): {nrow(final_facilities_all)}"
)
cli::cli_alert_success(
  "All facilities processed (distinct DHIS2 by adm0/1/2/3+hf): {dplyr::n_distinct(final_facilities_all$hf_uid_new)}"
)

# generate summary by matching method
matching_summary <- final_facilities_one_per_hf |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    n_matched = n(),
    avg_score = mean(score, na.rm = TRUE),
    min_score = min(score, na.rm = TRUE),
    max_score = max(score, na.rm = TRUE),
    .groups = "drop"
  )

# display summary
cli::cli_h2("Matching Results by Method")
matching_summary

#### Step 9.2: Create final integrated DHIS2-MFL dataset -----------------------

# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final <- dhis2_df |>
  dplyr::left_join(
    dhis2_map,
    by = c("adm0", "adm1", "adm2", "adm3", "hf")
  )

# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf <- final_facilities_one_per_hf |>
  dplyr::select(hf_uid_new, hf_mfl_raw)

# create the integrated dataset
final_dhis2_mfl_df <- dhis2_df_final |>
  dplyr::left_join(final_match_per_hf, by = "hf_uid_new") |>
  # join MFL attributes for matched facilities
  dplyr::left_join(
    dplyr::select(master_hf_df, -hf_mfl, -hf, -adm0, -adm1, -adm2, -adm3),
    by = "hf_mfl_raw"
  )

# counts on distinct facilities via hf_uid_new (robust to row-level
# duplication)
total <- dhis2_df_final |> dplyr::distinct(hf_uid_new) |> nrow()
with_mfl <- final_match_per_hf |>
  dplyr::filter(!is.na(hf_mfl_raw)) |>
  nrow()
without_mfl <- total - with_mfl

# validation summary
cli::cli_h2("DHIS2-MFL Integration Complete")
cli::cli_alert_success(
  "Total DHIS2 facilities preserved: {total}"
)
cli::cli_alert_info(
  "Facilities with MFL data: {with_mfl}"
)
cli::cli_alert_warning(
  "Facilities without MFL match: {without_mfl}"
)
cli::cli_alert_info("Total rows (facility-months): {nrow(final_dhis2_mfl_df)}")

### Step 10: Final Checks ------------------------------------------------------

#### Step 10.1: Detecting one-to-many matches for review -----------------------

# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many <- final_facilities_all |>
  dplyr::filter(!is.na(hf_mfl_raw)) |> # only matched facilities
  dplyr::group_by(hf_mfl_raw) |>
  dplyr::filter(dplyr::n() > 1) |> # MFL matched to >1 DHIS2
  dplyr::ungroup() |>
  dplyr::arrange(hf_mfl_raw, hf_dhis2_raw)

if (nrow(one_to_many) > 0) {
  cli::cli_alert_warning(
    paste0(
      "One-to-many matches detected: {nrow(one_to_many)} DHIS2 facilities ",
      "share {dplyr::n_distinct(one_to_many$hf_mfl_raw)} MFL facilities"
    )
  )

  # show sample
  knitr::kable(
    one_to_many |>
      dplyr::select(hf_dhis2_raw, hf_mfl_raw, score, final_method) |>
      dplyr::slice_head(n = 6),
    caption = "Sample one-to-many matches requiring review"
  )
} else {
  cli::cli_alert_success("No one-to-many matches detected")
}

#### Step 10.2: Unmatched facilities analysis ----------------------------------

# analyze unmatched facilities by reason
unmatched_analysis <- final_facilities_all |>
  dplyr::filter(is.na(hf_mfl_raw)) |>
  dplyr::group_by(final_method) |>
  dplyr::summarise(
    count = n(),
    .groups = "drop"
  ) |>
  dplyr::arrange(desc(count))

# display unmatched analysis
cli::cli_h2("Unmatched Facilities Analysis")
cli::cli_alert_info(
  "Total unmatched facilities: {sum(unmatched_analysis$count)}"
)

# show breakdown by reason
purrr::pwalk(
  unmatched_analysis,
  ~ cli::cli_alert_warning("{.x}: {.y} facilities")
)

# create detailed unmatched facilities export
unmatched_detailed <- final_facilities_one_per_hf |>
  # unmatched are those without an MFL linkage
  dplyr::filter(is.na(hf_mfl_raw)) |>
  # select required context and scoring/method fields
  dplyr::select(
    hf_dhis2_raw,
    adm1,
    adm2,
    adm3,
    final_method,
    score
  ) |>
  # derive a clear potential reason for review
  dplyr::mutate(
    potential_reason = dplyr::case_when(
      is.na(final_method) ~ "Never matched - possible new facility",
      final_method == "Manual Review - Rejected" ~
        "Rejected in review - possible different facility",
      score > 0 & score < 50 ~ "Low similarity - possible data quality issue",
      TRUE ~ "Unmatched - needs investigation"
    )
  ) |>
  dplyr::arrange(adm1, adm2, adm3, hf_dhis2_raw)

# export for SNT team review
readr::write_csv(
  unmatched_detailed,
  here::here("03_outputs", "unmatched_facilities_for_snt_review.csv")
)

cli::cli_alert_success(
  "Exported {nrow(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli::cli_text("File: outputs/unmatched_facilities_for_snt_review.csv")

### Step 11: Save Final Datasets -----------------------------------------------

# save main integrated DHIS2-MFL dataset
rio::export(
  final_dhis2_mfl_df,
  here::here("03_outputs", "final_dhis2_mfl_integrated.xlsx")
)

# save matching results summary
rio::export(
  final_facilities_all,
  here::here("03_outputs", "facility_matching_results.xlsx")
)

if (nrow(unmatched_facilities) > 0) {
  readr::write_csv(
    unmatched_facilities,
    here::here("03_outputs", "unmatched_dhis2_facilities.csv")
  )
}

if (nrow(one_to_many) > 0) {
  readr::write_csv(
    one_to_many,
    here::here("03_outputs", "one_to_many_matches_for_review.csv")
  )
}

# summary of saved files
cli::cli_h2("Final Datasets Saved")
cli::cli_alert_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli::cli_alert_success("Matching results: facility_matching_results.xlsx")
cli::cli_alert_info("Unmatched facilities: {nrow(unmatched_facilities)} exported")
cli::cli_alert_info("One-to-many matches: {nrow(one_to_many)} exported")

Show full code

################################################################################
############ ~ Fuzzy matching of names across datasets full code ~ #############
################################################################################

### Step 1: Install and Load Required Libraries --------------------------------

from pathlib import Path
import re
import unicodedata

import numpy as np
import pandas as pd
import pyreadr
import xxhash
import matplotlib.pyplot as plt
from pyprojroot import here
from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz

def read_rds(path):
    """Read a single-object RDS file as a pandas DataFrame."""
    result = pyreadr.read_r(str(path))
    return next(iter(result.values()))

def vdigest(x, algo="xxhash32"):
    """Vectorised xxhash32 digest. Mirrors sntutils::vdigest()."""
    return x.apply(lambda val: xxhash.xxh32(str(val)).hexdigest())

def cli_header(message):
    print(f"\n{message}")

def cli_info(message):
    print(f"INFO: {message}")

def cli_success(message):
    print(f"SUCCESS: {message}")

def cli_warning(message):
    print(f"WARNING: {message}")

def cli_danger(message):
    print(f"ERROR: {message}")

def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )

def show_table(df, n=10, caption=None):
    """Render a compact scrollable HTML table with the .out-table style.
    Chunks calling this must set #| results: asis."""
    from IPython.display import display, HTML
    rows = df.head(n)
    cap_html = f"<caption>{caption}</caption>" if caption else ""
    table_html = rows.to_html(
        index=False,
        classes="out-table",
        border=0,
        na_rep="",
    )
    # inject caption before the table header
    if cap_html:
        table_html = table_html.replace(
            "<thead>", cap_html + "<thead>", 1
        )
    display(HTML(f'<div class="out-scroll">{table_html}</div>'))

### Step 2: Load Data ----------------------------------------------------------

# set up path to dhis2 hf data
dhis2_path = Path(here("01_data/1.2_epidemiology/1.2a_routine_surveillance/processed"))
hf_path = Path(here("01_data/1.1_foundational/1.1c_health_facilities/processed"))

# read DHIS2 health facilities data
dhis2_df = read_rds(dhis2_path / "sle_dhis2_with_clean_adm3.rds")
dhis2_df = dhis2_df.assign(hf_dhis2_raw=dhis2_df["hf"])

# get distinct admin and hf cols
dhis2_hf_df = dhis2_df[
    ["adm0", "adm1", "adm2", "adm3", "hf", "hf_dhis2_raw"]
].drop_duplicates()

# read MFL health facilities data
master_hf_df = pd.read_csv(hf_path / "hf_final_clean_data.csv")
master_hf_df = (
    master_hf_df
    .drop_duplicates(subset=["adm0", "adm1", "adm2", "adm3", "hf", "lat", "long"])
    .assign(hf_mfl_raw=lambda d: d["hf"])
)

# attach a stable DHIS2 facility ID
dhis2_map = (
    dhis2_df[["adm0", "adm1", "adm2", "adm3", "hf"]]
    .drop_duplicates()
    .assign(
        hf_uid_new=lambda d: vdigest(
            d["adm0"].str.lower().str.strip() + "|" +
            d["adm1"].str.lower().str.strip() + "|" +
            d["adm2"].str.lower().str.strip() + "|" +
            d["adm3"].str.lower().str.strip() + "|" +
            d["hf"].str.lower().str.strip()
        )
    )
)

# print head of data
cli_header("Sample of DHIS2 data:")
dhis2_hf_df.head()

cli_header("Sample of MFL data:")
master_hf_df.head()

### Step 3: Initial Matching Diagnostics ---------------------------------------

#### Step 3.1: Overall exact match check ---------------------------------------

# check for exact matches on raw names (no admin constraint)
exact_matches_all = dhis2_hf_df.merge(
    master_hf_df[["hf_mfl_raw"]],
    left_on="hf_dhis2_raw",
    right_on="hf_mfl_raw",
    how="inner",
)

# calculate matching potential
total_dhis2 = len(dhis2_hf_df)
total_mfl = len(master_hf_df)
unmatched_dhis2 = total_dhis2 - len(exact_matches_all)

cli_header("Overall Matching Summary")
cli_info(f"Total DHIS2 facilities: {total_dhis2}")
cli_info(f"Total MFL facilities: {total_mfl}")
cli_success(
    f"Exact matches found: {len(exact_matches_all)} "
    f"({round(len(exact_matches_all) / total_dhis2 * 100, 1)}%)"
)
cli_warning(f"Remaining to match: {unmatched_dhis2}")

#### Step 3.2: Administrative level match check --------------------------------

# check matches at adm2 level (district)
dhis2_by_adm2 = (
    dhis2_hf_df.groupby("adm2", as_index=False)
    .agg(total_dhis2=("hf_dhis2_raw", "count"))
)

matches_by_adm2 = (
    dhis2_hf_df
    .merge(
        master_hf_df[["hf_mfl_raw", "adm2"]],
        left_on=["hf_dhis2_raw", "adm2"],
        right_on=["hf_mfl_raw", "adm2"],
        how="inner",
    )
    .groupby("adm2", as_index=False)
    .agg(exact_matches=("hf_dhis2_raw", "count"))
    .merge(dhis2_by_adm2, on="adm2", how="left")
    .assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
    [["adm2", "exact_matches", "total_dhis2", "match_rate"]]
    .sort_values("match_rate", ascending=False)
)

cli_header("Exact matches by district (adm2)")
matches_by_adm2

# check matches at adm3 level (chiefdom/sub-district)
dhis2_by_adm3 = (
    dhis2_hf_df.groupby(["adm2", "adm3"], as_index=False)
    .agg(total_dhis2=("hf_dhis2_raw", "count"))
)

matches_by_adm3 = (
    dhis2_hf_df
    .merge(
        master_hf_df[["hf_mfl_raw", "adm2", "adm3"]],
        left_on=["hf_dhis2_raw", "adm2", "adm3"],
        right_on=["hf_mfl_raw", "adm2", "adm3"],
        how="inner",
    )
    .groupby(["adm2", "adm3"], as_index=False)
    .agg(exact_matches=("hf_dhis2_raw", "count"))
    .merge(dhis2_by_adm3, on=["adm2", "adm3"], how="left")
    .assign(match_rate=lambda d: (d["exact_matches"] / d["total_dhis2"] * 100).round(1))
    .loc[lambda d: d["total_dhis2"] >= 5]
    .sort_values("match_rate", ascending=False)
    .head(10)
)

cli_header("Top exact matches by chiefdom (adm3)")
matches_by_adm3

#### Step 3.3: Duplicate name check --------------------------------------------

# check for duplicates within same adm2 (problematic)
dhis2_dups_adm2 = (
    dhis2_hf_df
    .groupby(["adm2", "hf_dhis2_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "hf_dhis2_raw"])
)

mfl_dups_adm2 = (
    master_hf_df
    .groupby(["adm2", "hf_mfl_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "hf_mfl_raw"])
)

cli_header("Duplicates within same district (adm2)")
cli_warning(
    f"DHIS2 duplicates within districts: {dhis2_dups_adm2['hf_dhis2_raw'].nunique()}"
)
cli_warning(
    f"MFL duplicates within districts: {mfl_dups_adm2['hf_mfl_raw'].nunique()}"
)

# check for duplicates within same adm3 (very problematic)
dhis2_dups_adm3 = (
    dhis2_hf_df
    .groupby(["adm2", "adm3", "hf_dhis2_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "adm3", "hf_dhis2_raw"])
)

mfl_dups_adm3 = (
    master_hf_df
    .groupby(["adm2", "adm3", "hf_mfl_raw"])
    .filter(lambda x: len(x) > 1)
    .sort_values(["adm2", "adm3", "hf_mfl_raw"])
)

cli_header("Duplicates within same chiefdom (adm3)")
cli_danger(
    f"DHIS2 duplicates within chiefdoms: {dhis2_dups_adm3['hf_dhis2_raw'].nunique()}"
)
cli_danger(
    f"MFL duplicates within chiefdoms: {mfl_dups_adm3['hf_mfl_raw'].nunique()}"
)

# check for overall duplicates (manageable with geographic context)
dhis2_dups_overall = (
    dhis2_hf_df
    .groupby("hf_dhis2_raw")
    .filter(lambda x: len(x) > 1)
    .sort_values("hf_dhis2_raw")
)

mfl_dups_overall = (
    master_hf_df
    .groupby("hf_mfl_raw")
    .filter(lambda x: len(x) > 1)
    .sort_values("hf_mfl_raw")
)

cli_header("Overall duplicate names (across all areas)")
cli_info(f"DHIS2 overall duplicates: {dhis2_dups_overall['hf_dhis2_raw'].nunique()}")
cli_info(f"MFL overall duplicates: {mfl_dups_overall['hf_mfl_raw'].nunique()}")

### Step 4: Interactive Stratified Geographic Matching -------------------------

from sntutils.geo import prep_geonames

# set up location to save cache
cache_loc = "01_data/1.1_foundational/1.1f_cache_files/processed"

# interactive stratified matching with automatic standardization
dhis2_df_cleaned = prep_geonames(
    target_df=dhis2_hf_df,
    lookup_df=master_hf_df,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    level4="hf",
    cache_path=here(cache_loc, "geoname_cache.csv"),
    unmatched_export_path=here(cache_loc, "dhis2_hf_unmatched.csv"),
)

# load the unmatched facilities for further processing (steps 5–8)
dhis2_hf_to_process = pd.read_csv(
    here(cache_loc, "dhis2_hf_unmatched.csv")
).rename(columns={"hf": "hf_dhis2_raw"})

# summary statistics
n_original = len(dhis2_hf_df)
n_matched = n_original - len(dhis2_hf_to_process)
match_rate = n_matched / n_original * 100

cli_success(
    f"Stratified matching completed: {n_matched:,}/{n_original:,} "
    f"facilities matched ({round(match_rate, 1)}%)"
)
cli_info(f"Remaining unmatched: {len(dhis2_hf_to_process)} facilities")

### Step 5: Process and Prepare Unmatched Data for Fuzzy Matching --------------

#### Step 5.1: Standardize health facility names -------------------------------

import unicodedata

def standardize_names(series):
    """Standardize health facility name strings for fuzzy matching.

    Applies lowercase conversion, punctuation removal, accent normalization,
    Roman numeral conversion, and alphabetical token sorting.
    """
    roman_map = {
        r"\bix\b": "9", r"\bviii\b": "8", r"\bvii\b": "7",
        r"\bvi\b": "6", r"\bv\b": "5", r"\biv\b": "4",
        r"\biii\b": "3", r"\bii\b": "2", r"\bi\b": "1",
    }

    def _clean(text):
        if pd.isna(text):
            return text
        text = str(text).lower()
        # replace punctuation with space
        text = re.sub(r"[^\w\s]", " ", text)
        # collapse multiple spaces
        text = re.sub(r"\s+", " ", text).strip()
        # normalize accents
        text = unicodedata.normalize("NFD", text)
        text = "".join(c for c in text if unicodedata.category(c) != "Mn")
        # convert roman numerals
        for pattern, replacement in roman_map.items():
            text = re.sub(pattern, replacement, text)
        # sort tokens: letters first (alphabetical), numbers last
        tokens = text.split()
        alpha = sorted(t for t in tokens if not t.isdigit())
        numeric = sorted(t for t in tokens if t.isdigit())
        return " ".join(alpha + numeric)

    return series.apply(_clean)

# set up example with messy formatting
example_word = "Clínica! Rahmâ    IV (  New clinic) East"

# print original
print(f"\nExample before standardization:\n{example_word}")

# apply standardization
example_word_st = standardize_names(pd.Series([example_word]))[0]

# print cleaned example
print(f"\nExample after standardization:\n{example_word_st}")

# remove duplicates from MFL and create standardized column
master_hf_df = (
    master_hf_df
    .drop_duplicates(subset="hf_mfl_raw")
    .assign(hf_mfl=lambda d: standardize_names(d["hf_mfl_raw"]))
)

# preserve original full dataset before processing unmatched
dhis2_hf_df_original = dhis2_hf_df.assign(
    hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"])
)

# build geo-aware stable ID map
hf_uid_new_map = (
    dhis2_hf_df[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"]]
    .drop_duplicates()
    .assign(
        hf_uid_new=lambda d: vdigest(
            d["adm0"].str.lower().str.strip() + "|" +
            d["adm1"].str.lower().str.strip() + "|" +
            d["adm2"].str.lower().str.strip() + "|" +
            d["adm3"].str.lower().str.strip() + "|" +
            d["hf_dhis2_raw"].str.lower().str.strip()
        )
    )
)

dhis2_hf_df_original = dhis2_hf_df_original.merge(
    hf_uid_new_map,
    on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
    how="left",
)

# process only unmatched facilities for fuzzy matching steps
dhis2_hf_unmatched = (
    dhis2_hf_to_process
    .assign(hf_dhis2=lambda d: standardize_names(d["hf_dhis2_raw"]))
    .merge(
        hf_uid_new_map.assign(
            hf_dhis2_raw=lambda d: d["hf_dhis2_raw"].str.upper()
        ),
        on=["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"],
        how="left",
    )
)

# check if standardization worked
dhis2_hf_unmatched[["hf_dhis2_raw", "hf_dhis2"]].drop_duplicates().head(10)

#### Step 5.2: Handle abbreviations in health facility names -------------------

# combine DHIS2 and MFL names for abbreviation detection
all_names = pd.concat([
    dhis2_hf_unmatched[["hf_dhis2_raw"]].rename(columns={"hf_dhis2_raw": "hf"}),
    master_hf_df[["hf_mfl_raw"]].rename(columns={"hf_mfl_raw": "hf"}),
])

# split into words and filter to 2-4 uppercase letter tokens
abbrev_dictionary = (
    all_names["hf"]
    .dropna()
    .str.split(expand=True)
    .stack()
    .reset_index(drop=True)
    .rename("word")
    .loc[lambda s: s.str.match(r"^[A-Z]{2,4}$")]
    .value_counts()
    .reset_index()
    .rename(columns={"count": "freq"})
    .loc[lambda d: d["freq"] > 2]
)

# check output
abbrev_dictionary

# define abbreviation dictionary (all lowercase, using standardized column)
abbrev_dict = {
    "maternal child health post": "mchp",
    "community health post": "chp",
    "community health center": "chc",
    "urban maternal clinic": "umi",
    "expanded programme on immunization": "epi",
}

# apply replacements to MFL names
for long_form, short_form in abbrev_dict.items():
    master_hf_df["hf_mfl"] = master_hf_df["hf_mfl"].str.replace(
        long_form, short_form, regex=False
    )

# apply replacements to DHIS2 names
for long_form, short_form in abbrev_dict.items():
    dhis2_hf_unmatched["hf_dhis2"] = dhis2_hf_unmatched["hf_dhis2"].str.replace(
        long_form, short_form, regex=False
    )

# check it worked: demonstrate abbreviation standardization
dhis2_hf_unmatched.loc[
    dhis2_hf_unmatched["hf_dhis2_raw"].str.contains(
        r"(?i)Community Health Center|Maternal Child Health Post|Community Health Post|CHC|MCHP|CHP",
        regex=True,
        na=False,
    ),
    ["hf_dhis2_raw", "hf_dhis2"],
].drop_duplicates().head()

### Step 6: Perform Fuzzy Matching on Remaining Unmatched ----------------------

#### Step 6.1: Identify matched and unmatched facilities -----------------------

# flag for one-to-one enforcement
enforce_one_to_one = False

# exact matches using raw names
matched_dhis2_raw = (
    dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
    .merge(
        master_hf_df[["hf_mfl_raw"]],
        left_on="hf_dhis2_raw",
        right_on="hf_mfl_raw",
        how="inner",
    )
)

# exact matches using standardized names
matched_dhis2 = (
    dhis2_hf_unmatched[
        ["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_dhis2", "hf_uid_new"]
    ]
    .merge(
        master_hf_df[["hf_mfl_raw", "hf_mfl"]],
        left_on="hf_dhis2",
        right_on="hf_mfl",
        how="inner",
    )
)
# exclude facilities already matched via raw names
matched_dhis2 = anti_join(matched_dhis2, matched_dhis2_raw, on=["hf_dhis2_raw"])
matched_dhis2 = matched_dhis2.assign(
    final_method="Matched Without Fuzzy Matching (standardization)",
    score=100,
)

# compute unmatched after standardization
# cross-named anti-join: dhis2_hf_unmatched.hf_dhis2 vs master_hf_df.hf_mfl
unmatched_dhis2 = (
    dhis2_hf_unmatched.loc[
        ~dhis2_hf_unmatched["hf_dhis2"].isin(master_hf_df["hf_mfl"]),
        ["hf_dhis2"],
    ]
    .drop_duplicates(subset="hf_dhis2")
)

used_mfl = list(matched_dhis2["hf_mfl"].unique())

# build candidate MFL pool for fuzzy step
candidate_mfl_df = master_hf_df[["hf_mfl"]].copy()
if enforce_one_to_one:
    candidate_mfl_df = candidate_mfl_df.loc[~candidate_mfl_df["hf_mfl"].isin(used_mfl)]

# summary counts
total_dhis2_hf = dhis2_hf_unmatched["hf_dhis2_raw"].nunique()
raw_match_dhis2_hf = matched_dhis2_raw["hf_dhis2_raw"].nunique()
raw_unmatch_dhis2_hf = total_dhis2_hf - raw_match_dhis2_hf
standardized_match_dhis2_hf = matched_dhis2["hf_dhis2"].nunique()
standardized_unmatch_dhis2_hf = total_dhis2_hf - standardized_match_dhis2_hf
total_mfl_hf = master_hf_df["hf_mfl"].nunique()
candidate_mfl_count = candidate_mfl_df["hf_mfl"].nunique()

cli_header("Summary of Matching Status")
cli_info(f"Total DHIS2 facilities: {total_dhis2_hf:,}")
cli_success(f"Matched after standardization: {standardized_match_dhis2_hf:,}")
cli_danger(f"Unmatched using raw names: {raw_unmatch_dhis2_hf:,}")
cli_danger(f"Unmatched after standardization: {standardized_unmatch_dhis2_hf:,}")
cli_info(
    f"MFL facilities in candidate pool for matching: "
    f"{candidate_mfl_count:,} of {total_mfl_hf:,}"
)

#### Step 6.1: Create the matching grid ----------------------------------------

# create comprehensive matching grid for fuzzy matching by pairing each
# unmatched DHIS2 facility with all available MFL facilities
candidate_match_pool = unmatched_dhis2.merge(candidate_mfl_df, how="cross")

# create match grid with dhis2 and mfl names
match_grid = candidate_match_pool.copy()

# preview sample - top 10 matches
match_grid.head(10)

#### Step 6.2: Compute similarity scores ---------------------------------------

from rapidfuzz.distance import Levenshtein, JaroWinkler
from rapidfuzz import fuzz

def _score_lv(a, b):
    """Normalized Levenshtein similarity (0–100)."""
    len_max = max(len(a), len(b))
    if len_max == 0:
        return 100.0
    return max(0.0, (1 - Levenshtein.distance(a, b) / len_max) * 100)

def _score_jw(a, b):
    """Jaro-Winkler similarity (0–100)."""
    return JaroWinkler.similarity(a, b) * 100

def _score_qg(a, b):
    """Q-gram similarity (0–100) using token sort ratio."""
    return float(fuzz.token_sort_ratio(a, b))

def _score_lcs(a, b):
    """LCS-based similarity (0–100) using partial ratio."""
    return float(fuzz.partial_ratio(a, b))

# calculate fuzzy scores
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]

match_grid = match_grid.assign(
    score_jw=lambda d: d.apply(
        lambda r: _score_jw(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
    score_lv=lambda d: d.apply(
        lambda r: _score_lv(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
    score_qg=lambda d: d.apply(
        lambda r: _score_qg(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
    score_lcs=lambda d: d.apply(
        lambda r: _score_lcs(str(r["hf_dhis2"]), str(r["hf_mfl"])), axis=1
    ),
)

# clip negatives and NaN to zero
for col in score_cols:
    match_grid[col] = match_grid[col].clip(lower=0).fillna(0)

#### Step 6.3: Create composite scores -----------------------------------------

# define similarity score columns
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]

# compute composite_score (simple average)
match_grid["composite_score"] = match_grid[score_cols].mean(axis=1)

# compute average rank across all similarity methods
for col in score_cols:
    match_grid[f"rank_{col}"] = (
        match_grid.groupby("hf_dhis2")[col]
        .rank(method="min", ascending=False)
    )

rank_cols = [f"rank_{c}" for c in score_cols]
match_grid["rank_avg"] = (
    match_grid[rank_cols].sum(axis=1) / len(score_cols)
).round()
ranked_grid = match_grid.copy()

# preview ranked results for a few sample facilities
sample_facilities = [
    "charity clinic kamba of",
    "arab clinic shad",
    "al arab clinic sheefa",
]

ranked_grid.loc[ranked_grid["hf_dhis2"].isin(sample_facilities)].groupby(
    "hf_dhis2"
).apply(lambda g: g.nsmallest(4, "rank_avg")).reset_index(drop=True)[
    ["hf_dhis2", "hf_mfl"] + score_cols + ["composite_score", "rank_avg"]
].sort_values(["hf_dhis2", "rank_avg"])

#### Step 6.4: Extract best match ----------------------------------------------

# for each fuzzy matching method, select the top-ranked MFL name per DHIS2 hf
best_jw = (
    match_grid.sort_values("score_jw", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_jw"]]
    .rename(columns={"score_jw": "score"})
    .assign(method="Jaro-Winkler")
)

best_lv = (
    match_grid.sort_values("score_lv", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_lv"]]
    .rename(columns={"score_lv": "score"})
    .assign(method="Levenshtein")
)

best_qg = (
    match_grid.sort_values("score_qg", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_qg"]]
    .rename(columns={"score_qg": "score"})
    .assign(method="Qgram")
)

best_lcs = (
    match_grid.sort_values("score_lcs", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "score_lcs"]]
    .rename(columns={"score_lcs": "score"})
    .assign(method="LCS")
)

best_comp = (
    match_grid.sort_values("composite_score", ascending=False)
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "composite_score"]]
    .rename(columns={"composite_score": "score"})
    .assign(method="Composite-Score")
)

best_ranked_match = (
    ranked_grid.sort_values("rank_avg")
    .groupby("hf_dhis2", as_index=False).first()
    [["hf_dhis2", "hf_mfl", "rank_avg"]]
    .rename(columns={"rank_avg": "score"})
    .assign(method="Rank-Ensemble")
)

# combine top matches from all methods for comparison across approaches
all_best = pd.concat(
    [best_jw, best_lv, best_qg, best_lcs, best_comp, best_ranked_match],
    ignore_index=True,
)

# preview best matches
all_best.head(10)[["hf_dhis2", "hf_mfl", "score", "method"]]

### Step 7: Evaluate Fuzzy Match Quality ---------------------------------------

#### Step 7.1: Visualize score distributions across methods --------------------

import matplotlib.pyplot as plt

# plot score distribution across methods (exclude rank-based method)
methods_for_plot = all_best.loc[all_best["method"] != "Rank-Ensemble"]
method_names = sorted(methods_for_plot["method"].unique())

fig, axes = plt.subplots(1, len(method_names), figsize=(12, 6), sharey=True)
for ax, name in zip(axes, method_names):
    grp = methods_for_plot.loc[methods_for_plot["method"] == name, "score"].dropna()
    grp.plot.kde(ax=ax, color="steelblue", linewidth=1)
    ax.fill_between(
        ax.lines[0].get_xdata(),
        ax.lines[0].get_ydata(),
        alpha=0.6,
        color="steelblue",
    )
    ax.set_title(name, fontsize=11)
    ax.set_xlabel("\nMatch Score (%)")
    if ax == axes[0]:
        ax.set_ylabel("Density\n")

fig.suptitle("Distribution of Fuzzy Matching Scores", fontsize=14)
plt.tight_layout()

# save plot
output_dir = Path(here("03_output/3a_figures"))
output_dir.mkdir(parents=True, exist_ok=True)
fig.savefig(output_dir / "fuzzy_score_distributions.png", dpi=300, bbox_inches="tight")

#### Step 7.2: Define match quality diagnostic function ------------------------

def assess_match_quality(names1, names2):
    """Compute structural quality diagnostics for name pairs.

    Returns a DataFrame with prefix_match, suffix_match,
    token_diff, and char_diff for each name pair.
    """
    rows = []
    for a, b in zip(names1, names2):
        tokens1 = str(a).split()
        tokens2 = str(b).split()
        rows.append({
            "prefix_match": tokens1[0].lower() == tokens2[0].lower()
                if tokens1 and tokens2 else False,
            "suffix_match": tokens1[-1].lower() == tokens2[-1].lower()
                if tokens1 and tokens2 else False,
            "token_diff": abs(len(tokens1) - len(tokens2)),
            "char_diff": abs(len(str(a)) - len(str(b))),
        })
    return pd.DataFrame(rows)

assess_match_quality(["Makeni Govt Hospital"], ["Makeni Government Hospital"])

#### Step 7.3: Evaluate quality of top fuzzy matches ---------------------------

def _rescale(series, new_min, new_max):
    """Rescale a numeric series to [new_min, new_max]."""
    s_min, s_max = series.min(), series.max()
    if s_max == s_min:
        return pd.Series([new_min] * len(series), index=series.index)
    return new_min + (series - s_min) / (s_max - s_min) * (new_max - new_min)

# assess structural match quality
diagnostics_df = pd.concat(
    [all_best.reset_index(drop=True),
     assess_match_quality(all_best["hf_dhis2"], all_best["hf_mfl"])],
    axis=1,
)

# compare method diagnostics
summary_stats = (
    diagnostics_df
    .groupby("method")
    .agg(
        avg_token_diff=("token_diff", "mean"),
        avg_char_diff=("char_diff", "mean"),
        pct_prefix_match=("prefix_match", "mean"),
        pct_suffix_match=("suffix_match", "mean"),
        total=("hf_dhis2", "count"),
    )
    .reset_index()
    .assign(
        avg_token_diff=lambda d: d["avg_token_diff"].round(2),
        avg_char_diff=lambda d: d["avg_char_diff"].round(2),
        pct_prefix_match=lambda d: (d["pct_prefix_match"] * 100).round(2),
        pct_suffix_match=lambda d: (d["pct_suffix_match"] * 100).round(2),
    )
)

# create a final overall score
summary_stats = summary_stats.assign(
    # rescale negative of average token difference (smaller is better)
    token_score=lambda d: _rescale(-d["avg_token_diff"], 0, 100),
    # rescale negative of average character difference (smaller is better)
    char_score=lambda d: _rescale(-d["avg_char_diff"], 0, 100),
    # rescale prefix match percentage (higher is better)
    prefix_score=lambda d: _rescale(d["pct_prefix_match"], 0, 100),
    # rescale suffix match percentage (higher is better)
    suffix_score=lambda d: _rescale(d["pct_suffix_match"], 0, 100),
).assign(
    # combine all four metrics into a weighted structural quality score
    structure_score=lambda d: (
        0.3 * d["token_score"] +        # emphasize fewer token differences
        0.2 * d["char_score"] +          # moderate weight on character similarity
        0.25 * d["prefix_score"] +       # give weight to matching initial words
        0.25 * d["suffix_score"]         # give equal weight to matching final words
    ).round(1)
).sort_values("structure_score", ascending=False)

summary_stats["rank"] = range(1, len(summary_stats) + 1)

summary_stats.rename(columns={
    "method": "Method",
    "avg_token_diff": "Avg. Token Difference",
    "avg_char_diff": "Avg. Character Difference",
    "pct_prefix_match": "% Prefix Match",
    "pct_suffix_match": "% Suffix Match",
    "structure_score": "Structural Score",
    "rank": "Rank",
})[["Method", "Avg. Token Difference", "Avg. Character Difference",
    "% Prefix Match", "% Suffix Match", "Structural Score", "Rank"]]

#### Step 7.4: Weighted thresholding using structural quality and false-positive risk

# calculate method-specific weights and thresholds
q70 = all_best["score"].quantile(0.70)
q95 = all_best["score"].quantile(0.95)

method_threshold = summary_stats[["method", "structure_score"]].copy()
method_threshold["score_frac"] = method_threshold["structure_score"] / 100
method_threshold["weight"] = (
    method_threshold["score_frac"] / method_threshold["score_frac"].sum()
)
# inverse scale: higher structure_score → lower threshold (easier matching)
method_threshold["threshold"] = (
    _rescale(method_threshold["structure_score"], q95, q70)
).round()
method_threshold = method_threshold[["method", "weight", "threshold"]]

# display results
method_threshold

### Step 8: Finalize Fuzzy Match Selection -------------------------------------

#### Step 8.1: Match selection options -----------------------------------------

# apply single-method threshold and flag matches
best_lv_final = (
    best_lv
    .assign(
        match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"),
        final_method="Levenshtein",
    )
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    [["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)

# show top 5 and bottom 5 matches by score
top_5 = best_lv_final.sort_values("score", ascending=False).head(5)
bottom_5 = best_lv_final.sort_values("score").head(5)

# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)

cli_header("Top 5 and Bottom 5 Matches by Levenshtein Score")

# preview results
top_bottom

# select best match using rank average
composite_final = (
    ranked_grid.sort_values("rank_avg")
    .groupby("hf_dhis2", as_index=False)
    .first()
    .assign(
        match_flag=lambda d: np.where(d["composite_score"] >= 85, "match", "review"),
        score=lambda d: d["composite_score"],
        final_method="Composite-Score",
    )
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    [["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)

# show top 5 and bottom 5 matches by composite score
top_5 = composite_final.sort_values("score", ascending=False).head(5)
bottom_5 = composite_final.sort_values("score").head(5)

# combine into a single table
top_bottom = pd.concat([top_5, bottom_5], ignore_index=True)

cli_header("Top 5 and Bottom 5 Matches by Composite Score")

# preview results
top_bottom

# define score columns and extract weights
score_cols = ["score_jw", "score_lv", "score_qg", "score_lcs"]

weights_df = method_threshold.loc[
    method_threshold["method"].isin(["Jaro-Winkler", "Levenshtein", "Qgram", "LCS"])
]
weights = weights_df.set_index("method")["weight"]

# extract individual weights for each method
w_jw = float(weights.get("Jaro-Winkler", 0))
w_lv = float(weights.get("Levenshtein", 0))
w_qg = float(weights.get("Qgram", 0))
w_lcs = float(weights.get("LCS", 0))

# calculate weighted composite score for each facility-match pair
# weighted average maintains 0-100 scale
weighted_final = (
    ranked_grid.assign(
        weighted_composite_score=lambda d: (
            d["score_jw"] * w_jw +
            d["score_lv"] * w_lv +
            d["score_qg"] * w_qg +
            d["score_lcs"] * w_lcs
        )
    )
    .sort_values("weighted_composite_score", ascending=False)
    .groupby("hf_dhis2", as_index=False)
    .first()
    .assign(
        match_flag=lambda d: np.where(
            d["weighted_composite_score"] >= 85, "match", "review"
        ),
        score=lambda d: d["weighted_composite_score"],
        final_method="Weighted-Composite",
    )
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    [["hf_dhis2", "hf_mfl", "hf_mfl_raw", "score", "final_method", "match_flag"]]
)

# show top 5 and bottom 5 matches by weighted composite score
top_5 = weighted_final.sort_values("score", ascending=False).head(5)
bottom_5 = weighted_final.sort_values("score").head(5)

cli_header("Top 5 and Bottom 5 Matches by Weighted Composite Score")

# preview results
bottom_5

# extract method order and thresholds from step 7.4 results
# order methods by structure score (highest to lowest)
method_stats = summary_stats.loc[
    summary_stats["method"] != "Rank-Ensemble"
].sort_values("structure_score", ascending=False)

method_order = list(method_stats["method"])

# use thresholds calculated in step 7.4
method_thresholds_df = method_threshold.loc[method_threshold["method"] != "Rank-Ensemble"]
method_thresholds = dict(zip(method_thresholds_df["method"], method_thresholds_df["threshold"]))

# build combined table of top matches from real methods only
all_best_nonrank = pd.concat(
    [best_lv, best_jw, best_qg, best_lcs, best_comp], ignore_index=True
)
all_best_nonrank["method"] = all_best_nonrank["method"].str.strip()

# add rank once
rank_key = ranked_grid[["hf_dhis2", "hf_mfl", "rank_avg"]].copy()
all_best_ranked = (
    all_best_nonrank
    .merge(rank_key, on=["hf_dhis2", "hf_mfl"], how="left")
    .assign(rank_avg=lambda d: d["rank_avg"].fillna(float("inf")))
)

# split by method and pre-arrange for stable tiebreak
by_method = {
    m: grp.sort_values(["score", "rank_avg"], ascending=[False, True])
    for m, grp in all_best_ranked.groupby("method")
}

fallback_chunks = []
unmatched_set = set(all_best_ranked["hf_dhis2"].unique())
current_thresholds = dict(method_thresholds)

# fast fallback loop
while True:
    for m in method_order:
        if m not in by_method:
            continue
        thr = current_thresholds.get(m, 85)  # fallback if method not found
        cand = by_method[m].loc[
            (by_method[m]["score"] >= thr) &
            (by_method[m]["hf_dhis2"].isin(unmatched_set))
        ]
        if len(cand) > 0:
            best_per_hf = (
                cand.drop_duplicates(subset="hf_dhis2", keep="first")
                .assign(final_method=f"Fuzzy-matched using: {m}")
            )
            fallback_chunks.append(best_per_hf)
            unmatched_set -= set(best_per_hf["hf_dhis2"])
        if not unmatched_set:
            break

    if not unmatched_set:
        break

    # relax thresholds gradually
    current_thresholds = {k: v - 1 for k, v in current_thresholds.items()}
    if any(v <= 50 for v in current_thresholds.values()):  # stop if too low
        break

# bind once
fallback_matched = (
    pd.concat(fallback_chunks, ignore_index=True)
    if fallback_chunks else pd.DataFrame()
)

# build a 1:1 map from standardized DHIS2 name to admin + raw + hf_uid_new
hf_uid_new_map_by_stand = (
    dhis2_hf_unmatched
    .sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
    .groupby("hf_dhis2", as_index=False)
    .first()[["hf_dhis2", "adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
)

# final flags and attach admin + ids
fallback_final = (
    fallback_matched
    .assign(match_flag=lambda d: np.where(d["score"] >= 85, "match", "review"))
    .merge(master_hf_df[["hf_mfl", "hf_mfl_raw"]], on="hf_mfl", how="left")
    .merge(hf_uid_new_map_by_stand, on="hf_dhis2", how="left")
    [["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
      "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
      "score", "final_method", "match_flag"]]
)

top_5 = fallback_final.sort_values("score", ascending=False).head(5)
bottom_5 = fallback_final.sort_values("score").head(5)

cli_header("Top 5 and Bottom 5 Matches by Fallback")

bottom_5 = pd.concat([top_5, bottom_5], ignore_index=True)

# preview results
bottom_5

#### Step 8.2: Select fuzzy matching approach ----------------------------------

# create fuzzy_matches from step 8 results (use whichever option was chosen)
# step 8 produces different result datasets depending on which option was chosen:
# - option 1: best_lv_final, best_jw_final, etc.
# - option 2: composite_final
# - option 3: weighted_final
# - option 4: fallback_final

if "fallback_final" in dir():
    fuzzy_matches_raw = fallback_final.loc[fallback_final["match_flag"] == "match"]
elif "composite_final" in dir():
    fuzzy_matches_raw = composite_final.loc[composite_final["match_flag"] == "match"]
elif "weighted_final" in dir():
    fuzzy_matches_raw = weighted_final.loc[weighted_final["match_flag"] == "match"]
elif "best_lv_final" in dir():
    fuzzy_matches_raw = best_lv_final.loc[best_lv_final["match_flag"] == "match"]
elif "best_jw_final" in dir():
    fuzzy_matches_raw = best_jw_final.loc[best_jw_final["match_flag"] == "match"]
elif "best_qg_final" in dir():
    fuzzy_matches_raw = best_qg_final.loc[best_qg_final["match_flag"] == "match"]
else:
    # empty dataframe with expected structure if no step 8 results exist
    fuzzy_matches_raw = pd.DataFrame(columns=[
        "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
        "score", "final_method",
    ])

# deduplicate: keep highest scoring match per DHIS2 facility
fuzzy_matches = (
    fuzzy_matches_raw
    .sort_values("score", ascending=False)
    .drop_duplicates(subset="hf_dhis2_raw", keep="first")
    .merge(
        dhis2_hf_unmatched[["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw", "hf_uid_new"]]
        .sort_values(["adm0", "adm1", "adm2", "adm3", "hf_dhis2_raw"])
        .drop_duplicates(subset="hf_dhis2_raw", keep="first"),
        on="hf_dhis2_raw",
        how="left",
    )
)

# display summary of selected approach
cli_header("Selected Fuzzy Matching Results")
cli_success(f"Fuzzy matches found: {len(fuzzy_matches)}")

if len(fuzzy_matches) > 0:
    cli_info(
        f"Score range: {fuzzy_matches['score'].min():.1f} - "
        f"{fuzzy_matches['score'].max():.1f}"
    )
    cli_info(f"Average score: {fuzzy_matches['score'].mean():.1f}")

#### Step 8.3: Manual review of flagged matches --------------------------------

# prepare review dataset (using fallback results as example)
review_matches = (
    fallback_final
    .loc[fallback_final["match_flag"] == "review"]
    .sort_values("score", ascending=False)
    .assign(
        reviewer_decision=pd.NA,  # accept/reject/uncertain
        reviewer_notes=pd.NA,
        review_date=pd.NA,
        reviewer_name=pd.NA,
    )
    [["adm0", "adm1", "adm2", "adm3", "hf_uid_new",
      "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl",
      "score", "final_method",
      "reviewer_decision", "reviewer_notes", "review_date", "reviewer_name"]]
)

# export for review
output_dir = Path(here("03_outputs/tables"))
output_dir.mkdir(parents=True, exist_ok=True)
review_matches.to_csv(output_dir / "facility_matches_for_review.csv", index=False)

# show summary
cli_header("Manual Review Required")
cli_info(f"Exported {len(review_matches)} matches flagged for review")
cli_info("File location: outputs/tables/facility_matches_for_review.csv")

# simulate reviewer decisions (replace with actual reviewed data)
# read back reviewed matches
reviewed_matches = pd.read_csv(
    here("03_outputs/tables/facility_matches_reviewed.csv"),
    dtype={
        "reviewer_decision": "object",
        "reviewer_notes": "object",
        "review_date": "object",
        "reviewer_name": "object",
    }
)

final_incorp = reviewed_matches.assign(
    final_match_flag=lambda d: np.select(
        [
            d["reviewer_decision"] == "accept",
            d["reviewer_decision"] == "reject",
            d["reviewer_decision"].isna(),
        ],
        ["match", "no_match", "pending_review"],
        default="no_match",
    )
)

# get the final matched ones
final_matches = (
    final_incorp
    .loc[final_incorp["final_match_flag"] == "match"]
    .drop(columns="reviewer_decision", errors="ignore")
)

# summary of all results (including rejected/pending)
all_results_summary = (
    final_incorp.groupby("final_match_flag", as_index=False)
    .size()
    .rename(columns={"size": "n"})
    .assign(percentage=lambda d: (d["n"] / d["n"].sum() * 100).round(1))
)

cli_header("Final Matching Results After Manual Review")
cli_success(f"Accepted matches: {len(final_matches):,}")

# breakdown by decision type
rejected_count = int(
    all_results_summary.loc[all_results_summary["final_match_flag"] == "no_match", "n"].sum()
)
pending_count = int(
    all_results_summary.loc[
        all_results_summary["final_match_flag"] == "pending_review", "n"
    ].sum()
)

if rejected_count > 0:
    cli_danger(
        f"Rejected matches: {rejected_count:,} "
        "(will have NA MFL values in final dataset)"
    )

if pending_count > 0:
    cli_warning(f"Pending review: {pending_count:,} (require additional review)")

# show breakdown table
all_results_summary

### Step 9: Combine All Matched Results ----------------------------------------

#### Step 9.1: Combine results from all matching approaches --------------------

# create manual_matches from manual review results (both matched and unmatched)
manual_matches = final_incorp.assign(
    # for unmatched, set appropriate values
    hf_mfl=lambda d: np.where(d["reviewer_decision"] == "accept", d["hf_mfl"], pd.NA),
    hf_mfl_raw=lambda d: np.where(
        d["reviewer_decision"] == "accept", d["hf_mfl_raw"], pd.NA
    ),
    final_method=lambda d: np.where(
        d["reviewer_decision"] == "accept",
        "Manual Review - Accepted",
        "Manual Review - Rejected",
    ),
    score=lambda d: np.where(d["reviewer_decision"] == "accept", 100, 0),
)

cols_needed = [
    "adm0", "adm1", "adm2", "adm3", "hf_uid_new",
    "hf_dhis2_raw", "hf_dhis2", "hf_mfl_raw", "hf_mfl", "score", "final_method",
]
manual_matches = manual_matches[[c for c in cols_needed if c in manual_matches.columns]]

# combine results from all matching approaches
# this dataset contains ALL facilities: matched, unmatched, and rejected
final_facilities_all = pd.concat(
    [
        # step 4: stratified geographic matches (if performed)
        matched_dhis2_prepgeoname[[c for c in cols_needed if c in matched_dhis2_prepgeoname.columns]],
        # step 5: exact matches after standardization
        matched_dhis2[[c for c in cols_needed if c in matched_dhis2.columns]],
        # step 6-8: complete matches from fuzzy matching
        fuzzy_matches[[c for c in cols_needed if c in fuzzy_matches.columns]],
        # manual matching (if performed)
        manual_matches,
    ],
    ignore_index=True,
)

# collapse to one record per DHIS2 facility for downstream integration
cols_to_drop = [c for c in ["match_flag", "hf"] if c in final_facilities_all.columns]
final_facilities_one_per_hf = final_facilities_all.drop(columns=cols_to_drop, errors="ignore")

cli_info(f"All facilities processed (rows): {len(final_facilities_all)}")
cli_success(
    f"All facilities processed (distinct DHIS2 by hf_uid_new): "
    f"{final_facilities_all['hf_uid_new'].nunique()}"
)

# generate summary by matching method
matching_summary = (
    final_facilities_one_per_hf
    .groupby("final_method", as_index=False)
    .agg(
        n_matched=("hf_dhis2_raw", "count"),
        avg_score=("score", "mean"),
        min_score=("score", "min"),
        max_score=("score", "max"),
    )
)

# display summary
cli_header("Matching Results by Method")
matching_summary

#### Step 9.2: Create final integrated DHIS2-MFL dataset -----------------------

# start with DHIS2 as base to preserve all rows; attach hf_uid_new
dhis2_df_final = dhis2_df.merge(
    dhis2_map,
    on=["adm0", "adm1", "adm2", "adm3", "hf"],
    how="left",
)

# one-per-facility matching results: use hf_uid_new as the join key
final_match_per_hf = final_facilities_one_per_hf[["hf_uid_new", "hf_mfl_raw"]].copy()

# create the integrated dataset
final_dhis2_mfl_df = (
    dhis2_df_final
    .merge(final_match_per_hf, on="hf_uid_new", how="left")
    # join MFL attributes for matched facilities
    .merge(
        master_hf_df.drop(
            columns=["hf_mfl", "hf", "adm0", "adm1", "adm2", "adm3"],
            errors="ignore",
        ),
        on="hf_mfl_raw",
        how="left",
    )
)

# counts on distinct facilities via hf_uid_new
total = dhis2_df_final["hf_uid_new"].nunique()
with_mfl = final_match_per_hf.dropna(subset=["hf_mfl_raw"])["hf_uid_new"].nunique()
without_mfl = total - with_mfl

# validation summary
cli_header("DHIS2-MFL Integration Complete")
cli_success(f"Total DHIS2 facilities preserved: {total}")
cli_info(f"Facilities with MFL data: {with_mfl}")
cli_warning(f"Facilities without MFL match: {without_mfl}")
cli_info(f"Total rows (facility-months): {len(final_dhis2_mfl_df)}")

### Step 10: Final Checks ------------------------------------------------------

#### Step 10.1: Detecting one-to-many matches for review -----------------------

# detect MFL facilities matched to multiple DHIS2 facilities
one_to_many = (
    final_facilities_all
    .dropna(subset=["hf_mfl_raw"])  # only matched facilities
    .groupby("hf_mfl_raw")
    .filter(lambda g: len(g) > 1)  # MFL matched to >1 DHIS2
    .sort_values(["hf_mfl_raw", "hf_dhis2_raw"])
)

if len(one_to_many) > 0:
    cli_warning(
        f"One-to-many matches detected: {len(one_to_many)} DHIS2 facilities "
        f"share {one_to_many['hf_mfl_raw'].nunique()} MFL facilities"
    )

    # show sample
    print(
        one_to_many[["hf_dhis2_raw", "hf_mfl_raw", "score", "final_method"]]
        .head(6)
        .to_string(index=False)
    )
else:
    cli_success("No one-to-many matches detected")

#### Step 10.2: Unmatched facilities analysis ----------------------------------

# analyze unmatched facilities by reason
unmatched_analysis = (
    final_facilities_all
    .loc[final_facilities_all["hf_mfl_raw"].isna()]
    .groupby("final_method", as_index=False)
    .size()
    .rename(columns={"size": "count"})
    .sort_values("count", ascending=False)
)

# display unmatched analysis
cli_header("Unmatched Facilities Analysis")
cli_info(f"Total unmatched facilities: {unmatched_analysis['count'].sum()}")

# show breakdown by reason
for _, row in unmatched_analysis.iterrows():
    cli_warning(f"{row['final_method']}: {row['count']} facilities")

# create detailed unmatched facilities export
unmatched_detailed = (
    final_facilities_one_per_hf
    .loc[final_facilities_one_per_hf["hf_mfl_raw"].isna()]
    [["hf_dhis2_raw", "adm1", "adm2", "adm3", "final_method", "score"]]
    .assign(
        potential_reason=lambda d: np.select(
            [
                d["final_method"].isna(),
                d["final_method"] == "Manual Review - Rejected",
                (d["score"] > 0) & (d["score"] < 50),
            ],
            [
                "Never matched - possible new facility",
                "Rejected in review - possible different facility",
                "Low similarity - possible data quality issue",
            ],
            default="Unmatched - needs investigation",
        )
    )
    .sort_values(["adm1", "adm2", "adm3", "hf_dhis2_raw"])
)

# export for SNT team review
unmatched_export_dir = Path(here("03_outputs"))
unmatched_export_dir.mkdir(parents=True, exist_ok=True)
unmatched_detailed.to_csv(
    unmatched_export_dir / "unmatched_facilities_for_snt_review.csv", index=False
)

cli_success(
    f"Exported {len(unmatched_detailed)} unmatched facilities for SNT team review"
)
cli_info("File: outputs/unmatched_facilities_for_snt_review.csv")

### Step 11: Save Final Datasets -----------------------------------------------

# save main integrated DHIS2-MFL dataset
output_dir = Path(here("03_outputs"))
output_dir.mkdir(parents=True, exist_ok=True)

final_dhis2_mfl_df.to_excel(
    output_dir / "final_dhis2_mfl_integrated.xlsx", index=False
)

# save matching results summary
final_facilities_all.to_excel(
    output_dir / "facility_matching_results.xlsx", index=False
)

if len(unmatched_detailed) > 0:
    unmatched_detailed.to_csv(
        output_dir / "unmatched_dhis2_facilities.csv", index=False
    )

if len(one_to_many) > 0:
    one_to_many.to_csv(
        output_dir / "one_to_many_matches_for_review.csv", index=False
    )

# summary of saved files
cli_header("Final Datasets Saved")
cli_success("Main integrated dataset: final_dhis2_mfl_integrated.xlsx")
cli_success("Matching results: facility_matching_results.xlsx")
cli_info(f"Unmatched facilities: {len(unmatched_detailed)} exported")
cli_info(f"One-to-many matches: {len(one_to_many)} exported")