Health facility coordinates and point data

Beginner

Overview

Many of the most important features in the SNT process are best represented as points rather than areas. A point is a single location on the Earth’s surface, described by a pair of coordinates (longitude and latitude). Health facilities, community health workers, villages and settlements, survey clusters, and the locations of individual cases or events can all be stored, mapped, and analyzed as point data.

This page uses health facility coordinate mapping as a worked example of how to work with point coordinate data. The same code can be adapted to any other kind of point data, because the underlying steps (cleaning coordinates, converting to a spatial object, and plotting over administrative boundaries) are the same regardless of what the points represent. Mapping indicators at their point coordinates, rather than aggregating them to a district, also reveals spatial patterns at a much finer scale.

Key Concepts

What Is Point Coordinate Data?

Point coordinate data records the position of individual features as a pair of numbers, usually longitude (x) and latitude (y). Each row in the table is one location, accompanied by attributes that describe it, such as a facility name, a unique identifier, the facility type, or an indicator value.

This differs from the polygon data stored in shapefiles, which describes areas (the outline of a district or chiefdom) rather than single locations. The two are complementary: in this workflow we plot facility points on top of administrative polygons so that each facility can be read in its geographic and administrative context. Point data usually arrives as a table (an Excel or CSV file with coordinate columns), whereas boundaries arrive as shapefiles.

Types of Point Data in SNT

The same workflow applies to any feature that can be located by a single coordinate pair. Common examples in SNT include:

Health facilities, from the master facility list (MFL) or DHIS2
Community health workers, mapped to their assigned area
Villages and settlements, often used for population or accessibility analysis
Survey clusters from DHS or MIS surveys (note that cluster coordinates are deliberately displaced for confidentiality, so they should not be treated as exact)
Case or event locations from surveillance, where each record is a single occurrence

Where Do Coordinates Come From?

Coordinates reach the SNT team from several sources, and knowing the source helps anticipate quality issues:

GPS collection in the field, for example during a facility census, usually the most accurate
Master facility lists and DHIS2 organisation unit registries, where coordinates can vary in age and quality
Gazetteers and geocoding, where a place name is matched to a coordinate, which can be approximate

Consult with SNT team

Coordinate data should be reviewed and confirmed with the SNT team before it is used for analysis. Confirm the source of the coordinates, the column that uniquely identifies each facility, and whether facility type classifications are standardized.

Coordinate Reference Systems (CRS)

The coordinate reference system (CRS) defines how geographic locations are represented numerically. For comprehensive background on coordinate reference systems, projections, and spatial data management principles, see the Spatial data overview.

Two coordinate reference systems matter most for point data:

WGS84 (EPSG:4326) uses longitude and latitude in decimal degrees. It is the standard for mapping and for GPS data, and it is how health facility coordinates are almost always shared in the SNT process.
UTM (EPSG:326XX) uses meters within a single zone. It is more appropriate when distances or areas need to be measured accurately, for example when estimating catchment areas or distance to the nearest facility.

Unlike shapefile work, where WGS84 is usually sufficient throughout, point analysis sometimes calls for UTM. A common pattern is to keep the data in WGS84 for mapping and reproject to the local UTM zone only for distance or area calculations. Whichever CRS we choose, every layer in a map must share the same CRS, or the layers will not align.

In the SNT process, health facility data is typically shared by the SNT team in Excel or CSV format, with coordinates provided as longitude and latitude. The geographic coordinate system (WGS84 / EPSG:4326) is therefore used for mapping unless the SNT team specifies otherwise.

Coordinate Formats

Example coordinates in WGS84:

Latitude: 8.4820° (North of the equator)
Longitude: -13.2334° (West of the Prime Meridian)
Location: This represents a point in Freetown, Sierra Leone
Format: (lat, lon) = (8.48206, -13.23345)

Note

Note: Another coordinate format called DMS (Degrees-Minutes-Seconds) exists but is less often used in health data. DMS format looks like 8°28’55”N, 13°14’00”W and is mainly found in traditional marine navigation, aviation, and land surveying applications. For more details, refer to the documentation at: Merging Spatial Vectors

Coordinate Precision

For coordinates in decimal form, the number of decimal places indicates how precisely a facility is located. More decimal places means a more precise position.

5+ decimal places (~1 meter precision): Ideal for facility-level mapping
4 decimal places (~11 meter precision): Acceptable for most SNT applications
3 decimal places (~111 meter precision): May cause misplacement in dense urban areas
2 or fewer decimal places: Consult SNT team for more precise coordinates

Common Issues with Point Coordinate Data

Coordinate data often arrives with quality problems that must be resolved before mapping. Each issue below is described the same way: what it looks like, why it matters, how to detect it, and how to resolve it. Step 3 shows detection code for these checks.

Missing or empty coordinates

What it looks like: a facility row has a blank or NA longitude or latitude.
Why it matters: the facility cannot be placed on a map and is silently dropped by most plotting functions.
How to detect: count rows where longitude or latitude is NA or empty.
How to resolve: follow up with the SNT team to recover the coordinates. If they cannot be found, filter the rows out before mapping and keep a record of those removed.

Malformed coordinate text

What it looks like: values that are not clean decimals, such as 8,4820 (comma decimal), -13.2334°, 13..26, a stray ? in place of a decimal point, or placeholders like "NA", "-", or "none".
Why it matters: these strings fail to parse as numbers, so the facility is lost or mislocated.
How to detect: attempt numeric conversion and inspect the values that become NA; scan for non-numeric characters.
How to resolve: normalize the text first (trim spaces, convert comma decimals to points, strip stray symbols and unicode minus signs), then convert to numeric.

Degrees-minutes-seconds (DMS) format

What it looks like: 8°28'55"N, 13°14'00"W instead of decimal degrees.
Why it matters: DMS strings are not numeric and will not map without conversion.
How to detect: look for the degree, minute, and second symbols or the N/S/E/W hemisphere markers.
How to resolve: convert to decimal degrees, for example with the parzer package (parzer::parse_lon() / parzer::parse_lat()) or the measurements package.

Coordinates out of range

What it looks like: longitude outside -180 to 180 or latitude outside -90 to 90 (for example 181 or -200).
Why it matters: the point cannot exist on Earth and signals a data entry or unit error.
How to detect: check each value against the valid WGS84 bounds.
How to resolve: set out-of-range values to NA and flag them for the SNT team.

Flipped latitude and longitude

What it looks like: the longitude value sits in the latitude column, or vice versa, placing the facility in the wrong location and sometimes the wrong hemisphere.
Why it matters: the facility maps to the wrong place while still looking like a valid coordinate.
How to detect: test whether the point falls inside the country boundary; if it does not but the swapped point does, the columns are likely reversed.
How to resolve: confirm the column order with the data source, then swap the values where the swap places the point inside the country.

Null-island (0, 0) placeholders

What it looks like: coordinates of exactly (0, 0), a point in the Gulf of Guinea off West Africa.
Why it matters: (0, 0) is a common default written when no real coordinate is available, so it is almost never a true facility location.
How to detect: flag rows where both longitude and latitude equal zero.
How to resolve: treat as missing and re-source from the SNT team.

Duplicate coordinates

What it looks like: the same facility identifier appears more than once with the same coordinates.
Why it matters: duplicate records inflate counts and bias any per-facility summary.
How to detect: look for rows that repeat on the facility identifier and coordinate pair.
How to resolve: keep one record per facility, after confirming the duplicates are genuine repeats rather than distinct facilities that happen to share a name.

Shared locations and centroid placeholders

What it looks like: several different facilities share the exact same longitude and latitude.
Why it matters: when a GPS reading is unavailable, the centroid of the district or chiefdom is often used as a stand-in, collapsing many facilities onto one point and distorting coverage and distance analysis.
How to detect: count facilities per unique coordinate pair; any pair shared by several facilities is suspect, especially when they share an admin unit.
How to resolve: flag shared locations for the SNT team rather than dropping them. Genuinely co-located facilities can be kept; centroid placeholders should be re-sourced.

Points outside the country boundary

What it looks like: a facility plots in the ocean or in a neighbouring country.
Why it matters: the point is clearly wrong and distorts the map extent and any spatial join to administrative units.
How to detect: test whether each point falls within the national (adm0) boundary.
How to resolve: flag points outside the boundary for review; they usually stem from flipped, imprecise, or mistyped coordinates.

Low precision

What it looks like: coordinates with only a few decimal places, such as 8.4, -11.7 instead of 8.412356, -11.736981.
Why it matters: each missing decimal place widens the location’s uncertainty (see Coordinate Precision).
How to detect: count the decimal places in each value and compare against a minimum of 4 or 5.
How to resolve: request coordinates with at least 5 decimal places from the SNT team.

CRS or projection mismatch

What it looks like: the facility points and the boundary shapefile use different coordinate reference systems, so the layers do not overlay.
Why it matters: misaligned layers make the map meaningless even when each layer is individually correct.
How to detect: compare the CRS of each layer (see Coordinate Reference Systems (CRS)).
How to resolve: reproject all layers to a common CRS (WGS84 for mapping). This is handled in Step 2.

Combined coordinate column

Health facility data sometimes stores both coordinates in a single column rather than in separate longitude and latitude columns. The table below shows the same three facilities written both ways for comparison: in practice the data will arrive with either the longitude and latitude pair or the combined coordinates column, not both.

facility_name	type	longitude	latitude	coordinates
Central Hospital	HOSPITAL	-13.2334	8.4820	8.4820, -13.2334
Health Post A	CHC	-13.1456	8.5123	8.5123; -13.1456
District Clinic	CLINIC	-13.0987	8.4567	8.4567 -13.0987

Notice that the combined column also varies in its separator (comma, semicolon, or whitespace), which is what Step 3.1 normalizes before splitting.

How to resolve: split the combined column into separate longitude and latitude columns before analysis. This is handled in Step 3.1.

Step-by-Step

This section walks through working with point coordinate data to create health facility maps with multiple visualization approaches. The example uses Sierra Leone health facility data with administrative boundaries.

Note on demonstration data

This page uses Sierra Leone health facility data: the master facility list (mfl_hfs.xlsx) for facility locations and types, and the merged DHIS2-MFL dataset (final_dhis2_mfl_integrated.rds, produced on the Fuzzy matching of names across datasets page) for the test positivity examples. The administrative boundaries are the processed adm2 and adm3 spatial objects. Replace these with the project’s facility data and boundaries.

Before working with health facilities coordinates data, verify with the SNT team the following:

Coordinate accuracy and completeness (no missing latitude or longitude values)
Facility names and identifiers are consistent and unique
Facility type classifications are standardized

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Install and Load Packages

Install and load the necessary packages for handling spatial data, reading various file formats, data manipulation, and advanced visualization for health facility mapping.

Show the code

# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load all required packages using pacman
pacman::p_load(
  sf,           # for reading and handling spatial data
  readxl,       # for reading Excel files
  readr,        # for reading CSVs
  dplyr,        # for data manipulation
  stringr,      # for string cleaning and formatting
  ggplot2,      # for creating plots
  RColorBrewer, # for color palettes
  scales,       # for formatting scales and labels
  here,         # for cross-platform file paths
  tidyr         # for data reshaping and coordinate splitting
)

# shared map theme so every map on this page has a consistent look
snt_map_theme <- function() {
  ggplot2::theme_void() +
    ggplot2::theme(
      legend.position = "bottom",
      legend.direction = "horizontal",
      legend.title.position = "top",
      legend.text.position = "bottom",
      legend.title = ggplot2::element_text(
        face = "bold", size = 10, hjust = 0.5,
        margin = ggplot2::margin(b = 6)
      ),
      legend.box.margin = ggplot2::margin(t = 8),
      legend.key.width = grid::unit(0.9, "cm"),
      strip.text = ggplot2::element_text(
        face = "bold", size = 10,
        margin = ggplot2::margin(t = 2, b = 6, l = 4, r = 4)
      ),
      panel.spacing = grid::unit(4, "pt"),
      plot.title = ggplot2::element_text(
        face = "bold", size = 14, margin = ggplot2::margin(b = 8)
      ),
      plot.subtitle = ggplot2::element_text(
        size = 11, margin = ggplot2::margin(b = 10)
      ),
      plot.margin = ggplot2::margin(t = 5, r = 5, b = 5, l = 5)
    )
}

To adapt the code:

Do not modify anything in the code above

Terminal Installation Required

Install packages in the terminal, if not already installed. For help installing packages, refer to the Getting Started page.

pip install pandas geopandas pyreadr matplotlib numpy pyprojroot openpyxl pyarrow

Show the code

from pathlib import Path

import geopandas as gpd
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
import pyreadr
from matplotlib.cm import ScalarMappable
from pyprojroot import here


def read_rds(path):
    """Read a single-object RDS file as a pandas object."""
    result = pyreadr.read_r(str(path))
    return next(iter(result.values()))


def cli_header(message):
    print(f"\n{message}")


def cli_info(message):
    print(f"INFO: {message}")


def cli_success(message):
    print(f"SUCCESS: {message}")


def cli_warning(message):
    print(f"WARNING: {message}")


def cli_danger(message):
    print(f"ERROR: {message}")


def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )


def finish_map(ax, title=None, subtitle=None):
    """Apply the shared static map styling used on this page."""
    if title and subtitle:
        ax.set_title(
            f"{title}\n{subtitle}", loc="left", fontsize=14, fontweight="bold"
        )
    elif title:
        ax.set_title(title, loc="left", fontsize=14, fontweight="bold")
    ax.set_axis_off()


def save_figure(fig, filename, width, height, dpi=300):
    """Save a matplotlib figure with dimensions matching the R examples."""
    Path(filename).parent.mkdir(parents=True, exist_ok=True)
    fig.set_size_inches(width, height)
    fig.savefig(filename, dpi=dpi, bbox_inches="tight")


def show_table(df, n=10, caption=None):
    """Render a compact scrollable HTML table matching the R show_table helper.

    Produces a <table class="out-table"> wrapped in <div class="out-scroll">
    so the output matches the R kable-based renderer exactly.
    """
    from IPython.display import display, HTML
    html = df.head(n).to_html(index=False, border=0, classes="out-table")
    if caption:
        html = f"<caption>{caption}</caption>" + html
    display(HTML(f'<div class="out-scroll">{html}</div>'))

To adapt the code:

Do not modify anything in the code above

Step 2: Load and Prepare Input Data

This step loads the processed administrative boundaries and the health facility master list, mirroring the inputs used in Basic shapefile use and visualization. We load the cleaned, processed adm3 (chiefdom) and adm2 (district) spatial objects produced in Shapefile management and customization, so no CRS transformation or renaming is required here. The health facility coordinates are read from the processed master facility list.

If the points data is stored in shapefile format rather than a table, follow Basic shapefile use and visualization to load it, then continue from the visualization steps. Make sure the points data and the administrative boundaries share the same CRS.

Show the code

# set up the path to the processed administrative boundaries
spat_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1a_admin_boundaries"
)

# load processed chiefdom (adm3) spatial object
gdf <- readRDS(
  here::here(spat_path, "processed", "sle_spatial_adm3_2021.rds")
) |>
  sf::st_as_sf()

# load processed district (adm2) spatial object
adm2_gdf <- readRDS(
  here::here(spat_path, "processed", "sle_spatial_adm2_2021.rds")
) |>
  sf::st_as_sf()

# set up the path to the processed health facility data
hf_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed"
)

# load the health facility master list
hf_data <- readxl::read_excel(
  here::here(hf_path, "mfl_hfs.xlsx")
)

# inspect the loaded data
cli::cli_h3("Administrative boundary columns")
print(names(gdf))
cli::cli_h3("Health facility data columns")
print(names(hf_data))
cli::cli_h3("Sample of health facility data")
print(dplyr::slice_head(hf_data, n = 3))

Output

── Administrative boundary columns

[1] "adm0"          "adm1"          "adm2"          "adm3"         
[5] "geometry_hash" "geometry"

── Health facility data columns

[1] "adm1"       "adm3"       "community"  "hf"         "type"      
[6] "Ownership"  "functional" "w_lat"      "w_long"

── Sample of health facility data

adm1	adm3	community	hf	type	Ownership	functional	w_lat	w_long
Bombali	Mara	Kiampkakolo	Kiampkakolo MCHP	MCHP	Government	Functional	8.610318	-12.20295
Bombali	Bombali Sebora	Station Road/Vincent Kanu Road	Loreto Clinic	CLINIC	Faith Based	Functional	8.886622	-12.04387
Bombali	Bombali Sebora	Makama Road	Makeni Govt. Hospital	HOSPITAL	Government	Functional	8.871770	-12.05627

To adapt the code:

Lines 2–6: Update spat_path to the folder where the processed spatial .rds files are stored
Lines 9–12 and 15–18: Replace the .rds filenames with the processed adm3 and adm2 filenames
Lines 21–26 and 29–31: Update hf_path and the master facility list filename to match the data

Show the code

# set up the path to the processed administrative boundaries
spat_path = Path(
    here(
        "01_data/1.1_foundational/"
        "1.1a_admin_boundaries/processed"
    )
)

# load processed chiefdom (adm3) spatial object
gdf = gpd.read_file(spat_path / "sle_spatial_adm3_2021.geojson")

# load processed district (adm2) spatial object
adm2_gdf = gpd.read_file(spat_path / "sle_spatial_adm2_2021.geojson")

# set up the path to the processed health facility data
hf_path = Path(
    here(
        "01_data/1.1_foundational/"
        "1.1c_health_facilities/processed"
    )
)

# load the health facility master list
hf_data = pd.read_excel(hf_path / "mfl_hfs.xlsx")

# inspect the loaded data
cli_header("Administrative boundary columns")
print(list(gdf.columns))
cli_header("Health facility data columns")
print(list(hf_data.columns))
cli_header("Sample of health facility data")
print(hf_data.head(3))

Output


Administrative boundary columns

['adm0', 'adm1', 'adm2', 'adm3', 'geometry_hash', 'geometry']


Health facility data columns

['adm1', 'adm3', 'community', 'hf', 'type', 'Ownership', 'functional', 'w_lat', 'w_long']


Sample of health facility data

<IPython.core.display.HTML object>

To adapt the code:

Lines 2–7: Update spat_path to the folder where the processed spatial files are stored
Lines 10 and 13: Replace the GeoJSON filenames with the processed adm3 and adm2 filenames
Lines 16–21 and 24: Update hf_path and the master facility list filename to match the data

Step 3: Define, Validate, and Clean Coordinates

Step 3.1: Split coordinates column (if needed)

This step is only required if longitude and latitude are stored together in a single column rather than in separate columns (see the Combined Coordinate Column issue above). If the data already has separate longitude and latitude columns, skip ahead to Step 3.2.

When coordinates arrive combined, the separator between the two values is rarely consistent: some rows use a comma, others a semicolon, and others a single space. The code below first normalizes all of these separators to a single comma and then splits the column into two numeric fields named w_lat and w_long. Doing the normalization in one pass avoids brittle row-by-row parsing and makes the splitting predictable.

Show the code

# this step is only required if longitude and latitude are combined
# in a single column. if columns are separate, skip this step.

# normalize separators (; or space) to a comma, then split into two
# numeric columns
hf_data <- hf_data |>
  dplyr::mutate(
    coordinates = stringr::str_replace_all(coordinates, "[; ]+", ",")
  ) |>
  tidyr::separate(
    col = "coordinates",
    into = c("w_lat", "w_long"),
    sep = ",",
    convert = TRUE
  )

cli::cli_alert_success("Coordinates split into separate columns")
cli::cli_h3("Sample of split coordinates")
print(utils::head(dplyr::select(hf_data, w_lat, w_long), 3))

To adapt the code:

Lines 8 and 11: Replace "coordinates" with the combined coordinate column name
Line 12: Update "w_lat" and "w_long" with the preferred column names
Skip this entire step if the data already has separate latitude and longitude columns

Show the code

# this step is only required if longitude and latitude are combined
# in a single column. if columns are separate, skip this step.

# replace separators (; or space) with a comma
hf_data["coordinates"] = (
    hf_data["coordinates"].str.replace(r"[; ]+", ",", regex=True)
)

# split the combined column into two columns
coord_split = hf_data["coordinates"].str.split(",", expand=True)
hf_data = hf_data.assign(
    w_lat=pd.to_numeric(coord_split[0], errors="coerce"),
    w_long=pd.to_numeric(coord_split[1], errors="coerce"),
)

cli_success("Coordinates split into separate columns")
cli_header("Sample of split coordinates")
print(hf_data[["w_lat", "w_long"]].head(3).to_string(index=False))

To adapt the code:

Lines 5 and 10: Replace "coordinates" with the combined coordinate column name
Lines 11–13: Update "w_lat" and "w_long" with the preferred column names
Skip this entire step if the data already has separate latitude and longitude columns

Critical: verify latitude and longitude order

A common error is mixing up latitude and longitude. The standard format is usually (Latitude, Longitude).

Latitude (Y-axis): Range is -90° to 90°. Represents North-South position.
Longitude (X-axis): Range is -180° to 180°. Represents East-West position.

Always confirm the order with the data source or the SNT team. Incorrect order will plot facilities in the wrong hemisphere!

Step 3.2: Detect coordinate quality issues

Before cleaning, run a set of checks to see which of the problems described in Common Issues with Point Coordinate Data are present in the data. The checks below count missing, out-of-range, null-island, low-precision, duplicate, shared-location, out-of-boundary, and flipped coordinates. The flip and boundary checks compare each point against the national (adm0) boundary.

How we detect flipped coordinates

A common error is swapping latitude and longitude, which can place a facility in the wrong country or even the wrong hemisphere. To detect it, we test each point against the national boundary:

Check whether the point, as given, falls inside the country.
Check whether the point with its longitude and latitude swapped falls inside instead.

If a point is outside the boundary as-is but inside once swapped, its columns were most likely reversed. Using the boundary rather than a summary statistic keeps the test reliable for countries with a wide geographic spread.

Show the code

# load the national boundary (adm0) for the spatial checks
adm0 <- readRDS(
  here::here(spat_path, "processed", "sle_spatial_adm0_2021.rds")
) |>
  sf::st_as_sf()

# helper: count decimal places in a coordinate value
count_decimals <- function(value) {
  if (is.na(value)) return(0L)
  text <- format(abs(value), scientific = FALSE, trim = TRUE)
  if (!grepl(".", text, fixed = TRUE)) return(0L)
  nchar(sub("0+$", "", strsplit(text, ".", fixed = TRUE)[[1]][2]))
}

# define longitude (x) and latitude (y), flag the valid in-range rows,
# and pre-compute the decimal precision of each coordinate
hf_data <- hf_data |>
  dplyr::mutate(
    x = w_long,
    y = w_lat,
    valid = !is.na(x) & !is.na(y) &
      dplyr::between(x, -180, 180) &
      dplyr::between(y, -90, 90),
    lon_dp = vapply(x, count_decimals, integer(1)),
    lat_dp = vapply(y, count_decimals, integer(1))
  )

# missing, out-of-range, null-island (0, 0), and low-precision counts
quality_counts <- hf_data |>
  dplyr::summarise(
    n_missing = sum(is.na(x) | is.na(y)),
    n_out_range = sum(!is.na(x) & !is.na(y) & !valid),
    n_zero = sum(x == 0 & y == 0, na.rm = TRUE),
    n_imprecise = sum(valid & (lon_dp < 4 | lat_dp < 4))
  )
n_missing <- quality_counts$n_missing
n_out_range <- quality_counts$n_out_range
n_zero <- quality_counts$n_zero
n_imprecise <- quality_counts$n_imprecise

# duplicate facility + coordinate, and shared locations
valid_data <- hf_data |> dplyr::filter(valid)
n_duplicate <- sum(duplicated(
  paste(valid_data$hf, valid_data$x, valid_data$y)
))
coord_key <- paste(valid_data$x, valid_data$y)
n_shared <- sum(coord_key %in% coord_key[duplicated(coord_key)])

# build an sf of the valid points for the spatial checks
hf_points <- valid_data |>
  sf::st_as_sf(
    coords = c("x", "y"),
    crs = 4326,
    remove = FALSE
  )

# points outside the national boundary
inside <- lengths(suppressWarnings(sf::st_within(hf_points, adm0))) > 0
n_outside <- sum(!inside)

# likely flipped lon/lat: outside as-is, but inside when swapped
swapped <- hf_points |>
  sf::st_drop_geometry() |>
  dplyr::transmute(x = y, y = x) |>
  sf::st_as_sf(coords = c("x", "y"), crs = 4326)
inside_swapped <- lengths(
  suppressWarnings(sf::st_within(swapped, adm0))
) > 0
n_flipped <- sum(!inside & inside_swapped)

# report the issue counts
cli::cli_h2("Coordinate quality summary ({nrow(hf_data)} facilities)")
cli::cli_alert_info("Missing coordinates: {n_missing}")
cli::cli_alert_info("Out-of-range coordinates: {n_out_range}")
cli::cli_alert_info("Null-island (0, 0): {n_zero}")
cli::cli_alert_info("Low precision (< 4 dp): {n_imprecise}")
cli::cli_alert_info("Duplicate facility and coordinate: {n_duplicate}")
cli::cli_alert_info("Records at shared locations: {n_shared}")
cli::cli_alert_info("Outside the national boundary: {n_outside}")
cli::cli_alert_info("Likely flipped lon/lat: {n_flipped}")

Output

── Coordinate quality summary (1556 facilities) ──

ℹ Missing coordinates: 25

ℹ Out-of-range coordinates: 0

ℹ Null-island (0, 0): 0

ℹ Low precision (< 4 dp): 30

ℹ Duplicate facility and coordinate: 0

ℹ Records at shared locations: 21

ℹ Outside the national boundary: 0

ℹ Likely flipped lon/lat: 0

To adapt the code:

Line 3: Update the adm0 filename to the national boundary file
Lines 19–20: Update w_long and w_lat to the longitude and latitude columns
Line 34: Adjust the minimum decimal places used for the precision check

Plotting the points against the national boundary gives a quick visual confirmation that they fall where they should, and makes any outside-boundary or flipped points easy to spot.

Show the code

# flag each valid point by whether it falls inside the national boundary
hf_points <- hf_points |>
  dplyr::mutate(
    location = dplyr::if_else(inside, "Inside boundary", "Outside boundary")
  )

# map the points over the national boundary
coord_check_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = adm0, fill = "white", color = "black") +
  ggplot2::geom_sf(
    data = hf_points,
    ggplot2::aes(color = location),
    size = 0.8, alpha = 0.7
  ) +
  ggplot2::scale_color_manual(
    values = c(
      "Inside boundary" = "#2c7fb8",
      "Outside boundary" = "#e31a1c"
    ),
    name = NULL
  ) +
  snt_map_theme() +
  ggplot2::labs(
    title = "Coordinate check: facilities against the national boundary"
  )

print(coord_check_map)

Output

Show the code

# load the national boundary (adm0) for the spatial checks
adm0 = gpd.read_file(
    Path(
        here(
            "01_data/1.1_foundational/"
            "1.1a_admin_boundaries/processed/"
            "sle_spatial_adm0_2021.geojson"
        )
    )
)

# helper: count decimal places in a coordinate value
def count_decimals(value):
    if pd.isna(value):
        return 0
    text = f"{abs(value):.10f}".rstrip("0")
    if "." not in text:
        return 0
    return len(text.split(".")[1])

# define longitude (x) and latitude (y)
hf_data = hf_data.assign(x=hf_data["w_long"], y=hf_data["w_lat"])

# rows with valid, in-range coordinates (used for the spatial checks)
valid = (
    hf_data["x"].notna() & hf_data["y"].notna()
    & hf_data["x"].between(-180, 180)
    & hf_data["y"].between(-90, 90)
)

# missing, out-of-range, and null-island (0, 0)
n_missing = (hf_data["x"].isna() | hf_data["y"].isna()).sum()
n_out_range = (
    hf_data["x"].notna() & hf_data["y"].notna() & ~valid
).sum()
n_zero = (
    (hf_data["x"] == 0) & (hf_data["y"] == 0)
).sum()

# low precision (fewer than 4 decimal places)
lon_dp = hf_data["x"].apply(count_decimals)
lat_dp = hf_data["y"].apply(count_decimals)
n_imprecise = int((valid & ((lon_dp < 4) | (lat_dp < 4))).sum())

# duplicate facility + coordinate, and shared locations
valid_data = hf_data.loc[valid].copy()
keys = (
    valid_data["hf"].astype(str) + "_"
    + valid_data["x"].astype(str) + "_"
    + valid_data["y"].astype(str)
)
n_duplicate = int(keys.duplicated().sum())
coord_key = (
    valid_data["x"].astype(str) + "_" + valid_data["y"].astype(str)
)
n_shared = int(
    coord_key.isin(coord_key[coord_key.duplicated()]).sum()
)

# build a GeoDataFrame of the valid points for the spatial checks
if adm0.crs is None:
    adm0 = adm0.set_crs("EPSG:4326")
hf_points = gpd.GeoDataFrame(
    valid_data,
    geometry=gpd.points_from_xy(valid_data["x"], valid_data["y"]),
    crs="EPSG:4326",
)
if hf_points.crs != adm0.crs:
    hf_points = hf_points.to_crs(adm0.crs)

# points outside the national boundary
inside = hf_points.within(adm0.union_all())
n_outside = int((~inside).sum())

# likely flipped lon/lat: outside as-is, but inside when swapped
swapped = gpd.GeoDataFrame(
    geometry=gpd.points_from_xy(hf_points["y"], hf_points["x"]),
    crs="EPSG:4326",
)
inside_swapped = swapped.within(adm0.union_all())
n_flipped = int((~inside & inside_swapped).sum())

# report the issue counts
cli_header(
    f"Coordinate quality summary ({len(hf_data)} facilities)"
)
cli_info(f"Missing coordinates: {n_missing}")
cli_info(f"Out-of-range coordinates: {n_out_range}")
cli_info(f"Null-island (0, 0): {n_zero}")
cli_info(f"Low precision (< 4 dp): {n_imprecise}")
cli_info(f"Duplicate facility and coordinate: {n_duplicate}")
cli_info(f"Records at shared locations: {n_shared}")
cli_info(f"Outside the national boundary: {n_outside}")
cli_info(f"Likely flipped lon/lat: {n_flipped}")

Show the code

# flag each valid point by whether it falls inside the national boundary
hf_points = hf_points.assign(
    location=inside.map(
        {True: "Inside boundary", False: "Outside boundary"}
    )
)

# map the points over the national boundary
color_map = {"Inside boundary": "#2c7fb8", "Outside boundary": "#e31a1c"}
fig, ax = plt.subplots(figsize=(10, 8))
adm0.plot(ax=ax, facecolor="white", edgecolor="black", linewidth=0.8,
          aspect="equal")
for label, color in color_map.items():
    subset = hf_points.loc[hf_points["location"] == label]
    subset.plot(
        ax=ax, color=color, markersize=3, alpha=0.7, label=label,
        aspect="equal",
    )
ax.legend(
    loc="lower center", bbox_to_anchor=(0.5, -0.05),
    frameon=False, fontsize=9, ncol=2,
)
finish_map(
    ax,
    title="Coordinate check: facilities against the national boundary",
)

Output


Coordinate quality summary (1556 facilities)

INFO: Missing coordinates: 25

INFO: Out-of-range coordinates: 0

INFO: Null-island (0, 0): 0

INFO: Low precision (< 4 dp): 2

INFO: Duplicate facility and coordinate: 0

INFO: Records at shared locations: 21

INFO: Outside the national boundary: 0

INFO: Likely flipped lon/lat: 0

To adapt the code:

Line 7: Update the adm0 filename to the national boundary file
Line 22: Update w_long and w_lat to the longitude and latitude columns
Line 43: Adjust the minimum decimal places used for the precision check

Output

Step 3.3: Clean coordinates

Step 3.2 surfaced the quality issues; this step acts on the most blocking ones. Facilities with missing coordinates or values outside the valid WGS84 range cannot be mapped at all and are silently dropped by most plotting functions, so we filter them out into a separate review table before producing the clean dataset. The review table is kept intentionally: these facilities are not deleted, they are set aside so the SNT team can re-source the coordinates rather than lose the facility from the master list.

The rest of the issues from Step 3.2 (low precision, duplicates, shared locations, points outside the country boundary, likely flips) are typically resolved in collaboration with the SNT team rather than dropped automatically, because each requires a judgement about whether the underlying facility is real. The output prints the number of rows kept, the number flagged for review, and the longitude and latitude ranges of the clean data to confirm the bounds match what is expected for the country.

Show the code

# define longitude (x) and latitude (y) coordinates
hf_data <- hf_data |>
  dplyr::mutate(
    x = w_long,
    y = w_lat
  )

# identify facilities with invalid coordinates for follow-up
facilities_to_review <- hf_data |>
  dplyr::filter(
    is.na(x) | is.na(y) |
      !dplyr::between(x, -180, 180) |
      !dplyr::between(y, -90, 90)
  )

# filter to keep only valid, in-range coordinates
facilities_clean <- hf_data |>
  dplyr::filter(
    !is.na(x),
    !is.na(y),
    dplyr::between(x, -180, 180),
    dplyr::between(y, -90, 90)
  )

# print cleaning results
cli::cli_h2("Data cleaning results")
cli::cli_alert_info("Original health facilities: {nrow(hf_data)}")
cli::cli_alert_info("Clean health facilities: {nrow(facilities_clean)}")
cli::cli_alert_info(
  "Removed facilities: {nrow(hf_data) - nrow(facilities_clean)}"
)
cli::cli_alert_info(
  "Facilities flagged for review: {nrow(facilities_to_review)}"
)
cli::cli_alert_info(
  "Longitude range: {min(facilities_clean$x)} to {max(facilities_clean$x)}"
)
cli::cli_alert_info(
  "Latitude range: {min(facilities_clean$y)} to {max(facilities_clean$y)}"
)

Output

── Data cleaning results ──

ℹ Original health facilities: 1556

ℹ Clean health facilities: 1531

ℹ Removed facilities: 25

ℹ Longitude range: -13.2920704 to -10.3074397999999

ℹ Latitude range: 6.9679198 to 9.9733458

To adapt the code:

Lines 4–5: Update w_long and w_lat to match the longitude and latitude column names

Show the code

# define longitude (x) and latitude (y) coordinates
hf_data = hf_data.assign(x=hf_data["w_long"], y=hf_data["w_lat"])

# identify facilities with invalid coordinates for follow-up
facilities_to_review = hf_data.loc[
    hf_data["x"].isna()
    | hf_data["y"].isna()
    | (hf_data["x"] < -180) | (hf_data["x"] > 180)
    | (hf_data["y"] < -90) | (hf_data["y"] > 90)
]

# filter to keep only valid, in-range coordinates
facilities_clean = hf_data.loc[
    hf_data["x"].notna()
    & hf_data["y"].notna()
    & hf_data["x"].between(-180, 180)
    & hf_data["y"].between(-90, 90)
]

# print cleaning results
cli_header("Data cleaning results")
cli_info(f"Original health facilities: {len(hf_data)}")
cli_info(f"Clean health facilities: {len(facilities_clean)}")
cli_info(
    f"Removed facilities: {len(hf_data) - len(facilities_clean)}"
)
cli_info(
    f"Facilities flagged for review: {len(facilities_to_review)}"
)
cli_info(
    f"Longitude range: {facilities_clean['x'].min()} to "
    f"{facilities_clean['x'].max()}"
)
cli_info(
    f"Latitude range: {facilities_clean['y'].min()} to "
    f"{facilities_clean['y'].max()}"
)

Output


Data cleaning results

INFO: Original health facilities: 1556

INFO: Clean health facilities: 1531

INFO: Removed facilities: 25

INFO: Longitude range: -13.2920704 to -10.3074397999999

INFO: Latitude range: 6.9679198 to 9.9733458

To adapt the code:

Line 2: Update w_long and w_lat to match the longitude and latitude column names

Step 4: Convert to Spatial Object

Convert the cleaned health facility data to a spatial object for advanced spatial operations and analysis.

Why create a spatial object?

A spatial object is a special type of dataset that understands the points are on a map, not just numbers in a table. This is required for any geographic operation.

Complete this step if planning to:

Map the facilities.
Calculate distances or catchment areas.
Join facility data to administrative boundaries (e.g., which district is each facility in?).
Perform any analysis that uses the geographic location of the facilities.

Show the code

# convert to spatial object
facilities_sf <- sf::st_as_sf(
  facilities_clean,
  coords = c("w_long", "w_lat"),
  crs = 4326
)

cli::cli_alert_success("Converted to spatial object")
cli::cli_h3("Spatial object summary")
print(dplyr::slice_head(facilities_sf, n = 3))

Output

✔ Converted to spatial object

── Spatial object summary

adm1	adm3	community	hf	type	Ownership	functional	x	y	valid	lon_dp	lat_dp	geometry
Bombali	Mara	Kiampkakolo	Kiampkakolo MCHP	MCHP	Government	Functional	-12.20295	8.610318	TRUE	5	6	POINT (-12.20295 8.610318)
Bombali	Bombali Sebora	Station Road/Vincent Kanu Road	Loreto Clinic	CLINIC	Faith Based	Functional	-12.04387	8.886622	TRUE	5	6	POINT (-12.04387 8.886622)
Bombali	Bombali Sebora	Makama Road	Makeni Govt. Hospital	HOSPITAL	Government	Functional	-12.05627	8.871770	TRUE	5	5	POINT (-12.05627 8.87177)

To adapt the code:

Line 4: Update coordinate column names to match the data

Show the code

# convert the cleaned table to a GeoDataFrame using point coordinates
facilities_sf = gpd.GeoDataFrame(
    facilities_clean,
    geometry=gpd.points_from_xy(
        facilities_clean["w_long"], facilities_clean["w_lat"]
    ),
    crs="EPSG:4326",
)

cli_success("Converted to spatial object")
cli_header("Spatial object summary")
print(facilities_sf.head(3).to_string(index=False))

Output

SUCCESS: Converted to spatial object


Spatial object summary

<IPython.core.display.HTML object>

To adapt the code:

Line 5: Update coordinate column names to match the data

Step 5: Assign Facilities to Administrative Units

With the facilities stored as a spatial object, we can determine which administrative unit each facility falls in using a spatial join. This labels every facility with the boundary it sits inside, which is needed for catchment analysis, per-district summaries, and for checking that a facility’s recorded admin unit matches where its coordinates actually land.

R offers several ways to relate points to polygons in a spatial join:

sf::st_within: the point lies completely inside the polygon (used below)
sf::st_contains: the polygon contains the point (the inverse relationship)
sf::st_intersects: the point touches or lies inside the polygon
sf::st_nearest_feature: assigns the closest polygon when a point sits just outside every boundary

Show the code

# keep the boundary admin columns, renamed to avoid clashing with the
# facility data's own admin columns
boundary_lookup <- gdf |>
  dplyr::select(adm2_boundary = adm2, adm3_boundary = adm3)

# assign each facility to the chiefdom (adm3) polygon it falls within
facilities_admin <- suppressWarnings(sf::st_join(
  facilities_sf,
  boundary_lookup,
  join = sf::st_within
))

# facilities that did not fall within any polygon
n_unmatched <- sum(is.na(facilities_admin$adm3_boundary))

cli::cli_h3("Facilities assigned to an administrative unit")
cli::cli_alert_info("Facilities joined: {nrow(facilities_admin)}")
cli::cli_alert_info("Unmatched (outside all polygons): {n_unmatched}")

# compare the recorded chiefdom with the polygon the point falls in
facilities_admin |>
  sf::st_drop_geometry() |>
  dplyr::select(hf, adm3, adm3_boundary) |>
  dplyr::slice_head(n = 5) |>
  print()

Output

── Facilities assigned to an administrative unit

ℹ Facilities joined: 1531

ℹ Unmatched (outside all polygons): 0

hf	adm3	adm3_boundary
Kiampkakolo MCHP	Mara	MARA
Loreto Clinic	Bombali Sebora	MAKENI CITY
Makeni Govt. Hospital	Bombali Sebora	MAKENI CITY
Tonko CHP	Bombali Sebora	MAKENI CITY
Fullah Town CHP	Bombali Sebora	MAKENI CITY

To adapt the code:

Line 4: Replace adm2 and adm3 with the admin columns in the boundary file
Line 10: Change the join method (for example sf::st_intersects) if points sit on boundaries

Points outside boundaries

Some facilities may fall just outside a boundary, especially near borders or where coordinates are imprecise. A strict sf::st_within join leaves these unmatched. To recover them, use a buffered join (match points within a small distance of a polygon) or sf::st_nearest_feature (assign the closest polygon). Suggested buffer distances:

GPS from mobile devices (lower accuracy): 500 m to 1 km
Facilities near administrative borders: 500 m to 2 km
Survey clusters displaced for privacy (such as DHS): 100 m to 500 m

Buffering recovers near-miss points but can assign points to the wrong unit in dense border areas, so review the results with the SNT team.

Show the code

# keep only the admin boundary columns, renamed to avoid clashing
# with the facility data's own admin columns
boundary_lookup = gdf[["adm2", "adm3", "geometry"]].rename(
    columns={"adm2": "adm2_boundary", "adm3": "adm3_boundary"}
)

# assign a CRS fallback if missing before the spatial join
if boundary_lookup.crs is None:
    boundary_lookup = boundary_lookup.set_crs("EPSG:4326")
if facilities_sf.crs is None:
    facilities_sf = facilities_sf.set_crs("EPSG:4326")

# assign each facility to the chiefdom (adm3) polygon it falls within
facilities_admin = gpd.sjoin(
    facilities_sf,
    boundary_lookup,
    how="left",
    predicate="within",
)

# facilities that did not fall within any polygon
n_unmatched = facilities_admin["adm3_boundary"].isna().sum()

cli_header("Facilities assigned to an administrative unit")
cli_info(f"Facilities joined: {len(facilities_admin)}")
cli_info(f"Unmatched (outside all polygons): {n_unmatched}")

# compare the recorded chiefdom with the polygon the point falls in
print(
    facilities_admin[["hf", "adm3", "adm3_boundary"]]
    .head(5)
    .to_string(index=False)
)

Output


Facilities assigned to an administrative unit

INFO: Facilities joined: 1531

INFO: Unmatched (outside all polygons): 0

<IPython.core.display.HTML object>

To adapt the code:

Lines 3–4: Replace adm2 and adm3 with the admin columns in the boundary file
Line 18: Change the predicate (for example "intersects") if points sit on boundaries

Step 6: Save Cleaned and Spatial Data

The cleaned table and the spatial object are the two outputs of this workflow that downstream analyses will consume, so we save both. The cleaned table is written to a CSV file that can be opened in Excel or any database tool, which is the format typically requested by SNT teams for review and sharing. The spatial object is written to a shapefile (R) or GeoPackage (Python), which preserves the CRS, geometry type, and column attributes so the data can be loaded directly into GIS software or into a later script without re-running the cleaning and conversion steps.

Storing the processed outputs under 01_data/.../processed/ keeps a clean separation from the raw master facility list and makes the data ready to plug into the catchment analyses, distance calculations, or facility-level indicator maps shown in Steps 7 and 8.

Show the code

# set up the output directory for processed datasets
out_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed"
)

# save the cleaned facility table as a CSV
readr::write_csv(
  facilities_clean,
  here::here(out_path, "facilities_clean.csv")
)

# save the facility points as a shapefile
sf::st_write(
  facilities_sf,
  here::here(out_path, "facilities_spatial.shp"),
  append = FALSE,
  quiet = TRUE
)

To adapt the code:

Lines 2–7: Update out_path to the folder for saving the processed data
Lines 12 and 18: Update the output filenames as needed

Show the code

# set up the output directory for processed datasets
out_path = Path(
    here(
        "01_data/1.1_foundational/"
        "1.1c_health_facilities/processed"
    )
)
out_path.mkdir(parents=True, exist_ok=True)

# save the cleaned facility table as a CSV
facilities_clean.to_csv(
    out_path / "facilities_clean.csv", index=False
)

# save the facility points as a GeoPackage
facilities_sf.to_file(
    out_path / "facilities_spatial.gpkg",
    layer="facilities",
    driver="GPKG",
)

To adapt the code:

Lines 2–8: Update out_path to the folder for saving the processed data
Lines 12 and 17: Update the output filenames as needed

Step 7: Create Health Facility Maps

With the data cleaned and joined to administrative units, the next step is to put it on a map. A single dataset of facility points can be visualized in several complementary ways, and each view answers a different question about the health system. This step works through three increasingly detailed maps:

A basic overview map (Step 7.1) that shows where every facility sits relative to the administrative boundaries.
A coloured-by-type map (Step 7.2) that adds facility type as a second dimension to reveal the mix of services in each area.
A faceted small-multiples map (Step 7.3) that separates each facility type into its own panel for easier comparison.

All three maps reuse the same boundary layers (gdf for chiefdoms and adm2_gdf for districts) and the same cleaned point data (facilities_clean), so only the styling and aesthetic mappings change between them. Each map is also saved to 03_output/3a_figures/ so it can be embedded in reports.

Step 7.1: Basic health facility map

The basic map is the first visual check on the cleaned dataset. Plotting every facility as a single-colour point over the chiefdom and district boundaries shows the overall geographic coverage of the health system and makes any remaining anomalies (clusters in the wrong place, isolated points outside the country, large gaps between facilities) easy to spot.

We use a light grey fill for the chiefdom polygons so the map is uncluttered, a darker outline for the districts so the higher-level structure stays readable, and a single blue colour for the facility points so the eye can focus on density rather than category. The point size is set generously enough to remain visible against the country outline, with some transparency so dense clusters do not bleed into a single solid blob.

Show the code

# define the map title
title <- "Health facilities in Sierra Leone"

# create the basic map
facilities_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "grey50",
                   size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = facilities_clean,
    ggplot2::aes(x = x, y = y),
    color = "#47B5FF", size = 1.8, alpha = 0.7
  ) +
  ggplot2::coord_sf() +
  snt_map_theme() +
  ggplot2::labs(
    title = title,
    subtitle = "adm3 boundaries with adm2 overlay"
  )

# print the map
print(facilities_map)

# save the map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_locations.png"
  ),
  plot = facilities_map,
  width = 10,
  height = 7,
  dpi = 300
)

Output

To adapt the code:

Line 2: Update the map title
Lines 6–11: Update fill, color, and size for boundary styling
Line 15: Update color, size, and alpha for health facility points

Show the code

# define the map title
title = "Health facilities in Sierra Leone"

# create the basic map
fig, ax = plt.subplots(figsize=(10, 8))
gdf.plot(ax=ax, facecolor="white", edgecolor="grey", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
ax.scatter(
    facilities_clean["x"],
    facilities_clean["y"],
    color="#47B5FF",
    s=15,
    alpha=0.7,
    zorder=5,
)
finish_map(ax, title=title, subtitle="adm3 boundaries with adm2 overlay")

# save the map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_locations.png"),
    width=10,
    height=7,
)

Output

To adapt the code:

Line 2: Update the map title
Lines 6–9: Update facecolor, edgecolor, and linewidth for boundary styling
Lines 13–15: Update color, s (size), and alpha for health facility points

The resulting map gives a country-wide view of facility coverage. Districts (the darker outlines) that contain few or no blue points point to potential service gaps and are usually the first places the SNT team prioritizes for follow-up. Conversely, very dense clusters in a single chiefdom can hint at duplicate records or shared coordinates that escaped Step 3.2 and are worth revisiting.

Step 7.2: Health facilities by type

The basic map shows where facilities are, but it does not show what kind of care each one provides. Mapping facility type as a colour categorical reveals the mix of services across the country: which districts depend mostly on community health posts, which have a hospital nearby, and which combine several tiers of care. This is one of the most useful maps for service-availability planning and for catchment analysis, because each tier of care typically serves a different population size and case mix.

The code is almost identical to Step 7.1, with one change: the color aesthetic in geom_point() (or the per-type scatter() loop in Python) is bound to the type column rather than fixed to one value. A categorical palette (RColorBrewer::brewer.pal("Set1") in R, the matching Set1 colormap in Python) provides distinguishable colours for each facility type, and a single shared legend lists the categories.

Show the code

# define the facility type map title
title_type <- "Health facility types in Sierra Leone"

# create health facility type map
facility_type_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = facilities_clean,
    ggplot2::aes(x = x, y = y, color = type),
    size = 1, alpha = 0.8
  ) +
  ggplot2::coord_sf() +
  ggplot2::scale_color_brewer(
    palette = "Set1", name = "Facility Type"
  ) +
  snt_map_theme() +
  ggplot2::labs(title = title_type)

# print the type map
print(facility_type_map)

# save the type map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_by_type.png"
  ),
  plot = facility_type_map,
  width = 10,
  height = 7,
  dpi = 300
)

Output

To adapt the code:

Line 2: Update the map title
Line 13: Replace type with the facility type column name
Line 18: Update palette for different color schemes

Show the code

# define the facility type map title
title_type = "Health facility types in Sierra Leone"

# get the ordered facility types and assign a Set1 color to each
facility_types = sorted(facilities_clean["type"].dropna().unique())
set1_colors = plt.get_cmap("Set1").colors
type_palette = {
    t: set1_colors[i % len(set1_colors)]
    for i, t in enumerate(facility_types)
}

# create health facility type map
fig, ax = plt.subplots(figsize=(10, 9))
gdf.plot(ax=ax, facecolor="white", edgecolor="gray", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
for ftype, color in type_palette.items():
    subset = facilities_clean.loc[facilities_clean["type"] == ftype]
    ax.scatter(
        subset["x"], subset["y"],
        color=color, s=5, alpha=0.8, label=ftype, zorder=5,
    )
ax.legend(
    title="Facility Type",
    loc="lower center",
    bbox_to_anchor=(0.5, -0.10),
    ncol=3,
    frameon=False,
    fontsize=8,
    title_fontsize=9,
)
finish_map(ax, title=title_type)

# save the type map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_by_type.png"),
    width=10,
    height=7,
)

Output

To adapt the code:

Line 2: Update the map title
Line 5: Replace type with the facility type column name
Lines 6–10: Update set1_colors or swap in a different colormap for different color schemes

Reading the map by colour gives an immediate sense of where higher-level services are concentrated and where community-level care dominates. Districts served only by community health posts and clinics, for example, will appear in one or two colours, while urban districts typically mix several types. This is often the first map shared with the SNT team in a service-mix discussion.

Step 7.3: Specific health facility types

When several facility types overlap on the same map, dense areas can make it hard to read the distribution of any single category. Small multiples (also called facets) solve this by drawing one panel per type, each with the same boundaries and axes so the panels can be compared at a glance. Looking at hospitals alone, for example, makes referral coverage gaps obvious in a way the combined map cannot.

In R this is a single call to ggplot2::facet_wrap(~ type). In Python we recreate the same effect with a grid of matplotlib subplots, one per facility type, sharing a common layout. The point styling is intentionally kept the same as in Step 7.1 so that each facet reads as a clean single-colour distribution map.

Show the code

# define the faceted map title
title_facet <- "Health facility distribution by type in Sierra Leone"

# create faceted map by facility type
faceted_type_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.2) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = facilities_clean,
    ggplot2::aes(x = x, y = y),
    color = "#47B5FF", size = 1, alpha = 0.7
  ) +
  ggplot2::facet_wrap(~ type) +
  ggplot2::coord_sf() +
  snt_map_theme() +
  ggplot2::labs(title = title_facet)

# print the faceted map
print(faceted_type_map)

# save the faceted map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_faceted_by_type.png"
  ),
  plot = faceted_type_map,
  width = 12,
  height = 9,
  dpi = 300
)

Output

To adapt the code:

Line 2: Update the map title
Line 16: Replace type in facet_wrap() with the facility type column name
Line 14: Update point color, size, and alpha as desired

Show the code

# define the faceted map title
title_facet = "Health facility distribution by type in Sierra Leone"

# get the unique facility types for faceting
facility_types = sorted(facilities_clean["type"].dropna().unique())
n_types = len(facility_types)
ncols = 3
nrows = -(-n_types // ncols)  # ceiling division

fig, axes = plt.subplots(nrows, ncols, figsize=(12, 9))
axes_flat = axes.flatten()

for i, ftype in enumerate(facility_types):
    ax = axes_flat[i]
    gdf.plot(ax=ax, facecolor="white", edgecolor="gray",
             linewidth=0.2, aspect="equal")
    adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
                  linewidth=0.5, aspect="equal")
    subset = facilities_clean.loc[facilities_clean["type"] == ftype]
    ax.scatter(
        subset["x"], subset["y"],
        color="#47B5FF", s=5, alpha=0.7, zorder=5,
    )
    ax.set_title(ftype, fontsize=10, fontweight="bold")
    ax.set_axis_off()

# hide any unused axes panels
for j in range(n_types, len(axes_flat)):
    axes_flat[j].set_visible(False)

fig.suptitle(title_facet, fontsize=14, fontweight="bold", x=0.02,
             ha="left")
plt.tight_layout()

# save the faceted map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_faceted_by_type.png"),
    width=12,
    height=9,
)

Output

To adapt the code:

Line 2: Update the map title
Line 5: Replace type with the facility type column name
Line 22: Update point color, s (size), and alpha as desired

Reading each panel in isolation makes it much easier to assess type-specific coverage. Hospitals and referral facilities often appear in only a handful of districts, while community health posts cover most of the country. The faceted view is the version typically used in stratification briefings, because the reader can see each tier of care without the visual clutter of the combined map.

Step 8: Advanced Point Coordinates Visualizations

Point coordinates data can be used to visualize indicators at the facility level in addition to geographic location. The example below uses the merged DHIS2-MFL data obtained through the Fuzzy Matching of Names Across Datasets page of this code library.

Step 8.1: Point coordinates data summary

This example uses the merged DHIS2-MFL data to calculate the test positivity rate for each health facility. Test positivity is the proportion of positive cases out of all cases tested. First we calculate the test positivity rate for each health facility per year.

Show the code

# load the merged DHIS2-MFL dataset
final_dhis2_mfl_integrated <- readRDS(
  here::here(
    "01_data",
    "1.1_foundational",
    "1.1c_health_facilities",
    "processed",
    "final_dhis2_mfl_integrated.rds"
  )
)

# back-fill missing lat/long from the MFL by facility name; many DHIS2
# rows arrive without coordinates because the upstream fuzzy match
# only confirmed a subset of facility identities
mfl_coords <- readxl::read_excel(
  here::here(hf_path, "mfl_hfs.xlsx")
) |>
  dplyr::select(hf, w_lat, w_long) |>
  dplyr::distinct(hf, .keep_all = TRUE)

final_dhis2_mfl_integrated <- final_dhis2_mfl_integrated |>
  dplyr::left_join(mfl_coords, by = "hf") |>
  dplyr::mutate(
    lat = dplyr::coalesce(lat, w_lat),
    long = dplyr::coalesce(long, w_long)
  )

# calculate annual summaries for each health facility
annual_summary <- final_dhis2_mfl_integrated |>
  dplyr::group_by(hf_uid, hf, year, lat, long) |>
  dplyr::summarise(
    # sum tests across all age groups and settings
    total_tests = sum(
      test_u5 + test_5_14 + test_ov15, na.rm = TRUE
    ),
    # sum positive tests across all age groups and settings
    positive_tests = sum(
      conf_u5 + conf_5_14 + conf_ov15, na.rm = TRUE
    ),
    .groups = "drop"
  ) |>
  # calculate test positivity rate (TPR) as a percentage
  dplyr::mutate(
    tpr = dplyr::if_else(
      total_tests > 0,
      (positive_tests / total_tests) * 100,
      NA_real_
    )
  ) |>
  # remove facilities with no tests
  dplyr::filter(total_tests > 0)

# filter for 2023 and assign a testing-volume category
annual_summary_2023 <- annual_summary |>
  dplyr::filter(
    year == 2023, !is.na(tpr), total_tests > 0,
    !is.na(lat), !is.na(long)
  ) |>
  dplyr::mutate(
    size_category = dplyr::case_when(
      total_tests <= 100 ~ "≤100",
      total_tests <= 500 ~ "101-500",
      total_tests <= 1000 ~ "501-1,000",
      total_tests <= 5000 ~ "1,001-5,000",
      TRUE ~ "5,000+"
    ),
    size_category = factor(
      size_category,
      levels = c(
        "≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+"
      )
    )
  )

Show the code

# load the merged DHIS2-MFL dataset
final_dhis2_mfl_integrated = pd.read_parquet(
    Path(
        here(
            "01_data/1.1_foundational/"
            "1.1c_health_facilities/processed/"
            "final_dhis2_mfl_integrated.parquet"
        )
    )
)

# the year column may arrive as a string; coerce it to a nullable
# integer so the year filter below behaves the same way as R's lenient
# comparison ("2023" == 2023 is TRUE in R but False in Python)
final_dhis2_mfl_integrated["year"] = pd.to_numeric(
    final_dhis2_mfl_integrated["year"], errors="coerce"
).astype("Int64")

# back-fill missing lat/long from the MFL by facility name; many DHIS2
# rows arrive without coordinates because the upstream fuzzy match
# only confirmed a subset of facility identities
mfl_coords = (
    pd.read_excel(hf_path / "mfl_hfs.xlsx")
    [["hf", "w_lat", "w_long"]]
    .drop_duplicates("hf")
)
final_dhis2_mfl_integrated = final_dhis2_mfl_integrated.merge(
    mfl_coords, on="hf", how="left"
)
final_dhis2_mfl_integrated["lat"] = (
    final_dhis2_mfl_integrated["lat"]
    .combine_first(final_dhis2_mfl_integrated["w_lat"])
)
final_dhis2_mfl_integrated["long"] = (
    final_dhis2_mfl_integrated["long"]
    .combine_first(final_dhis2_mfl_integrated["w_long"])
)

# calculate annual summaries for each health facility
annual_summary = (
    final_dhis2_mfl_integrated
    .assign(
        total_tests=lambda d: (
            d["test_u5"].fillna(0)
            + d["test_5_14"].fillna(0)
            + d["test_ov15"].fillna(0)
        ),
        positive_tests=lambda d: (
            d["conf_u5"].fillna(0)
            + d["conf_5_14"].fillna(0)
            + d["conf_ov15"].fillna(0)
        ),
    )
    .groupby(["hf_uid", "hf", "year", "lat", "long"], as_index=False)
    .agg(
        total_tests=("total_tests", "sum"),
        positive_tests=("positive_tests", "sum"),
    )
    .assign(
        tpr=lambda d: np.where(
            d["total_tests"] > 0,
            (d["positive_tests"] / d["total_tests"]) * 100,
            np.nan,
        )
    )
    .loc[lambda d: d["total_tests"] > 0]
)

# filter for 2023 and assign a testing-volume category
size_levels = ["≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+"]
annual_summary_2023 = (
    annual_summary
    .loc[lambda d: (
        (d["year"] == 2023)
        & d["tpr"].notna()
        & (d["total_tests"] > 0)
        & d["lat"].notna()
        & d["long"].notna()
    )]
    .assign(
        size_category=lambda d: pd.Categorical(
            np.select(
                [
                    d["total_tests"] <= 100,
                    d["total_tests"] <= 500,
                    d["total_tests"] <= 1000,
                    d["total_tests"] <= 5000,
                ],
                ["≤100", "101-500", "501-1,000", "1,001-5,000"],
                default="5,000+",
            ),
            categories=size_levels,
            ordered=True,
        )
    )
)

Step 8.2: Single indicator points coordinate visualization

This code plots health facilities by their geographic coordinates and uses the coloring of points to indicate the test positivity value for that facility. For simplicity, this plot shows only 2023 test positivity rates by facility.

Show the code

tpr_only_2023 <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = annual_summary_2023,
    ggplot2::aes(x = long, y = lat, fill = tpr),
    shape = 21,
    size = 2,
    stroke = 0.5,
    alpha = 0.8
  ) +
  ggplot2::coord_sf() +
  ggplot2::scale_fill_gradientn(
    colors = rev(RColorBrewer::brewer.pal(7, "RdYlBu")),
    limits = c(0, 100),
    name = "Test Positivity Rate (%)",
  ) +
  snt_map_theme() +
  ggplot2::labs(title = "Health Facility Test Positivity Rate (2023)")

# print the map
print(tpr_only_2023)

# save the map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_tpr_2023.png"
  ),
  plot = tpr_only_2023,
  width = 10,
  height = 7,
  dpi = 300
)

Output

Show the code

# build the color map matching R's rev(brewer.pal(7, "RdYlBu"))
rdylbu_7_rev = [
    "#4575b4", "#91bfdb", "#e0f3f8",
    "#ffffbf", "#fee090", "#fc8d59", "#d73027",
]
cmap_tpr = mcolors.LinearSegmentedColormap.from_list(
    "rdylbu_rev", rdylbu_7_rev
)

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, facecolor="white", edgecolor="gray", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
norm = mcolors.Normalize(vmin=0, vmax=100)
sc = ax.scatter(
    annual_summary_2023["long"],
    annual_summary_2023["lat"],
    c=annual_summary_2023["tpr"],
    cmap=cmap_tpr,
    norm=norm,
    s=15,
    alpha=0.8,
    edgecolors="black",
    linewidths=0.5,
    zorder=5,
)
sm = ScalarMappable(cmap=cmap_tpr, norm=norm)
sm.set_array([])
cbar = fig.colorbar(
    sm, ax=ax, orientation="horizontal", fraction=0.04, pad=0.04
)
cbar.set_label("Test Positivity Rate (%)", fontweight="bold")
finish_map(ax, title="Health Facility Test Positivity Rate (2023)")

# save the map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_tpr_2023.png"),
    width=10,
    height=7,
)

Output

Step 8.3: Multiple indicator points coordinate visualization

The previous example can be built upon by adding additional plot elements to represent other indicators. For example, the size of each health facilities point can be used to represent testing volume. The code below illustrates this example.

Show the code

size_scale <- c("≤100" = 1.5, "101-500" = 2.0, "501-1,000" = 2.5,
                "1,001-5,000" = 3.0, "5,000+" = 3.5)

combined_2023_categorical <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = annual_summary_2023,
    ggplot2::aes(x = long, y = lat,
                 fill = tpr,
                 size = size_category),
    shape = 21,
    stroke = 0.3,  # thinner stroke
    alpha = 0.7
  ) +
  ggplot2::coord_sf() +
  ggplot2::scale_fill_gradientn(
    colors = rev(RColorBrewer::brewer.pal(7, "RdYlBu")),
    limits = c(0, 100),
    name = "Test Positivity Rate (%)",
    na.value = "grey50"
  ) +
  ggplot2::scale_size_manual(
    name = "Total Tested",
    values = size_scale,
    breaks = c("≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+")
  ) +
  snt_map_theme() +
  ggplot2::theme(
    legend.spacing.x = grid::unit(1.5, "cm"),
    legend.box.spacing = grid::unit(0.6, "cm")
  ) +
  ggplot2::labs(
    title = "Health Facility Test Positivity Rate and Testing Volume (2023)"
  )

# print the categorical version
print(combined_2023_categorical)

# save the map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_tpr_test_2023.png"
  ),
  plot = combined_2023_categorical,
  width = 10,
  height = 7,
  dpi = 300
)

Output

Show the code

# size mapping matching the R size_scale
size_map = {
    "≤100": 20, "101-500": 30, "501-1,000": 50,
    "1,001-5,000": 80, "5,000+": 120,
}

# build the color map matching R's rev(brewer.pal(7, "RdYlBu"))
rdylbu_7_rev = [
    "#4575b4", "#91bfdb", "#e0f3f8",
    "#ffffbf", "#fee090", "#fc8d59", "#d73027",
]
cmap_tpr = mcolors.LinearSegmentedColormap.from_list(
    "rdylbu_rev", rdylbu_7_rev
)

fig, ax = plt.subplots(figsize=(10, 12))
gdf.plot(ax=ax, facecolor="white", edgecolor="gray", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
norm = mcolors.Normalize(vmin=0, vmax=100)

# derive the marker size for every row from the size_category column,
# falling back to the smallest size when the category is missing
size_levels = ["≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+"]
point_sizes = (
    annual_summary_2023["size_category"]
    .astype(str)
    .map(size_map)
    .fillna(size_map["≤100"])
)

# plot all facilities in a single scatter call so every point renders
ax.scatter(
    annual_summary_2023["long"],
    annual_summary_2023["lat"],
    c=annual_summary_2023["tpr"],
    cmap=cmap_tpr,
    norm=norm,
    s=point_sizes,
    alpha=0.7,
    edgecolors="black",
    linewidths=0.3,
    zorder=5,
)

# colorbar for TPR
sm = ScalarMappable(cmap=cmap_tpr, norm=norm)
sm.set_array([])
cbar = fig.colorbar(
    sm, ax=ax, orientation="horizontal", fraction=0.04, pad=0.04
)
cbar.set_label("Test Positivity Rate (%)", fontweight="bold")

# build proxy handles for the size legend so it always renders, even if a
# size category has no facilities
size_handles = [
    plt.scatter(
        [], [], s=size_map[cat], color="lightgray",
        edgecolors="black", linewidths=0.3, alpha=0.7,
    )
    for cat in size_levels
]
ax.legend(
    size_handles, size_levels,
    title="Total Tested",
    loc="upper center",
    bbox_to_anchor=(0.5, -0.18),
    ncol=5,
    frameon=False,
    fontsize=8,
    title_fontsize=9,
)
finish_map(
    ax,
    title=(
        "Health Facility Test Positivity Rate and Testing Volume (2023)"
    ),
)

# save the map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_tpr_test_2023.png"),
    width=10,
    height=7,
)

Output

Summary

In this section, we worked through the full process of mapping health facility coordinates. We loaded the processed administrative boundaries and the master facility list, then checked the coordinates for the common quality problems, including missing, malformed, out-of-range, low-precision, duplicate, shared-location, out-of-boundary, and flipped values, before cleaning them. We converted the cleaned table to a spatial object, assigned each facility to its administrative unit with a spatial join, and produced a series of maps, from basic facility locations and facility types to facility-level test positivity and testing volume.

The same workflow applies to any point data in the SNT process, such as community health workers, villages, or survey clusters. Return to it whenever we need to validate and map georeferenced points against administrative boundaries.

Full Code

Find the full code script for health facility coordinate mapping below.

Show full code

################################################################################
########### ~ Health facility coordinates and point data full code ~ ###########
################################################################################

### Step 1: Install and Load Packages ------------------------------------------

# check if 'pacman' is installed; install it if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load all required packages using pacman
pacman::p_load(
  sf,           # for reading and handling spatial data
  readxl,       # for reading Excel files
  readr,        # for reading CSVs
  dplyr,        # for data manipulation
  stringr,      # for string cleaning and formatting
  ggplot2,      # for creating plots
  RColorBrewer, # for color palettes
  scales,       # for formatting scales and labels
  here,         # for cross-platform file paths
  tidyr         # for data reshaping and coordinate splitting
)

# shared map theme so every map on this page has a consistent look
snt_map_theme <- function() {
  ggplot2::theme_void() +
    ggplot2::theme(
      legend.position = "bottom",
      legend.direction = "horizontal",
      legend.title.position = "top",
      legend.text.position = "bottom",
      legend.title = ggplot2::element_text(
        face = "bold", size = 10, hjust = 0.5,
        margin = ggplot2::margin(b = 6)
      ),
      legend.box.margin = ggplot2::margin(t = 8),
      legend.key.width = grid::unit(0.9, "cm"),
      strip.text = ggplot2::element_text(
        face = "bold", size = 10,
        margin = ggplot2::margin(t = 2, b = 6, l = 4, r = 4)
      ),
      panel.spacing = grid::unit(4, "pt"),
      plot.title = ggplot2::element_text(
        face = "bold", size = 14, margin = ggplot2::margin(b = 8)
      ),
      plot.subtitle = ggplot2::element_text(
        size = 11, margin = ggplot2::margin(b = 10)
      ),
      plot.margin = ggplot2::margin(t = 5, r = 5, b = 5, l = 5)
    )
}

### Step 2: Load and Prepare Input Data ----------------------------------------

# set up the path to the processed administrative boundaries
spat_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1a_admin_boundaries"
)

# load processed chiefdom (adm3) spatial object
gdf <- readRDS(
  here::here(spat_path, "processed", "sle_spatial_adm3_2021.rds")
) |>
  sf::st_as_sf()

# load processed district (adm2) spatial object
adm2_gdf <- readRDS(
  here::here(spat_path, "processed", "sle_spatial_adm2_2021.rds")
) |>
  sf::st_as_sf()

# set up the path to the processed health facility data
hf_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed"
)

# load the health facility master list
hf_data <- readxl::read_excel(
  here::here(hf_path, "mfl_hfs.xlsx")
)

# inspect the loaded data
cli::cli_h3("Administrative boundary columns")
print(names(gdf))
cli::cli_h3("Health facility data columns")
print(names(hf_data))
cli::cli_h3("Sample of health facility data")
print(dplyr::slice_head(hf_data, n = 3))

### Step 3: Define, Validate, and Clean Coordinates ----------------------------

#### Step 3.1: Split coordinates column (if needed) ----------------------------

# this step is only required if longitude and latitude are combined
# in a single column. if columns are separate, skip this step.

# normalize separators (; or space) to a comma, then split into two
# numeric columns
hf_data <- hf_data |>
  dplyr::mutate(
    coordinates = stringr::str_replace_all(coordinates, "[; ]+", ",")
  ) |>
  tidyr::separate(
    col = "coordinates",
    into = c("w_lat", "w_long"),
    sep = ",",
    convert = TRUE
  )

cli::cli_alert_success("Coordinates split into separate columns")
cli::cli_h3("Sample of split coordinates")
print(utils::head(dplyr::select(hf_data, w_lat, w_long), 3))

#### Step 3.2: Detect coordinate quality issues --------------------------------

# load the national boundary (adm0) for the spatial checks
adm0 <- readRDS(
  here::here(spat_path, "processed", "sle_spatial_adm0_2021.rds")
) |>
  sf::st_as_sf()

# helper: count decimal places in a coordinate value
count_decimals <- function(value) {
  if (is.na(value)) return(0L)
  text <- format(abs(value), scientific = FALSE, trim = TRUE)
  if (!grepl(".", text, fixed = TRUE)) return(0L)
  nchar(sub("0+$", "", strsplit(text, ".", fixed = TRUE)[[1]][2]))
}

# define longitude (x) and latitude (y), flag the valid in-range rows,
# and pre-compute the decimal precision of each coordinate
hf_data <- hf_data |>
  dplyr::mutate(
    x = w_long,
    y = w_lat,
    valid = !is.na(x) & !is.na(y) &
      dplyr::between(x, -180, 180) &
      dplyr::between(y, -90, 90),
    lon_dp = vapply(x, count_decimals, integer(1)),
    lat_dp = vapply(y, count_decimals, integer(1))
  )

# missing, out-of-range, null-island (0, 0), and low-precision counts
quality_counts <- hf_data |>
  dplyr::summarise(
    n_missing = sum(is.na(x) | is.na(y)),
    n_out_range = sum(!is.na(x) & !is.na(y) & !valid),
    n_zero = sum(x == 0 & y == 0, na.rm = TRUE),
    n_imprecise = sum(valid & (lon_dp < 4 | lat_dp < 4))
  )
n_missing <- quality_counts$n_missing
n_out_range <- quality_counts$n_out_range
n_zero <- quality_counts$n_zero
n_imprecise <- quality_counts$n_imprecise

# duplicate facility + coordinate, and shared locations
valid_data <- hf_data |> dplyr::filter(valid)
n_duplicate <- sum(duplicated(
  paste(valid_data$hf, valid_data$x, valid_data$y)
))
coord_key <- paste(valid_data$x, valid_data$y)
n_shared <- sum(coord_key %in% coord_key[duplicated(coord_key)])

# build an sf of the valid points for the spatial checks
hf_points <- valid_data |>
  sf::st_as_sf(
    coords = c("x", "y"),
    crs = 4326,
    remove = FALSE
  )

# points outside the national boundary
inside <- lengths(suppressWarnings(sf::st_within(hf_points, adm0))) > 0
n_outside <- sum(!inside)

# likely flipped lon/lat: outside as-is, but inside when swapped
swapped <- hf_points |>
  sf::st_drop_geometry() |>
  dplyr::transmute(x = y, y = x) |>
  sf::st_as_sf(coords = c("x", "y"), crs = 4326)
inside_swapped <- lengths(
  suppressWarnings(sf::st_within(swapped, adm0))
) > 0
n_flipped <- sum(!inside & inside_swapped)

# report the issue counts
cli::cli_h2("Coordinate quality summary ({nrow(hf_data)} facilities)")
cli::cli_alert_info("Missing coordinates: {n_missing}")
cli::cli_alert_info("Out-of-range coordinates: {n_out_range}")
cli::cli_alert_info("Null-island (0, 0): {n_zero}")
cli::cli_alert_info("Low precision (< 4 dp): {n_imprecise}")
cli::cli_alert_info("Duplicate facility and coordinate: {n_duplicate}")
cli::cli_alert_info("Records at shared locations: {n_shared}")
cli::cli_alert_info("Outside the national boundary: {n_outside}")
cli::cli_alert_info("Likely flipped lon/lat: {n_flipped}")

# flag each valid point by whether it falls inside the national boundary
hf_points <- hf_points |>
  dplyr::mutate(
    location = dplyr::if_else(inside, "Inside boundary", "Outside boundary")
  )

# map the points over the national boundary
coord_check_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = adm0, fill = "white", color = "black") +
  ggplot2::geom_sf(
    data = hf_points,
    ggplot2::aes(color = location),
    size = 0.8, alpha = 0.7
  ) +
  ggplot2::scale_color_manual(
    values = c(
      "Inside boundary" = "#2c7fb8",
      "Outside boundary" = "#e31a1c"
    ),
    name = NULL
  ) +
  snt_map_theme() +
  ggplot2::labs(
    title = "Coordinate check: facilities against the national boundary"
  )

print(coord_check_map)

#### Step 3.3: Clean coordinates -----------------------------------------------

# define longitude (x) and latitude (y) coordinates
hf_data <- hf_data |>
  dplyr::mutate(
    x = w_long,
    y = w_lat
  )

# identify facilities with invalid coordinates for follow-up
facilities_to_review <- hf_data |>
  dplyr::filter(
    is.na(x) | is.na(y) |
      !dplyr::between(x, -180, 180) |
      !dplyr::between(y, -90, 90)
  )

# filter to keep only valid, in-range coordinates
facilities_clean <- hf_data |>
  dplyr::filter(
    !is.na(x),
    !is.na(y),
    dplyr::between(x, -180, 180),
    dplyr::between(y, -90, 90)
  )

# print cleaning results
cli::cli_h2("Data cleaning results")
cli::cli_alert_info("Original health facilities: {nrow(hf_data)}")
cli::cli_alert_info("Clean health facilities: {nrow(facilities_clean)}")
cli::cli_alert_info(
  "Removed facilities: {nrow(hf_data) - nrow(facilities_clean)}"
)
cli::cli_alert_info(
  "Facilities flagged for review: {nrow(facilities_to_review)}"
)
cli::cli_alert_info(
  "Longitude range: {min(facilities_clean$x)} to {max(facilities_clean$x)}"
)
cli::cli_alert_info(
  "Latitude range: {min(facilities_clean$y)} to {max(facilities_clean$y)}"
)

### Step 4: Convert to Spatial Object ------------------------------------------

# convert to spatial object
facilities_sf <- sf::st_as_sf(
  facilities_clean,
  coords = c("w_long", "w_lat"),
  crs = 4326
)

cli::cli_alert_success("Converted to spatial object")
cli::cli_h3("Spatial object summary")
print(dplyr::slice_head(facilities_sf, n = 3))

### Step 5: Assign Facilities to Administrative Units --------------------------

# keep the boundary admin columns, renamed to avoid clashing with the
# facility data's own admin columns
boundary_lookup <- gdf |>
  dplyr::select(adm2_boundary = adm2, adm3_boundary = adm3)

# assign each facility to the chiefdom (adm3) polygon it falls within
facilities_admin <- suppressWarnings(sf::st_join(
  facilities_sf,
  boundary_lookup,
  join = sf::st_within
))

# facilities that did not fall within any polygon
n_unmatched <- sum(is.na(facilities_admin$adm3_boundary))

cli::cli_h3("Facilities assigned to an administrative unit")
cli::cli_alert_info("Facilities joined: {nrow(facilities_admin)}")
cli::cli_alert_info("Unmatched (outside all polygons): {n_unmatched}")

# compare the recorded chiefdom with the polygon the point falls in
facilities_admin |>
  sf::st_drop_geometry() |>
  dplyr::select(hf, adm3, adm3_boundary) |>
  dplyr::slice_head(n = 5) |>
  print()

### Step 6: Save Cleaned and Spatial Data --------------------------------------

# set up the output directory for processed datasets
out_path <- here::here(
  "01_data",
  "1.1_foundational",
  "1.1c_health_facilities",
  "processed"
)

# save the cleaned facility table as a CSV
readr::write_csv(
  facilities_clean,
  here::here(out_path, "facilities_clean.csv")
)

# save the facility points as a shapefile
sf::st_write(
  facilities_sf,
  here::here(out_path, "facilities_spatial.shp"),
  append = FALSE,
  quiet = TRUE
)

### Step 7: Create Health Facility Maps ----------------------------------------

#### Step 7.1: Basic health facility map ---------------------------------------

# define the map title
title <- "Health facilities in Sierra Leone"

# create the basic map
facilities_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "grey50",
                   size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = facilities_clean,
    ggplot2::aes(x = x, y = y),
    color = "#47B5FF", size = 1.8, alpha = 0.7
  ) +
  ggplot2::coord_sf() +
  snt_map_theme() +
  ggplot2::labs(
    title = title,
    subtitle = "adm3 boundaries with adm2 overlay"
  )

# print the map
print(facilities_map)

# save the map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_locations.png"
  ),
  plot = facilities_map,
  width = 10,
  height = 7,
  dpi = 300
)

#### Step 7.2: Health facilities by type ---------------------------------------

# define the facility type map title
title_type <- "Health facility types in Sierra Leone"

# create health facility type map
facility_type_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = facilities_clean,
    ggplot2::aes(x = x, y = y, color = type),
    size = 1, alpha = 0.8
  ) +
  ggplot2::coord_sf() +
  ggplot2::scale_color_brewer(
    palette = "Set1", name = "Facility Type"
  ) +
  snt_map_theme() +
  ggplot2::labs(title = title_type)

# print the type map
print(facility_type_map)

# save the type map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_by_type.png"
  ),
  plot = facility_type_map,
  width = 10,
  height = 7,
  dpi = 300
)

#### Step 7.3: Specific health facility types ----------------------------------

# define the faceted map title
title_facet <- "Health facility distribution by type in Sierra Leone"

# create faceted map by facility type
faceted_type_map <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.2) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = facilities_clean,
    ggplot2::aes(x = x, y = y),
    color = "#47B5FF", size = 1, alpha = 0.7
  ) +
  ggplot2::facet_wrap(~ type) +
  ggplot2::coord_sf() +
  snt_map_theme() +
  ggplot2::labs(title = title_facet)

# print the faceted map
print(faceted_type_map)

# save the faceted map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_faceted_by_type.png"
  ),
  plot = faceted_type_map,
  width = 12,
  height = 9,
  dpi = 300
)

### Step 8: Advanced Point Coordinates Visualizations --------------------------

#### Step 8.1: Point coordinates data summary ----------------------------------

# load the merged DHIS2-MFL dataset
final_dhis2_mfl_integrated <- readRDS(
  here::here(
    "01_data",
    "1.1_foundational",
    "1.1c_health_facilities",
    "processed",
    "final_dhis2_mfl_integrated.rds"
  )
)

# back-fill missing lat/long from the MFL by facility name; many DHIS2
# rows arrive without coordinates because the upstream fuzzy match
# only confirmed a subset of facility identities
mfl_coords <- readxl::read_excel(
  here::here(hf_path, "mfl_hfs.xlsx")
) |>
  dplyr::select(hf, w_lat, w_long) |>
  dplyr::distinct(hf, .keep_all = TRUE)

final_dhis2_mfl_integrated <- final_dhis2_mfl_integrated |>
  dplyr::left_join(mfl_coords, by = "hf") |>
  dplyr::mutate(
    lat = dplyr::coalesce(lat, w_lat),
    long = dplyr::coalesce(long, w_long)
  )

# calculate annual summaries for each health facility
annual_summary <- final_dhis2_mfl_integrated |>
  dplyr::group_by(hf_uid, hf, year, lat, long) |>
  dplyr::summarise(
    # sum tests across all age groups and settings
    total_tests = sum(
      test_u5 + test_5_14 + test_ov15, na.rm = TRUE
    ),
    # sum positive tests across all age groups and settings
    positive_tests = sum(
      conf_u5 + conf_5_14 + conf_ov15, na.rm = TRUE
    ),
    .groups = "drop"
  ) |>
  # calculate test positivity rate (TPR) as a percentage
  dplyr::mutate(
    tpr = dplyr::if_else(
      total_tests > 0,
      (positive_tests / total_tests) * 100,
      NA_real_
    )
  ) |>
  # remove facilities with no tests
  dplyr::filter(total_tests > 0)

# filter for 2023 and assign a testing-volume category
annual_summary_2023 <- annual_summary |>
  dplyr::filter(
    year == 2023, !is.na(tpr), total_tests > 0,
    !is.na(lat), !is.na(long)
  ) |>
  dplyr::mutate(
    size_category = dplyr::case_when(
      total_tests <= 100 ~ "≤100",
      total_tests <= 500 ~ "101-500",
      total_tests <= 1000 ~ "501-1,000",
      total_tests <= 5000 ~ "1,001-5,000",
      TRUE ~ "5,000+"
    ),
    size_category = factor(
      size_category,
      levels = c(
        "≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+"
      )
    )
  )

#### Step 8.2: Single indicator points coordinate visualization ----------------

tpr_only_2023 <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = annual_summary_2023,
    ggplot2::aes(x = long, y = lat, fill = tpr),
    shape = 21,
    size = 2,
    stroke = 0.5,
    alpha = 0.8
  ) +
  ggplot2::coord_sf() +
  ggplot2::scale_fill_gradientn(
    colors = rev(RColorBrewer::brewer.pal(7, "RdYlBu")),
    limits = c(0, 100),
    name = "Test Positivity Rate (%)",
  ) +
  snt_map_theme() +
  ggplot2::labs(title = "Health Facility Test Positivity Rate (2023)")

# print the map
print(tpr_only_2023)

# save the map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_tpr_2023.png"
  ),
  plot = tpr_only_2023,
  width = 10,
  height = 7,
  dpi = 300
)

#### Step 8.3: Multiple indicator points coordinate visualization --------------

size_scale <- c("≤100" = 1.5, "101-500" = 2.0, "501-1,000" = 2.5,
                "1,001-5,000" = 3.0, "5,000+" = 3.5)

combined_2023_categorical <- ggplot2::ggplot() +
  ggplot2::geom_sf(data = gdf, fill = "white",
                   color = "gray50", size = 0.3) +
  ggplot2::geom_sf(data = adm2_gdf, fill = NA,
                   color = "black",
                   size = 0.5) +
  ggplot2::geom_point(
    data = annual_summary_2023,
    ggplot2::aes(x = long, y = lat,
                 fill = tpr,
                 size = size_category),
    shape = 21,
    stroke = 0.3,  # thinner stroke
    alpha = 0.7
  ) +
  ggplot2::coord_sf() +
  ggplot2::scale_fill_gradientn(
    colors = rev(RColorBrewer::brewer.pal(7, "RdYlBu")),
    limits = c(0, 100),
    name = "Test Positivity Rate (%)",
    na.value = "grey50"
  ) +
  ggplot2::scale_size_manual(
    name = "Total Tested",
    values = size_scale,
    breaks = c("≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+")
  ) +
  snt_map_theme() +
  ggplot2::theme(
    legend.spacing.x = grid::unit(1.5, "cm"),
    legend.box.spacing = grid::unit(0.6, "cm")
  ) +
  ggplot2::labs(
    title = "Health Facility Test Positivity Rate and Testing Volume (2023)"
  )

# print the categorical version
print(combined_2023_categorical)

# save the map
ggplot2::ggsave(
  filename = here::here(
    "03_output", "3a_figures", "health_facility_tpr_test_2023.png"
  ),
  plot = combined_2023_categorical,
  width = 10,
  height = 7,
  dpi = 300
)

Show full code

################################################################################
########### ~ Health facility coordinates and point data full code ~ ###########
################################################################################

### Step 1: Install and Load Packages ------------------------------------------

from pathlib import Path

import geopandas as gpd
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
import pyreadr
from matplotlib.cm import ScalarMappable
from pyprojroot import here

def read_rds(path):
    """Read a single-object RDS file as a pandas object."""
    result = pyreadr.read_r(str(path))
    return next(iter(result.values()))

def cli_header(message):
    print(f"\n{message}")

def cli_info(message):
    print(f"INFO: {message}")

def cli_success(message):
    print(f"SUCCESS: {message}")

def cli_warning(message):
    print(f"WARNING: {message}")

def cli_danger(message):
    print(f"ERROR: {message}")

def anti_join(left, right, on):
    """Return rows in left with no matching key in right."""
    right_keys = right[on].drop_duplicates()
    return (
        left.merge(right_keys, on=on, how="left", indicator=True)
        .loc[lambda x: x["_merge"] == "left_only"]
        .drop(columns="_merge")
    )

def finish_map(ax, title=None, subtitle=None):
    """Apply the shared static map styling used on this page."""
    if title and subtitle:
        ax.set_title(
            f"{title}\n{subtitle}", loc="left", fontsize=14, fontweight="bold"
        )
    elif title:
        ax.set_title(title, loc="left", fontsize=14, fontweight="bold")
    ax.set_axis_off()

def save_figure(fig, filename, width, height, dpi=300):
    """Save a matplotlib figure with dimensions matching the R examples."""
    Path(filename).parent.mkdir(parents=True, exist_ok=True)
    fig.set_size_inches(width, height)
    fig.savefig(filename, dpi=dpi, bbox_inches="tight")

def show_table(df, n=10, caption=None):
    """Render a compact scrollable HTML table matching the R show_table helper.

    Produces a <table class="out-table"> wrapped in <div class="out-scroll">
    so the output matches the R kable-based renderer exactly.
    """
    from IPython.display import display, HTML
    html = df.head(n).to_html(index=False, border=0, classes="out-table")
    if caption:
        html = f"<caption>{caption}</caption>" + html
    display(HTML(f'<div class="out-scroll">{html}</div>'))

### Step 2: Load and Prepare Input Data ----------------------------------------

# set up the path to the processed administrative boundaries
spat_path = Path(
    here(
        "01_data/1.1_foundational/"
        "1.1a_admin_boundaries/processed"
    )
)

# load processed chiefdom (adm3) spatial object
gdf = gpd.read_file(spat_path / "sle_spatial_adm3_2021.geojson")

# load processed district (adm2) spatial object
adm2_gdf = gpd.read_file(spat_path / "sle_spatial_adm2_2021.geojson")

# set up the path to the processed health facility data
hf_path = Path(
    here(
        "01_data/1.1_foundational/"
        "1.1c_health_facilities/processed"
    )
)

# load the health facility master list
hf_data = pd.read_excel(hf_path / "mfl_hfs.xlsx")

# inspect the loaded data
cli_header("Administrative boundary columns")
print(list(gdf.columns))
cli_header("Health facility data columns")
print(list(hf_data.columns))
cli_header("Sample of health facility data")
print(hf_data.head(3))

### Step 3: Define, Validate, and Clean Coordinates ----------------------------

#### Step 3.1: Split coordinates column (if needed) ----------------------------

# this step is only required if longitude and latitude are combined
# in a single column. if columns are separate, skip this step.

# replace separators (; or space) with a comma
hf_data["coordinates"] = (
    hf_data["coordinates"].str.replace(r"[; ]+", ",", regex=True)
)

# split the combined column into two columns
coord_split = hf_data["coordinates"].str.split(",", expand=True)
hf_data = hf_data.assign(
    w_lat=pd.to_numeric(coord_split[0], errors="coerce"),
    w_long=pd.to_numeric(coord_split[1], errors="coerce"),
)

cli_success("Coordinates split into separate columns")
cli_header("Sample of split coordinates")
print(hf_data[["w_lat", "w_long"]].head(3).to_string(index=False))

#### Step 3.2: Detect coordinate quality issues --------------------------------

# load the national boundary (adm0) for the spatial checks
adm0 = gpd.read_file(
    Path(
        here(
            "01_data/1.1_foundational/"
            "1.1a_admin_boundaries/processed/"
            "sle_spatial_adm0_2021.geojson"
        )
    )
)

# helper: count decimal places in a coordinate value
def count_decimals(value):
    if pd.isna(value):
        return 0
    text = f"{abs(value):.10f}".rstrip("0")
    if "." not in text:
        return 0
    return len(text.split(".")[1])

# define longitude (x) and latitude (y)
hf_data = hf_data.assign(x=hf_data["w_long"], y=hf_data["w_lat"])

# rows with valid, in-range coordinates (used for the spatial checks)
valid = (
    hf_data["x"].notna() & hf_data["y"].notna()
    & hf_data["x"].between(-180, 180)
    & hf_data["y"].between(-90, 90)
)

# missing, out-of-range, and null-island (0, 0)
n_missing = (hf_data["x"].isna() | hf_data["y"].isna()).sum()
n_out_range = (
    hf_data["x"].notna() & hf_data["y"].notna() & ~valid
).sum()
n_zero = (
    (hf_data["x"] == 0) & (hf_data["y"] == 0)
).sum()

# low precision (fewer than 4 decimal places)
lon_dp = hf_data["x"].apply(count_decimals)
lat_dp = hf_data["y"].apply(count_decimals)
n_imprecise = int((valid & ((lon_dp < 4) | (lat_dp < 4))).sum())

# duplicate facility + coordinate, and shared locations
valid_data = hf_data.loc[valid].copy()
keys = (
    valid_data["hf"].astype(str) + "_"
    + valid_data["x"].astype(str) + "_"
    + valid_data["y"].astype(str)
)
n_duplicate = int(keys.duplicated().sum())
coord_key = (
    valid_data["x"].astype(str) + "_" + valid_data["y"].astype(str)
)
n_shared = int(
    coord_key.isin(coord_key[coord_key.duplicated()]).sum()
)

# build a GeoDataFrame of the valid points for the spatial checks
if adm0.crs is None:
    adm0 = adm0.set_crs("EPSG:4326")
hf_points = gpd.GeoDataFrame(
    valid_data,
    geometry=gpd.points_from_xy(valid_data["x"], valid_data["y"]),
    crs="EPSG:4326",
)
if hf_points.crs != adm0.crs:
    hf_points = hf_points.to_crs(adm0.crs)

# points outside the national boundary
inside = hf_points.within(adm0.union_all())
n_outside = int((~inside).sum())

# likely flipped lon/lat: outside as-is, but inside when swapped
swapped = gpd.GeoDataFrame(
    geometry=gpd.points_from_xy(hf_points["y"], hf_points["x"]),
    crs="EPSG:4326",
)
inside_swapped = swapped.within(adm0.union_all())
n_flipped = int((~inside & inside_swapped).sum())

# report the issue counts
cli_header(
    f"Coordinate quality summary ({len(hf_data)} facilities)"
)
cli_info(f"Missing coordinates: {n_missing}")
cli_info(f"Out-of-range coordinates: {n_out_range}")
cli_info(f"Null-island (0, 0): {n_zero}")
cli_info(f"Low precision (< 4 dp): {n_imprecise}")
cli_info(f"Duplicate facility and coordinate: {n_duplicate}")
cli_info(f"Records at shared locations: {n_shared}")
cli_info(f"Outside the national boundary: {n_outside}")
cli_info(f"Likely flipped lon/lat: {n_flipped}")

# flag each valid point by whether it falls inside the national boundary
hf_points = hf_points.assign(
    location=inside.map(
        {True: "Inside boundary", False: "Outside boundary"}
    )
)

# map the points over the national boundary
color_map = {"Inside boundary": "#2c7fb8", "Outside boundary": "#e31a1c"}
fig, ax = plt.subplots(figsize=(10, 8))
adm0.plot(ax=ax, facecolor="white", edgecolor="black", linewidth=0.8,
          aspect="equal")
for label, color in color_map.items():
    subset = hf_points.loc[hf_points["location"] == label]
    subset.plot(
        ax=ax, color=color, markersize=3, alpha=0.7, label=label,
        aspect="equal",
    )
ax.legend(
    loc="lower center", bbox_to_anchor=(0.5, -0.05),
    frameon=False, fontsize=9, ncol=2,
)
finish_map(
    ax,
    title="Coordinate check: facilities against the national boundary",
)

#### Step 3.3: Clean coordinates -----------------------------------------------

# define longitude (x) and latitude (y) coordinates
hf_data = hf_data.assign(x=hf_data["w_long"], y=hf_data["w_lat"])

# identify facilities with invalid coordinates for follow-up
facilities_to_review = hf_data.loc[
    hf_data["x"].isna()
    | hf_data["y"].isna()
    | (hf_data["x"] < -180) | (hf_data["x"] > 180)
    | (hf_data["y"] < -90) | (hf_data["y"] > 90)
]

# filter to keep only valid, in-range coordinates
facilities_clean = hf_data.loc[
    hf_data["x"].notna()
    & hf_data["y"].notna()
    & hf_data["x"].between(-180, 180)
    & hf_data["y"].between(-90, 90)
]

# print cleaning results
cli_header("Data cleaning results")
cli_info(f"Original health facilities: {len(hf_data)}")
cli_info(f"Clean health facilities: {len(facilities_clean)}")
cli_info(
    f"Removed facilities: {len(hf_data) - len(facilities_clean)}"
)
cli_info(
    f"Facilities flagged for review: {len(facilities_to_review)}"
)
cli_info(
    f"Longitude range: {facilities_clean['x'].min()} to "
    f"{facilities_clean['x'].max()}"
)
cli_info(
    f"Latitude range: {facilities_clean['y'].min()} to "
    f"{facilities_clean['y'].max()}"
)

### Step 4: Convert to Spatial Object ------------------------------------------

# convert the cleaned table to a GeoDataFrame using point coordinates
facilities_sf = gpd.GeoDataFrame(
    facilities_clean,
    geometry=gpd.points_from_xy(
        facilities_clean["w_long"], facilities_clean["w_lat"]
    ),
    crs="EPSG:4326",
)

cli_success("Converted to spatial object")
cli_header("Spatial object summary")
print(facilities_sf.head(3).to_string(index=False))

### Step 5: Assign Facilities to Administrative Units --------------------------

# keep only the admin boundary columns, renamed to avoid clashing
# with the facility data's own admin columns
boundary_lookup = gdf[["adm2", "adm3", "geometry"]].rename(
    columns={"adm2": "adm2_boundary", "adm3": "adm3_boundary"}
)

# assign a CRS fallback if missing before the spatial join
if boundary_lookup.crs is None:
    boundary_lookup = boundary_lookup.set_crs("EPSG:4326")
if facilities_sf.crs is None:
    facilities_sf = facilities_sf.set_crs("EPSG:4326")

# assign each facility to the chiefdom (adm3) polygon it falls within
facilities_admin = gpd.sjoin(
    facilities_sf,
    boundary_lookup,
    how="left",
    predicate="within",
)

# facilities that did not fall within any polygon
n_unmatched = facilities_admin["adm3_boundary"].isna().sum()

cli_header("Facilities assigned to an administrative unit")
cli_info(f"Facilities joined: {len(facilities_admin)}")
cli_info(f"Unmatched (outside all polygons): {n_unmatched}")

# compare the recorded chiefdom with the polygon the point falls in
print(
    facilities_admin[["hf", "adm3", "adm3_boundary"]]
    .head(5)
    .to_string(index=False)
)

### Step 6: Save Cleaned and Spatial Data --------------------------------------

# set up the output directory for processed datasets
out_path = Path(
    here(
        "01_data/1.1_foundational/"
        "1.1c_health_facilities/processed"
    )
)
out_path.mkdir(parents=True, exist_ok=True)

# save the cleaned facility table as a CSV
facilities_clean.to_csv(
    out_path / "facilities_clean.csv", index=False
)

# save the facility points as a GeoPackage
facilities_sf.to_file(
    out_path / "facilities_spatial.gpkg",
    layer="facilities",
    driver="GPKG",
)

### Step 7: Create Health Facility Maps ----------------------------------------

#### Step 7.1: Basic health facility map ---------------------------------------

# define the map title
title = "Health facilities in Sierra Leone"

# create the basic map
fig, ax = plt.subplots(figsize=(10, 8))
gdf.plot(ax=ax, facecolor="white", edgecolor="grey", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
ax.scatter(
    facilities_clean["x"],
    facilities_clean["y"],
    color="#47B5FF",
    s=15,
    alpha=0.7,
    zorder=5,
)
finish_map(ax, title=title, subtitle="adm3 boundaries with adm2 overlay")

# save the map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_locations.png"),
    width=10,
    height=7,
)

#### Step 7.2: Health facilities by type ---------------------------------------

# define the facility type map title
title_type = "Health facility types in Sierra Leone"

# get the ordered facility types and assign a Set1 color to each
facility_types = sorted(facilities_clean["type"].dropna().unique())
set1_colors = plt.get_cmap("Set1").colors
type_palette = {
    t: set1_colors[i % len(set1_colors)]
    for i, t in enumerate(facility_types)
}

# create health facility type map
fig, ax = plt.subplots(figsize=(10, 9))
gdf.plot(ax=ax, facecolor="white", edgecolor="gray", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
for ftype, color in type_palette.items():
    subset = facilities_clean.loc[facilities_clean["type"] == ftype]
    ax.scatter(
        subset["x"], subset["y"],
        color=color, s=5, alpha=0.8, label=ftype, zorder=5,
    )
ax.legend(
    title="Facility Type",
    loc="lower center",
    bbox_to_anchor=(0.5, -0.10),
    ncol=3,
    frameon=False,
    fontsize=8,
    title_fontsize=9,
)
finish_map(ax, title=title_type)

# save the type map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_by_type.png"),
    width=10,
    height=7,
)

#### Step 7.3: Specific health facility types ----------------------------------

# define the faceted map title
title_facet = "Health facility distribution by type in Sierra Leone"

# get the unique facility types for faceting
facility_types = sorted(facilities_clean["type"].dropna().unique())
n_types = len(facility_types)
ncols = 3
nrows = -(-n_types // ncols)  # ceiling division

fig, axes = plt.subplots(nrows, ncols, figsize=(12, 9))
axes_flat = axes.flatten()

for i, ftype in enumerate(facility_types):
    ax = axes_flat[i]
    gdf.plot(ax=ax, facecolor="white", edgecolor="gray",
             linewidth=0.2, aspect="equal")
    adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
                  linewidth=0.5, aspect="equal")
    subset = facilities_clean.loc[facilities_clean["type"] == ftype]
    ax.scatter(
        subset["x"], subset["y"],
        color="#47B5FF", s=5, alpha=0.7, zorder=5,
    )
    ax.set_title(ftype, fontsize=10, fontweight="bold")
    ax.set_axis_off()

# hide any unused axes panels
for j in range(n_types, len(axes_flat)):
    axes_flat[j].set_visible(False)

fig.suptitle(title_facet, fontsize=14, fontweight="bold", x=0.02,
             ha="left")
plt.tight_layout()

# save the faceted map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_faceted_by_type.png"),
    width=12,
    height=9,
)

### Step 8: Advanced Point Coordinates Visualizations --------------------------

#### Step 8.1: Point coordinates data summary ----------------------------------

# load the merged DHIS2-MFL dataset
final_dhis2_mfl_integrated = pd.read_parquet(
    Path(
        here(
            "01_data/1.1_foundational/"
            "1.1c_health_facilities/processed/"
            "final_dhis2_mfl_integrated.parquet"
        )
    )
)

# the year column may arrive as a string; coerce it to a nullable
# integer so the year filter below behaves the same way as R's lenient
# comparison ("2023" == 2023 is TRUE in R but False in Python)
final_dhis2_mfl_integrated["year"] = pd.to_numeric(
    final_dhis2_mfl_integrated["year"], errors="coerce"
).astype("Int64")

# back-fill missing lat/long from the MFL by facility name; many DHIS2
# rows arrive without coordinates because the upstream fuzzy match
# only confirmed a subset of facility identities
mfl_coords = (
    pd.read_excel(hf_path / "mfl_hfs.xlsx")
    [["hf", "w_lat", "w_long"]]
    .drop_duplicates("hf")
)
final_dhis2_mfl_integrated = final_dhis2_mfl_integrated.merge(
    mfl_coords, on="hf", how="left"
)
final_dhis2_mfl_integrated["lat"] = (
    final_dhis2_mfl_integrated["lat"]
    .combine_first(final_dhis2_mfl_integrated["w_lat"])
)
final_dhis2_mfl_integrated["long"] = (
    final_dhis2_mfl_integrated["long"]
    .combine_first(final_dhis2_mfl_integrated["w_long"])
)

# calculate annual summaries for each health facility
annual_summary = (
    final_dhis2_mfl_integrated
    .assign(
        total_tests=lambda d: (
            d["test_u5"].fillna(0)
            + d["test_5_14"].fillna(0)
            + d["test_ov15"].fillna(0)
        ),
        positive_tests=lambda d: (
            d["conf_u5"].fillna(0)
            + d["conf_5_14"].fillna(0)
            + d["conf_ov15"].fillna(0)
        ),
    )
    .groupby(["hf_uid", "hf", "year", "lat", "long"], as_index=False)
    .agg(
        total_tests=("total_tests", "sum"),
        positive_tests=("positive_tests", "sum"),
    )
    .assign(
        tpr=lambda d: np.where(
            d["total_tests"] > 0,
            (d["positive_tests"] / d["total_tests"]) * 100,
            np.nan,
        )
    )
    .loc[lambda d: d["total_tests"] > 0]
)

# filter for 2023 and assign a testing-volume category
size_levels = ["≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+"]
annual_summary_2023 = (
    annual_summary
    .loc[lambda d: (
        (d["year"] == 2023)
        & d["tpr"].notna()
        & (d["total_tests"] > 0)
        & d["lat"].notna()
        & d["long"].notna()
    )]
    .assign(
        size_category=lambda d: pd.Categorical(
            np.select(
                [
                    d["total_tests"] <= 100,
                    d["total_tests"] <= 500,
                    d["total_tests"] <= 1000,
                    d["total_tests"] <= 5000,
                ],
                ["≤100", "101-500", "501-1,000", "1,001-5,000"],
                default="5,000+",
            ),
            categories=size_levels,
            ordered=True,
        )
    )
)

#### Step 8.2: Single indicator points coordinate visualization ----------------

# build the color map matching R's rev(brewer.pal(7, "RdYlBu"))
rdylbu_7_rev = [
    "#4575b4", "#91bfdb", "#e0f3f8",
    "#ffffbf", "#fee090", "#fc8d59", "#d73027",
]
cmap_tpr = mcolors.LinearSegmentedColormap.from_list(
    "rdylbu_rev", rdylbu_7_rev
)

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, facecolor="white", edgecolor="gray", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
norm = mcolors.Normalize(vmin=0, vmax=100)
sc = ax.scatter(
    annual_summary_2023["long"],
    annual_summary_2023["lat"],
    c=annual_summary_2023["tpr"],
    cmap=cmap_tpr,
    norm=norm,
    s=15,
    alpha=0.8,
    edgecolors="black",
    linewidths=0.5,
    zorder=5,
)
sm = ScalarMappable(cmap=cmap_tpr, norm=norm)
sm.set_array([])
cbar = fig.colorbar(
    sm, ax=ax, orientation="horizontal", fraction=0.04, pad=0.04
)
cbar.set_label("Test Positivity Rate (%)", fontweight="bold")
finish_map(ax, title="Health Facility Test Positivity Rate (2023)")

# save the map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_tpr_2023.png"),
    width=10,
    height=7,
)

#### Step 8.3: Multiple indicator points coordinate visualization --------------

# size mapping matching the R size_scale
size_map = {
    "≤100": 20, "101-500": 30, "501-1,000": 50,
    "1,001-5,000": 80, "5,000+": 120,
}

# build the color map matching R's rev(brewer.pal(7, "RdYlBu"))
rdylbu_7_rev = [
    "#4575b4", "#91bfdb", "#e0f3f8",
    "#ffffbf", "#fee090", "#fc8d59", "#d73027",
]
cmap_tpr = mcolors.LinearSegmentedColormap.from_list(
    "rdylbu_rev", rdylbu_7_rev
)

fig, ax = plt.subplots(figsize=(10, 12))
gdf.plot(ax=ax, facecolor="white", edgecolor="gray", linewidth=0.3,
         aspect="equal")
adm2_gdf.plot(ax=ax, facecolor="none", edgecolor="black",
              linewidth=0.5, aspect="equal")
norm = mcolors.Normalize(vmin=0, vmax=100)

# derive the marker size for every row from the size_category column,
# falling back to the smallest size when the category is missing
size_levels = ["≤100", "101-500", "501-1,000", "1,001-5,000", "5,000+"]
point_sizes = (
    annual_summary_2023["size_category"]
    .astype(str)
    .map(size_map)
    .fillna(size_map["≤100"])
)

# plot all facilities in a single scatter call so every point renders
ax.scatter(
    annual_summary_2023["long"],
    annual_summary_2023["lat"],
    c=annual_summary_2023["tpr"],
    cmap=cmap_tpr,
    norm=norm,
    s=point_sizes,
    alpha=0.7,
    edgecolors="black",
    linewidths=0.3,
    zorder=5,
)

# colorbar for TPR
sm = ScalarMappable(cmap=cmap_tpr, norm=norm)
sm.set_array([])
cbar = fig.colorbar(
    sm, ax=ax, orientation="horizontal", fraction=0.04, pad=0.04
)
cbar.set_label("Test Positivity Rate (%)", fontweight="bold")

# build proxy handles for the size legend so it always renders, even if a
# size category has no facilities
size_handles = [
    plt.scatter(
        [], [], s=size_map[cat], color="lightgray",
        edgecolors="black", linewidths=0.3, alpha=0.7,
    )
    for cat in size_levels
]
ax.legend(
    size_handles, size_levels,
    title="Total Tested",
    loc="upper center",
    bbox_to_anchor=(0.5, -0.18),
    ncol=5,
    frameon=False,
    fontsize=8,
    title_fontsize=9,
)
finish_map(
    ax,
    title=(
        "Health Facility Test Positivity Rate and Testing Volume (2023)"
    ),
)

# save the map
save_figure(
    fig,
    here("03_output/3a_figures/health_facility_tpr_test_2023.png"),
    width=10,
    height=7,
)