Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 2. Data Assembly and Management
  2. 2.1 Working with Shapefiles
  3. Shapefile management and customization
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Determining active and inactive status
      • Contextual considerations
      • Missing data detection methods
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to Care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
    • 6.1: Trend analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • Considerations for Managing and Customizing Shapefiles
    • Valid and Complete Geometries
    • Ensuring Unique Polygons
    • Coordinate Reference Systems (CRS)
    • Aggregating or Reconstructing Admin Levels
    • Partial Aggregation for Flexible Resolution
    • Tracking Boundary Changes
  • Step-by-Step
    • Step 1: Install and load packages
    • Step 2: Load and prepare shapefiles
      • Step 2.1: Load and examine data
      • Step 2.2: Check and transform CRS
    • Step 3: Geometry validation
      • Step 3.1: Check and correct geometry validity
      • Step 3.2: Check for duplicate polygons
      • Step 3.3: Resolve any duplicated polygons
    • Step 4: Aggregate or reconstruct higher-level administrative boundaries
    • Step: 5 Shapefile customization
      • Step 5.1: Mixed-level shapefiles
      • Step 5.2: Merging multiple polygons into a single unit
    • Step 6: Create geographic IDs using hashes
    • Step 7: Saving vector data
  • Summary
  • Full code
  1. 2. Data Assembly and Management
  2. 2.1 Working with Shapefiles
  3. Shapefile management and customization

Shapefile management and customization

Intermediate

Overview

In the SNT context, having an official, accurate, and up-to-date shapefile is essential. These boundaries determine the scale at which analysts aggregate data, calculate indicators, apply targeting criteria, and visualize outputs. Misaligned or incomplete shapefiles can introduce serious errors, such as dropped areas, misclassified units, or misleading maps, which compromise both the validity of the analysis and its usefulness for decision-making.

More on spatial data

For background information on shapefiles and links to all the spatial data content in the SNT code library, including rasters and point data, please visit Spatial data overview.

Proper shapefile management, including attribute cleaning, name harmonization, geometry validation, and display simplification, is essential for spatial coherence, reproducibility, and alignment with SNT goals. A validated boundary layer underpins all joins, analyses, and outputs in the SNT workflow. This section introduces core shapefile concepts and outlines the key steps for preparing and customizing them for use.

This page introduces the core steps for managing and customizing shapefiles for use in the SNT workflow. The Sierra Leone shapefile prepared in these sections, along with the visualization concepts and spatial knowledge introduced, is used throughout the remainder of the SNT code library to support all analysis and mapping steps.

Foundational Reference Guide

This page serves as a foundational reference for all spatial operations in the SNT Code Library. Much of the subsequent analysis pages—including those that download and process raster data such as population, climate, and modeled estimates—rely heavily on the shapefile management, cleaning, and troubleshooting methods presented here. The techniques for geometry validation, coordinate reference system alignment, boundary aggregation, and duplicate handling are essential prerequisites for successful spatial joins and accurate geographic analysis throughout the SNT workflow.

It is worth ensuring that your shapefile is able to pass through these validation checks initially to prevent any issues down the line. Keep this page bookmarked as a reference guide to return to when encountering spatial data issues in later analysis steps.

Objectives
  • Import, inspect, and prepare shapefiles and other formats
  • Validate and clean geometries to ensure accurate joins and maps
  • Aggregate or customize administrative boundaries for flexible analysis
  • Generate geometry hashes
  • Save cleaned shapefile files for later use

Considerations for Managing and Customizing Shapefiles

Before a shapefile is used for any purpose, whether for joining data, visualizing maps, aggregating indicators, or running spatial analysis, it must be clean, valid, and properly structured. This section outlines key challenges and best practices in shapefile preparation that guide the reliability of all downstream steps in the SNT workflow. These steps form the data foundation on which reliable SNT analysis is built.

Valid and Complete Geometries

Shapefiles must be geometrically valid. Possible problems include:

  • spatial artifacts and overlaps from poor digitization
  • empty or zero-area polygons
  • multipart geometries or holes that affect containment logic.

Invalid geometries can silently break joins or misplace features. In SNT workflows, this may result in missing areas in coverage maps or unmatched clusters in analysis of survey data. It is always important to identify invalid geometries, flag them, and correct them where possible.

Ensuring Unique Polygons

Each unit in your shapefile should represent a single, distinct geographic area. Duplicate polygons, caused by digitization errors or mistakenly repeated records, can lead to incorrect joins, double-counting in aggregations, and confusion in mapping.

Why this matters: Duplicate features with the same name or ID can: - Distort summary statistics - Lead to incorrect matches when joining multiple shapefiles - Create misleading visualizations or maps

Possible issues to resolve:

  • Duplicate geometries (polygons or other geometry) with identical or repeated IDs

  • Multiple records representing the same unit

Example: If two features both have the name “Bumpe Ngao” and cover the same area, population estimates may be counted twice during zonal extraction or incidence calculations. One of the duplicates should be removed before using the shapefile.

You can use geospatial tools to detect and resolve duplicates—such as flagging repeated names or IDs, comparing geometries for equality, or dissolving features by administrative codes. We will demonstrate several examples in the Step-by-Step section below.

Coordinate Reference Systems (CRS)

For spatial joins to work correctly, both spatial data files must share the same CRS. A mismatch can cause points to fall outside their true polygons, even if coordinates are correct. Always confirm that both datasets use a common, appropriate CRS as mismatches can lead to misalignment and incorrect results. Typically WGS84 (EPSG:4326) is used. For more information on CRS, please visit Spatial data overview.

Aggregating or Reconstructing Admin Levels

Before starting any analysis, analysts should request all official shapefiles at the required unit of analysis and at all higher administrative levels. For example, if the chosen unit is adm2, you should also obtain the official adm1 shapefile. This is not only important for harmonizing data across levels, but also essential for producing SNT maps that include national and regional context and are easy to interpret.

However, in practice, you may find that the shapefile for your intended unit of analysis is missing or incomplete. For example, you might have data at adm2 but only a complete shapefile for adm3. In such cases, you can aggregate polygons from a more granular shapefile using a shared identifier to reconstruct the higher-level boundaries.

  • Adm3 polygons can be grouped and dissolved by a common adm2 name or code to recreate adm2 boundaries. The same approach applies to adm1 boundaries.
  • This technique is used in SNT when reporting at broader levels (e.g., provinces or regions) or when visualizing national maps with consistent resolution, if dedicated shapefiles at the larger admin unit are not available.

Example: In Sierra Leone, if only adm3 boundaries are available, you can group adm3 units by their district name and merge their geometries. This reconstructs an adm2-level shapefile that matches the data and enables accurate mapping and aggregation.

This process ensures that your shapefile structure aligns with the level of your indicators and supports clean joins, visualizations, and reporting across administrative tiers.

Partial Aggregation for Flexible Resolution

In some cases, it may be beneficial to represent different parts of the country at different administrative levels within the same shapefile. For example, an SNT analysis might require adm3 detail for elimination areas, while the rest of the country can be visualized and analyzed at adm2.

This strategy is particularly useful when interventions or programmatic decisions require more granularity in some areas than others. It provides a practical way to balance detail and usability in maps and analysis while ensuring alignment with country-specific priorities and operational needs. This approach allows you to retain detailed boundaries in certain areas while using larger-unit boundaries elsewhere.

Handling mixed admin levels within a country

In some SNT exercises, countries choose to use different administrative levels in different regions, for instance, adm3 for urban areas and adm2 for rural zones. This reflects operational realities, data availability, or planning needs.

To support this hybrid setup, you can clip each shapefile to its relevant region and then merge them into a single boundary layer. With proper alignment of geometries and attributes, this structure allows for flexible, context-sensitive spatial analysis. The techniques needed to do this are covered on this page.

Tracking Boundary Changes

Administrative boundaries can shift over time, through redistricting, renaming, or reclassification. In SNT, it is important to detect these changes to ensure geographic consistency for time-series or trend analysis, and to make sure historical data is mapped and interpreted correctly.

While visualizing and comparing different versions of a shapefile by eye is a simple way to look for visible differences, it is also possible to detect changes more systematically and rigorously. One way to do this is by generating a geometry hash, a unique identifier based on a polygon’s coordinates. If a shape changes, its hash will change. If only the name or code changes but the geometry stays the same, the hash remains stable, making it easier to detect renames or confirm spatial continuity.

Example: If Kambia district’s geometry is unchanged but renamed across versions, its hash stays the same. If the shape is redrawn, the hash will differ, flagging it for review.

This approach supports version control, longitudinal consistency, and reproducible spatial workflows. Methods for computing and comparing hashes are covered later in this section.

Step-by-Step

In this section, we guide you through the essential steps to load, manage, troubleshoot, and customize shapefile data.

The example uses administrative boundary shapefiles from Sierra Leone, with focus on chiefdom (adm3) and district (adm2) levels. The principles can be applied to shapefiles from any country.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Install and load packages

First, install and/or load the necessary packages for handling spatial data, data manipulation, and visualization. These libraries provide essential functions for working with spatial data.

  • R
  • Python

The sf package is particularly important as it implements simple features standards for handling geographic vector data in R.

# Install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Load required packages using pacman
pacman::p_load(
  sf,         # handling shapefile data
  dplyr,      # data manipulation
  ggplot2,    # plotting
  here,       # file path management
  stringr,    # string manipulation (e.g., str_remove)
  cli,        # clean logging and CLI-style messages
  devtools    # for package management
)

# sntutils has a number of useful helper functions for
# data management and cleaning
if (!requireNamespace("sntutils", quietly = TRUE)) {
  devtools::install_github("ahadi-analytics/sntutils")
}

To adapt the code:

  • Do not modify anything in the code above

Step 2: Load and prepare shapefiles

Step 2.1: Load and examine data

Import both datasets and examine their structure to identify the appropriate join keys. This important step ensures understanding of the data before attempting to merge.

We begin by loading the Sierra Leone shapefile and selecting the relevant administrative levels for analysis. A new column for adm0 (country name) is added, and admin unit fields are renamed to standard adm1, adm2, and adm3 labels for consistency across the SNT workflow.

  • R
  • Python
# set up spatial path
spat_path <- here::here("1.1_foundational", "1.1a_administrative_boundaries")

# import shapefile
shp_adm3 <- sf::read_sf(
  here::here(spat_path, "raw", "Chiefdom2021.shp")
) |>
  dplyr::mutate(adm0 = "SIERRA LEONE") |>
  dplyr::select(
    # select relevant columns and rename to standard nomenclature
    adm0,
    adm1 = FIRST_REGI,
    adm2 = FIRST_DNAM,
    adm3 = FIRST_CHIE
  ) |>
  # ensure output remains a valid sf object for downstream usage
  sf::st_sf()

# check output by mapping
plot(shp_adm3)
Output

To adapt the code:

The code below shows how you can modify the loading process when working with other spatial vector formats. All of these examples use sf::read_sf(), which supports multiple formats via GDAL.

GeoJSON: If your boundary file is in GeoJSON format, simply update the file extension

shp_adm3 <- sf::read_sf(
  here::here("folder", "boundary.geojson")  # path to geojson file
)

GeoPackage (.gpkg): For GeoPackage files, you must specify both the file and the layer name:

shp_adm3 <- sf::read_sf(
  here::here("folder", "boundary.gpkg"),    # path to gpkg file
  layer = "admin_boundaries"               # layer name inside gpkg
)

ℹ️ Use sf::st_layers("boundary.gpkg") to list all available layers in the file.

File Geodatabase (.gdb): For Geodatabase files, you must specify both the file and the layer name:

shp_adm3 <- sf::read_sf(
  "path/to/geodatabase.gdb",     # path to gdb
  layer = "admin_boundaries"    # layer name inside gdb
)

Once imported, the rest of the processing pipeline (mutate(), select(), st_set_crs(), etc.) remains the same.

Additional adaptations:

  • Line 2: Change the here::here() path to match your folder structure
  • Line 5: If you are working with a shapefile at a different admin unit level, you may want to change the variable name for the spatial object away from shp_adm3. For example, if working at adm2, shp_adm2 would be more appropriate.
  • Lines 11–14: Update the administrative column names (adm0, adm1, adm2, adm3) to reflect your dataset. For example, if your shapefile does not contain an adm3 level, remove the adm3 line.

Step 2.2: Check and transform CRS

Once loaded, it is important to ensure that your shapefile is in the appropriate coordinate reference system (CRS).

Below, we check the current CRS of the shapefile and, if needed, transforms it to a consistent geographic reference system, to ensure all layers align correctly and will be compatible with other spatial datasets we might use.

The Sierra Leone shapefile is already in WGS84 (EPSG:4326), so normally we would not need to change the CRS. However, for demonstration purposes we provide an example of how to transform the CRS (to Equal Earth (EPSG:6933)).

  • R
  • Python
cli::cli_h2("Check and assign original CRS for shp_adm3")
shp_adm3 <- sf::st_set_crs(shp_adm3, 4326)
cli::cli_alert_info("Original CRS: {sf::st_crs(shp_adm3)$input}")

# Transform to new CRS (e.g., Equal Earth - EPSG:6933)
shp_adm3_diff_crs <- sf::st_transform(shp_adm3, 6933)

cli::cli_h2("Check CRS after transformation")
cli::cli_alert_info("New CRS: {sf::st_crs(shp_adm3_diff_crs)$input}")
Output
── Check and assign original CRS for shp_adm3 ──
ℹ Original CRS: EPSG:4326
── Check CRS after transformation ──
ℹ New CRS: EPSG:6933

To adapt the code:

  • Lines 1–2: Don’t change anything unless you’re using a different variable for your shapefile object (e.g., shp_adm2) or need a different CRS.
  • Line 2: Use st_set_crs() only if the shapefile is missing a CRS but you’re certain of what it should be. If you need to reproject the geometry, always use st_transform().
  • Line 6: To use another projection, just update the EPSG code in st_transform(), for example to 4326 for the WGS84 CRS usually used in SNT, if the original shapefile had a different CRS.
  • If your shapefile was already in the correct CRS, lines 6 onwards can be removed.

Step 3: Geometry validation

Step 3.1: Check and correct geometry validity

Invalid geometries, like overlaps, gaps, or broken shapes, can cause joins to fail or maps to misalign. These issues can go unnoticed but ultimately lead to missing areas or incorrect outputs. Before doing any joins, always check that your shapefile is valid.

  • R
  • Python
# check geometry validity
shp_adm3_inv <- shp_adm3 |>
  dplyr::mutate(
    valid = sf::st_is_valid(geometry),
    invalid_reason = sf::st_is_valid(geometry, reason = TRUE),
    invalid_reason = stringr::str_remove(invalid_reason, "\\[.*")
  )

# get list of invalid adm3s
invalid_df <- shp_adm3_inv |>
  dplyr::filter(!valid) |>
  sf::st_drop_geometry() |>
  dplyr::select(adm0, adm1, adm2, adm3, invalid_reason)

# view invalid rows (if any)
invalid_df
Output
# A tibble: 2 × 5
  adm0         adm1       adm2      adm3             invalid_reason             
  <chr>        <chr>      <chr>     <chr>            <chr>                      
1 SIERRA LEONE EASTERN    KENEMA    NIAWA            Edge 936 has duplicate ver…
2 SIERRA LEONE NORTH EAST KOINADUGU WARA WARA YAGALA Edge 1091 has duplicate ve…

To adapt the code:

  • Lines 2, 10, and 13: Replace shp_adm3 with the variable name for your spatial object, such as shp_adm2.

Two chiefdoms, Niawa and Wara Wara Yagala, have invalid geometries due to ring self-intersections, where polygon edges cross or loop back on themselves (see this guide for common invalid geometry reasons). These issues tend to arise from digitization errors or overly complex boundaries and can disrupt spatial joins and visualizations. It is important to identify and correct them before further processing.

We will retain invalid_df for reference, as it contains the chiefdoms with unresolved issues for potential manual review or future reference, then proceed to make all geometries in shp_adm3 valid.

  • R
  • Python
# fix known geometry issues
# (e.g. self-intersections, unclosed rings)
shp_adm3 <- shp_adm3 |>
  sf::st_make_valid()

# re-check and report remaining invalid geometries
n_invalid <- shp_adm3 |>
  dplyr::mutate(valid = sf::st_is_valid(geometry)) |>
  dplyr::filter(!valid) |>
  nrow()

cli::cli_alert_info("Remaining invalid geometries: {n_invalid}")
Output
ℹ Remaining invalid geometries: 0

To adapt the code:

  • Lines 3 and 7: Replace shp_adm3 with your relevant spatial object, such as shp_adm2.

All geometries are now valid and safe for future spatial operations.

Step 3.2: Check for duplicate polygons

Each polygon should represent a unique geographic unit. Duplicate shapes, whether exact copies or multiple records for the same area, can cause double-counting, join errors, and misleading summaries. Always check for and remove duplicates before proceeding. Below, we outline simple checks and fixes using standard approaches.

  • R
  • Python
# check for duplicate geometries
shp_adm3_dup <- shp_adm3 |>
  dplyr::mutate(dup_geom = duplicated(sf::st_as_binary(geometry))) |>
  dplyr::filter(dup_geom == TRUE)

cli::cli_alert_info("Number of duplicated geometries: {nrow(shp_adm3_dup)}")
Output
ℹ Number of duplicated geometries: 3

To adapt the code:

  • Lines 2 and 6: Replace shp_adm3 with your relevant spatial object, such as shp_adm2.

The results show 5 non-unique geometries. It is important to investigate and resolve them before proceeding with analysis or integration. Use the check below to identify duplicates based on attributes and geometry. It is helpful to distinguish between:

  • Full row duplicates: Same attributes and same geometry.
  • Geometry-only duplicates: Identical shapes with differing labels.
  • R
  • Python
# Check for full row duplicates
dups_full <- shp_adm3 |>
  dplyr::group_by(across(everything())) |>
  dplyr::filter(dplyr::n() > 1)

# Check for geometry-only duplicates
dups_geom <- shp_adm3 |>
  dplyr::mutate(geom_hash = sntutils::vdigest(geometry)) |>
  dplyr::add_count(geom_hash, name = "n_geom") |>
  dplyr::filter(n_geom > 1)

# View results
cli::cli_h1("Checking for full row duplicates")
dups_full |> sf::st_drop_geometry()

cli::cli_h1("Checking for geometry-only duplicates")
dups_geom |> sf::st_drop_geometry()
Output
── Checking for full row duplicates ────────────────────────────────────────────
# A tibble: 4 × 4
# Groups:   adm0, adm1, adm2, adm3 [2]
  adm0         adm1    adm2     adm3 
* <chr>        <chr>   <chr>    <chr>
1 SIERRA LEONE EASTERN KAILAHUN DEA  
2 SIERRA LEONE EASTERN KAILAHUN JAHN 
3 SIERRA LEONE EASTERN KAILAHUN DEA  
4 SIERRA LEONE EASTERN KAILAHUN JAHN 
── Checking for geometry-only duplicates ───────────────────────────────────────
# A tibble: 6 × 6
  adm0         adm1    adm2     adm3            geom_hash                 n_geom
* <chr>        <chr>   <chr>    <chr>           <chr>                      <int>
1 SIERRA LEONE EASTERN KAILAHUN DEA             1630a6761aecfc9d43ccf339…      2
2 SIERRA LEONE EASTERN KAILAHUN JAHN            3c06ac4770522ec5ac11f0a8…      2
3 SIERRA LEONE EASTERN KAILAHUN MALEMA          d50634a03d2d9968d64d8319…      2
4 SIERRA LEONE EASTERN KAILAHUN DEA             1630a6761aecfc9d43ccf339…      2
5 SIERRA LEONE EASTERN KAILAHUN JAHN            3c06ac4770522ec5ac11f0a8…      2
6 SIERRA LEONE EASTERN KAILAHUN DUPLICATE LABEL d50634a03d2d9968d64d8319…      2

To adapt the code:

  • Lines 2, 7, 14, and 17: Replace shp_adm3 with your relevant spatial object, such as shp_adm2.
  • Ensure the geometry column is named geometry (as it is with sf::read_sf()).

In this example:

  • DEA and JAHN each appear twice with identical attributes and geometry, indicating full row duplicates. These are likely accidental (e.g., created during merging or binding) and can be safely removed using a deduplication step appropriate to your software or language.

  • MALEMA and DUPLICATE LABEL share the same geometry but have different names. This suggests a labeling or naming issue, not a problem with the shapefile itself. These cases may reflect an administrative convention or a mismatched join, rather than spatial error.

Use geometry hashes to detect changes in geometry

We use geometry hashing to assign a unique ID to each shape, making it easier to detect and compare changes across shapefile versions. See Step 6 for more detail.

Step 3.3: Resolve any duplicated polygons

Next we will resolve the duplicated polygons in our shapefile. We will first remove the full-row duplicates. Then for geometry-only duplicates, we will drop one version. In this example, we will drop the polygon with adm3 name DUPLICATE LABEL. Finally, we will check again for geometry duplicates to make sure all duplicates are now resolved.

Consult SNT team

If the duplicates aren’t clearly redundant, or reflect more than just data entry issues, consult the SNT team before making any reconciliation decisions. Some cases may involve overlapping units, legacy naming conventions, or intentional duplicates used in planning. Don’t drop anything unless the purpose of the duplication is understood.

Always document your cleaning steps clearly, so the SNT team is aware of any changes and can trace the process if needed.

  • R
  • Python
# remove full row duplicates
shp_adm3 <- shp_adm3 |>
  dplyr::distinct()

# if needed, remove known geometry-only duplicates by filtering on labels
# drop intentionally duplicated label
shp_adm3 <- shp_adm3 |>
  dplyr::filter(adm3 != "DUPLICATE LABEL")

# Check  again after cleaning
# check for geometry-only duplicates in shp_adm3
shp_adm3_dup <- shp_adm3 |>
  dplyr::mutate(dup_geom = duplicated(sf::st_as_binary(geometry))) |>
  dplyr::filter(dup_geom == TRUE)

cli::cli_alert_info(
  "Number of duplicated geometries in shp_adm3: {nrow(shp_adm3_dup)}"
)
Output
ℹ Number of duplicated geometries in shp_adm3: 0

To adapt the code:

  • Lines 2, 7, 12, and 16: Replace shp_adm3 with your relevant spatial object, such as shp_adm2.
  • Line 8: Update the column used in filter() (e.g., adm3, facility_name, or district_name) to match your dataset’s naming convention.
  • Line 2: Use dplyr::distinct() to drop exact duplicate rows (same attributes and geometry).
  • Line 8: Use dplyr::filter() to remove known label-based duplicates by passing in the name of the label to be dropped (e.g., rows with DUPLICATE LABEL).

The results show that the operations worked, as the dataset no longer has full row or geometry-only duplicates. The dataset now contains only unique, valid geometries.

Check for historical boundary versions

In some cases, shapefiles or geodatabases may include multiple versions of the same boundaries, each associated with metadata indicating when that boundary was in use (e.g., start_date, end_date).

Always check the attribute table and metadata for date-related fields. If your dataset includes historical versions, you may need to filter to the most recent boundaries, such as:

shp <- shp |>
  dplyr::filter(end_date == "9999-12-31")  # or use the most recent date available

🔎 If you’re unsure how boundaries were versioned or labeled, consult with the SNT team or data owner before proceeding.

Step 4: Aggregate or reconstruct higher-level administrative boundaries

In some cases, shapefiles may only be available at a lower administrative level (e.g., adm3). You can aggregate these adm3 units to construct adm2, adm1, or adm0 boundaries.

  • R
  • Python
Show the code
# create adm0-level shapefile by
# grouping over the country name
shp_adm0 <- shp_adm3 |>
  dplyr::group_by(adm0) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# create adm1-level shapefile by
# grouping over the region (e.g., province)
shp_adm1 <- shp_adm3 |>
  dplyr::group_by(adm0, adm1) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# create adm2-level shapefile (e.g., districts)
shp_adm2 <- shp_adm3 |>
  dplyr::group_by(adm0, adm1, adm2) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# Check to see the operations worked
# set up a 2x2 plotting area
graphics::par(mfrow = c(2, 2))

# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)

# Plot each administrative level
plot(shp_adm3$geometry, col = plot_col, main = "Original adm3")
plot(shp_adm2$geometry, col = plot_col, main = "Aggregated adm2 (from adm3)")
plot(shp_adm1$geometry, col = plot_col, main = "Aggregated adm1 (from adm3)")
plot(shp_adm0$geometry, col = plot_col, main = "Aggregated adm0 (from adm3)")

# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))
Output

To adapt the code:

  • Lines 3, 10, and 16: Replace shp_adm3 with your shapefile object.
  • Lines 4, 11–12, and 17–18: Update the column names (adm0, adm1, adm2) to match those in your data.
  • Lines 4, 11–12, and 17–18: Adjust the grouping level depending on which higher admin level you want to create.

The plots confirm that aggregation to adm0, adm1, and adm2 levels worked as expected. Each map shows progressively coarser boundaries, and the original adm3 structure is preserved.

Step: 5 Shapefile customization

Step 5.1: Mixed-level shapefiles

In some cases, you may want to combine different administrative levels within a single shapefile—for example, using adm2 for most of the country while substituting finer adm3 boundaries in selected areas. This approach is useful when certain districts require more granularity for analysis, planning, or visualization, but a full switch to adm3 isn’t necessary.

One use case for this approach is when finer geographic detail is needed only in specific parts of the country. For example, in Sierra Leone, you may want to use adm3 boundaries within Western Area Urban, which includes Freetown, a densely populated area, while retaining adm2 boundaries for the rest of the country. This allows you to maintain national coverage at a manageable resolution, while zooming in where more granularity is required.

  • R
  • Python
Show the code
# remove target adm2 unit from adm2 shapefile
target_adm2 <- "WESTERN URBAN"

adm2_trimmed <- shp_adm2 |>
  dplyr::filter(adm2 != target_adm2)

# get the equivalent adm3 units
adm3_subset <- shp_adm3 |>
  dplyr::filter(adm2 == target_adm2)

# optional: add a column to indicate source level
adm2_trimmed <- adm2_trimmed |> dplyr::mutate(source = "adm2")
adm3_subset <- adm3_subset |> dplyr::mutate(source = "adm3")

# combine them into a single shapefile
shp_mixed <- dplyr::bind_rows(adm2_trimmed, adm3_subset)

# Check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))

# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)

# Plot WESTERN to show the finner changes
shp_adm2 |>
  dplyr::filter(adm1 == "WESTERN") |>
  sf::st_geometry() |>
  plot(col = plot_col, main = "Original adm2 (zoomed on WESTERN region)")

shp_mixed |>
  dplyr::filter(adm1 == "WESTERN") |>
  sf::st_geometry() |>
  plot(col = plot_col, main = "Mixed adm2 (zoomed on WESTERN region)")


# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))
Output

To adapt the code:

To apply this in your context:

  • Line 2: Replace WESTERN URBAN with the name of the admin unit you want to swap.
  • Lines 4–5, 8–9: Update adm2 and adm3 column names if your shapefile uses different labels.
  • Repeat entire code block as needed for all areas of interest.

The results show that the operation worked: in the WESTERN region, Western Rural remains as a single adm2 unit, while Western Urban has been successfully replaced by several finer adm3 units. This zoomed view shows differences not visible at the national scale. While shp_mixed includes all regions, this comparison confirms the targeted substitution.

Step 5.2: Merging multiple polygons into a single unit

In some cases, it may be necessary to merge two or more adjacent administrative units into a single geographic feature. This is could be done for operational convenience, reporting simplicity, or when small neighboring units should be treated as one planning zone.

For example, we show how three southern districts in Sierra Leone (Bonthe, Moyamba, and Pujehun) can be combined into a single unit.

  • R
  • Python
Show the code
# define the adm2 units to combine
combine_names <- c("BONTHE", "MOYAMBA", "PUJEHUN")

# create a new combined unit
combined <- shp_adm2 |>
  dplyr::filter(adm2 %in% combine_names) |>
  dplyr::summarise(
    adm3 = paste(combine_names, collapse = " ~ "),
    geometry = sf::st_union(geometry),
    .groups = "drop"
  )

# drop the originals and bind the new unit
shp_adm2_combined <- shp_adm2 |>
  dplyr::filter(!adm2 %in% combine_names) |>
  dplyr::bind_rows(combined)

# check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))

# set colors
plot_col <- sf::sf.colors(10, categorical = TRUE)

# Plot each administrative level
plot(shp_adm2$geometry, col = plot_col, main = "Original adm2")
plot(
  shp_adm2_combined$geometry,
  col = plot_col,
  main = "Combined adm2: Bonthe + Moyamba + Pujehun"
)

# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))
Output

To adapt the code:

  • Line 2: Change the names in combine_names to the units you want to merge.
  • Lines 6, 15: Update the adm2 field if your shapefile uses a different name (e.g., district_name).
  • Line 8: Adjust the output name (adm3 = paste(...)) to reflect how you want the merged unit labeled—e.g., use Southern Cluster instead.
  • Repeat entire code block as needed for all areas of interest.

The plot shows that the selected districts have been successfully merged into a single polygon, forming a custom geographic unit.

Step 6: Create geographic IDs using hashes

To track changes in boundary geometry over time or across shapefile versions, we generate a geometry-based hash ID for each polygon. This allows us to detect if a unit has changed spatially, even if its name remains the same.

This step is useful for verifying shapefile updates, aligning geographies across years, and making sure historical comparisons are done correctly. We can also use geometry hashes to check for duplicate polygons in the same shapefile (Step 3 above).

  • R
  • Python
Show the code
# create a hash ID for each polygon using its geometry
shp_adm3 <- shp_adm3 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, adm1, adm2, adm3, geometry_hash)

shp_adm2 <- shp_adm2 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, adm1, adm2, geometry_hash)

shp_adm1 <- shp_adm1 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, adm1, geometry_hash)

shp_adm0 <- shp_adm0 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, geometry_hash)

# Check uniqueness of each shape hash ID
cli::cli_h2("Checking hash ID uniqueness")

adm0_n <- nrow(shp_adm0)
adm0_unique <- dplyr::n_distinct(shp_adm0$geometry_hash)
cli::cli_alert_info("adm0: {adm0_unique} unique hashes across {adm0_n} rows")

adm1_n <- nrow(shp_adm1)
adm1_unique <- dplyr::n_distinct(shp_adm1$geometry_hash)
cli::cli_alert_info("adm1: {adm1_unique} unique hashes across {adm1_n} rows")

adm2_n <- nrow(shp_adm2)
adm2_unique <- dplyr::n_distinct(shp_adm2$geometry_hash)
cli::cli_alert_info("adm2: {adm2_unique} unique hashes across {adm2_n} rows")

adm3_n <- nrow(shp_adm3)
adm3_unique <- dplyr::n_distinct(shp_adm3$geometry_hash)
cli::cli_alert_info("adm3: {adm3_unique} unique hashes across {adm3_n} rows")

# check output
head(shp_adm3)
Output
── Checking hash ID uniqueness ──
ℹ adm0: 1 unique hashes across 1 rows
ℹ adm1: 5 unique hashes across 5 rows
ℹ adm2: 16 unique hashes across 16 rows
ℹ adm3: 208 unique hashes across 208 rows
Simple feature collection with 6 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -11.02524 ymin: 7.758168 xmax: -10.27056 ymax: 8.505881
Geodetic CRS:  WGS 84
# A tibble: 6 × 6
  adm0         adm1    adm2     adm3     geometry_hash                  geometry
  <chr>        <chr>   <chr>    <chr>    <chr>                <MULTIPOLYGON [°]>
1 SIERRA LEONE EASTERN KAILAHUN DEA      bee1a2ea      (((-10.66064 8.029778, -…
2 SIERRA LEONE EASTERN KAILAHUN JAHN     c54ea6d9      (((-10.90291 8.08022, -1…
3 SIERRA LEONE EASTERN KAILAHUN JAWIE    d2a4880e      (((-10.8085 8.024512, -1…
4 SIERRA LEONE EASTERN KAILAHUN KISSI K… ec4698b1      (((-10.35988 8.464595, -…
5 SIERRA LEONE EASTERN KAILAHUN KISSI T… 3c531939      (((-10.31537 8.496933, -…
6 SIERRA LEONE EASTERN KAILAHUN KISSI T… fca31da6      (((-10.27296 8.441019, -…

To adapt the code:

  • Lines 2, 8, 14, and 20: Replace shp_adm3, shp_adm2, etc. with your shapefile objects.
  • Ensure each layer has a geometry column of class sfc.
  • Lines 6, 12, 18, and 24: The field names (adm0, adm1, etc.) can be adjusted to match your naming scheme.

We now have clean shapefiles for each admin level (adm0–adm3), each tagged with a geometry-based hash ID.

Step 7: Saving vector data

After cleaning and transforming spatial data, it is important to save it in a format that preserves the geometry, structure, and CRS. R supports multiple vector formats, and you do not need to use all of them; simply choose the formats that fit your workflow and how you plan to reuse or share the data.

Below are the main options, with notes on when and how to use them. Use the one(s) that work best for your team and context.

  • R
  • Python
Show the code
# save adm0
saveRDS(
  shp_adm0,
  here::here(
    spat_path, "processed", "sle_spatial_adm0_2021.rds"
  )
)

# save adm1
saveRDS(
  shp_adm1,
  here::here(
    spat_path, "processed", "sle_spatial_adm1_2021.rds"
  )
)

# save adm2
saveRDS(
  shp_adm2,
  here::here(
    spat_path, "processed", "sle_spatial_adm2_2021.rds"
  )
)

# save adm3
saveRDS(
  shp_adm3,
  here::here(
    spat_path, "processed", "sle_spatial_adm3_2021.rds"
  )
)
Deleting source `/Users/mohamedyusuf/ahadi-analytics/code/GitHub/snt-code-library/english/data_r/shapefiles/sle_spatial_adm3_2021.geojson' using driver `GeoJSON'
Writing layer `sle_spatial_adm3_2021' to data source 
  `/Users/mohamedyusuf/ahadi-analytics/code/GitHub/snt-code-library/english/data_r/shapefiles/sle_spatial_adm3_2021.geojson' using driver `GeoJSON'
Writing 208 features with 5 fields and geometry type Multi Polygon.

To adapt the code:

The example below shows how to save shapefiles of multiple administrative levels using both RDS and st_write functions. Choose the format that best fits your workflow, whether you are storing layers together or exporting them individually.

If your spatial object was saved using saveRDS(), you can load it with readRDS() and convert it to an sf object:

 readRDS("data/rds/cleaned_adm3.rds") |> sf::st_as_sf()

GeoJSON (.geojson): For GeoJSON files:

sf::st_write(
  shp_adm3,                                       # object to write
  here::here("folder", "boundary.geojson"),       # output file path
  delete_dsn = TRUE                               # overwrite file if it exists
)

GeoPackage (.gpkg): For GeoPackage files:

sf::st_write(
  shp_adm3,                                       # object to write
  here::here("folder", "boundary.gpkg"),          # output file path
  layer = "adm3",                                 # layer name inside gpkg
  delete_layer = TRUE                             # overwrite layer if it exists
)

File Geodatabase (.gdb): For Geodatabase files:

sf::st_write(
  shp_adm3,                                       # object to write
  here::here("folder", "boundary.gdb"),           # output geodatabase path
  layer = "adm3",                                 # layer name inside gdb
  driver = "FileGDB",                             # use FileGDB driver
  delete_layer = TRUE                             # overwrite layer if it exists
)

Lines 2–4, 7–10, 13–16, and 19–22: Be sure to adapt the file paths and column names (adm0, adm1, adm2, adm3) based on your own folder structure and country-specific data. Not all countries use an adm3 level, so this will vary depending on your context.

Summary

This section provides a practical foundation for managing shapefile data in the SNT workflow. It walks through the key steps needed to prepare, validate, and customize boundary layers. It offers solutions for resolving geometry issues, handling duplicates, aligning CRS, and constructing flexible administrative layers tailored to program needs. The page also introduces geometry hashing to support version tracking and shows how to export cleaned spatial data for analysis and mapping. These steps ensure that all spatial operations in SNT are built on a stable and reproducible geographic base.

Analysts can return to specific steps as needed throughout their work, making this page a long-term reference for consistent spatial data preparation across SNT projects.

Full code

Find the full code script for managing and customising shapefiles below.

  • R
  • Python
Show full code
################################################################################
############# ~ Shapefile management and customization full code ~ #############
################################################################################

### Step 1: Install and load packages ------------------------------------------

# Install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Load required packages using pacman
pacman::p_load(
  sf,         # handling shapefile data
  dplyr,      # data manipulation
  ggplot2,    # plotting
  here,       # file path management
  stringr,    # string manipulation (e.g., str_remove)
  cli,        # clean logging and CLI-style messages
  devtools    # for package management
)

# sntutils has a number of useful helper functions for
# data management and cleaning
if (!requireNamespace("sntutils", quietly = TRUE)) {
  devtools::install_github("ahadi-analytics/sntutils")
}

### Step 2: Load and prepare shapefiles ----------------------------------------

#### Step 2.1: Load and examine data -------------------------------------------

# set up spatial path
spat_path <- here::here("1.1_foundational", "1.1a_administrative_boundaries")

# import shapefile
shp_adm3 <- sf::read_sf(
  here::here(spat_path, "raw", "Chiefdom2021.shp")
) |>
  dplyr::mutate(adm0 = "SIERRA LEONE") |>
  dplyr::select(
    # select relevant columns and rename to standard nomenclature
    adm0,
    adm1 = FIRST_REGI,
    adm2 = FIRST_DNAM,
    adm3 = FIRST_CHIE
  ) |>
  # ensure output remains a valid sf object for downstream usage
  sf::st_sf()

# check output by mapping
plot(shp_adm3)

#### Step 2.2: Check and transform CRS -----------------------------------------

cli::cli_h2("Check and assign original CRS for shp_adm3")
shp_adm3 <- sf::st_set_crs(shp_adm3, 4326)
cli::cli_alert_info("Original CRS: {sf::st_crs(shp_adm3)$input}")

# Transform to new CRS (e.g., Equal Earth - EPSG:6933)
shp_adm3_diff_crs <- sf::st_transform(shp_adm3, 6933)

cli::cli_h2("Check CRS after transformation")
cli::cli_alert_info("New CRS: {sf::st_crs(shp_adm3_diff_crs)$input}")

### Step 3: Geometry validation ------------------------------------------------

#### Step 3.1: Check and correct geometry validity -----------------------------

# check geometry validity
shp_adm3_inv <- shp_adm3 |>
  dplyr::mutate(
    valid = sf::st_is_valid(geometry),
    invalid_reason = sf::st_is_valid(geometry, reason = TRUE),
    invalid_reason = stringr::str_remove(invalid_reason, "\\[.*")
  )

# get list of invalid adm3s
invalid_df <- shp_adm3_inv |>
  dplyr::filter(!valid) |>
  sf::st_drop_geometry() |>
  dplyr::select(adm0, adm1, adm2, adm3, invalid_reason)

# view invalid rows (if any)
invalid_df

# fix known geometry issues
# (e.g. self-intersections, unclosed rings)
shp_adm3 <- shp_adm3 |>
  sf::st_make_valid()

# re-check and report remaining invalid geometries
n_invalid <- shp_adm3 |>
  dplyr::mutate(valid = sf::st_is_valid(geometry)) |>
  dplyr::filter(!valid) |>
  nrow()

cli::cli_alert_info("Remaining invalid geometries: {n_invalid}")

#### Step 3.2: Check for duplicate polygons ------------------------------------

# Simulate duplicates for demonstration purposes

# Pick one row to duplicate exactly (full row duplicate)
dup_full <- shp_adm3 |> dplyr::slice(1:2)

# Pick another row to duplicate geometry only (geometry-only duplicate)
dup_geom <- shp_adm3 |>
  dplyr::slice(10) |>
  dplyr::mutate(adm3 = "DUPLICATE LABEL")

# Combine with the original shapefile
shp_adm3 <- shp_adm3 |>
  dplyr::bind_rows(dup_full, dup_geom)

# check for duplicate geometries
shp_adm3_dup <- shp_adm3 |>
  dplyr::mutate(dup_geom = duplicated(sf::st_as_binary(geometry))) |>
  dplyr::filter(dup_geom == TRUE)

cli::cli_alert_info("Number of duplicated geometries: {nrow(shp_adm3_dup)}")

# Check for full row duplicates
dups_full <- shp_adm3 |>
  dplyr::group_by(across(everything())) |>
  dplyr::filter(dplyr::n() > 1)

# Check for geometry-only duplicates
dups_geom <- shp_adm3 |>
  dplyr::mutate(geom_hash = sntutils::vdigest(geometry)) |>
  dplyr::add_count(geom_hash, name = "n_geom") |>
  dplyr::filter(n_geom > 1)

# View results
cli::cli_h1("Checking for full row duplicates")
dups_full |> sf::st_drop_geometry()

cli::cli_h1("Checking for geometry-only duplicates")
dups_geom |> sf::st_drop_geometry()

#### Step 3.3: Resolve any duplicated polygons ---------------------------------

# remove full row duplicates
shp_adm3 <- shp_adm3 |>
  dplyr::distinct()

# if needed, remove known geometry-only duplicates by filtering on labels
# drop intentionally duplicated label
shp_adm3 <- shp_adm3 |>
  dplyr::filter(adm3 != "DUPLICATE LABEL")

# Check  again after cleaning
# check for geometry-only duplicates in shp_adm3
shp_adm3_dup <- shp_adm3 |>
  dplyr::mutate(dup_geom = duplicated(sf::st_as_binary(geometry))) |>
  dplyr::filter(dup_geom == TRUE)

cli::cli_alert_info(
  "Number of duplicated geometries in shp_adm3: {nrow(shp_adm3_dup)}"
)

### Step 4: Aggregate or reconstruct higher-level administrative boundaries ----

# create adm0-level shapefile by
# grouping over the country name
shp_adm0 <- shp_adm3 |>
  dplyr::group_by(adm0) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# create adm1-level shapefile by
# grouping over the region (e.g., province)
shp_adm1 <- shp_adm3 |>
  dplyr::group_by(adm0, adm1) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# create adm2-level shapefile (e.g., districts)
shp_adm2 <- shp_adm3 |>
  dplyr::group_by(adm0, adm1, adm2) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# Check to see the operations worked
# set up a 2x2 plotting area
graphics::par(mfrow = c(2, 2))

# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)

# Plot each administrative level
plot(shp_adm3$geometry, col = plot_col, main = "Original adm3")
plot(shp_adm2$geometry, col = plot_col, main = "Aggregated adm2 (from adm3)")
plot(shp_adm1$geometry, col = plot_col, main = "Aggregated adm1 (from adm3)")
plot(shp_adm0$geometry, col = plot_col, main = "Aggregated adm0 (from adm3)")

# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))

### Step: 5 Shapefile customization --------------------------------------------

#### Step 5.1: Mixed-level shapefiles ------------------------------------------

# remove target adm2 unit from adm2 shapefile
target_adm2 <- "WESTERN URBAN"

adm2_trimmed <- shp_adm2 |>
  dplyr::filter(adm2 != target_adm2)

# get the equivalent adm3 units
adm3_subset <- shp_adm3 |>
  dplyr::filter(adm2 == target_adm2)

# optional: add a column to indicate source level
adm2_trimmed <- adm2_trimmed |> dplyr::mutate(source = "adm2")
adm3_subset <- adm3_subset |> dplyr::mutate(source = "adm3")

# combine them into a single shapefile
shp_mixed <- dplyr::bind_rows(adm2_trimmed, adm3_subset)

# Check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))

# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)

# Plot WESTERN to show the finner changes
shp_adm2 |>
  dplyr::filter(adm1 == "WESTERN") |>
  sf::st_geometry() |>
  plot(col = plot_col, main = "Original adm2 (zoomed on WESTERN region)")

shp_mixed |>
  dplyr::filter(adm1 == "WESTERN") |>
  sf::st_geometry() |>
  plot(col = plot_col, main = "Mixed adm2 (zoomed on WESTERN region)")

# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))

#### Step 5.2: Merging multiple polygons into a single unit --------------------

# define the adm2 units to combine
combine_names <- c("BONTHE", "MOYAMBA", "PUJEHUN")

# create a new combined unit
combined <- shp_adm2 |>
  dplyr::filter(adm2 %in% combine_names) |>
  dplyr::summarise(
    adm3 = paste(combine_names, collapse = " ~ "),
    geometry = sf::st_union(geometry),
    .groups = "drop"
  )

# drop the originals and bind the new unit
shp_adm2_combined <- shp_adm2 |>
  dplyr::filter(!adm2 %in% combine_names) |>
  dplyr::bind_rows(combined)

# check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))

# set colors
plot_col <- sf::sf.colors(10, categorical = TRUE)

# Plot each administrative level
plot(shp_adm2$geometry, col = plot_col, main = "Original adm2")
plot(
  shp_adm2_combined$geometry,
  col = plot_col,
  main = "Combined adm2: Bonthe + Moyamba + Pujehun"
)

# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))

### Step 6: Create geographic IDs using hashes ---------------------------------

# create a hash ID for each polygon using its geometry
shp_adm3 <- shp_adm3 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, adm1, adm2, adm3, geometry_hash)

shp_adm2 <- shp_adm2 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, adm1, adm2, geometry_hash)

shp_adm1 <- shp_adm1 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, adm1, geometry_hash)

shp_adm0 <- shp_adm0 |>
  dplyr::mutate(
    geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
  ) |>
  dplyr::select(adm0, geometry_hash)

# Check uniqueness of each shape hash ID
cli::cli_h2("Checking hash ID uniqueness")

adm0_n <- nrow(shp_adm0)
adm0_unique <- dplyr::n_distinct(shp_adm0$geometry_hash)
cli::cli_alert_info("adm0: {adm0_unique} unique hashes across {adm0_n} rows")

adm1_n <- nrow(shp_adm1)
adm1_unique <- dplyr::n_distinct(shp_adm1$geometry_hash)
cli::cli_alert_info("adm1: {adm1_unique} unique hashes across {adm1_n} rows")

adm2_n <- nrow(shp_adm2)
adm2_unique <- dplyr::n_distinct(shp_adm2$geometry_hash)
cli::cli_alert_info("adm2: {adm2_unique} unique hashes across {adm2_n} rows")

adm3_n <- nrow(shp_adm3)
adm3_unique <- dplyr::n_distinct(shp_adm3$geometry_hash)
cli::cli_alert_info("adm3: {adm3_unique} unique hashes across {adm3_n} rows")

# check output
head(shp_adm3)

### Step 7: Saving vector data -------------------------------------------------

# save adm0
saveRDS(
  shp_adm0,
  here::here(
    spat_path, "processed", "sle_spatial_adm0_2021.rds"
  )
)

# save adm1
saveRDS(
  shp_adm1,
  here::here(
    spat_path, "processed", "sle_spatial_adm1_2021.rds"
  )
)

# save adm2
saveRDS(
  shp_adm2,
  here::here(
    spat_path, "processed", "sle_spatial_adm2_2021.rds"
  )
)

# save adm3
saveRDS(
  shp_adm3,
  here::here(
    spat_path, "processed", "sle_spatial_adm3_2021.rds"
  )
)
 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved