# Install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# Load required packages using pacman
pacman::p_load(
sf, # handling shapefile data
dplyr, # data manipulation
ggplot2, # plotting
here, # file path management
stringr, # string manipulation (e.g., str_remove)
cli, # clean logging and CLI-style messages
devtools # for package management
)
# sntutils has a number of useful helper functions for
# data management and cleaning
if (!requireNamespace("sntutils", quietly = TRUE)) {
devtools::install_github("ahadi-analytics/sntutils")
}Shapefile management and customization
Intermediate
Overview
In the SNT context, having an official, accurate, and up-to-date shapefile is essential. These boundaries determine the scale at which analysts aggregate data, calculate indicators, apply targeting criteria, and visualize outputs. Misaligned or incomplete shapefiles can introduce serious errors, such as dropped areas, misclassified units, or misleading maps, which compromise both the validity of the analysis and its usefulness for decision-making.
For background information on shapefiles and links to all the spatial data content in the SNT code library, including rasters and point data, please visit Spatial data overview.
Proper shapefile management, including attribute cleaning, name harmonization, geometry validation, and display simplification, is essential for spatial coherence, reproducibility, and alignment with SNT goals. A validated boundary layer underpins all joins, analyses, and outputs in the SNT workflow. This section introduces core shapefile concepts and outlines the key steps for preparing and customizing them for use.
This page introduces the core steps for managing and customizing shapefiles for use in the SNT workflow. The Sierra Leone shapefile prepared in these sections, along with the visualization concepts and spatial knowledge introduced, is used throughout the remainder of the SNT code library to support all analysis and mapping steps.
This page serves as a foundational reference for all spatial operations in the SNT Code Library. Much of the subsequent analysis pages—including those that download and process raster data such as population, climate, and modeled estimates—rely heavily on the shapefile management, cleaning, and troubleshooting methods presented here. The techniques for geometry validation, coordinate reference system alignment, boundary aggregation, and duplicate handling are essential prerequisites for successful spatial joins and accurate geographic analysis throughout the SNT workflow.
It is worth ensuring that your shapefile is able to pass through these validation checks initially to prevent any issues down the line. Keep this page bookmarked as a reference guide to return to when encountering spatial data issues in later analysis steps.
- Import, inspect, and prepare shapefiles and other formats
- Validate and clean geometries to ensure accurate joins and maps
- Aggregate or customize administrative boundaries for flexible analysis
- Generate geometry hashes
- Save cleaned shapefile files for later use
Considerations for Managing and Customizing Shapefiles
Before a shapefile is used for any purpose, whether for joining data, visualizing maps, aggregating indicators, or running spatial analysis, it must be clean, valid, and properly structured. This section outlines key challenges and best practices in shapefile preparation that guide the reliability of all downstream steps in the SNT workflow. These steps form the data foundation on which reliable SNT analysis is built.
Valid and Complete Geometries
Shapefiles must be geometrically valid. Possible problems include:
- spatial artifacts and overlaps from poor digitization
- empty or zero-area polygons
- multipart geometries or holes that affect containment logic.
Invalid geometries can silently break joins or misplace features. In SNT workflows, this may result in missing areas in coverage maps or unmatched clusters in analysis of survey data. It is always important to identify invalid geometries, flag them, and correct them where possible.
Ensuring Unique Polygons
Each unit in your shapefile should represent a single, distinct geographic area. Duplicate polygons, caused by digitization errors or mistakenly repeated records, can lead to incorrect joins, double-counting in aggregations, and confusion in mapping.
Why this matters: Duplicate features with the same name or ID can: - Distort summary statistics - Lead to incorrect matches when joining multiple shapefiles - Create misleading visualizations or maps
Possible issues to resolve:
Duplicate geometries (polygons or other geometry) with identical or repeated IDs
Multiple records representing the same unit
Example: If two features both have the name “Bumpe Ngao” and cover the same area, population estimates may be counted twice during zonal extraction or incidence calculations. One of the duplicates should be removed before using the shapefile.
You can use geospatial tools to detect and resolve duplicates—such as flagging repeated names or IDs, comparing geometries for equality, or dissolving features by administrative codes. We will demonstrate several examples in the Step-by-Step section below.
Coordinate Reference Systems (CRS)
For spatial joins to work correctly, both spatial data files must share the same CRS. A mismatch can cause points to fall outside their true polygons, even if coordinates are correct. Always confirm that both datasets use a common, appropriate CRS as mismatches can lead to misalignment and incorrect results. Typically WGS84 (EPSG:4326) is used. For more information on CRS, please visit Spatial data overview.
Aggregating or Reconstructing Admin Levels
Before starting any analysis, analysts should request all official shapefiles at the required unit of analysis and at all higher administrative levels. For example, if the chosen unit is admin2, you should also obtain the official admin1 shapefile. This is not only important for harmonizing data across levels, but also essential for producing SNT maps that include national and regional context and are easy to interpret.
However, in practice, you may find that the shapefile for your intended unit of analysis is missing or incomplete. For example, you might have data at admin2 but only a complete shapefile for admin3. In such cases, you can aggregate polygons from a more granular shapefile using a shared identifier to reconstruct the higher-level boundaries.
- Admin3 polygons can be grouped and dissolved by a common admin2 name or code to recreate admin2 boundaries. The same approach applies to admin1 boundaries.
- This technique is used in SNT when reporting at broader levels (e.g., provinces or regions) or when visualizing national maps with consistent resolution, if dedicated shapefiles at the larger admin unit are not available.
Example: In Sierra Leone, if only admin3 boundaries are available, you can group admin3 units by their district name and merge their geometries. This reconstructs an admin2-level shapefile that matches the data and enables accurate mapping and aggregation.
This process ensures that your shapefile structure aligns with the level of your indicators and supports clean joins, visualizations, and reporting across administrative tiers.
Partial Aggregation for Flexible Resolution
In some cases, it may be beneficial to represent different parts of the country at different administrative levels within the same shapefile. For example, an SNT analysis might require admin3 detail for elimination areas, while the rest of the country can be visualized and analyzed at admin2.
This strategy is particularly useful when interventions or programmatic decisions require more granularity in some areas than others. It provides a practical way to balance detail and usability in maps and analysis while ensuring alignment with country-specific priorities and operational needs. This approach allows you to retain detailed boundaries in certain areas while using larger-unit boundaries elsewhere.
In some SNT exercises, countries choose to use different administrative levels in different regions, for instance, admin3 for urban areas and admin2 for rural zones. This reflects operational realities, data availability, or planning needs.
To support this hybrid setup, you can clip each shapefile to its relevant region and then merge them into a single boundary layer. With proper alignment of geometries and attributes, this structure allows for flexible, context-sensitive spatial analysis. The techniques needed to do this are covered on this page.
Tracking Boundary Changes
Administrative boundaries can shift over time, through redistricting, renaming, or reclassification. In SNT, it is important to detect these changes to ensure geographic consistency for time-series or trend analysis, and to make sure historical data is mapped and interpreted correctly.
While visualizing and comparing different versions of a shapefile by eye is a simple way to look for visible differences, it is also possible to detect changes more systematically and rigorously. One way to do this is by generating a geometry hash, a unique identifier based on a polygon’s coordinates. If a shape changes, its hash will change. If only the name or code changes but the geometry stays the same, the hash remains stable, making it easier to detect renames or confirm spatial continuity.
Example: If Kambia district’s geometry is unchanged but renamed across versions, its hash stays the same. If the shape is redrawn, the hash will differ, flagging it for review.
This approach supports version control, longitudinal consistency, and reproducible spatial workflows. Methods for computing and comparing hashes are covered later in this section.
Step-by-Step
In this section, we guide you through the essential steps to load, manage, troubleshoot, and customize shapefile data.
The example uses administrative boundary shapefiles from Sierra Leone, with focus on chiefdom (adm3) and district (adm2) levels. The principles can be applied to shapefiles from any country.
To skip the step-by-step explanation, jump to the full code at the end of this page.
Step 1: Install and load packages
First, install and/or load the necessary packages for handling spatial data, data manipulation, and visualization. These libraries provide essential functions for working with spatial data.
The sf package is particularly important as it implements simple features standards for handling geographic vector data in R.
To adapt the code:
- Do not modify anything in the code above
Step 2: Load and Prepare Shapefiles
Step 2.1: Load and Examine Data
Import both datasets and examine their structure to identify the appropriate join keys. This important step ensures understanding of the data before attempting to merge.
We begin by loading the Sierra Leone shapefile and selecting the relevant administrative levels for analysis. A new column for adm0 (country name) is added, and admin unit fields are renamed to standard adm1, adm2, and adm3 labels for consistency across the SNT workflow.
# set up spatial path
spat_path <- here::here("1.1_foundational", "1.1a_administrative_boundaries")
# import shapefile
shp_adm3 <- sf::read_sf(
here::here(spat_path, "raw", "Chiefdom2021.shp")
) |>
dplyr::mutate(adm0 = "SIERRA LEONE") |>
dplyr::select(
# select relevant columns and rename to standard nomenclature
adm0,
adm1 = FIRST_REGI,
adm2 = FIRST_DNAM,
adm3 = FIRST_CHIE
) |>
# ensure output remains a valid sf object for downstream usage
sf::st_sf()
# check output by mapping
plot(shp_adm3)To adapt the code:
The code below shows how you can modify the loading process when working with other spatial vector formats. All of these examples use sf::read_sf(), which supports multiple formats via GDAL.
GeoJSON: If your boundary file is in GeoJSON format, simply update the file extension
GeoPackage (.gpkg): For GeoPackage files, you must specify both the file and the layer name:
ℹ️ Use
sf::st_layers("boundary.gpkg")to list all available layers in the file.
File Geodatabase (.gdb): For Geodatabase files, you must specify both the file and the layer name:
Once imported, the rest of the processing pipeline (mutate(), select(), st_set_crs(), etc.) remains the same.
Additional adaptations:
- Line 2: Change the
here::here()path to match your folder structure - Line 5: If you are working with a shapefile at a different admin unit level, you may want to change the variable name for the spatial object away from
shp_adm3. For example, if working at adm2,shp_adm2would be more appropriate. - Lines 11–14: Update the administrative column names (
adm0,adm1,adm2,adm3) to reflect your dataset. For example, if your shapefile does not contain an adm3 level, remove the adm3 line.
Step 2.2: Check and Transform CRS
Once loaded, it is important to ensure that your shapefile is in the appropriate coordinate reference system (CRS).
Below, we check the current CRS of the shapefile and, if needed, transforms it to a consistent geographic reference system, to ensure all layers align correctly and will be compatible with other spatial datasets we might use.
The Sierra Leone shapefile is already in WGS84 (EPSG:4326), so normally we would not need to change the CRS. However, for demonstration purposes we provide an example of how to transform the CRS (to Equal Earth (EPSG:6933)).
cli::cli_h2("Check and assign original CRS for shp_adm3")
shp_adm3 <- sf::st_set_crs(shp_adm3, 4326)
cli::cli_alert_info("Original CRS: {sf::st_crs(shp_adm3)$input}")
# Transform to new CRS (e.g., Equal Earth - EPSG:6933)
shp_adm3_diff_crs <- sf::st_transform(shp_adm3, 6933)
cli::cli_h2("Check CRS after transformation")
cli::cli_alert_info("New CRS: {sf::st_crs(shp_adm3_diff_crs)$input}")To adapt the code:
- Lines 1–2: Don’t change anything unless you’re using a different variable for your shapefile object (e.g.,
shp_adm2) or need a different CRS. - Line 2: Use
st_set_crs()only if the shapefile is missing a CRS but you’re certain of what it should be. If you need to reproject the geometry, always usest_transform(). - Line 6: To use another projection, just update the EPSG code in
st_transform(), for example to 4326 for the WGS84 CRS usually used in SNT, if the original shapefile had a different CRS. - If your shapefile was already in the correct CRS, lines 6 onwards can be removed.
Step 3: Geometry Validation
Step 3.1: Check and Correct Geometry Validity
Invalid geometries, like overlaps, gaps, or broken shapes, can cause joins to fail or maps to misalign. These issues can go unnoticed but ultimately lead to missing areas or incorrect outputs. Before doing any joins, always check that your shapefile is valid.
# check geometry validity
shp_adm3_inv <- shp_adm3 |>
dplyr::mutate(
valid = sf::st_is_valid(geometry),
invalid_reason = sf::st_is_valid(geometry, reason = TRUE),
invalid_reason = stringr::str_remove(invalid_reason, "\\[.*")
)
# get list of invalid adm3s
invalid_df <- shp_adm3_inv |>
dplyr::filter(!valid) |>
sf::st_drop_geometry() |>
dplyr::select(adm0, adm1, adm2, adm3, invalid_reason)
# view invalid rows (if any)
invalid_dfTo adapt the code:
- Lines 2, 10, and 13: Replace
shp_adm3with the variable name for your spatial object, such asshp_adm2.
Two chiefdoms, Niawa and Wara Wara Yagala, have invalid geometries due to ring self-intersections, where polygon edges cross or loop back on themselves (see this guide for common invalid geometry reasons). These issues tend to arise from digitization errors or overly complex boundaries and can disrupt spatial joins and visualizations. It is important to identify and correct them before further processing.
We will retain invalid_df for reference, as it contains the chiefdoms with unresolved issues for potential manual review or future reference, then proceed to make all geometries in shp_adm3 valid.
# fix known geometry issues
# (e.g. self-intersections, unclosed rings)
shp_adm3 <- shp_adm3 |>
sf::st_make_valid()
# re-check and report remaining invalid geometries
n_invalid <- shp_adm3 |>
dplyr::mutate(valid = sf::st_is_valid(geometry)) |>
dplyr::filter(!valid) |>
nrow()
cli::cli_alert_info("Remaining invalid geometries: {n_invalid}")To adapt the code:
- Lines 3 and 7: Replace
shp_adm3with your relevant spatial object, such asshp_adm2.
All geometries are now valid and safe for future spatial operations.
Step 3.2: Check for Duplicate Polygons
Each polygon should represent a unique geographic unit. Duplicate shapes, whether exact copies or multiple records for the same area, can cause double-counting, join errors, and misleading summaries. Always check for and remove duplicates before proceeding. Below, we outline simple checks and fixes using standard approaches.
To adapt the code:
- Lines 2 and 6: Replace
shp_adm3with your relevant spatial object, such asshp_adm2.
The results show 5 non-unique geometries. It is important to investigate and resolve them before proceeding with analysis or integration. Use the check below to identify duplicates based on attributes and geometry. It is helpful to distinguish between:
- Full row duplicates: Same attributes and same geometry.
- Geometry-only duplicates: Identical shapes with differing labels.
# Check for full row duplicates
dups_full <- shp_adm3 |>
dplyr::group_by(across(everything())) |>
dplyr::filter(dplyr::n() > 1)
# Check for geometry-only duplicates
dups_geom <- shp_adm3 |>
dplyr::mutate(geom_hash = sntutils::vdigest(geometry)) |>
dplyr::add_count(geom_hash, name = "n_geom") |>
dplyr::filter(n_geom > 1)
# View results
cli::cli_h1("Checking for full row duplicates")
dups_full |> sf::st_drop_geometry()
cli::cli_h1("Checking for geometry-only duplicates")
dups_geom |> sf::st_drop_geometry()To adapt the code:
- Lines 2, 7, 14, and 17: Replace
shp_adm3with your relevant spatial object, such asshp_adm2. - Ensure the geometry column is named geometry (as it is with
sf::read_sf()).
In this example:
DEAandJAHNeach appear twice with identical attributes and geometry, indicating full row duplicates. These are likely accidental (e.g., created during merging or binding) and can be safely removed using a deduplication step appropriate to your software or language.MALEMAandDUPLICATE LABELshare the same geometry but have different names. This suggests a labeling or naming issue, not a problem with the shapefile itself. These cases may reflect an administrative convention or a mismatched join, rather than spatial error.
We use geometry hashing to assign a unique ID to each shape, making it easier to detect and compare changes across shapefile versions. See Step 6 for more detail.
Step 3.3: Resolve Any Duplicated Polygons
Next we will resolve the duplicated polygons in our shapefile. We will first remove the full-row duplicates. Then for geometry-only duplicates, we will drop one version. In this example, we will drop the polygon with adm3 name DUPLICATE LABEL. Finally, we will check again for geometry duplicates to make sure all duplicates are now resolved.
If the duplicates aren’t clearly redundant, or reflect more than just data entry issues, consult the SNT team before making any reconciliation decisions. Some cases may involve overlapping units, legacy naming conventions, or intentional duplicates used in planning. Some cases may involve overlapping operational units, legacy classifications, or planning conventions that require contextual knowledge. Don’t drop anything unless the purpose of the duplication is understood.
Always document your cleaning steps clearly, so the SNT team is aware of any changes and can trace the process if needed.
# remove full row duplicates
shp_adm3 <- shp_adm3 |>
dplyr::distinct()
# if needed, remove known geometry-only duplicates by filtering on labels
# drop intentionally duplicated label
shp_adm3 <- shp_adm3 |>
dplyr::filter(adm3 != "DUPLICATE LABEL")
# Check again after cleaning
# check for geometry-only duplicates in shp_adm3
shp_adm3_dup <- shp_adm3 |>
dplyr::mutate(dup_geom = duplicated(sf::st_as_binary(geometry))) |>
dplyr::filter(dup_geom == TRUE)
cli::cli_alert_info(
"Number of duplicated geometries in shp_adm3: {nrow(shp_adm3_dup)}"
)To adapt the code:
- Lines 2, 7, 12, and 16: Replace
shp_adm3with your relevant spatial object, such asshp_adm2. - Line 8: Update the column used in
filter()(e.g.,adm3,facility_name, ordistrict_name) to match your dataset’s naming convention. - Line 2: Use
dplyr::distinct()to drop exact duplicate rows (same attributes and geometry). - Line 8: Use
dplyr::filter()to remove known label-based duplicates by passing in the name of the label to be dropped (e.g., rows withDUPLICATE LABEL).
The results show that the operations worked, as the dataset no longer has full row or geometry-only duplicates. The dataset now contains only unique, valid geometries.
In some cases, shapefiles or geodatabases may include multiple versions of the same boundaries, each associated with metadata indicating when that boundary was in use (e.g., start_date, end_date).
Always check the attribute table and metadata for date-related fields. If your dataset includes historical versions, you may need to filter to the most recent boundaries, such as:
🔎 If you’re unsure how boundaries were versioned or labeled, consult with the SNT team or data owner before proceeding.
Step 4: Aggregate or Reconstruct Higher-Level Administrative Boundaries
In some cases, shapefiles may only be available at a lower administrative level (e.g., adm3). You can aggregate these adm3 units to construct adm2, adm1, or adm0 boundaries.
Show the code
# create adm0-level shapefile by
# grouping over the country name
shp_adm0 <- shp_adm3 |>
dplyr::group_by(adm0) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# create adm1-level shapefile by
# grouping over the region (e.g., province)
shp_adm1 <- shp_adm3 |>
dplyr::group_by(adm0, adm1) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# create adm2-level shapefile (e.g., districts)
shp_adm2 <- shp_adm3 |>
dplyr::group_by(adm0, adm1, adm2) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# Check to see the operations worked
# set up a 2x2 plotting area
graphics::par(mfrow = c(2, 2))
# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)
# Plot each administrative level
plot(shp_adm3$geometry, col = plot_col, main = "Original adm3")
plot(shp_adm2$geometry, col = plot_col, main = "Aggregated adm2 (from adm3)")
plot(shp_adm1$geometry, col = plot_col, main = "Aggregated adm1 (from adm3)")
plot(shp_adm0$geometry, col = plot_col, main = "Aggregated adm0 (from adm3)")
# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))To adapt the code:
- Lines 3, 10, and 16: Replace
shp_adm3with your shapefile object. - Lines 4, 11–12, and 17–18: Update the column names (
adm0,adm1,adm2) to match those in your data. - Lines 4, 11–12, and 17–18: Adjust the grouping level depending on which higher admin level you want to create.
The plots confirm that aggregation to adm0, adm1, and adm2 levels worked as expected. Each map shows progressively coarser boundaries, and the original adm3 structure is preserved.
Step: 5 Shapefile Customization
Step 5.1: Mixed-Level Shapefiles
In some cases, you may want to combine different administrative levels within a single shapefile—for example, using adm2 for most of the country while substituting finer adm3 boundaries in selected areas. This approach is useful when certain districts require more granularity for analysis, planning, or visualization, but a full switch to adm3 isn’t necessary.
One use case for this approach is when finer geographic detail is needed only in specific parts of the country. For example, in Sierra Leone, you may want to use adm3 boundaries within Western Area Urban, which includes Freetown, a densely populated area, while retaining adm2 boundaries for the rest of the country. This allows you to maintain national coverage at a manageable resolution, while zooming in where more granularity is required.
Show the code
# remove target adm2 unit from adm2 shapefile
target_adm2 <- "WESTERN URBAN"
adm2_trimmed <- shp_adm2 |>
dplyr::filter(adm2 != target_adm2)
# get the equivalent adm3 units
adm3_subset <- shp_adm3 |>
dplyr::filter(adm2 == target_adm2)
# optional: add a column to indicate source level
adm2_trimmed <- adm2_trimmed |> dplyr::mutate(source = "adm2")
adm3_subset <- adm3_subset |> dplyr::mutate(source = "adm3")
# combine them into a single shapefile
shp_mixed <- dplyr::bind_rows(adm2_trimmed, adm3_subset)
# Check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))
# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)
# Plot WESTERN to show the finner changes
shp_adm2 |>
dplyr::filter(adm1 == "WESTERN") |>
sf::st_geometry() |>
plot(col = plot_col, main = "Original adm2 (zoomed on WESTERN region)")
shp_mixed |>
dplyr::filter(adm1 == "WESTERN") |>
sf::st_geometry() |>
plot(col = plot_col, main = "Mixed adm2 (zoomed on WESTERN region)")
# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))To adapt the code:
To apply this in your context:
- Line 2: Replace
WESTERN URBANwith the name of the admin unit you want to swap. - Lines 4–5, 8–9: Update
adm2andadm3column names if your shapefile uses different labels. - Repeat entire code block as needed for all areas of interest.
The results show that the operation worked: in the WESTERN region, Western Rural remains as a single adm2 unit, while Western Urban has been successfully replaced by several finer adm3 units. This zoomed view shows differences not visible at the national scale. While shp_mixed includes all regions, this comparison confirms the targeted substitution.
Step 5.2: Merging Multiple Polygons into a Single Unit
In some cases, it may be necessary to merge two or more adjacent administrative units into a single geographic feature. This is could be done for operational convenience, reporting simplicity, or when small neighboring units should be treated as one planning zone.
For example, we show how three southern districts in Sierra Leone (Bonthe, Moyamba, and Pujehun) can be combined into a single unit.
Show the code
# define the adm2 units to combine
combine_names <- c("BONTHE", "MOYAMBA", "PUJEHUN")
# create a new combined unit
combined <- shp_adm2 |>
dplyr::filter(adm2 %in% combine_names) |>
dplyr::summarise(
adm3 = paste(combine_names, collapse = " ~ "),
geometry = sf::st_union(geometry),
.groups = "drop"
)
# drop the originals and bind the new unit
shp_adm2_combined <- shp_adm2 |>
dplyr::filter(!adm2 %in% combine_names) |>
dplyr::bind_rows(combined)
# check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))
# set colors
plot_col <- sf::sf.colors(10, categorical = TRUE)
# Plot each administrative level
plot(shp_adm2$geometry, col = plot_col, main = "Original adm2")
plot(
shp_adm2_combined$geometry,
col = plot_col,
main = "Combined adm2: Bonthe + Moyamba + Pujehun"
)
# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))To adapt the code:
- Line 2: Change the names in
combine_namesto the units you want to merge. - Lines 6, 15: Update the adm2 field if your shapefile uses a different name (e.g.,
district_name). - Line 8: Adjust the output name (
adm3 = paste(...)) to reflect how you want the merged unit labeled—e.g., useSouthern Clusterinstead. - Repeat entire code block as needed for all areas of interest.
The plot shows that the selected districts have been successfully merged into a single polygon, forming a custom geographic unit.
Step 6: Create Geographic IDs Using Hashes
To track changes in boundary geometry over time or across shapefile versions, we generate a geometry-based hash ID for each polygon. This allows us to detect if a unit has changed spatially, even if its name remains the same.
This step is useful for verifying shapefile updates, aligning geographies across years, and making sure historical comparisons are done correctly. We can also use geometry hashes to check for duplicate polygons in the same shapefile (Step 3 above).
Show the code
# create a hash ID for each polygon using its geometry
shp_adm3 <- shp_adm3 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, adm1, adm2, adm3, geometry_hash)
shp_adm2 <- shp_adm2 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, adm1, adm2, geometry_hash)
shp_adm1 <- shp_adm1 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, adm1, geometry_hash)
shp_adm0 <- shp_adm0 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, geometry_hash)
# Check uniqueness of each shape hash ID
cli::cli_h2("Checking hash ID uniqueness")
adm0_n <- nrow(shp_adm0)
adm0_unique <- dplyr::n_distinct(shp_adm0$geometry_hash)
cli::cli_alert_info("adm0: {adm0_unique} unique hashes across {adm0_n} rows")
adm1_n <- nrow(shp_adm1)
adm1_unique <- dplyr::n_distinct(shp_adm1$geometry_hash)
cli::cli_alert_info("adm1: {adm1_unique} unique hashes across {adm1_n} rows")
adm2_n <- nrow(shp_adm2)
adm2_unique <- dplyr::n_distinct(shp_adm2$geometry_hash)
cli::cli_alert_info("adm2: {adm2_unique} unique hashes across {adm2_n} rows")
adm3_n <- nrow(shp_adm3)
adm3_unique <- dplyr::n_distinct(shp_adm3$geometry_hash)
cli::cli_alert_info("adm3: {adm3_unique} unique hashes across {adm3_n} rows")
# check output
head(shp_adm3)To adapt the code:
- Lines 2, 8, 14, and 20: Replace
shp_adm3,shp_adm2, etc. with your shapefile objects. - Ensure each layer has a geometry column of class
sfc. - Lines 6, 12, 18, and 24: The field names (
adm0,adm1, etc.) can be adjusted to match your naming scheme.
We now have clean shapefiles for each admin level (adm0–adm3), each tagged with a geometry-based hash ID.
Step 7: Saving Vector Data
After cleaning and transforming spatial data, it is important to save it in a format that preserves the geometry, structure, and CRS. R supports multiple vector formats, and you do not need to use all of them; simply choose the formats that fit your workflow and how you plan to reuse or share the data.
Below are the main options, with notes on when and how to use them. Use the one(s) that work best for your team and context.
Show the code
# save adm0
saveRDS(
shp_adm0,
here::here(
spat_path, "processed", "sle_spatial_adm0_2021.rds"
)
)
# save adm1
saveRDS(
shp_adm1,
here::here(
spat_path, "processed", "sle_spatial_adm1_2021.rds"
)
)
# save adm2
saveRDS(
shp_adm2,
here::here(
spat_path, "processed", "sle_spatial_adm2_2021.rds"
)
)
# save adm3
saveRDS(
shp_adm3,
here::here(
spat_path, "processed", "sle_spatial_adm3_2021.rds"
)
)To adapt the code:
The example below shows how to save shapefiles of multiple administrative levels using both RDS and st_write functions. Choose the format that best fits your workflow, whether you are storing layers together or exporting them individually.
If your spatial object was saved using saveRDS(), you can load it with readRDS() and convert it to an sf object:
GeoJSON (.geojson): For GeoJSON files:
GeoPackage (.gpkg): For GeoPackage files:
File Geodatabase (.gdb): For GeoPackage files:
Lines 2–4, 7–10, 13–16, and 19–22: Be sure to adapt the file paths and column names (adm0, adm1, adm2, adm3) based on your own folder structure and country-specific data. Not all countries use an adm3 level, so this will vary depending on your context.
Summary
This section provides a practical foundation for managing shapefile data in the SNT workflow. It walks through the key steps needed to prepare, validate, and customize boundary layers. It offers solutions for resolving geometry issues, handling duplicates, aligning CRS, and constructing flexible administrative layers tailored to program needs. The page also introduces geometry hashing to support version tracking and shows how to export cleaned spatial data for analysis and mapping. These steps ensure that all spatial operations in SNT are built on a stable and reproducible geographic base.
Analysts can return to specific steps as needed throughout their work, making this page a long-term reference for consistent spatial data preparation across SNT projects.
Full code
Find the full code script for managing and customising shapefiles below.
Show full code
################################################################################
############# ~ Shapefile management and customization full code ~ #############
################################################################################
### Step 1: Install and load packages ------------------------------------------
# Install `pacman` if not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# Load required packages using pacman
pacman::p_load(
sf, # handling shapefile data
dplyr, # data manipulation
ggplot2, # plotting
here, # file path management
stringr, # string manipulation (e.g., str_remove)
cli, # clean logging and CLI-style messages
devtools # for package management
)
# sntutils has a number of useful helper functions for
# data management and cleaning
if (!requireNamespace("sntutils", quietly = TRUE)) {
devtools::install_github("ahadi-analytics/sntutils")
}
### Step 2: Load and Prepare Shapefiles ----------------------------------------
#### Step 2.1: Load and Examine Data -------------------------------------------
# set up spatial path
spat_path <- here::here("1.1_foundational", "1.1a_administrative_boundaries")
# import shapefile
shp_adm3 <- sf::read_sf(
here::here(spat_path, "raw", "Chiefdom2021.shp")
) |>
dplyr::mutate(adm0 = "SIERRA LEONE") |>
dplyr::select(
# select relevant columns and rename to standard nomenclature
adm0,
adm1 = FIRST_REGI,
adm2 = FIRST_DNAM,
adm3 = FIRST_CHIE
) |>
# ensure output remains a valid sf object for downstream usage
sf::st_sf()
# check output by mapping
plot(shp_adm3)
### Step 2: Load and Prepare Shapefiles ----------------------------------------
#### Step 2.2: Check and Transform CRS -----------------------------------------
cli::cli_h2("Check and assign original CRS for shp_adm3")
shp_adm3 <- sf::st_set_crs(shp_adm3, 4326)
cli::cli_alert_info("Original CRS: {sf::st_crs(shp_adm3)$input}")
# Transform to new CRS (e.g., Equal Earth - EPSG:6933)
shp_adm3_diff_crs <- sf::st_transform(shp_adm3, 6933)
cli::cli_h2("Check CRS after transformation")
cli::cli_alert_info("New CRS: {sf::st_crs(shp_adm3_diff_crs)$input}")
### Step 3: Geometry Validation ------------------------------------------------
#### Step 3.1: Check and Correct Geometry Validity -----------------------------
# check geometry validity
shp_adm3_inv <- shp_adm3 |>
dplyr::mutate(
valid = sf::st_is_valid(geometry),
invalid_reason = sf::st_is_valid(geometry, reason = TRUE),
invalid_reason = stringr::str_remove(invalid_reason, "\\[.*")
)
# get list of invalid adm3s
invalid_df <- shp_adm3_inv |>
dplyr::filter(!valid) |>
sf::st_drop_geometry() |>
dplyr::select(adm0, adm1, adm2, adm3, invalid_reason)
# view invalid rows (if any)
invalid_df
### Step 3: Geometry Validation ------------------------------------------------
#### Step 3.1: Check and Correct Geometry Validity -----------------------------
# fix known geometry issues
# (e.g. self-intersections, unclosed rings)
shp_adm3 <- shp_adm3 |>
sf::st_make_valid()
# re-check and report remaining invalid geometries
n_invalid <- shp_adm3 |>
dplyr::mutate(valid = sf::st_is_valid(geometry)) |>
dplyr::filter(!valid) |>
nrow()
cli::cli_alert_info("Remaining invalid geometries: {n_invalid}")
### Step 3: Geometry Validation ------------------------------------------------
#### Step 3.2: Check for Duplicate Polygons ------------------------------------
# Simulate duplicates for demonstration purposes
# Pick one row to duplicate exactly (full row duplicate)
dup_full <- shp_adm3 |> dplyr::slice(1:2)
# Pick another row to duplicate geometry only (geometry-only duplicate)
dup_geom <- shp_adm3 |>
dplyr::slice(10) |>
dplyr::mutate(adm3 = "DUPLICATE LABEL")
# Combine with the original shapefile
shp_adm3 <- shp_adm3 |>
dplyr::bind_rows(dup_full, dup_geom)
### Step 3: Geometry Validation ------------------------------------------------
#### Step 3.2: Check for Duplicate Polygons ------------------------------------
# check for duplicate geometries
shp_adm3_dup <- shp_adm3 |>
dplyr::mutate(dup_geom = duplicated(sf::st_as_binary(geometry))) |>
dplyr::filter(dup_geom == TRUE)
cli::cli_alert_info("Number of duplicated geometries: {nrow(shp_adm3_dup)}")
### Step 3: Geometry Validation ------------------------------------------------
#### Step 3.2: Check for Duplicate Polygons ------------------------------------
# Check for full row duplicates
dups_full <- shp_adm3 |>
dplyr::group_by(across(everything())) |>
dplyr::filter(dplyr::n() > 1)
# Check for geometry-only duplicates
dups_geom <- shp_adm3 |>
dplyr::mutate(geom_hash = sntutils::vdigest(geometry)) |>
dplyr::add_count(geom_hash, name = "n_geom") |>
dplyr::filter(n_geom > 1)
# View results
cli::cli_h1("Checking for full row duplicates")
dups_full |> sf::st_drop_geometry()
cli::cli_h1("Checking for geometry-only duplicates")
dups_geom |> sf::st_drop_geometry()
### Step 3: Geometry Validation ------------------------------------------------
#### Step 3.3: Resolve Any Duplicated Polygons ---------------------------------
# remove full row duplicates
shp_adm3 <- shp_adm3 |>
dplyr::distinct()
# if needed, remove known geometry-only duplicates by filtering on labels
# drop intentionally duplicated label
shp_adm3 <- shp_adm3 |>
dplyr::filter(adm3 != "DUPLICATE LABEL")
# Check again after cleaning
# check for geometry-only duplicates in shp_adm3
shp_adm3_dup <- shp_adm3 |>
dplyr::mutate(dup_geom = duplicated(sf::st_as_binary(geometry))) |>
dplyr::filter(dup_geom == TRUE)
cli::cli_alert_info(
"Number of duplicated geometries in shp_adm3: {nrow(shp_adm3_dup)}"
)
### Step 4: Aggregate or Reconstruct Higher-Level Administrative Boundaries ----
# create adm0-level shapefile by
# grouping over the country name
shp_adm0 <- shp_adm3 |>
dplyr::group_by(adm0) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# create adm1-level shapefile by
# grouping over the region (e.g., province)
shp_adm1 <- shp_adm3 |>
dplyr::group_by(adm0, adm1) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# create adm2-level shapefile (e.g., districts)
shp_adm2 <- shp_adm3 |>
dplyr::group_by(adm0, adm1, adm2) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# Check to see the operations worked
# set up a 2x2 plotting area
graphics::par(mfrow = c(2, 2))
# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)
# Plot each administrative level
plot(shp_adm3$geometry, col = plot_col, main = "Original adm3")
plot(shp_adm2$geometry, col = plot_col, main = "Aggregated adm2 (from adm3)")
plot(shp_adm1$geometry, col = plot_col, main = "Aggregated adm1 (from adm3)")
plot(shp_adm0$geometry, col = plot_col, main = "Aggregated adm0 (from adm3)")
# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))
### Step: 5 Shapefile Customization --------------------------------------------
#### Step 5.1: Mixed-Level Shapefiles ------------------------------------------
# remove target adm2 unit from adm2 shapefile
target_adm2 <- "WESTERN URBAN"
adm2_trimmed <- shp_adm2 |>
dplyr::filter(adm2 != target_adm2)
# get the equivalent adm3 units
adm3_subset <- shp_adm3 |>
dplyr::filter(adm2 == target_adm2)
# optional: add a column to indicate source level
adm2_trimmed <- adm2_trimmed |> dplyr::mutate(source = "adm2")
adm3_subset <- adm3_subset |> dplyr::mutate(source = "adm3")
# combine them into a single shapefile
shp_mixed <- dplyr::bind_rows(adm2_trimmed, adm3_subset)
# Check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))
# set colors
plot_col <- sf::sf.colors(208, categorical = TRUE)
# Plot WESTERN to show the finner changes
shp_adm2 |>
dplyr::filter(adm1 == "WESTERN") |>
sf::st_geometry() |>
plot(col = plot_col, main = "Original adm2 (zoomed on WESTERN region)")
shp_mixed |>
dplyr::filter(adm1 == "WESTERN") |>
sf::st_geometry() |>
plot(col = plot_col, main = "Mixed adm2 (zoomed on WESTERN region)")
# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))
### Step: 5 Shapefile Customization --------------------------------------------
#### Step 5.2: Merging Multiple Polygons into a Single Unit --------------------
# define the adm2 units to combine
combine_names <- c("BONTHE", "MOYAMBA", "PUJEHUN")
# create a new combined unit
combined <- shp_adm2 |>
dplyr::filter(adm2 %in% combine_names) |>
dplyr::summarise(
adm3 = paste(combine_names, collapse = " ~ "),
geometry = sf::st_union(geometry),
.groups = "drop"
)
# drop the originals and bind the new unit
shp_adm2_combined <- shp_adm2 |>
dplyr::filter(!adm2 %in% combine_names) |>
dplyr::bind_rows(combined)
# check to see the operations worked
# set up a 1x2 plotting area
graphics::par(mfrow = c(1, 2))
# set colors
plot_col <- sf::sf.colors(10, categorical = TRUE)
# Plot each administrative level
plot(shp_adm2$geometry, col = plot_col, main = "Original adm2")
plot(
shp_adm2_combined$geometry,
col = plot_col,
main = "Combined adm2: Bonthe + Moyamba + Pujehun"
)
# reset plotting layout (optional)
graphics::par(mfrow = c(1, 1))
### Step 6: Create Geographic IDs Using Hashes ---------------------------------
# create a hash ID for each polygon using its geometry
shp_adm3 <- shp_adm3 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, adm1, adm2, adm3, geometry_hash)
shp_adm2 <- shp_adm2 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, adm1, adm2, geometry_hash)
shp_adm1 <- shp_adm1 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, adm1, geometry_hash)
shp_adm0 <- shp_adm0 |>
dplyr::mutate(
geometry_hash = sntutils::vdigest(geometry, algo = "xxhash32")
) |>
dplyr::select(adm0, geometry_hash)
# Check uniqueness of each shape hash ID
cli::cli_h2("Checking hash ID uniqueness")
adm0_n <- nrow(shp_adm0)
adm0_unique <- dplyr::n_distinct(shp_adm0$geometry_hash)
cli::cli_alert_info("adm0: {adm0_unique} unique hashes across {adm0_n} rows")
adm1_n <- nrow(shp_adm1)
adm1_unique <- dplyr::n_distinct(shp_adm1$geometry_hash)
cli::cli_alert_info("adm1: {adm1_unique} unique hashes across {adm1_n} rows")
adm2_n <- nrow(shp_adm2)
adm2_unique <- dplyr::n_distinct(shp_adm2$geometry_hash)
cli::cli_alert_info("adm2: {adm2_unique} unique hashes across {adm2_n} rows")
adm3_n <- nrow(shp_adm3)
adm3_unique <- dplyr::n_distinct(shp_adm3$geometry_hash)
cli::cli_alert_info("adm3: {adm3_unique} unique hashes across {adm3_n} rows")
# check output
head(shp_adm3)
### Step 7: Saving Vector Data -------------------------------------------------
# save adm0
saveRDS(
shp_adm0,
here::here(
spat_path, "processed", "sle_spatial_adm0_2021.rds"
)
)
# save adm1
saveRDS(
shp_adm1,
here::here(
spat_path, "processed", "sle_spatial_adm1_2021.rds"
)
)
# save adm2
saveRDS(
shp_adm2,
here::here(
spat_path, "processed", "sle_spatial_adm2_2021.rds"
)
)
# save adm3
saveRDS(
shp_adm3,
here::here(
spat_path, "processed", "sle_spatial_adm3_2021.rds"
)
)