# install or load relevant packages
pacman::p_load(
tidyverse, # core tidy tools (includes dplyr, tidyr, readr, etc.)
readxl, # read Excel files
fs, # handle file paths
purrr, # iterate over lists (map, walk, etc.)
rlang, # for tidy evaluation and error messaging
devtools # for download github packages
)National population data
Overview
In SNT, population data is foundational. It’s used to calculate indicators like incidence rates, assess population at risk, model burden, and guide intervention targeting. Without reliable and well-structured population data, it’s hard to prioritise effectively or assess coverage. Later in the SNT workflow, the cleaned population dataset—often derived from census sources—will be merged with routine data (e.g. DHIS2) and surveys (e.g. DHS) to support epidemiological stratification and subnational planning.
In many contexts, additional population groups, such as IDPs, refugees, and nomadic or migratory communities, may not be included in national datasets but have major implications for malaria burden and campaign planning. These groups should be considered early in the SNT process. Analysts are encouraged to raise any known gaps with the SNT team to ensure special populations are reflected in the final estimates and that coordination with relevant partners (e.g. WHO Emergencies, UNHCR) can happen in a timely way.
- Use only the population dataset approved by the SNT team
- Reshape official national data into tidy, analysis-ready format
- Harmonise with administrative fields (e.g. ADM3 names, codes)
- Aggregate to relevant admin levels (e.g. district, province)
- Validate population data by mapping it on a shapefile
Population Data Needs
Population data must support the SNT team’s ability to assess risk, stratify intervention needs, and guide subnational planning. This often requires estimates disaggregated by year, administrative level, and, where relevant, by specific population groups (e.g. under-fives, pregnant women). Alignment with the spatial units used in your analysis is essential.
In many contexts, additional special populations must also be considered:
- Internally Displaced Persons (IDPs)
- Refugees, both in and outside camps
- Migratory groups (e.g. mining communities, nomadic populations)
- Urban informal settlements
These groups are often missing or underrepresented in census data. Where relevant, the SNT team will coordinate with agencies such as WHO Emergencies, UNICEF, or UNHCR to source updated estimates. Additionally, population data for SNT must reflect all vulnerable groups relevant to malaria transmission and service delivery. This includes disaggregations by sex, age, geographic context (e.g., conflict zones), and even specific professions, where these are known to influence risk or access.
Such disaggregations are important throughout the transmission continuum but become especially critical in low-transmission or pre-elimination settings, where granular understanding of residual risk and service gaps is essential. If any of these population groups are relevant in your context, ensure they are discussed with the SNT team early so that data collection, estimation, and partner coordination can be adjusted accordingly.
Localised discussions may also be required, especially when IDP data is only reported at higher administrative levels (e.g. regions) and needs to be apportioned to districts. This has important implications for census reviews, malaria risk assessment, and campaign planning (e.g. ITN distribution).
Population data must be sourced from the official dataset designated by the SNT team. This is not an analyst decision.
- In most cases, this will be census-derived data shared by the NMP or national statistics office.
- Modelled data sources (e.g. WorldPop) should not be used unless explicitly approved.
- Special population data (e.g. IDPs, refugees) must be sourced through country partners and validated by the SNT team.
- Never extrapolate or project population counts independently.
- If disaggregation by sex, age, or other key vulnerability factors (e.g. conflict-affected populations, specific occupations) is relevant for the context, raise this early. The SNT team is responsible for determining what population groups must be included based on transmission setting and programmatic need.
Consulting the SNT team ensures that all estimates align with country policy, retain credibility, and allow consistent comparisons across geographies.
Accurate and up to date population estimates are essential for reviewing census baselines, informing malaria risk assessments, particularly in areas receiving large influxes from high transmission zones, and guiding intervention planning such as ITN campaigns. In contexts of internal displacement, this means accounting for both the current location and place of origin of IDPs, as each affects service needs, resource allocation, and planning decisions within the SNT process. Official data often lags behind recent movements, so these adjustments are especially important where displacement is ongoing or widespread.
IDP data is typically available at the admin1 level from sources like IOM, WHO Emergencies, UNHCR, or UNICEF. However, SNT often requires admin2-level granularity. When more detailed data is unavailable, local validation may be needed to reallocate population counts across districts based on known movement patterns.
If IDPs are relevant in your context, raise this early with the SNT team so population estimates can be reviewed and adjusted accordingly.
Before using a national dataset, keep in mind the following:
Is the data at the correct geographic level?
SNT typically requires population data at adm2 (district) or below. If the national data only covers higher levels (e.g., adm1), raise this with the SNT team. They will determine whether to request disaggregated data or use an approved alternative.
Are the years of interest available?
Census estimates may not cover the full SNT analysis period. If population counts for recent years are missing, do not apply your own growth rates. The SNT team will handle any projections using validated national assumptions or fallback methods agreed with country stakeholders.
Does it include the relevant target populations?
If the dataset does not include disaggregated populations (e.g., under-fives or women of reproductive age), alert the SNT team. They will coordinate with the national statistics office to determine whether official proportion estimates exist. If none are available, alternate sources (e.g., surveys or modelled data) may be considered—but only with prior validation and approval by the SNT team.
The key takeaway is that national population datasets are the preferred starting point—but only when they meet SNT needs. Your responsibility is to assess structure, year coverage, and completeness. Data selection and population estimation are led by the SNT team, not done independently.
Different sources may be appropriate depending on the level of analysis and specific planning use case. For example:
- Census data – for national or district-level estimates
- WorldPop raster data – for high-resolution access or catchment modelling
- DHIS2 health facility data – for facility-level planning inputs
- ITN campaign microplans – for population adjustments based on past distribution rounds
- Urban GIS datasets – for micro-stratification in dense settings
- Refugee and IDP data – from UNHCR, WHO Emergencies, or national actors
All sources used should be agreed with the SNT team and clearly documented.
This page covers how to clean and prepare national population data—that is, data officially endorsed for use in SNT. It does not cover modelled or raster-based datasets. If you’re working with WorldPop or other spatial rasters, refer to the WorldPop Raster Data section instead.
Step-by-Step
To skip the step-by-step explanation, jump to the full code at the end of this page.
Step 1: Initial Setup
Step 1.1: Define Functions and Get Packages
Before importing our data locally, we first load the required packages.
Install packages in your terminal, if not already installed. If you need help installing packages, please refer to the Getting Started page.
Once packages are installed, load the relevant packages.
# load relevant packages
import pandas as pd # data manipulation
import geopandas as gpd # spatial data handling
import numpy as np # numerical operations
import matplotlib.pyplot as plt # plotting and visualization
import seaborn as sns # statistical plotting
from pyhere import here # handle file paths
import re # string operationsStep 1.2: Import Data
We’ll use Sierra Leone’s population and boundary data as an example. These files can be obtained from the provided repository.
The Sierra Leone Chiefdom shapefiles are available in the shapefiles section. Be sure to download all related components (.dbf, .shp, .shx). The population data is in the Excel files section. If following the examples, download both datasets and save them to your local environment.
We first start with importing the Sierra Leone shapefiles and population data from our local environment.
# import shapefile
shp_adm3 <- sf::read_sf(
here::here(
"1.1_foundational",
"1.1a_administrative_boundaries",
"Chiefdom2021.shp"
)
) |>
dplyr::mutate(adm0 = "SIERRA LEONE") |>
dplyr::select(
adm0,
adm1 = FIRST_REGI,
adm2 = FIRST_DNAM,
adm3 = FIRST_CHIE
) |>
sf::st_sf()
# get the pop data
pop_df <- rio::import(
here::here(
"1.1_foundational",
"1.1c_population",
"1.1ci_national_population",
"raw",
"sle_pop_data.xlsx"
)
)
# check the top 6 rows
head(pop_df)To adapt the code:
This section uses example data from GitHub. If you have local data or use a different format, follow these steps:
Line 5-7: Replace the path in the
sf::read_sffunction with the local path where you shapefile data is stored.Line 10-15: These are optional steps for creating the
adm0column and selecting or renaming relevant administrative columns. This will vary depending on your shapefile, so adapt accordingly.Line 21-26: Replace the path in the
rio::import()function with the local path where you pop data is stored.
Once updated, run the code to load your shapefile and population data.
# import shapefile
shp_adm3 = (gpd.read_file(
here(
"1.1_foundational",
"1.1a_administrative_boundaries",
"Chiefdom2021.shp"
)
)
.assign(adm0="SIERRA LEONE")
.rename(columns={
'FIRST_REGI': 'adm1',
'FIRST_DNAM': 'adm2',
'FIRST_CHIE': 'adm3'
})
[['adm0', 'adm1', 'adm2', 'adm3', 'geometry']]
)
# get the pop data
pop_df = pd.read_excel(
here(
"1.1_foundational",
"1.1c_population",
"1.1ci_national_population",
"raw",
"sle_pop_data.xlsx"
)
)
# check the top 6 rows
print(pop_df.head(6))To adapt the code:
This section uses example data from GitHub. If you have local data or use a different format, follow these steps:
Line 4-6: Replace the path in the
gpd.read_filefunction with the local path where you shapefile data is stored.Line 9-15: These are optional steps for creating the
adm0column and selecting or renaming relevant administrative columns. This will vary depending on your shapefile, so adapt accordingly.Line 21-25: Replace the path in the
pd.read_excelfunction with the local path where you pop data is stored.
Once updated, run the code to load your shapefile and population data.
We’ve now loaded Sierra Leone population data at the adm3 level, covering the years 2015 to 2024. The dataset includes geographic identifiers (adm1, adm2, adm3), yearly population estimates (pop2015 to pop2024). For now, we’ll focus on the population figures and location columns needed for SNT.
Step 2: Transform Data
The dataset is currently in wide format, with one row per adm3 unit and separate columns for each year (e.g. pop2016, pop2017, etc.). To make it easier to work with in the SNT pipeline, we’ll reshape it into a tidydata—where each row corresponds to a single admin unit for a given year, and population values are stored in a single column. If your data is already in this long format (see data structures section, you can skip the pivot_longer() step in the R code and the .melt() step in the Python code.
Some countries may already provide data in the tidy structure, especially if it’s been preprocessed or submitted in a standardised reporting format.
Step 2.1: Reshape Wide Data to Long Format
pop_long <- pop_df |>
# add country column
dplyr::mutate(adm0 = "SIERRA LEONE") |>
# keep only admin and population columns
dplyr::select(adm0,
adm2,
adm3,
dplyr::starts_with("pop")) |>
# reshape to long format: one row per year per adm3
tidyr::pivot_longer(
cols = dplyr::starts_with("pop"),
names_to = "col_name",
values_to = "pop"
) |>
# extract year from column name and drop the original
dplyr::mutate(
year = stringr::str_extract(col_name, "\\d{4}"),
year = as.integer(year),
.keep = "unused"
)To adapt the code:
- Line 3: Set the country name. Replace
"SIERRA LEONE"with your target. - Lines 5 to 8: Match your admin columns. Rename or drop adm3 if absent. Adjust the population prefix if not “pop” (for example use
dplyr::matches("^wp_?\\d{4}$")). - Lines 10 to 14: Ensure pivot_longer() targets only year columns. If needed, switch cols to a stricter pattern (for example
dplyr::matches("^pop_?\\d{4}$")). - Line 17: Update the year regex if your headers differ. Examples:
"y\\d{4}", or"(?:^|_)\\d{4}(?:$|_)"to avoid accidental matches.
pop_long = (pop_df
# add country column
.assign(adm0="SIERRA LEONE")
# keep only admin and population columns
.filter(regex=r'^(adm0|adm2|adm3|pop)')
# reshape to long format: one row per year per adm3
.melt(
id_vars=['adm0', 'adm2', 'adm3'],
value_vars=[col for col in pop_df.columns if col.startswith('pop')],
var_name='col_name',
value_name='pop'
)
# extract year from column name and drop the original
.assign(
year=lambda x: x['col_name'].str.extract(r'(\d{4})')[0].astype(int)
)
.drop(columns=['col_name'])
.sort_values(['adm3', 'year']) # this is optional (added to match R output ordering)
.reset_index(drop=True)
)To adapt the code:
- Line 3: Set the country name. Replace
"SIERRA LEONE"with your target. - Lines 5: Match your admin columns. Rename or drop adm3 if absent. Adjust the population prefix if not “pop” (for example use
.filter(regex=r'^(adm0|adm2|adm3|wp)')). - Lines 7 to 11: Ensure
.melt()targets only year columns. If needed, switch cols to a stricter pattern (for examplevalue_vars=[col for col in pop_df.columns if re.match(r'^pop_?\d{4}$', col)]). - Line 15: Update the year regex if your headers differ. Examples:
r'y(\d{4})', orr'(?:^|_)(\d{4})(?:$|_)'to avoid accidental matches.
We add a country column (adm0), keep only the admin and population columns, and then pivot the data into long format so each row is a single adm3 unit per year. We extract the year from the column names and drop the original wide column name. This sets us up for easier joins and indicator calculations later on. The stringr::str_extract(col_name, "\\d{4}") line in R and str.extract(r'(\d{4})') line in Python is doing the work of pulling the 4-digit year (e.g. 2017) from column names like pop2017, using a regex that matches any four-digit number. We then convert this to an integer so it behaves properly in filters or plots.
\d{4}
The expression \d{4} is a regular expression (regex) used to match any sequence of exactly four digits.
\dmeans “any digit (0–9)”{4}means “repeat the digit pattern exactly four times”
For the R code, the reason we use double backslashes (\\) is because R treats a single backslash as an escape character. So to pass a literal \d to the regex engine, we need to write it as "\\d".
So str_extract(col_name, "\\d{4}") pulls out the 4-digit year from strings like "pop2017", returning "2017".
Step 2.2: What Should My Final Data Looklike?
Your population dataset should now follow a tidy, long format: one row per admin unit and year, with a single column holding population values. This structure is essential for analysis in the SNT pipeline, where population data must align cleanly with other spatial and time-based datasets like routine case data or survey indicators.
Working in long format also makes it easier to run group-wise operations, join across sources, and generate time trends. A properly reshaped dataset should resemble the example below:
| adm0 | adm1 | adm2 | adm3 | pop | year |
|---|---|---|---|---|---|
| SIERRA LEONE | SOUTHERN | BO | BADJIA | 2015 | 8135 |
| SIERRA LEONE | SOUTHERN | BO | BADJIA | 2016 | 8369 |
| SIERRA LEONE | SOUTHERN | BO | BADJIA | 2017 | 8602 |
| SIERRA LEONE | SOUTHERN | BO | BADJIA | 2018 | 8834 |
| SIERRA LEONE | SOUTHERN | BO | BADJIA | 2019 | 9063 |
Step 3: Admin Unit Alignment
Before using any population dataset in the SNT workflow, it’s essential to ensure that all administrative levels—such as adm0, adm1, adm2, and/or adm3—are fully aligned with the shapefile being used in the analysis. This means checking that names, structures, and geographic references match exactly between the population data and your spatial boundaries. Inconsistent or missing admin values can cause join errors, distort summaries, or lead to misinterpretation in stratification outputs. This step focuses on aligning and validating the administrative structure across sources to ensure consistency and compatibility.
The population data appears to be missing adm1 (province-level) information. To fill this gap, we merge in adm1 values using a lookup extracted from the shapefile. This highlights the broader need to ensure that population data is fully aligned with the administrative boundaries used elsewhere in the analysis. Consistent naming and structure across datasets is essential for clean joins, accurate aggregation, and meaningful comparisons in the SNT workflow. After merging, we reorder the columns for clarity. The final output should be a cleaned and properly structured population file—aligned with your administrative boundaries and ready for use in epidemiological stratification.
Step 3.1: Align Admin Levels Using Shapefile
To adapt the code:
This section uses example data from GitHub. If you have local data or use a different format, follow these steps:
- Line 4: Update
dplyr::distinct()if your shapefile uses different admin levels or column names. For example, usedplyr::distinct(adm0, adm1)if we want distinct admin levels up to adm1.
Once updated, run the code to load your admin reference file.
To adapt the code:
This section uses example data from GitHub. If you have local data or use a different format, follow these steps:
- Line 5: Update
.drop_duplicates()if your shapefile uses different admin levels or column names. For example, use.drop_duplicates(subset=['adm0', 'adm1'])if we want distinct admin levels up to adm1.
Once updated, run the code to load your admin reference file.
In our example dataset, the population file includes adm0, adm2, and adm3, but is missing adm1 (province). However, your file may contain different gaps. Some may already include adm1, while others might require different corrections. This step is simply one illustration of how to address alignment issues.
To resolve this, we extract a lookup table from the shapefile and merge the adm1 values back in.
# join the adm1 from shapefile lookup df
pop_long2 <- pop_long |>
dplyr::mutate(
adm2 = toupper(adm2),
adm3 = toupper(adm3)) |>
dplyr::left_join(
dplyr::distinct(shp_lookup, adm1, adm2),
by = "adm2"
) |>
# reorder columns
dplyr::select(
adm0,
adm1,
adm2,
adm3,
year,
pop
)
# check results after adding in adm1
pop_long2 |> head()To adapt the code:
- Line 8: Adjust the join key to match your data. If your population data aligns at adm1, change
by = "adm2"toby = "adm1". If at adm3, change toby = "adm3".- Make sure the columns you join on actually exist and match in both datasets.
- Column Names: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.
Once updated, run the code to join your pop data to your admin data.
# join the adm1 from shapefile lookup df
pop_long2 = (pop_long
.assign(
adm2=lambda x: x['adm2'].str.upper(),
adm3=lambda x: x['adm3'].str.upper()
)
.merge(
shp_lookup[['adm1', 'adm2']].drop_duplicates(),
on='adm2',
how='left'
)
# reorder columns
[['adm0', 'adm1', 'adm2', 'adm3', 'year', 'pop']]
.sort_values(['adm3', 'year']) # this is optional (added to match R output ordering)
.reset_index(drop=True)
)
# check results after adding in adm1
print(pop_long2.head())To adapt the code:
- Line 9: Adjust the join key to match your data. If your population data aligns at adm1, change
on = 'adm2'toon = 'adm1'. If at adm3, change toon = 'adm3'.- Make sure the columns you join on actually exist and match in both datasets.
- Column Names: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.
Once updated, run the code to join your pop data to your admin data.
We join adm1 values into the population dataset using a lookup table derived from the shapefile. This ensures the admin structure is consistent with other spatial datasets. After merging, we reorder the columns for clarity. The resulting file should now contain adm0, adm1, adm2, and adm3, in tidy format, and aligned with the administrative units used elsewhere in your analysis. The output above confirms that adm1 is now present in the data.
Step 3.2: Check for Admin Value Mismatches
Now to ensure consistency, we extract the distinct combinations of admin columns from both datasets and compare them. Any mismatch could signal a misspelled district name, missing entry, or a structural issue in the merge.
We extract unique combinations of adm0 to adm3 from both datasets to check alignment. Using !!!rlang::syms(keys) dynamically passes column names stored in keys, and is equivalent to writing shp_lookup |> dplyr::distinct(adm0, adm1, adm2, adm3), but more flexible and reusable. Using anti_join() in both directions helps catch any mismatches across sources.
# extract unique admin combinations
keys = c("adm0", "adm1", "adm2", "adm3")
lookup_keys <- pop_long2 |> dplyr::distinct(!!!rlang::syms(keys))
pop_keys <- shp_lookup |> dplyr::distinct(!!!rlang::syms(keys))
# identify mismatches (either direction)
mismatches <-
dplyr::anti_join(
pop_keys, lookup_keys, by = keys) |>
dplyr::bind_rows(
dplyr::anti_join(lookup_keys, pop_keys, by = keys)
)
# check if all admin level combinations (adm0–adm3) match between sources
if (nrow(mismatches) == 0) {
cli::cli_alert_success(
"admin levels (adm0 to adm3) match between pop data and shapefile."
)
} else {
cli::cli_alert_danger(
"Mismatch detected in admin levels between pop data and shapefile."
)
}To adapt the code:
- Line 2: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.
- Make sure the columns you join on actually exist and match in both datasets.
Once updated, run the code to check admin mismatches.
We extract unique combinations of adm0 to adm3 from both datasets to check alignment. Using .merge() with indicator=True and filtering for 'left_only' in both directions helps catch any mismatches across sources.
# extract unique admin combinations
keys = ['adm0', 'adm1', 'adm2', 'adm3']
lookup_keys = pop_long2[keys].drop_duplicates()
pop_keys = shp_lookup[keys].drop_duplicates()
# identify mismatches (either direction)
mismatches = pd.concat([
pop_keys.merge(lookup_keys, on=keys, how='left', indicator=True)
.query('_merge == "left_only"')
.drop(columns=['_merge']),
lookup_keys.merge(pop_keys, on=keys, how='left', indicator=True)
.query('_merge == "left_only"')
.drop(columns=['_merge'])
], ignore_index=True)
# check if all admin level combinations (adm0–adm3) match between sources
if len(mismatches) == 0:
print("✓ admin levels (adm0 to adm3) match between pop data and shapefile.")
else:
print("✗ Mismatch detected in admin levels between pop data and shapefile.")To adapt the code:
- Line 2: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.
- Make sure the columns you join on actually exist and match in both datasets.
Once updated, run the code to check admin mismatches.
Step 3.3: Resolving Mismatches in Admin Units
If mismatches are found—due to spelling differences, inconsistent naming conventions, or boundary changes—you’ll need to harmonise the administrative names between the population data and the shapefile. One recommended approach is to use the prep_geonames() function from the sntutils package.
This function combines algorithmic matching (e.g., string distance) with user-guided correction to standardise administrative units across datasets. It supports up to six hierarchical levels and can cache decisions for future reuse—so you don’t need to repeat the same matching steps. These cached corrections can also be shared with colleagues, saving time and ensuring consistent naming across team members.
We don’t cover all details here, but below is a basic setup. For a full walkthrough—including both interactive and scripted use—see Merging shapefiles with tabular data section of the code library. That section explains how to integrate prep_geonames() into your cleaning workflow and how to save and reuse decisions efficiently.
# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")
# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"
# Harmonise admin names between population data and shapefile
pop_harmonised <-
sntutils::prep_geonames(
target_df = pop_long2,
lookup_df = lookup_keys,
level0 = "adm0",
level1 = "adm1",
level2 = "adm2",
level3 = "adm3",
interactive = TRUE,
cache_path = here::here(cache_path, "geoname_decisions.rds")
)To adapt the code:
- Line 11: Replace
pop_long2with your population dataset. - Line 12: place
lookup_keyswith your lookup or reference dataset. - Line 13-16: Adjust level0, level1, level2, and level3 to match your actual admin column names if they are named differently in your dataset (e.g., “country”, “region”, “district”, “ward”).
- Make sure the columns you join on actually exist and match in both datasets
- Line 6 and 18: Update the path if you want to save the geoname decisions to a different location.
Once updated, run the code to harmonise pop data admin names with shapefile ones.
Install the sntutils-py package in your terminal, if not already installed. If you need help installing packages, please refer to the Getting Started page.
# load the prep_geonames function from the sntutils-py package
from sntutils.geo import prep_geonames
# set up location to save cache
cache_path = "1.1_foundational/1d_cache_files"
# Harmonise admin names between population data and shapefile
pop_harmonised = prep_geonames(
target_df=pop_long2,
lookup_df=lookup_keys,
level0="adm0",
level1="adm1",
level2="adm2",
level3="adm3",
method="jw",
cache_path=here(cache_path, "geoname_decisions.csv")
preserve_case=True
)To adapt the code:
- Line 9: Replace
pop_long2with your population dataset. - Line 10: place
lookup_keyswith your lookup or reference dataset. - Line 11-14: Adjust level0, level1, level2, and level3 to match your actual admin column names if they are named differently in your dataset (e.g., “country”, “region”, “district”, “ward”).
- Make sure the columns you join on actually exist and match in both datasets
- Line 5 and 16: Update the path if you want to save the geoname decisions to a different location.
- Line 15: Replace
method="jw"withmethod="lv"to switch distance metrics:- Jaro-Winkler (
"jw") - Recommended for geographic/administrative names - Levenshtein (
"lv") - Better for general typos/spelling errors
- Jaro-Winkler (
Once updated, run the code to harmonise pop data admin names with shapefile ones.
This confirms that all administrative levels have been successfully aligned between your dataset and the reference. If mismatches remain, the tool will prompt you to resolve them interactively.
Step 4: Aggregating Population Data
With the population data now reshaped and harmonized across administrative levels, we can begin exploring it through basic checks and summaries. As a first step, we aggregate the population totals at different administrative units of interest:
# summarize total population by adm0
pop_summary_adm0 <- pop_long2 |>
dplyr::group_by(adm0, year) |>
dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))
# summarize total population by adm1
pop_summary_adm1 <- pop_long2 |>
dplyr::group_by(adm1, year) |>
dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))
# summarize total population by adm2
pop_summary_adm2 <- pop_long2 |>
dplyr::group_by(adm2, year) |>
dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))
# summarize total population by adm3
pop_summary_adm3 <- pop_long2
# check to see if aggregation worked
head(pop_summary_adm1, n = 9)To adapt the code:
- No major changes are needed here.
- The grouping levels (adm0, adm1, adm2, adm3) follow from the earlier joining and harmonization steps.
# summarize total population by adm0
pop_summary_adm0 = (pop_long2
.groupby(['adm0', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)
# summarize total population by adm1
pop_summary_adm1 = (pop_long2
.groupby(['adm1', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)
# summarize total population by adm2
pop_summary_adm2 = (pop_long2
.groupby(['adm2', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)
# summarize total population by adm3
pop_summary_adm3 = pop_long2.copy()
# check to see if aggregation worked
print(pop_summary_adm1.head(9))To adapt the code:
- No major changes are needed here.
- The grouping levels (adm0, adm1, adm2, adm3) follow from the earlier joining and harmonization steps.
Step 5: Visualizing Aggregated Population Data
Step 5.1: Prepare Population Data for Mapping
Once we’ve aggregated the population data and aligned it with the administrative shapefiles, the next step is to prepare it for visualization. This includes merging population totals into the shapefile and assigning population ranges.
To make visual patterns easier to interpret, population values are grouped into bins. Any location with a value of 0 or NA is placed in a special “Missing or 0” category. These will appear in grey on the map, making it easy to spot gaps or misalignments before moving forward with analysis.
# join pop data to shapefile
shp_adm3_pop <- shp_adm3 |>
dplyr::left_join(
pop_summary_adm3,
by = c("adm0", "adm1", "adm2", "adm3"),
relationship = "many-to-many"
) |>
dplyr::filter(year >= 2020) |>
dplyr::mutate(
pop_bin = dplyr::case_when(
is.na(pop) | pop == 0 ~ "Missing or 0",
pop <= 15000 ~ "<15k",
pop <= 30000 ~ "15k–30k",
pop <= 50000 ~ "30k–50k",
pop <= 100000 ~ "50k–100k",
pop > 100000 ~ "100k+"
),
pop_bin = factor(
pop_bin,
levels = c(
"Missing or 0",
"<15k",
"15k–30k",
"30k–50k",
"50k–100k",
"100k+"
)
)
)To adapt the code:
Line 5: If you used a different shapefile or admin join level earlier, update by = “adm3” to the level you used (e.g., “adm2” or “adm1”).
Line 8: Change the year filter if you are interested in a different range (e.g., year >= 2015).
Lines 11–16: You may adjust the population bins to match meaningful ranges for your context.
Once updated, run the code to set up your population data for mapping.
# join pop data to shapefile
shp_adm3_pop = (shp_adm3
.merge(
pop_summary_adm3,
on=['adm0', 'adm1', 'adm2', 'adm3'],
how='left'
)
.query('year >= 2020')
.assign(
pop_bin=lambda x: pd.cut(
x['pop'].fillna(0),
bins=[-np.inf, 0, 15000, 30000, 50000, 100000, np.inf],
labels=['Missing or 0', '<15k', '15k–30k', '30k–50k', '50k–100k', '100k+'],
include_lowest=True
),
year=lambda x: x['year'].astype(int)
)
)To adapt the code:
Line 5: If you used a different shapefile or admin join level earlier, update by = “adm3” to the level you used (e.g., “adm2” or “adm1”).
Line 8: Change the year filter if you are interested in a different range (e.g., year >= 2015).
Line 12: You may adjust the population bins to match meaningful ranges for your context.
Once updated, run the code to set up your population data for mapping.
Step 5.2: Visualize Aggregated Population by Region
We now generate faceted maps showing population distribution at the adm3 level for the years 2020 to 2024. Each facet represents a different year, allowing for visual comparison across time.
Grey regions may indicate missing data or failed joins. These should be reviewed before using the dataset in subsequent steps like stratification or targeting.
Show full code
# generate adm0-level shapefile
shp_adm0 <- shp_adm3 |>
dplyr::group_by(adm0) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# generate adm1-level shapefile
shp_adm1 <- shp_adm3 |>
dplyr::group_by(adm1) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# plot pop from 2020 to 2023
adm3_pop_map <- shp_adm3_pop |>
ggplot2::ggplot() +
ggplot2::geom_sf(
ggplot2::aes(fill = pop_bin),
color = "white",
size = 0.2
) +
# add adm1 outline as a black border
ggplot2::geom_sf(
data = shp_adm1,
fill = NA, color = "black", size = 0.5
) +
ggplot2::scale_fill_manual(
name = "Population Range",
values = c("Missing or 0" = "grey80",
"<15k" = "#fee5d9",
"15k–30k" = "#fcae91",
"30k–50k" = "#fb6a4a",
"50k–100k" = "#de2d26",
"100k+" = "#a50f15"),
na.value = "grey80",
guide = ggplot2::guide_legend(
label.position = "bottom",
title.position = "top"
)
) +
ggplot2::facet_wrap(~year, nrow = 1) +
ggplot2::labs(
title = "Total Population by Chiefdom (adm3)",
subtitle = "Between 2020 and 2024",
fill = "Population"
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = ggplot2::element_text(size = 20),
plot.subtitle = ggplot2::element_text(size = 20),
strip.text = ggplot2::element_text(size = 18),
legend.title = ggplot2::element_text(size = 18),
legend.text = ggplot2::element_text(size = 18),
legend.position = "bottom",
legend.title.position = "top",
legend.key.width = grid::unit(1, "cm"),
axis.text.x = ggplot2::element_text(
angle = 75, vjust = 1, hjust = 1
)
)To adapt the code:
Line 14 and 42: If you joined at a different admin level (e.g., adm2), update
shp_adm3_popand the map title accordingly.Line 23: If your admin1 shapefile is named differently, update
data = shp_adm1to your object name.Line 28-33: Adjust color values and ranges if you defined different population bins.
Once updated, run the code to map your aggregated data.
Show full code
# generate adm0-level shapefile
shp_adm0 = (shp_adm3
.dissolve(by='adm0')
.reset_index()
)
# generate adm1-level shapefile
shp_adm1 = (shp_adm3
.dissolve(by='adm1')
.reset_index()
)
# plot pop from 2020 to 2023
years = sorted(shp_adm3_pop['year'].unique())
adm3_pop_map, axes = plt.subplots(1, len(years), figsize=(12, 5),constrained_layout=True)
# handle case where there's only one year
if len(years) == 1:
axes = [axes]
# define color mapping
color_map = {
'Missing or 0': 'lightgrey',
'<15k': '#fee5d9',
'15k–30k': '#fcae91',
'30k–50k': '#fb6a4a',
'50k–100k': '#de2d26',
'100k+': '#a50f15'
}
for i, year in enumerate(years):
ax = axes[i]
# plot population data
year_data = shp_adm3_pop[shp_adm3_pop['year'] == year]
for pop_bin, color in color_map.items():
subset = year_data[year_data['pop_bin'] == pop_bin]
if not subset.empty:
subset.plot(ax=ax,
color = color,
edgecolor = 'white',
linewidth = 0.3)
# add adm1 outline as a black border
shp_adm1.boundary.plot(ax=ax, color='black', linewidth=0.5)
ax.set_title(f'{year}',fontsize = 16)
ax.set_axis_off()
# add overall title and legend
adm3_pop_map.suptitle('Total Population by Chiefdom (adm3)\nBetween 2020 and 2024', fontsize=20)
# create custom legend
handles = [plt.Rectangle((0,0),1,1, color=color) for color in color_map.values()]
labels = list(color_map.keys())
adm3_pop_map.legend(
handles,
labels,
title='Population Range',
loc='lower center',
bbox_to_anchor=(0.5, -0.03),
ncol=6,
fontsize=18,
title_fontsize=18,
frameon=False,
columnspacing=0.8,
handletextpad=0.5)
plt.show()To adapt the code:
Lines 14, 36, 53: If you joined at a different admin level (e.g., adm2), update
shp_adm3_popand the map title accordingly.Line 47: If your admin1 shapefile is named differently, update
shp_adm1to your object name.Line 23-29: Adjust color values and ranges if you defined different population bins.
Line 16: Adjust the width and height in
figsize=(12, 5)if you need a different image size.
Once updated, run the code to map your aggregated data.
All chiefdoms are shaded on the map, indicating that population values were successfully joined—none are missing or zero for the years 2020 to 2024. This suggests that the merge step worked correctly and there are no gaps in coverage at the adm3 level.
Step 5.3: Save plots
The map is then saved as a high-resolution PNG for future reference.
To adapt the code:
- Line 4: Update the filepath to match your folder structure by changing
03_output/3a_figures/. - Line 4: Update the filename by changing
pop_check_adm3_2020_2024.png. - Line 5: Adjust the width, height, and dpi if you need a different image size or quality.
Once updated, run the code to save the plot as a PNG file.
To adapt the code:
- Line 3: Update the filepath to match your folder structure by changing
"03_output","3a_figures". - Line 3: Update the filename by changing
pop_check_adm3_2020_2024.png. - Line 4: Adjust the dpi if you need a different image quality.
Once updated, run the code to save the plot as a PNG file.
Based on the folded code above in Step 5.2, this plot confirms that our population data is complete and well structured. Each adm3 has a population value for all the years we’re interested in (2020–2024), and joins cleanly to the shapefile. This quick visual check helps ensure that the data is aligned before we move on to deeper analysis.
Step 6: Save Data
Once the census population data is cleaned, reshaped, and aligned with your admin boundaries, it’s a good idea to save it in a ready-to-use format. This avoids repeating the cleaning steps and makes it easier to join with other datasets later on.
We save the processed data as .rds, .csv, and .xlsx—so it’s easy to load in R or share with others.
# define output folder
save_path <- here::here(
"01_data", "1.1_foundational", "1.1c_population", "1ci_country_sourced"
)
# save as RDS
rio::export(
pop_long,
here::here(save_path, "sierra_leone_pop_census_2020_2023.rds")
)
# save as CSV
rio::export(
pop_long,
here::here(save_path, "sierra_leone_pop_census_2020_2023.csv")
)
# save as XLSX
rio::export(
pop_long,
here::here(save_path, "sierra_leone_pop_census_2020_2023.xlsx")
)To adapt the code:
Line 3: Update
save_pathto your output directory.File Names: Change file names (e.g.,
sierra_leone_pop_census.csv) to reflect your dataset or time range.
Once updated, run the code to save your outputs in both raw and processed formats.
# define output folder
save_path = here("01_data", "1.1_foundational", "1.1c_population", "1ci_country_sourced")
save_path.mkdir(parents=True, exist_ok=True)
# save as CSV
pop_long.to_csv(save_path / "sierra_leone_pop_census_2020_2023.csv", index=False)
# save as XLSX
pop_long.to_excel(save_path / "sierra_leone_pop_census_2020_2023.xlsx", index=False)To adapt the code:
Line 2: Update
save_pathto your output directory.File Names: Change file names (e.g.,
sierra_leone_pop_census.csv) to reflect your dataset or time range.
Once updated, run the code to save your outputs in both raw and processed formats.
Summary
We’ve now covered how to get census-based population data cleaned, reshaped, and ready for use. We brought in the raw files, reshaped them into a long tidy format, joined admin names using shapefiles, and created aggregated totals across different administrative levels. Everything is now cleaned, structured, and ready to be merged with survey or routine datasets later in the SNT pipeline. You’ll find the full code at the bottom of this section, collapsed for easy reference. It can be reused as-is or tweaked for your own setup—just update the file paths, column names, or admin levels where needed.
Full Code
Find the full code script for processing national population data below.
Show full code
################################################################################
#################### ~ National Population Data full code ~ ####################
################################################################################
### Step 1: Initial Setup ------------------------------------------------------
#### Step 1.1: Define Functions and Get Packages -------------------------------
# install or load relevant packages
pacman::p_load(
tidyverse, # core tidy tools (includes dplyr, tidyr, readr, etc.)
readxl, # read Excel files
fs, # handle file paths
purrr, # iterate over lists (map, walk, etc.)
rlang, # for tidy evaluation and error messaging
devtools # for download github packages
)
### Step 1: Initial Setup ------------------------------------------------------
#### Step 1.2: Import Data -----------------------------------------------------
# import shapefile
shp_adm3 <- sf::read_sf(
here::here(
"1.1_foundational",
"1.1a_administrative_boundaries",
"Chiefdom2021.shp"
)
) |>
dplyr::mutate(adm0 = "SIERRA LEONE") |>
dplyr::select(
adm0,
adm1 = FIRST_REGI,
adm2 = FIRST_DNAM,
adm3 = FIRST_CHIE
) |>
sf::st_sf()
# get the pop data
pop_df <- rio::import(
here::here(
"1.1_foundational",
"1.1c_population",
"1.1ci_national_population",
"raw",
"sle_pop_data.xlsx"
)
)
# check the top 6 rows
head(pop_df)
### Step 2: Transform Data -----------------------------------------------------
#### Step 2.1: Reshape Wide Data to Long Format --------------------------------
pop_long <- pop_df |>
# add country column
dplyr::mutate(adm0 = "SIERRA LEONE") |>
# keep only admin and population columns
dplyr::select(adm0,
adm2,
adm3,
dplyr::starts_with("pop")) |>
# reshape to long format: one row per year per adm3
tidyr::pivot_longer(
cols = dplyr::starts_with("pop"),
names_to = "col_name",
values_to = "pop"
) |>
# extract year from column name and drop the original
dplyr::mutate(
year = stringr::str_extract(col_name, "\\d{4}"),
year = as.integer(year),
.keep = "unused"
)
### Step 3: Admin Unit Alignment -----------------------------------------------
#### Step 3.1: Align Admin Levels Using Shapefile ------------------------------
# get shapefile
# create admin lookup df
shp_lookup <- shp_adm3 |>
sf::st_drop_geometry() |>
dplyr::distinct()
# check results
pop_long |> head()
### Step 3: Admin Unit Alignment -----------------------------------------------
#### Step 3.1: Align Admin Levels Using Shapefile ------------------------------
# join the adm1 from shapefile lookup df
pop_long2 <- pop_long |>
dplyr::mutate(
adm2 = toupper(adm2),
adm3 = toupper(adm3)) |>
dplyr::left_join(
dplyr::distinct(shp_lookup, adm1, adm2),
by = "adm2"
) |>
# reorder columns
dplyr::select(
adm0,
adm1,
adm2,
adm3,
year,
pop
)
# check results after adding in adm1
pop_long2 |> head()
### Step 3: Admin Unit Alignment -----------------------------------------------
#### Step 3.2: Check for Admin Value Mismatches --------------------------------
# extract unique admin combinations
keys = c("adm0", "adm1", "adm2", "adm3")
lookup_keys <- pop_long2 |> dplyr::distinct(!!!rlang::syms(keys))
pop_keys <- shp_lookup |> dplyr::distinct(!!!rlang::syms(keys))
# identify mismatches (either direction)
mismatches <-
dplyr::anti_join(
pop_keys, lookup_keys, by = keys) |>
dplyr::bind_rows(
dplyr::anti_join(lookup_keys, pop_keys, by = keys)
)
# check if all admin level combinations (adm0–adm3) match between sources
if (nrow(mismatches) == 0) {
cli::cli_alert_success(
"admin levels (adm0 to adm3) match between pop data and shapefile."
)
} else {
cli::cli_alert_danger(
"Mismatch detected in admin levels between pop data and shapefile."
)
}
### Step 3: Admin Unit Alignment -----------------------------------------------
#### Step 3.3: Resolving Mismatches in Admin Units -----------------------------
# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")
# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"
# Harmonise admin names between population data and shapefile
pop_harmonised <-
sntutils::prep_geonames(
target_df = pop_long2,
lookup_df = lookup_keys,
level0 = "adm0",
level1 = "adm1",
level2 = "adm2",
level3 = "adm3",
interactive = TRUE,
cache_path = here::here(cache_path, "geoname_decisions.rds")
)
### Step 4: Aggregating Population Data ----------------------------------------
# summarize total population by adm0
pop_summary_adm0 <- pop_long2 |>
dplyr::group_by(adm0, year) |>
dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))
# summarize total population by adm1
pop_summary_adm1 <- pop_long2 |>
dplyr::group_by(adm1, year) |>
dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))
# summarize total population by adm2
pop_summary_adm2 <- pop_long2 |>
dplyr::group_by(adm2, year) |>
dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))
# summarize total population by adm3
pop_summary_adm3 <- pop_long2
# check to see if aggregation worked
head(pop_summary_adm1, n = 9)
### Step 5: Visualizing Aggregated Population Data -----------------------------
#### Step 5.1: Prepare Population Data for Mapping -----------------------------
# join pop data to shapefile
shp_adm3_pop <- shp_adm3 |>
dplyr::left_join(
pop_summary_adm3,
by = c("adm0", "adm1", "adm2", "adm3"),
relationship = "many-to-many"
) |>
dplyr::filter(year >= 2020) |>
dplyr::mutate(
pop_bin = dplyr::case_when(
is.na(pop) | pop == 0 ~ "Missing or 0",
pop <= 15000 ~ "<15k",
pop <= 30000 ~ "15k–30k",
pop <= 50000 ~ "30k–50k",
pop <= 100000 ~ "50k–100k",
pop > 100000 ~ "100k+"
),
pop_bin = factor(
pop_bin,
levels = c(
"Missing or 0",
"<15k",
"15k–30k",
"30k–50k",
"50k–100k",
"100k+"
)
)
)
### Step 5: Visualizing Aggregated Population Data -----------------------------
#### Step 5.2: Visualize Aggregated Population by Region -----------------------
# generate adm0-level shapefile
shp_adm0 <- shp_adm3 |>
dplyr::group_by(adm0) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# generate adm1-level shapefile
shp_adm1 <- shp_adm3 |>
dplyr::group_by(adm1) |>
dplyr::summarise(.groups = "drop") |>
sf::st_union(by_feature = TRUE)
# plot pop from 2020 to 2023
adm3_pop_map <- shp_adm3_pop |>
ggplot2::ggplot() +
ggplot2::geom_sf(
ggplot2::aes(fill = pop_bin),
color = "white",
size = 0.2
) +
# add adm1 outline as a black border
ggplot2::geom_sf(
data = shp_adm1,
fill = NA, color = "black", size = 0.5
) +
ggplot2::scale_fill_manual(
name = "Population Range",
values = c("Missing or 0" = "grey80",
"<15k" = "#fee5d9",
"15k–30k" = "#fcae91",
"30k–50k" = "#fb6a4a",
"50k–100k" = "#de2d26",
"100k+" = "#a50f15"),
na.value = "grey80",
guide = ggplot2::guide_legend(
label.position = "bottom",
title.position = "top"
)
) +
ggplot2::facet_wrap(~year, nrow = 1) +
ggplot2::labs(
title = "Total Population by Chiefdom (adm3)",
subtitle = "Between 2020 and 2023",
fill = "Population"
) +
ggplot2::theme_minimal() +
ggplot2::theme(
legend.position = "bottom",
legend.title.position = "top",
legend.key.width = grid::unit(1, "cm"),
axis.text.x = ggplot2::element_text(
angle = 75, vjust = 1, hjust = 1
)
)
### Step 5: Visualizing Aggregated Population Data -----------------------------
#### Step 5.3: Save plots ------------------------------------------------------
# save plot
ggplot2::ggsave(
plot = adm3_pop_map,
here::here("03_output/3a_figures/pop_check_adm3_2020_2023.png"),
width = 12, height = 5, dpi = 300)
### Step 6: Save Data ----------------------------------------------------------
# define output folder
save_path <- here::here(
"01_data", "1.1_foundational", "1.1c_population", "1ci_country_sourced"
)
# save as RDS
rio::export(
pop_long,
here::here(save_path, "sierra_leone_pop_census_2020_2023.rds")
)
# save as CSV
rio::export(
pop_long,
here::here(save_path, "sierra_leone_pop_census_2020_2023.csv")
)
# save as XLSX
rio::export(
pop_long,
here::here(save_path, "sierra_leone_pop_census_2020_2023.xlsx")
)