National population data

Overview

In SNT, population data is foundational. It’s used to calculate indicators like incidence rates, assess population at risk, model burden, and guide intervention targeting. Without reliable and well-structured population data, it’s hard to prioritise effectively or assess coverage. Later in the SNT workflow, the cleaned population dataset—often derived from census sources—will be merged with routine data (e.g. DHIS2) and surveys (e.g. DHS) to support epidemiological stratification and subnational planning.

In many contexts, additional population groups, such as IDPs, refugees, and nomadic or migratory communities, may not be included in national datasets but have major implications for malaria burden and campaign planning. These groups should be considered early in the SNT process. Analysts are encouraged to raise any known gaps with the SNT team to ensure special populations are reflected in the final estimates and that coordination with relevant partners (e.g. WHO Emergencies, UNHCR) can happen in a timely way.

Objectives

Use only the population dataset approved by the SNT team
Reshape official national data into tidy, analysis-ready format
Harmonise with administrative fields (e.g. ADM3 names, codes)
Aggregate to relevant admin levels (e.g. district, province)
Validate population data by mapping it on a shapefile

Population Data Needs

Population data must support the SNT team’s ability to assess risk, stratify intervention needs, and guide subnational planning. This often requires estimates disaggregated by year, administrative level, and, where relevant, by specific population groups (e.g. under-fives, pregnant women). Alignment with the spatial units used in your analysis is essential.

In many contexts, additional special populations must also be considered:

Internally Displaced Persons (IDPs)
Refugees, both in and outside camps
Migratory groups (e.g. mining communities, nomadic populations)
Urban informal settlements

These groups are often missing or underrepresented in census data. Where relevant, the SNT team will coordinate with agencies such as WHO Emergencies, UNICEF, or UNHCR to source updated estimates. Additionally, population data for SNT must reflect all vulnerable groups relevant to malaria transmission and service delivery. This includes disaggregations by sex, age, geographic context (e.g., conflict zones), and even specific professions, where these are known to influence risk or access.

Such disaggregations are important throughout the transmission continuum but become especially critical in low-transmission or pre-elimination settings, where granular understanding of residual risk and service gaps is essential. If any of these population groups are relevant in your context, ensure they are discussed with the SNT team early so that data collection, estimation, and partner coordination can be adjusted accordingly.

Localised discussions may also be required, especially when IDP data is only reported at higher administrative levels (e.g. regions) and needs to be apportioned to districts. This has important implications for census reviews, malaria risk assessment, and campaign planning (e.g. ITN distribution).

Consult SNT Team

Population data must be sourced from the official dataset designated by the SNT team. This is not an analyst decision.

In most cases, this will be census-derived data shared by the NMP or national statistics office.
Modelled data sources (e.g. WorldPop) should not be used unless explicitly approved.
Special population data (e.g. IDPs, refugees) must be sourced through country partners and validated by the SNT team.
Never extrapolate or project population counts independently.
If disaggregation by sex, age, or other key vulnerability factors (e.g. conflict-affected populations, specific occupations) is relevant for the context, raise this early. The SNT team is responsible for determining what population groups must be included based on transmission setting and programmatic need.

Consulting the SNT team ensures that all estimates align with country policy, retain credibility, and allow consistent comparisons across geographies.

Accounting for IDPs in Population Estimates

Accurate and up to date population estimates are essential for reviewing census baselines, informing malaria risk assessments, particularly in areas receiving large influxes from high transmission zones, and guiding intervention planning such as ITN campaigns. In contexts of internal displacement, this means accounting for both the current location and place of origin of IDPs, as each affects service needs, resource allocation, and planning decisions within the SNT process. Official data often lags behind recent movements, so these adjustments are especially important where displacement is ongoing or widespread.

IDP data is typically available at the admin1 level from sources like IOM, WHO Emergencies, UNHCR, or UNICEF. However, SNT often requires admin2-level granularity. When more detailed data is unavailable, local validation may be needed to reallocate population counts across districts based on known movement patterns.

If IDPs are relevant in your context, raise this early with the SNT team so population estimates can be reviewed and adjusted accordingly.

Before using a national dataset, keep in mind the following:

Is the data at the correct geographic level?

SNT typically requires population data at adm2 (district) or below. If the national data only covers higher levels (e.g., adm1), raise this with the SNT team. They will determine whether to request disaggregated data or use an approved alternative.

Are the years of interest available?

Census estimates may not cover the full SNT analysis period. If population counts for recent years are missing, do not apply your own growth rates. The SNT team will handle any projections using validated national assumptions or fallback methods agreed with country stakeholders.

Does it include the relevant target populations?

If the dataset does not include disaggregated populations (e.g., under-fives or women of reproductive age), alert the SNT team. They will coordinate with the national statistics office to determine whether official proportion estimates exist. If none are available, alternate sources (e.g., surveys or modelled data) may be considered—but only with prior validation and approval by the SNT team.

The key takeaway is that national population datasets are the preferred starting point—but only when they meet SNT needs. Your responsibility is to assess structure, year coverage, and completeness. Data selection and population estimation are led by the SNT team, not done independently.

Using Multiple Sources Appropriately

Different sources may be appropriate depending on the level of analysis and specific planning use case. For example:

Census data – for national or district-level estimates
WorldPop raster data – for high-resolution access or catchment modelling
DHIS2 health facility data – for facility-level planning inputs
ITN campaign microplans – for population adjustments based on past distribution rounds
Urban GIS datasets – for micro-stratification in dense settings
Refugee and IDP data – from UNHCR, WHO Emergencies, or national actors

All sources used should be agreed with the SNT team and clearly documented.

This page covers how to clean and prepare national population data—that is, data officially endorsed for use in SNT. It does not cover modelled or raster-based datasets. If you’re working with WorldPop or other spatial rasters, refer to the WorldPop Raster Data section instead.

Step-by-Step

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Initial Setup

Step 1.1: Define Functions and Get Packages

Before importing our data locally, we first load the required packages.

# install or load relevant packages
pacman::p_load(
  tidyverse,   # core tidy tools (includes dplyr, tidyr, readr, etc.)
  readxl,      # read Excel files
  fs,          # handle file paths
  purrr,       # iterate over lists (map, walk, etc.)
  rlang,       # for tidy evaluation and error messaging
  devtools     # for download github packages
)

Terminal Installation Required

Install packages in your terminal, if not already installed. If you need help installing packages, please refer to the Getting Started page.

pip install pandas geopandas numpy matplotlib seaborn pyhere re

Once packages are installed, load the relevant packages.

# load relevant packages
import pandas as pd               # data manipulation
import geopandas as gpd           # spatial data handling 
import numpy as np                # numerical operations 
import matplotlib.pyplot as plt   # plotting and visualization
import seaborn as sns             # statistical plotting
from pyhere import here           # handle file paths
import re                         # string operations

Step 1.2: Import Data

We’ll use Sierra Leone’s population and boundary data as an example. These files can be obtained from the provided repository.

The Sierra Leone Chiefdom shapefiles are available in the shapefiles section. Be sure to download all related components (.dbf, .shp, .shx). The population data is in the Excel files section. If following the examples, download both datasets and save them to your local environment.

We first start with importing the Sierra Leone shapefiles and population data from our local environment.

# import shapefile
shp_adm3 <- sf::read_sf(
  here::here(
    "1.1_foundational",
    "1.1a_administrative_boundaries",
    "Chiefdom2021.shp"
  )
) |>
  dplyr::mutate(adm0 = "SIERRA LEONE") |>
  dplyr::select(
    adm0,
    adm1 = FIRST_REGI,
    adm2 = FIRST_DNAM,
    adm3 = FIRST_CHIE
  ) |>
  sf::st_as_sf()

# get the pop data
pop_df <- rio::import(
  here::here(
    "1.1_foundational",
    "1.1c_population",
    "1.1ci_national_population",
    "raw",
    "sle_pop_data.xlsx"
  )
)

# check the top 6 rows
head(pop_df)

Output

       adm1      adm2          adm3  id x_chiefdom y_chiefdom CCODE FIRST_RE_1
1        Bo        Bo        Badjia 140  -11.42127   8.091023  4101          4
2        Bo        Bo         Bagbo 141  -11.85890   7.597199  4102          4
3        Bo        Bo Bagbwe(Bagbe) 142  -11.53293   8.078750  4103          4
4   Moyamba   Moyamba       Bagruwa 169  -12.47197   7.901437  4301          4
5 Port Loko Port Loko    Bakeh Loko 126  -12.83513   8.814081  3301          3
6   Pujehun   Pujehun         Barri 183  -11.41914   7.459128  4401          4
  FIRST_DCOD pop2015 pop2016 pop2017 pop2018 pop2019 pop2020 pop2021 pop2022
1          1    8135    8369    8602    8834    9063    9291    9518    9743
2          1   25884   26630   27371   28107   28838   29564   30285   31001
3          1   20926   21529   22128   22723   23314   23901   24484   25063
4          3   27623   28419   29210   29995   30775   31550   32319   33084
5          3    7773    7997    8219    8441    8660    8878    9095    9310
6          4   36905   37968   39025   40074   41117   42151   43179   44201
  pop2023 pop2024
1    9967   10289
2   31714   32737
3   25640   26467
4   33845   34938
5    9524    9832
6   45218   46677

To adapt the code:

This section uses example data from GitHub. If you have local data or use a different format, follow these steps:

Line 5-7: Replace the path in the sf::read_sf function with the local path where you shapefile data is stored.
Line 10-15: These are optional steps for creating the adm0 column and selecting or renaming relevant administrative columns. This will vary depending on your shapefile, so adapt accordingly.
Line 21-26: Replace the path in the rio::import() function with the local path where you pop data is stored.

Once updated, run the code to load your shapefile and population data.

# import shapefile
shp_adm3 = (gpd.read_file(
    here(
        "1.1_foundational",
        "1.1a_administrative_boundaries",
        "Chiefdom2021.shp"
    )
)
.assign(adm0="SIERRA LEONE")
.rename(columns={
    'FIRST_REGI': 'adm1',
    'FIRST_DNAM': 'adm2', 
    'FIRST_CHIE': 'adm3'
})
[['adm0', 'adm1', 'adm2', 'adm3', 'geometry']]
)

# get the pop data
pop_df = pd.read_excel(
    here(
        "1.1_foundational",
        "1.1c_population",
        "1.1ci_national_population",
        "raw",
        "sle_pop_data.xlsx"
    )
)

# check the top 6 rows
print(pop_df.head(6))

Output

        adm1       adm2           adm3   id  ...  pop2021  pop2022  pop2023  pop2024
0         Bo         Bo         Badjia  140  ...     9518     9743     9967    10289
1         Bo         Bo          Bagbo  141  ...    30285    31001    31714    32737
2         Bo         Bo  Bagbwe(Bagbe)  142  ...    24484    25063    25640    26467
3    Moyamba    Moyamba        Bagruwa  169  ...    32319    33084    33845    34938
4  Port Loko  Port Loko     Bakeh Loko  126  ...     9095     9310     9524     9832
5    Pujehun    Pujehun          Barri  183  ...    43179    44201    45218    46677

[6 rows x 19 columns]

To adapt the code:

This section uses example data from GitHub. If you have local data or use a different format, follow these steps:

Line 4-6: Replace the path in the gpd.read_file function with the local path where you shapefile data is stored.
Line 9-15: These are optional steps for creating the adm0 column and selecting or renaming relevant administrative columns. This will vary depending on your shapefile, so adapt accordingly.
Line 21-25: Replace the path in the pd.read_excel function with the local path where you pop data is stored.

Once updated, run the code to load your shapefile and population data.

We’ve now loaded Sierra Leone population data at the adm3 level, covering the years 2015 to 2024. The dataset includes geographic identifiers (adm1, adm2, adm3), yearly population estimates (pop2015 to pop2024). For now, we’ll focus on the population figures and location columns needed for SNT.

Step 2: Transform Data

The dataset is currently in wide format, with one row per adm3 unit and separate columns for each year (e.g. pop2016, pop2017, etc.). To make it easier to work with in the SNT pipeline, we’ll reshape it into a tidy data—where each row corresponds to a single admin unit for a given year, and population values are stored in a single column. If your data is already in this long format (see data structures section, you can skip the pivot_longer() step in the R code and the .melt() step in the Python code.

Some countries may already provide data in the tidy structure, especially if it’s been preprocessed or submitted in a standardised reporting format.

Step 2.1: Reshape Wide Data to Long Format

pop_long <- pop_df |>
  # add country column
  dplyr::mutate(adm0 = "SIERRA LEONE") |>
  # keep only admin and population columns
  dplyr::select(adm0,
                adm2,
                adm3,
                dplyr::starts_with("pop")) |>
  # reshape to long format: one row per year per adm3
  tidyr::pivot_longer(
    cols = dplyr::starts_with("pop"),
    names_to = "col_name",
    values_to = "pop"
  ) |>
  # extract year from column name and drop the original
  dplyr::mutate(
    year = stringr::str_extract(col_name, "\\d{4}"),
    year = as.integer(year),
    .keep = "unused"
  )

To adapt the code:

Line 3: Set the country name. Replace "SIERRA LEONE" with your target.
Lines 5 to 8: Match your admin columns. Rename or drop adm3 if absent. Adjust the population prefix if not “pop” (for example use dplyr::matches("^wp_?\\d{4}$")).
Lines 10 to 14: Ensure pivot_longer() targets only year columns. If needed, switch cols to a stricter pattern (for example dplyr::matches("^pop_?\\d{4}$")).
Line 17: Update the year regex if your headers differ. Examples: "y\\d{4}", or "(?:^|_)\\d{4}(?:$|_)" to avoid accidental matches.

pop_long = (pop_df
# add country column
.assign(adm0="SIERRA LEONE")
# keep only admin and population columns
.filter(regex=r'^(adm0|adm2|adm3|pop)')
# reshape to long format: one row per year per adm3
.melt(
    id_vars=['adm0', 'adm2', 'adm3'],
    value_vars=[col for col in pop_df.columns if col.startswith('pop')],
    var_name='col_name',
    value_name='pop'
)
# extract year from column name and drop the original
.assign(
    year=lambda x: x['col_name'].str.extract(r'(\d{4})')[0].astype(int)
)
.drop(columns=['col_name'])
.sort_values(['adm3', 'year']) # this is optional (added to match R output ordering)
.reset_index(drop=True)
)

To adapt the code:

Line 3: Set the country name. Replace "SIERRA LEONE" with your target.
Lines 5: Match your admin columns. Rename or drop adm3 if absent. Adjust the population prefix if not “pop” (for example use .filter(regex=r'^(adm0|adm2|adm3|wp)')).
Lines 7 to 11: Ensure .melt() targets only year columns. If needed, switch cols to a stricter pattern (for example value_vars=[col for col in pop_df.columns if re.match(r'^pop_?\d{4}$', col)]).
Line 15: Update the year regex if your headers differ. Examples: r'y(\d{4})', or r'(?:^|_)(\d{4})(?:$|_)' to avoid accidental matches.

We add a country column (adm0), keep only the admin and population columns, and then pivot the data into long format so each row is a single adm3 unit per year. We extract the year from the column names and drop the original wide column name. This sets us up for easier joins and indicator calculations later on. The stringr::str_extract(col_name, "\\d{4}") line in R and str.extract(r'(\d{4})') line in Python is doing the work of pulling the 4-digit year (e.g. 2017) from column names like pop2017, using a regex that matches any four-digit number. We then convert this to an integer so it behaves properly in filters or plots.

Quick Tip: Understanding the regex \d{4}

The expression \d{4} is a regular expression (regex) used to match any sequence of exactly four digits.

\d means “any digit (0–9)”
{4} means “repeat the digit pattern exactly four times”

For the R code, the reason we use double backslashes (\\) is because R treats a single backslash as an escape character. So to pass a literal \d to the regex engine, we need to write it as "\\d".

So str_extract(col_name, "\\d{4}") pulls out the 4-digit year from strings like "pop2017", returning "2017".

Step 2.2: What Should My Final Data Looklike?

Your population dataset should now follow a tidy, long format: one row per admin unit and year, with a single column holding population values. This structure is essential for analysis in the SNT pipeline, where population data must align cleanly with other spatial and time-based datasets like routine case data or survey indicators.

Working in long format also makes it easier to run group-wise operations, join across sources, and generate time trends. A properly reshaped dataset should resemble the example below:

adm0	adm1	adm2	adm3	pop	year
SIERRA LEONE	SOUTHERN	BO	BADJIA	2015	8135
SIERRA LEONE	SOUTHERN	BO	BADJIA	2016	8369
SIERRA LEONE	SOUTHERN	BO	BADJIA	2017	8602
SIERRA LEONE	SOUTHERN	BO	BADJIA	2018	8834
SIERRA LEONE	SOUTHERN	BO	BADJIA	2019	9063

Step 3: Admin Unit Alignment

Before using any population dataset in the SNT workflow, it’s essential to ensure that all administrative levels—such as adm0, adm1, adm2, and/or adm3—are fully aligned with the shapefile being used in the analysis. This means checking that names, structures, and geographic references match exactly between the population data and your spatial boundaries. Inconsistent or missing admin values can cause join errors, distort summaries, or lead to misinterpretation in stratification outputs. This step focuses on aligning and validating the administrative structure across sources to ensure consistency and compatibility.

The population data appears to be missing adm1 (province-level) information. To fill this gap, we merge in adm1 values using a lookup extracted from the shapefile. This highlights the broader need to ensure that population data is fully aligned with the administrative boundaries used elsewhere in the analysis. Consistent naming and structure across datasets is essential for clean joins, accurate aggregation, and meaningful comparisons in the SNT workflow. After merging, we reorder the columns for clarity. The final output should be a cleaned and properly structured population file—aligned with your administrative boundaries and ready for use in epidemiological stratification.

Step 3.1: Align Admin Levels Using Shapefile

# get shapefile
# create admin lookup df
shp_lookup  <-  shp_adm3 |>
  sf::st_drop_geometry() |>
  dplyr::distinct()

# check results
pop_long |> head()

Output

# A tibble: 6 × 5
  adm0         adm2  adm3     pop  year
  <chr>        <chr> <chr>  <dbl> <int>
1 SIERRA LEONE Bo    Badjia  8135  2015
2 SIERRA LEONE Bo    Badjia  8369  2016
3 SIERRA LEONE Bo    Badjia  8602  2017
4 SIERRA LEONE Bo    Badjia  8834  2018
5 SIERRA LEONE Bo    Badjia  9063  2019
6 SIERRA LEONE Bo    Badjia  9291  2020

To adapt the code:

This section uses example data from GitHub. If you have local data or use a different format, follow these steps:

Line 4: Update dplyr::distinct() if your shapefile uses different admin levels or column names. For example, use dplyr::distinct(adm0, adm1) if we want distinct admin levels up to adm1.

Once updated, run the code to load your admin reference file.

# get shapefile
# create admin lookup df
shp_lookup = (shp_adm3
.drop(columns=['geometry'])
.drop_duplicates()
)

# check results
print(pop_long.head())

Output

           adm0 adm2    adm3   pop  year
0  SIERRA LEONE   Bo  Badjia  8135  2015
1  SIERRA LEONE   Bo  Badjia  8369  2016
2  SIERRA LEONE   Bo  Badjia  8602  2017
3  SIERRA LEONE   Bo  Badjia  8834  2018
4  SIERRA LEONE   Bo  Badjia  9063  2019

To adapt the code:

This section uses example data from GitHub. If you have local data or use a different format, follow these steps:

Line 5: Update .drop_duplicates() if your shapefile uses different admin levels or column names. For example, use .drop_duplicates(subset=['adm0', 'adm1']) if we want distinct admin levels up to adm1.

Once updated, run the code to load your admin reference file.

In our example dataset, the population file includes adm0, adm2, and adm3, but is missing adm1 (province). However, your file may contain different gaps. Some may already include adm1, while others might require different corrections. This step is simply one illustration of how to address alignment issues.

To resolve this, we extract a lookup table from the shapefile and merge the adm1 values back in.

# join the adm1 from shapefile lookup df
pop_long2 <- pop_long |>
  dplyr::mutate(
    adm2 = toupper(adm2),
    adm3 = toupper(adm3)) |>
  dplyr::left_join(
    dplyr::distinct(shp_lookup, adm1, adm2),
    by = "adm2"
  ) |>
  # reorder columns
  dplyr::select(
    adm0,
    adm1,
    adm2,
    adm3,
    year,
    pop
  )

# check results after adding in adm1
pop_long2 |> head()

Output

# A tibble: 6 × 6
  adm0         adm1     adm2  adm3    year   pop
  <chr>        <chr>    <chr> <chr>  <int> <dbl>
1 SIERRA LEONE SOUTHERN BO    BADJIA  2015  8135
2 SIERRA LEONE SOUTHERN BO    BADJIA  2016  8369
3 SIERRA LEONE SOUTHERN BO    BADJIA  2017  8602
4 SIERRA LEONE SOUTHERN BO    BADJIA  2018  8834
5 SIERRA LEONE SOUTHERN BO    BADJIA  2019  9063
6 SIERRA LEONE SOUTHERN BO    BADJIA  2020  9291

To adapt the code:

Line 8: Adjust the join key to match your data. If your population data aligns at adm1, change by = "adm2" to by = "adm1". If at adm3, change to by = "adm3".
- Make sure the columns you join on actually exist and match in both datasets.
Column Names: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.

Once updated, run the code to join your pop data to your admin data.

# join the adm1 from shapefile lookup df
pop_long2 = (pop_long
.assign(
    adm2=lambda x: x['adm2'].str.upper(),
    adm3=lambda x: x['adm3'].str.upper()
)
.merge(
    shp_lookup[['adm1', 'adm2']].drop_duplicates(),
    on='adm2',
    how='left'
)
# reorder columns
[['adm0', 'adm1', 'adm2', 'adm3', 'year', 'pop']]
.sort_values(['adm3', 'year']) # this is optional (added to match R output ordering)
.reset_index(drop=True)
)

# check results after adding in adm1
print(pop_long2.head())

Output

           adm0      adm1 adm2    adm3  year   pop
0  SIERRA LEONE  SOUTHERN   BO  BADJIA  2015  8135
1  SIERRA LEONE  SOUTHERN   BO  BADJIA  2016  8369
2  SIERRA LEONE  SOUTHERN   BO  BADJIA  2017  8602
3  SIERRA LEONE  SOUTHERN   BO  BADJIA  2018  8834
4  SIERRA LEONE  SOUTHERN   BO  BADJIA  2019  9063

To adapt the code:

Line 9: Adjust the join key to match your data. If your population data aligns at adm1, change on = 'adm2' to on = 'adm1'. If at adm3, change to on = 'adm3'.
- Make sure the columns you join on actually exist and match in both datasets.
Column Names: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.

Once updated, run the code to join your pop data to your admin data.

We join adm1 values into the population dataset using a lookup table derived from the shapefile. This ensures the admin structure is consistent with other spatial datasets. After merging, we reorder the columns for clarity. The resulting file should now contain adm0, adm1, adm2, and adm3, in tidy format, and aligned with the administrative units used elsewhere in your analysis. The output above confirms that adm1 is now present in the data.

Step 3.2: Check for Admin Value Mismatches

Now to ensure consistency, we extract the distinct combinations of admin columns from both datasets and compare them. Any mismatch could signal a misspelled district name, missing entry, or a structural issue in the merge.

We extract unique combinations of adm0 to adm3 from both datasets to check alignment. Using !!!rlang::syms(keys) dynamically passes column names stored in keys, and is equivalent to writing shp_lookup |> dplyr::distinct(adm0, adm1, adm2, adm3), but more flexible and reusable. Using anti_join() in both directions helps catch any mismatches across sources.

# extract unique admin combinations
keys = c("adm0", "adm1", "adm2", "adm3")
lookup_keys <- pop_long2 |> dplyr::distinct(!!!rlang::syms(keys))
pop_keys <- shp_lookup |> dplyr::distinct(!!!rlang::syms(keys))

# identify mismatches (either direction)
mismatches <-
  dplyr::anti_join(
    pop_keys, lookup_keys, by = keys) |>
  dplyr::bind_rows(
    dplyr::anti_join(lookup_keys, pop_keys, by = keys)
  )

# check if all admin level combinations (adm0–adm3) match between sources
if (nrow(mismatches) == 0) {
  cli::cli_alert_success(
    "admin levels (adm0 to adm3) match between pop data and shapefile."
  )
} else {
  cli::cli_alert_danger(
    "Mismatch detected in admin levels between pop data and shapefile."
  )
}

Output

✖ Mismatch detected in admin levels between pop data and shapefile.

To adapt the code:

Line 2: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.
- Make sure the columns you join on actually exist and match in both datasets.

Once updated, run the code to check admin mismatches.

We extract unique combinations of adm0 to adm3 from both datasets to check alignment. Using .merge() with indicator=True and filtering for 'left_only' in both directions helps catch any mismatches across sources.

# extract unique admin combinations
keys = ['adm0', 'adm1', 'adm2', 'adm3']
lookup_keys = pop_long2[keys].drop_duplicates()
pop_keys = shp_lookup[keys].drop_duplicates()

# identify mismatches (either direction)
mismatches = pd.concat([
    pop_keys.merge(lookup_keys, on=keys, how='left', indicator=True)
    .query('_merge == "left_only"')
    .drop(columns=['_merge']),
    lookup_keys.merge(pop_keys, on=keys, how='left', indicator=True)
    .query('_merge == "left_only"')
    .drop(columns=['_merge'])
], ignore_index=True)

# check if all admin level combinations (adm0–adm3) match between sources
if len(mismatches) == 0:
    print("✓ admin levels (adm0 to adm3) match between pop data and shapefile.")
else:
    print("✗ Mismatch detected in admin levels between pop data and shapefile.")

Output

✗ Mismatch detected in admin levels between pop data and shapefile.

To adapt the code:

Line 2: Adjust adm0, adm1, adm2, adm3 to reflect your own shapefile structure.
- Make sure the columns you join on actually exist and match in both datasets.

Once updated, run the code to check admin mismatches.

Step 3.3: Resolving Mismatches in Admin Units

If mismatches are found—due to spelling differences, inconsistent naming conventions, or boundary changes—you’ll need to harmonise the administrative names between the population data and the shapefile. One recommended approach is to use the prep_geonames() function from the sntutils package.

This function combines algorithmic matching (e.g., string distance) with user-guided correction to standardise administrative units across datasets. It supports up to six hierarchical levels and can cache decisions for future reuse—so you don’t need to repeat the same matching steps. These cached corrections can also be shared with colleagues, saving time and ensuring consistent naming across team members.

We don’t cover all details here, but below is a basic setup. For a full walkthrough—including both interactive and scripted use—see Merging shapefiles with tabular data section of the code library. That section explains how to integrate prep_geonames() into your cleaning workflow and how to save and reuse decisions efficiently.

# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")

# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"

# Harmonise admin names between population data and shapefile
pop_harmonised <-
  sntutils::prep_geonames(
    target_df = pop_long2,
    lookup_df = lookup_keys,
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    interactive = TRUE,
    cache_path = here::here(cache_path, "geoname_decisions.rds")
  )

Output

── ℹ Match Summary ─────────────────────────────────────────────────────────────

✔ Hierarchies are aligned across data and lookup.


Target data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 16 matched                                       
• adm3 (level3): 208 out of 208 matched                                     
Lookup data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 5 out of 5 matched                                         
• adm2 (level2): 16 out of 16 matched                                       
• adm3 (level3): 208 out of 208 matched

✔ All records matched; process completed. Exiting...

To adapt the code:

Line 11: Replace pop_long2 with your population dataset.
Line 12: place lookup_keys with your lookup or reference dataset.
Line 13-16: Adjust level0, level1, level2, and level3 to match your actual admin column names if they are named differently in your dataset (e.g., “country”, “region”, “district”, “ward”).
- Make sure the columns you join on actually exist and match in both datasets
Line 6 and 18: Update the path if you want to save the geoname decisions to a different location.

Once updated, run the code to harmonise pop data admin names with shapefile ones.

Terminal Installation Required

Install the sntutils-py package in your terminal, if not already installed. If you need help installing packages, please refer to the Getting Started page.

pip install git+https://github.com/ahadi-analytics/sntutils-py.git

# load the prep_geonames function from the sntutils-py package 
from sntutils.geo import prep_geonames

# set up location to save cache
cache_path = "1.1_foundational/1d_cache_files"

# Harmonise admin names between population data and shapefile
pop_harmonised = prep_geonames(
    target_df=pop_long2,
    lookup_df=lookup_keys,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    method="jw",
    cache_path=here(cache_path, "geoname_decisions.csv")
    preserve_case=True
)

Output


ℹ Match Summary

╭───────────────┬─────────┬───────────────┬───────────────╮
│ Level         │ Matched │ Target Data N │ Lookup Data N │
├───────────────┼─────────┼───────────────┼───────────────┤
│ adm0 (level0) │    1    │       1       │       1       │
│ adm1 (level1) │    5    │       5       │       5       │
│ adm2 (level2) │   16    │      16       │      16       │
│ adm3 (level3) │   208   │      208      │      208      │
╰───────────────┴─────────┴───────────────┴───────────────╯
✓ Hierarchies are aligned across data and lookup           

Success: All records matched; process completed. Exiting...

To adapt the code:

Line 9: Replace pop_long2 with your population dataset.
Line 10: place lookup_keys with your lookup or reference dataset.
Line 11-14: Adjust level0, level1, level2, and level3 to match your actual admin column names if they are named differently in your dataset (e.g., “country”, “region”, “district”, “ward”).
- Make sure the columns you join on actually exist and match in both datasets
Line 5 and 16: Update the path if you want to save the geoname decisions to a different location.
Line 15: Replace method="jw" with method="lv" to switch distance metrics:
- Jaro-Winkler ("jw") - Recommended for geographic/administrative names
- Levenshtein ("lv") - Better for general typos/spelling errors

Once updated, run the code to harmonise pop data admin names with shapefile ones.

This confirms that all administrative levels have been successfully aligned between your dataset and the reference. If mismatches remain, the tool will prompt you to resolve them interactively.

Step 4: Aggregating Population Data

With the population data now reshaped and harmonized across administrative levels, we can begin exploring it through basic checks and summaries. As a first step, we aggregate the population totals at different administrative units of interest:

# summarize total population by adm0
pop_summary_adm0 <- pop_long2 |>
  dplyr::group_by(adm0, year) |>
  dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))

# summarize total population by adm1
pop_summary_adm1 <- pop_long2 |>
  dplyr::group_by(adm1, year) |>
  dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))

# summarize total population by adm2
pop_summary_adm2 <- pop_long2 |>
  dplyr::group_by(adm2, year) |>
  dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))

# summarize total population by adm3
pop_summary_adm3 <- pop_long2

# check to see if aggregation worked
head(pop_summary_adm1, n = 9)

Output

# A tibble: 9 × 3
  adm1     year pop_total
  <chr>   <int>     <dbl>
1 EASTERN  2015   1642370
2 EASTERN  2016   1689678
3 EASTERN  2017   1736710
4 EASTERN  2018   1783417
5 EASTERN  2019   1829792
6 EASTERN  2020   1875847
7 EASTERN  2021   1921600
8 EASTERN  2022   1967077
9 EASTERN  2023   2012312

To adapt the code:

No major changes are needed here.
- The grouping levels (adm0, adm1, adm2, adm3) follow from the earlier joining and harmonization steps.

# summarize total population by adm0
pop_summary_adm0 = (pop_long2
.groupby(['adm0', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)

# summarize total population by adm1
pop_summary_adm1 = (pop_long2
.groupby(['adm1', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)

# summarize total population by adm2
pop_summary_adm2 = (pop_long2
.groupby(['adm2', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)

# summarize total population by adm3
pop_summary_adm3 = pop_long2.copy()

# check to see if aggregation worked
print(pop_summary_adm1.head(9))

Output

      adm1  year  pop_total
0  EASTERN  2015    1642370
1  EASTERN  2016    1689678
2  EASTERN  2017    1736710
3  EASTERN  2018    1783417
4  EASTERN  2019    1829792
5  EASTERN  2020    1875847
6  EASTERN  2021    1921600
7  EASTERN  2022    1967077
8  EASTERN  2023    2012312

To adapt the code:

No major changes are needed here.
- The grouping levels (adm0, adm1, adm2, adm3) follow from the earlier joining and harmonization steps.

Step 5: Visualizing Aggregated Population Data

Step 5.1: Prepare Population Data for Mapping

Once we’ve aggregated the population data and aligned it with the administrative shapefiles, the next step is to prepare it for visualization. This includes merging population totals into the shapefile and assigning population ranges.

To make visual patterns easier to interpret, population values are grouped into bins. Any location with a value of 0 or NA is placed in a special “Missing or 0” category. These will appear in grey on the map, making it easy to spot gaps or misalignments before moving forward with analysis.

# join pop data to shapefile
shp_adm3_pop <- shp_adm3 |>
  dplyr::left_join(
    pop_summary_adm3,
    by = c("adm0", "adm1", "adm2", "adm3"),
    relationship = "many-to-many"
  ) |>
  dplyr::filter(year >= 2020) |>
  dplyr::mutate(
    pop_bin = dplyr::case_when(
      is.na(pop) | pop == 0 ~ "Missing or 0",
      pop <= 15000 ~ "<15k",
      pop <= 30000 ~ "15k–30k",
      pop <= 50000 ~ "30k–50k",
      pop <= 100000 ~ "50k–100k",
      pop > 100000 ~ "100k+"
    ),
    pop_bin = factor(
      pop_bin,
      levels = c(
        "Missing or 0",
        "<15k",
        "15k–30k",
        "30k–50k",
        "50k–100k",
        "100k+"
      )
    )
  )

To adapt the code:

Line 5: If you used a different shapefile or admin join level earlier, update by = “adm3” to the level you used (e.g., “adm2” or “adm1”).
Line 8: Change the year filter if you are interested in a different range (e.g., year >= 2015).
Lines 11–16: You may adjust the population bins to match meaningful ranges for your context.

Once updated, run the code to set up your population data for mapping.

# join pop data to shapefile
shp_adm3_pop = (shp_adm3
.merge(
    pop_summary_adm3,
    on=['adm0', 'adm1', 'adm2', 'adm3'],
    how='left'
)
.query('year >= 2020')
.assign(
    pop_bin=lambda x: pd.cut(
        x['pop'].fillna(0),
        bins=[-np.inf, 0, 15000, 30000, 50000, 100000, np.inf],
        labels=['Missing or 0', '<15k', '15k–30k', '30k–50k', '50k–100k', '100k+'],
        include_lowest=True
    ),
    year=lambda x: x['year'].astype(int)
)
)

To adapt the code:

Line 5: If you used a different shapefile or admin join level earlier, update by = “adm3” to the level you used (e.g., “adm2” or “adm1”).
Line 8: Change the year filter if you are interested in a different range (e.g., year >= 2015).
Line 12: You may adjust the population bins to match meaningful ranges for your context.

Once updated, run the code to set up your population data for mapping.

Step 5.2: Visualize Aggregated Population by Region

We now generate faceted maps showing population distribution at the adm3 level for the years 2020 to 2024. Each facet represents a different year, allowing for visual comparison across time.

Grey regions may indicate missing data or failed joins. These should be reviewed before using the dataset in subsequent steps like stratification or targeting.

Show full code

# generate adm0-level shapefile
shp_adm0 <- shp_adm3 |>
  dplyr::group_by(adm0) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# generate adm1-level shapefile
shp_adm1 <- shp_adm3 |>
    dplyr::group_by(adm1) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# plot pop from 2020 to 2023
adm3_pop_map <- shp_adm3_pop |>
  ggplot2::ggplot() +
  ggplot2::geom_sf(
    ggplot2::aes(fill = pop_bin),
    color = "white",
    size = 0.2
  ) +
  # add adm1 outline as a black border
  ggplot2::geom_sf(
    data = shp_adm1,
    fill = NA, color = "black", size = 0.5
  ) +
  ggplot2::scale_fill_manual(
    name = "Population Range",
    values = c("Missing or 0" = "grey80",
               "<15k" = "#fee5d9",
               "15k–30k" = "#fcae91",
               "30k–50k" = "#fb6a4a",
               "50k–100k" = "#de2d26",
               "100k+" = "#a50f15"),
    na.value = "grey80",
    guide = ggplot2::guide_legend(
      label.position = "bottom",
      title.position = "top"
    )
  ) +
  ggplot2::facet_wrap(~year, nrow = 1) +
  ggplot2::labs(
    title = "Total Population by Chiefdom (adm3)",
    subtitle = "Between 2020 and 2024",
    fill = "Population"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(size = 20), 
    plot.subtitle = ggplot2::element_text(size = 20),
    strip.text = ggplot2::element_text(size = 18),     
    legend.title = ggplot2::element_text(size = 18),    
    legend.text = ggplot2::element_text(size = 18),     
    legend.position = "bottom",
    legend.title.position = "top",
    legend.key.width = grid::unit(1, "cm"),
    axis.text.x = ggplot2::element_text(
      angle = 75, vjust = 1, hjust = 1
    )
  )

Output

To adapt the code:

Line 14 and 42: If you joined at a different admin level (e.g., adm2), update shp_adm3_pop and the map title accordingly.
Line 23: If your admin1 shapefile is named differently, update data = shp_adm1 to your object name.
Line 28-33: Adjust color values and ranges if you defined different population bins.

Once updated, run the code to map your aggregated data.

Show full code

# generate adm0-level shapefile
shp_adm0 = (shp_adm3
.dissolve(by='adm0')
.reset_index()
)

# generate adm1-level shapefile
shp_adm1 = (shp_adm3
.dissolve(by='adm1')
.reset_index()
)

# plot pop from 2020 to 2023
years = sorted(shp_adm3_pop['year'].unique())

adm3_pop_map, axes = plt.subplots(1, len(years), figsize=(12, 5),constrained_layout=True)

# handle case where there's only one year
if len(years) == 1:
    axes = [axes]

# define color mapping
color_map = {
    'Missing or 0': 'lightgrey',
    '<15k': '#fee5d9',
    '15k–30k': '#fcae91', 
    '30k–50k': '#fb6a4a',
    '50k–100k': '#de2d26',
    '100k+': '#a50f15'
}

for i, year in enumerate(years):
    ax = axes[i]
    
    # plot population data
    year_data = shp_adm3_pop[shp_adm3_pop['year'] == year]
    
    for pop_bin, color in color_map.items():
        subset = year_data[year_data['pop_bin'] == pop_bin]
        if not subset.empty:
            subset.plot(ax=ax, 
            color = color, 
            edgecolor = 'white', 
            linewidth = 0.3)
    
    # add adm1 outline as a black border
    shp_adm1.boundary.plot(ax=ax, color='black', linewidth=0.5)
    
    ax.set_title(f'{year}',fontsize = 16)
    ax.set_axis_off()

# add overall title and legend
adm3_pop_map.suptitle('Total Population by Chiefdom (adm3)\nBetween 2020 and 2024', fontsize=20)

# create custom legend
handles = [plt.Rectangle((0,0),1,1, color=color) for color in color_map.values()]
labels = list(color_map.keys())
adm3_pop_map.legend(
  handles, 
  labels, 
  title='Population Range', 
  loc='lower center', 
  bbox_to_anchor=(0.5, -0.03), 
  ncol=6, 
  fontsize=18,
  title_fontsize=18,
  frameon=False,
  columnspacing=0.8,
  handletextpad=0.5)
  
plt.show()

Output

To adapt the code:

Lines 14, 36, 53: If you joined at a different admin level (e.g., adm2), update shp_adm3_pop and the map title accordingly.
Line 47: If your admin1 shapefile is named differently, update shp_adm1 to your object name.
Line 23-29: Adjust color values and ranges if you defined different population bins.
Line 16: Adjust the width and height in figsize=(12, 5) if you need a different image size.

Once updated, run the code to map your aggregated data.

All chiefdoms are shaded on the map, indicating that population values were successfully joined—none are missing or zero for the years 2020 to 2024. This suggests that the merge step worked correctly and there are no gaps in coverage at the adm3 level.

Step 5.3: Save plots

The map is then saved as a high-resolution PNG for future reference.

# save plot
ggplot2::ggsave(
  plot = adm3_pop_map,
  here::here("03_output/3a_figures/pop_check_adm3_2020_2024.png"),
  width = 12, height = 5, dpi = 300)

To adapt the code:

Line 4: Update the filepath to match your folder structure by changing 03_output/3a_figures/.
Line 4: Update the filename by changing pop_check_adm3_2020_2024.png.
Line 5: Adjust the width, height, and dpi if you need a different image size or quality.

Once updated, run the code to save the plot as a PNG file.

# save plot
plt.savefig(
    here("03_output", "3a_figures", "pop_check_adm3_2020_2024.png"),
    dpi=300, 
    bbox_inches='tight'
)

To adapt the code:

Line 3: Update the filepath to match your folder structure by changing "03_output","3a_figures".
Line 3: Update the filename by changing pop_check_adm3_2020_2024.png.
Line 4: Adjust the dpi if you need a different image quality.

Once updated, run the code to save the plot as a PNG file.

Based on the folded code above in Step 5.2, this plot confirms that our population data is complete and well structured. Each adm3 has a population value for all the years we’re interested in (2020–2024), and joins cleanly to the shapefile. This quick visual check helps ensure that the data is aligned before we move on to deeper analysis.

Step 6: Save Data

Once the census population data is cleaned, reshaped, and aligned with your admin boundaries, it’s a good idea to save it in a ready-to-use format. This avoids repeating the cleaning steps and makes it easier to join with other datasets later on.

We save the processed data as .rds, .csv, and .xlsx—so it’s easy to load in R or share with others.

# define output folder
save_path <- here::here(
  "01_data", "1.1_foundational", "1.1c_population", "1ci_country_sourced"
)

# save as RDS
rio::export(
  pop_long,
  here::here(save_path, "sierra_leone_pop_census_2020_2023.rds")
)

# save as CSV
rio::export(
  pop_long,
  here::here(save_path, "sierra_leone_pop_census_2020_2023.csv")
)

# save as XLSX
rio::export(
  pop_long,
  here::here(save_path, "sierra_leone_pop_census_2020_2023.xlsx")
)

To adapt the code:

Line 3: Update save_path to your output directory.
File Names: Change file names (e.g., sierra_leone_pop_census.csv) to reflect your dataset or time range.

Once updated, run the code to save your outputs in both raw and processed formats.

# define output folder
save_path = here("01_data", "1.1_foundational", "1.1c_population", "1ci_country_sourced")
save_path.mkdir(parents=True, exist_ok=True)

# save as CSV
pop_long.to_csv(save_path / "sierra_leone_pop_census_2020_2023.csv", index=False)

# save as XLSX
pop_long.to_excel(save_path / "sierra_leone_pop_census_2020_2023.xlsx", index=False)

To adapt the code:

Line 2: Update save_path to your output directory.
File Names: Change file names (e.g., sierra_leone_pop_census.csv) to reflect your dataset or time range.

Once updated, run the code to save your outputs in both raw and processed formats.

Summary

We’ve now covered how to get census-based population data cleaned, reshaped, and ready for use. We brought in the raw files, reshaped them into a long tidy format, joined admin names using shapefiles, and created aggregated totals across different administrative levels. Everything is now cleaned, structured, and ready to be merged with survey or routine datasets later in the SNT pipeline. You’ll find the full code at the bottom of this section, collapsed for easy reference. It can be reused as-is or tweaked for your own setup—just update the file paths, column names, or admin levels where needed.

Full Code

Find the full code script for processing national population data below.

Show full code

################################################################################
#################### ~ National population data full code ~ ####################
################################################################################

### Step 1: Initial Setup ------------------------------------------------------

#### Step 1.1: Define Functions and Get Packages -------------------------------

# install or load relevant packages
pacman::p_load(
  tidyverse,   # core tidy tools (includes dplyr, tidyr, readr, etc.)
  readxl,      # read Excel files
  fs,          # handle file paths
  purrr,       # iterate over lists (map, walk, etc.)
  rlang,       # for tidy evaluation and error messaging
  devtools     # for download github packages
)

#### Step 1.2: Import Data -----------------------------------------------------

# import shapefile
shp_adm3 <- sf::read_sf(
  here::here(
    "1.1_foundational",
    "1.1a_administrative_boundaries",
    "Chiefdom2021.shp"
  )
) |>
  dplyr::mutate(adm0 = "SIERRA LEONE") |>
  dplyr::select(
    adm0,
    adm1 = FIRST_REGI,
    adm2 = FIRST_DNAM,
    adm3 = FIRST_CHIE
  ) |>
  sf::st_as_sf()

# get the pop data
pop_df <- rio::import(
  here::here(
    "1.1_foundational",
    "1.1c_population",
    "1.1ci_national_population",
    "raw",
    "sle_pop_data.xlsx"
  )
)

# check the top 6 rows
head(pop_df)

### Step 2: Transform Data -----------------------------------------------------

#### Step 2.1: Reshape Wide Data to Long Format --------------------------------

pop_long <- pop_df |>
  # add country column
  dplyr::mutate(adm0 = "SIERRA LEONE") |>
  # keep only admin and population columns
  dplyr::select(adm0,
                adm2,
                adm3,
                dplyr::starts_with("pop")) |>
  # reshape to long format: one row per year per adm3
  tidyr::pivot_longer(
    cols = dplyr::starts_with("pop"),
    names_to = "col_name",
    values_to = "pop"
  ) |>
  # extract year from column name and drop the original
  dplyr::mutate(
    year = stringr::str_extract(col_name, "\\d{4}"),
    year = as.integer(year),
    .keep = "unused"
  )

### Step 3: Admin Unit Alignment -----------------------------------------------

#### Step 3.1: Align Admin Levels Using Shapefile ------------------------------

# get shapefile
# create admin lookup df
shp_lookup  <-  shp_adm3 |>
  sf::st_drop_geometry() |>
  dplyr::distinct()

# check results
pop_long |> head()

# join the adm1 from shapefile lookup df
pop_long2 <- pop_long |>
  dplyr::mutate(
    adm2 = toupper(adm2),
    adm3 = toupper(adm3)) |>
  dplyr::left_join(
    dplyr::distinct(shp_lookup, adm1, adm2),
    by = "adm2"
  ) |>
  # reorder columns
  dplyr::select(
    adm0,
    adm1,
    adm2,
    adm3,
    year,
    pop
  )

# check results after adding in adm1
pop_long2 |> head()

#### Step 3.2: Check for Admin Value Mismatches --------------------------------

# extract unique admin combinations
keys = c("adm0", "adm1", "adm2", "adm3")
lookup_keys <- pop_long2 |> dplyr::distinct(!!!rlang::syms(keys))
pop_keys <- shp_lookup |> dplyr::distinct(!!!rlang::syms(keys))

# identify mismatches (either direction)
mismatches <-
  dplyr::anti_join(
    pop_keys, lookup_keys, by = keys) |>
  dplyr::bind_rows(
    dplyr::anti_join(lookup_keys, pop_keys, by = keys)
  )

# check if all admin level combinations (adm0–adm3) match between sources
if (nrow(mismatches) == 0) {
  cli::cli_alert_success(
    "admin levels (adm0 to adm3) match between pop data and shapefile."
  )
} else {
  cli::cli_alert_danger(
    "Mismatch detected in admin levels between pop data and shapefile."
  )
}

#### Step 3.3: Resolving Mismatches in Admin Units -----------------------------

# Install the sntutils package from GitHub
# has a number of useful helper functions
devtools::install_github("ahadi-analytics/sntutils")

# set up location to save cache
cache_path <- "1.1_foundational/1d_cache_files"

# Harmonise admin names between population data and shapefile
pop_harmonised <-
  sntutils::prep_geonames(
    target_df = pop_long2,
    lookup_df = lookup_keys,
    level0 = "adm0",
    level1 = "adm1",
    level2 = "adm2",
    level3 = "adm3",
    interactive = TRUE,
    cache_path = here::here(cache_path, "geoname_decisions.rds")
  )

### Step 4: Aggregating Population Data ----------------------------------------

# summarize total population by adm0
pop_summary_adm0 <- pop_long2 |>
  dplyr::group_by(adm0, year) |>
  dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))

# summarize total population by adm1
pop_summary_adm1 <- pop_long2 |>
  dplyr::group_by(adm1, year) |>
  dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))

# summarize total population by adm2
pop_summary_adm2 <- pop_long2 |>
  dplyr::group_by(adm2, year) |>
  dplyr::reframe(pop_total = sum(pop, na.rm = TRUE))

# summarize total population by adm3
pop_summary_adm3 <- pop_long2

# check to see if aggregation worked
head(pop_summary_adm1, n = 9)

### Step 5: Visualizing Aggregated Population Data -----------------------------

#### Step 5.1: Prepare Population Data for Mapping -----------------------------

# join pop data to shapefile
shp_adm3_pop <- shp_adm3 |>
  dplyr::left_join(
    pop_summary_adm3,
    by = c("adm0", "adm1", "adm2", "adm3"),
    relationship = "many-to-many"
  ) |>
  dplyr::filter(year >= 2020) |>
  dplyr::mutate(
    pop_bin = dplyr::case_when(
      is.na(pop) | pop == 0 ~ "Missing or 0",
      pop <= 15000 ~ "<15k",
      pop <= 30000 ~ "15k–30k",
      pop <= 50000 ~ "30k–50k",
      pop <= 100000 ~ "50k–100k",
      pop > 100000 ~ "100k+"
    ),
    pop_bin = factor(
      pop_bin,
      levels = c(
        "Missing or 0",
        "<15k",
        "15k–30k",
        "30k–50k",
        "50k–100k",
        "100k+"
      )
    )
  )

#### Step 5.2: Visualize Aggregated Population by Region -----------------------

# generate adm0-level shapefile
shp_adm0 <- shp_adm3 |>
  dplyr::group_by(adm0) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# generate adm1-level shapefile
shp_adm1 <- shp_adm3 |>
    dplyr::group_by(adm1) |>
  dplyr::summarise(.groups = "drop") |>
  sf::st_union(by_feature = TRUE)

# plot pop from 2020 to 2023
adm3_pop_map <- shp_adm3_pop |>
  ggplot2::ggplot() +
  ggplot2::geom_sf(
    ggplot2::aes(fill = pop_bin),
    color = "white",
    size = 0.2
  ) +
  # add adm1 outline as a black border
  ggplot2::geom_sf(
    data = shp_adm1,
    fill = NA, color = "black", size = 0.5
  ) +
  ggplot2::scale_fill_manual(
    name = "Population Range",
    values = c("Missing or 0" = "grey80",
               "<15k" = "#fee5d9",
               "15k–30k" = "#fcae91",
               "30k–50k" = "#fb6a4a",
               "50k–100k" = "#de2d26",
               "100k+" = "#a50f15"),
    na.value = "grey80",
    guide = ggplot2::guide_legend(
      label.position = "bottom",
      title.position = "top"
    )
  ) +
  ggplot2::facet_wrap(~year, nrow = 1) +
  ggplot2::labs(
    title = "Total Population by Chiefdom (adm3)",
    subtitle = "Between 2020 and 2024",
    fill = "Population"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(size = 20), 
    plot.subtitle = ggplot2::element_text(size = 20),
    strip.text = ggplot2::element_text(size = 18),     
    legend.title = ggplot2::element_text(size = 18),    
    legend.text = ggplot2::element_text(size = 18),     
    legend.position = "bottom",
    legend.title.position = "top",
    legend.key.width = grid::unit(1, "cm"),
    axis.text.x = ggplot2::element_text(
      angle = 75, vjust = 1, hjust = 1
    )
  )

#### Step 5.3: Save plots ------------------------------------------------------

# save plot
ggplot2::ggsave(
  plot = adm3_pop_map,
  here::here("03_output/3a_figures/pop_check_adm3_2020_2024.png"),
  width = 12, height = 5, dpi = 300)

### Step 6: Save Data ----------------------------------------------------------

# define output folder
save_path <- here::here(
  "01_data", "1.1_foundational", "1.1c_population", "1ci_country_sourced"
)

# save as RDS
rio::export(
  pop_long,
  here::here(save_path, "sierra_leone_pop_census_2020_2023.rds")
)

# save as CSV
rio::export(
  pop_long,
  here::here(save_path, "sierra_leone_pop_census_2020_2023.csv")
)

# save as XLSX
rio::export(
  pop_long,
  here::here(save_path, "sierra_leone_pop_census_2020_2023.xlsx")
)

Show full code

################################################################################
#################### ~ National population data full code ~ ####################
################################################################################

### Step 1: Initial Setup ------------------------------------------------------

#### Step 1.1: Define Functions and Get Packages -------------------------------

# load relevant packages
import pandas as pd               # data manipulation
import geopandas as gpd           # spatial data handling 
import numpy as np                # numerical operations 
import matplotlib.pyplot as plt   # plotting and visualization
import seaborn as sns             # statistical plotting
from pyhere import here           # handle file paths
import re                         # string operations

#### Step 1.2: Import Data -----------------------------------------------------

# import shapefile
shp_adm3 = (gpd.read_file(
    here(
        "1.1_foundational",
        "1.1a_administrative_boundaries",
        "Chiefdom2021.shp"
    )
)
.assign(adm0="SIERRA LEONE")
.rename(columns={
    'FIRST_REGI': 'adm1',
    'FIRST_DNAM': 'adm2', 
    'FIRST_CHIE': 'adm3'
})
[['adm0', 'adm1', 'adm2', 'adm3', 'geometry']]
)

# get the pop data
pop_df = pd.read_excel(
    here(
        "1.1_foundational",
        "1.1c_population",
        "1.1ci_national_population",
        "raw",
        "sle_pop_data.xlsx"
    )
)

# check the top 6 rows
print(pop_df.head(6))

### Step 2: Transform Data -----------------------------------------------------

#### Step 2.1: Reshape Wide Data to Long Format --------------------------------

pop_long = (pop_df
# add country column
.assign(adm0="SIERRA LEONE")
# keep only admin and population columns
.filter(regex=r'^(adm0|adm2|adm3|pop)')
# reshape to long format: one row per year per adm3
.melt(
    id_vars=['adm0', 'adm2', 'adm3'],
    value_vars=[col for col in pop_df.columns if col.startswith('pop')],
    var_name='col_name',
    value_name='pop'
)
# extract year from column name and drop the original
.assign(
    year=lambda x: x['col_name'].str.extract(r'(\d{4})')[0].astype(int)
)
.drop(columns=['col_name'])
.sort_values(['adm3', 'year']) # this is optional (added to match R output ordering)
.reset_index(drop=True)
)

### Step 3: Admin Unit Alignment -----------------------------------------------

#### Step 3.1: Align Admin Levels Using Shapefile ------------------------------

# get shapefile
# create admin lookup df
shp_lookup = (shp_adm3
.drop(columns=['geometry'])
.drop_duplicates()
)

# check results
print(pop_long.head())

# join the adm1 from shapefile lookup df
pop_long2 = (pop_long
.assign(
    adm2=lambda x: x['adm2'].str.upper(),
    adm3=lambda x: x['adm3'].str.upper()
)
.merge(
    shp_lookup[['adm1', 'adm2']].drop_duplicates(),
    on='adm2',
    how='left'
)
# reorder columns
[['adm0', 'adm1', 'adm2', 'adm3', 'year', 'pop']]
.sort_values(['adm3', 'year']) # this is optional (added to match R output ordering)
.reset_index(drop=True)
)

# check results after adding in adm1
print(pop_long2.head())

#### Step 3.2: Check for Admin Value Mismatches --------------------------------

# extract unique admin combinations
keys = ['adm0', 'adm1', 'adm2', 'adm3']
lookup_keys = pop_long2[keys].drop_duplicates()
pop_keys = shp_lookup[keys].drop_duplicates()

# identify mismatches (either direction)
mismatches = pd.concat([
    pop_keys.merge(lookup_keys, on=keys, how='left', indicator=True)
    .query('_merge == "left_only"')
    .drop(columns=['_merge']),
    lookup_keys.merge(pop_keys, on=keys, how='left', indicator=True)
    .query('_merge == "left_only"')
    .drop(columns=['_merge'])
], ignore_index=True)

# check if all admin level combinations (adm0–adm3) match between sources
if len(mismatches) == 0:
    print("✓ admin levels (adm0 to adm3) match between pop data and shapefile.")
else:
    print("✗ Mismatch detected in admin levels between pop data and shapefile.")
    
#### Step 3.3: Resolving Mismatches in Admin Units -----------------------------

# load the prep_geonames function from the sntutils-py package 
from sntutils.geo import prep_geonames

# set up location to save cache
cache_path = "1.1_foundational/1d_cache_files"

# Harmonise admin names between population data and shapefile
pop_harmonised = prep_geonames(
    target_df=pop_long2,
    lookup_df=lookup_keys,
    level0="adm0",
    level1="adm1",
    level2="adm2",
    level3="adm3",
    method="jw",
    cache_path=here(cache_path, "geoname_decisions.csv")
    preserve_case=True
)
  
### Step 4: Aggregating Population Data ----------------------------------------

# summarize total population by adm0
pop_summary_adm0 = (pop_long2
.groupby(['adm0', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)

# summarize total population by adm1
pop_summary_adm1 = (pop_long2
.groupby(['adm1', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)

# summarize total population by adm2
pop_summary_adm2 = (pop_long2
.groupby(['adm2', 'year'])
.agg({'pop': 'sum'})
.reset_index()
.rename(columns={'pop': 'pop_total'})
)

# summarize total population by adm3
pop_summary_adm3 = pop_long2.copy()

# check to see if aggregation worked
print(pop_summary_adm1.head(9))

### Step 5: Visualizing Aggregated Population Data -----------------------------

#### Step 5.1: Prepare Population Data for Mapping -----------------------------

# join pop data to shapefile
shp_adm3_pop = (shp_adm3
.merge(
    pop_summary_adm3,
    on=['adm0', 'adm1', 'adm2', 'adm3'],
    how='left'
)
.query('year >= 2020')
.assign(
    pop_bin=lambda x: pd.cut(
        x['pop'].fillna(0),
        bins=[-np.inf, 0, 15000, 30000, 50000, 100000, np.inf],
        labels=['Missing or 0', '<15k', '15k–30k', '30k–50k', '50k–100k', '100k+'],
        include_lowest=True
    ),
    year=lambda x: x['year'].astype(int)
)
)

#### Step 5.2: Visualize Aggregated Population by Region -----------------------

# generate adm0-level shapefile
shp_adm0 = (shp_adm3
.dissolve(by='adm0')
.reset_index()
)

# generate adm1-level shapefile
shp_adm1 = (shp_adm3
.dissolve(by='adm1')
.reset_index()
)

# plot pop from 2020 to 2023
years = sorted(shp_adm3_pop['year'].unique())

adm3_pop_map, axes = plt.subplots(1, len(years), figsize=(12, 5),constrained_layout=True)

# handle case where there's only one year
if len(years) == 1:
    axes = [axes]

# define color mapping
color_map = {
    'Missing or 0': 'lightgrey',
    '<15k': '#fee5d9',
    '15k–30k': '#fcae91', 
    '30k–50k': '#fb6a4a',
    '50k–100k': '#de2d26',
    '100k+': '#a50f15'
}

for i, year in enumerate(years):
    ax = axes[i]
    
    # plot population data
    year_data = shp_adm3_pop[shp_adm3_pop['year'] == year]
    
    for pop_bin, color in color_map.items():
        subset = year_data[year_data['pop_bin'] == pop_bin]
        if not subset.empty:
            subset.plot(ax=ax, 
            color = color, 
            edgecolor = 'white', 
            linewidth = 0.3)
    
    # add adm1 outline as a black border
    shp_adm1.boundary.plot(ax=ax, color='black', linewidth=0.5)
    
    ax.set_title(f'{year}',fontsize = 16)
    ax.set_axis_off()

# add overall title and legend
adm3_pop_map.suptitle('Total Population by Chiefdom (adm3)\nBetween 2020 and 2024', fontsize=20)

# create custom legend
handles = [plt.Rectangle((0,0),1,1, color=color) for color in color_map.values()]
labels = list(color_map.keys())
adm3_pop_map.legend(
  handles, 
  labels, 
  title='Population Range', 
  loc='lower center', 
  bbox_to_anchor=(0.5, -0.03), 
  ncol=6, 
  fontsize=18,
  title_fontsize=18,
  frameon=False,
  columnspacing=0.8,
  handletextpad=0.5)
  
plt.show()

#### Step 5.3: Save plots ------------------------------------------------------

# save plot
plt.savefig(
    here("03_output", "3a_figures", "pop_check_adm3_2020_2024.png"),
    dpi=300, 
    bbox_inches='tight'
)

### Step 6: Save Data ----------------------------------------------------------

# define output folder
save_path = here("01_data", "1.1_foundational", "1.1c_population", "1ci_country_sourced")
save_path.mkdir(parents=True, exist_ok=True)

# save as CSV
pop_long.to_csv(save_path / "sierra_leone_pop_census_2020_2023.csv", index=False)

# save as XLSX
pop_long.to_excel(save_path / "sierra_leone_pop_census_2020_2023.xlsx", index=False)