Spatial data overview

Beginner

Overview

Spatial data is central to the SNT process. It allows the SNT team to quantify malaria burden, assess intervention coverage, and target strategies to specific geographic areas. Each step of the analysis is grounded in spatial units that align with the operational needs of decision-making, such as districts, chiefdoms, or program-defined areas.

SNT workflows rely on two primary types of spatial data: vector and raster. Vector data captures discrete geographic features like administrative boundaries, roads, and health facility locations (points data can also be represented as a table with associated metadata), whereas raster data represents continuous surfaces such as rainfall, temperature, or population density. While the term “spatial vector” is technically correct for vector geographic data, this documentation uses “shapefile” throughout as it is the commonly known and widely used term in public health contexts and malaria programs. This distinction between the formal terminology and practical usage is important to understand, and more detailed definitions of spatial data types and formats are covered in the following sections.

Objectives

Understand the role of shapefile data in SNT workflows
Understand the importance of properly-sourced shapefiles for SNT
Understand spatial data formats and the importance of coordinate reference systems

Spatial Data in the SNT Code Library

The code library contains many pages dedicated to demonstrating how to work with spatial data. For working with shapefiles, see:

Basic shapefile use and visualization - Basics of loading and visualizing shapefiles and associated attribute data
Health facility coordinates and point data - Loading point data from a table and visualizing point data, with the example use case of mapping health facility coordinates
Shapefile management and customization - Validating and troubleshooting shapefiles, and making custom shapefiles
Shapefile merging with tabular data - Combining shapefiles with geographic data from elsewhere in the SNT process to prepare for visualization

SNT workflows also rely on raster data and geospatial model estimates that are available as raster-format outputs. For information on working with raster data, see the pages below:

Working with geospatial model estimates - Extract, visualize, and aggregate raster data to administrative units for SNT analysis, with the example use case of prevalence of malaria infection
WorldPop population rasters - Download, extract, visualize, and aggregate population estimates from WorldPop
Climate and environment data extraction from raster - Download, extract, visualize, and aggregate climate and environmental data
Modeled estimates of malaria mortality and proxies - Download, extract, visualize, and aggregate estimates of all-cause child mortality
Modeled Estimates of Entomological Indicators - Download, extract, visualize, and aggregate modeled estimates of entomological indicators such as relative abundance of vector species and insecticide resistance

Working with Shapefiles

Shapefiles are one of the two primary ways to represent geospatial data, alongside raster data. Shapefiles capture real-world features, such as points, lines, and polygons, along with associated attributes, and are foundational to most geographic information system (GIS) applications used in SNT workflows.

While the term shapefile is commonly used in public health contexts and malaria programs to refer to geographic boundary data, it is technically one specific file format for storing spatial vector data, developed by Esri in the 1990s. It remains widely used across public health GIS, including in many national malaria programs. However, spatial vector data can also be stored in other formats such as GeoJSON, GeoPackage, and File Geodatabase.

Spatial Vector Formats in SNT Workflows

The following are commonly used spatial vector formats in SNT and public health contexts:

Shapefile (.shp + related files) – a legacy but widely supported format made up of several linked files. Common across public health and general GIS workflows despite limitations (e.g., short field names, limited character encoding).
GeoJSON – a lightweight, human-readable format based on JSON. Common in web mapping and easily supported in R, Python, and GIS tools.
GeoPackage (.gpkg) – an open, single-file format that can store multiple vector (and raster) layers with metadata. Supported across platforms including R, Python, QGIS, and ArcGIS.
File Geodatabase (.gdb) – Esri’s proprietary format for managing multiple layers in a structured folder. Primarily used in ArcGIS but accessible in R and Python.

While different spatial vector formats, such as Shapefile, GeoJSON, or File Geodatabase, may vary in how they are stored and saved, once imported into R or Python, they are all represented as standard spatial objects (e.g., an sf object in R or a GeoDataFrame in Python). In practice, this means that the import method matters more than the original file format for how the data are handled in analysis.

What Is a Shapefile?

In the code library, we primarily use data from Sierra Leone provided in Esri shapefile format (.shp). The Esri shapefile remains one of the most frequently used formats in many national malaria programs and existing SNT workflows. However, we also demonstrate how to import spatial vectors stored in other formats, such as GeoJSON, GeoPackage, and File Geodatabase, to accommodate different format preferences and institutional standards across SNT workflows.

Unlike other spatial vectors, the Esri shapefile (.shp) comes with certain structural caveats. Although commonly referred to as a “shapefile”, it is not a single file but a collection of multiple files that work together to store both the geometry and attribute information for spatial features.

For most SNT applications, the three most essential components are:

.shp – contains the geometry of spatial features
.shx – an index file linking geometry to attribute records
.dbf – stores the attributes for each feature (in dBASE format)

These files must be kept together and aligned by record position. While additional files like .prj (projection) or .cpg (character encoding) are often included, they are not strictly required for basic functionality.

Choosing and Sourcing Shapefiles for SNT

Before any analysis begins, one of the first tasks of the SNT team is to confirm the lowest operational administrative unit where programmatic decisions and interventions can realistically be made. This choice, whether it is adm1, adm2, or a custom unit, defines the unit of analysis for the entire exercise. It also determines which boundary layer (shapefile) must be used. We should always confirm that this decision has been made before starting any spatial work, or initiate the discussion if it has not yet occurred.

Consult with SNT team

All shapefiles used in SNT must be reviewed and validated by the SNT team and must represent the official national boundary set. This ensures alignment with national standards, guarantees boundary accuracy, and avoids discrepancies in spatial data.

Only boundary data (shapefiles) officially provided by the SNT team should be used for SNT analyses. Publicly available boundary data (such as from HDX or GADM) may be useful for learning or exploration, but they must not be used in SNT workflows unless reviewed and explicitly approved by the SNT team. If no official boundary file is available, the SNT team may decide to source from a trusted public dataset—but this decision must be made centrally.

Make sure you have all the shapefiles you need

Always request all relevant administrative levels, not just the chosen unit of analysis. For example, if we are conducting the analysis at adm2 level, we should also obtain the corresponding adm1 shapefile. This ensures that output maps can include higher-level boundaries for clear orientation and interpretation.

If we only have lower-level units (e.g., adm3) and lack the higher ones, there are ways to derive them from the lower units. These steps are covered later in the guide. Use this approach with caution and always validate the results against a known, trusted source.

For SNT purposes, official shapefiles from the SNT team should be used rather than use a publicly-available set. However, publicly-available shapefiles can still be very useful for other analyses. Sources for general shapefile access include:

GADM: Global administrative boundaries
Humanitarian Data Exchange (HDX): Humanitarian and administrative boundary datasets

Identify the unique key

When we receive an official shapefile, one of our first tasks is to identify the unique identifier column (e.g., adm3 or FIRST_CHIE). This code or name is used to join the spatial data with all our tabular data (e.g., MFL, routine data, etc.). Confirm with the SNT team that this identifier is consistent across all our datasets.

Coordinate Reference Systems (CRS)

A Coordinate Reference System (CRS) is a coordinate-based framework used to locate features on the Earth’s surface. It consists of several components:

Coordinate system – Defines how positions are described (e.g., latitude/longitude, UTM).
Units – Specifies the measurement units (e.g., decimal degrees, meters).
Datum – The model of the Earth used as a reference (e.g., WGS84, NAD83). All spatial layers must use the same datum to align properly.
Projection – The mathematical transformation used to represent the curved Earth on a flat surface. This affects how distance, area, and shape are calculated.

Different CRSs serve different purposes:

WGS84 (EPSG:4326) is a global CRS that uses latitude and longitude on a spherical Earth model. It is ideal for global mapping and GPS-based data collection, but less suitable for accurate distance or area measurements. For standard mapping for visualization in SNT, WGS84 is usually used.
UTM zones (EPSG:326XX) assume a flat Earth within small regions. They use meters as units and are better suited for measuring distances or areas at the country or subnational level.
Equal-area projections (e.g., EPSG:6933) preserve area relationships and are useful when comparing land coverage or performing zonal statistics.

When calculating distances or areas, use a projected CRS that respects the Earth’s curvature and has appropriate linear units (e.g., meters). For example, if you need to calculate service area coverage of a health facility, measure distance to nearest health facility, or create buffer zones, UTM would be a better choice than WGS84.

Additional Resources for Learning About Spatial Vectors and GIS Vector Formats

If we are new to shapefiles or want to deepen our understanding of how shapefile data is stored and managed in GIS, the following resources provide clear, reliable guidance:

For R Users

📘 Applied Spatial Data Analysis in R (ASDAR) – A foundational textbook for spatial data analysis using R, covering theory and practical applications.
🧑‍🏫 Introduction to Geospatial Data in R (Data Carpentry) – A beginner-friendly workshop series covering shapefile import, visualization, and manipulation in R.
🧰 Simple Features for R (sf) – R Package: – Official documentation for the sf package, the standard for handling vector data in R.

For Python Users

🎓 Python Foundation for Spatial Analysis (Spatial Thoughts) – A full course offering a gentle introduction to Python programming with a focus on geospatial data workflows.
🧭 Introduction to Python for Geographic Data Analysis – An open-access, well-structured guide for working with spatial data in Python using key libraries like geopandas.
💡 Geopandas Documentation – The official site for geopandas, the most widely used Python package for working with vector data.

These resources provide both conceptual grounding and practical guidance, whether we are working in R, Python, or any other GIS environment.

Summary

This section provides background knowledge for working with spatial data in SNT analysis. It introduces core concepts and spatial vector formats, with a focus on shapefiles, and explains the importance of proper sourcing of shapefiles in particular. References are provided for other code library content on working with spatial data and further resources for learning.