• English
  • Français
  1. 1. Getting Started
  2. 1.4 For Analysts
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
      • Determining active and inactive status
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Assessing missing data
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputing missing data and correcting outliers
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • How to Work with the SNT Team
  • Orientation and Setup
    • How to Use This Guide
    • System Requirements and Setup
    • Assumed Knowledge
    • What if I’m New to Coding?
  • Working Conventions
    • Folder Structures and File Paths
    • R Projects and Relative Paths (R users only)
    • Coding Style and Formatting
  • Data Structures for SNT Workflows
    • Structure Within a Dataset
    • Structure Across Datasets
  • Summary
  1. 1. Getting Started
  2. 1.4 For Analysts

Getting started: For analysts

Overview

This section of the SNT Code Library is for analysts—whether national program staff, implementing partners, or technical advisors—who are responsible for executing the data workflows that power subnational tailoring. It covers the essential setup steps to help you use the code library effectively: how the guide is structured, what baseline skills are expected, how to organize your project folders, and how to write clean, consistent code across languages.

The goal is to reduce friction, prevent early errors, and establish a smooth, reproducible workflow that others can follow. Whether you’re starting a new analysis or continuing work someone else began, this section ensures your environment is set up correctly, your data is organized, and your code is structured in a portable and collaborative way.

If you’re new to coding in your preferred language, R, Python, or Stata, we link to trusted resources to help you build foundational skills quickly. If you’re already experienced, this section helps you align with the conventions used throughout the SNT codebase. Investing a bit of time here at the beginning will save hours later when scaling, adapting, or sharing your work across teams.

Once you’ve completed this section, you’ll be ready to begin running the workflows across the SNT library.

How to Work with the SNT Team

Strong collaboration between analysts and the SNT team is central to ensuring that analysis is relevant, trusted, and actionable. While the code library provides technical workflows, it is not a substitute for programmatic judgment. These workflows are meant to support a dialogue-led process, with the SNT team at the center.

Here are some principles to keep in mind when working with the SNT team:

  • Use the right data, with the right approval: Analysts should only use data sources that have been approved and provided by the SNT team or agreed upon jointly. Where modeled data (e.g., MAP, WorldPop) or proxies are used, these choices must be discussed and cleared before use. Always document the origin and version of every dataset used.

  • Be transparent about your methods: All analysis steps should be documented clearly in code and supporting notes. Assumptions, transformations, exclusions, and calculations should be explainable. The goal is to make the reasoning behind outputs visible—not just the outputs themselves.

  • Expect review and validation: The SNT team will need to formally review and sign off on key outputs, including stratification results, prioritization scores, and scenario comparisons. Analysts should expect multiple rounds of feedback and engage with it constructively. Not all feedback will lead to changes—but analysts should explain clearly when something cannot or should not be changed, and why. SNT is rarely black-and-white, and building shared understanding is part of the job.

  • Be clear when changes aren’t feasible: Not all requests from the SNT team can or should be implemented. Analysts should be ready to explain when limitations in data, methods, or timelines prevent a revision—and to justify why. Being honest about constraints helps strengthen trust and avoids misinterpretation.

  • Stay aligned on purpose: Analysts are not just here to run code. They’re here to help the SNT team interpret what the data is saying, how it connects to past and future decisions, and what uncertainties or gaps need to be acknowledged. This means engaging in dialogue, not just delivering files.

  • Expect to revise and iterate: SNT is not a one-time exercise. You should expect to revisit analyses as new input, questions, or feedback comes from the SNT team. This is a normal and necessary part of building ownership and ensuring the outputs serve real planning needs.

  • Understand that iteration ≠ rework: Revisiting outputs is not about redoing past work, it’s about responding to shifting questions and making the analysis more useful. Analysts should approach this as an expected and constructive part of the process.

  • Prioritize dialogue over delivery: Treat every output as a conversation starter. Don’t just send files, do engage the SNT team to help them interpret the meaning and implications of results. Use clear visual aids, summaries, or verbal briefings where helpful.

  • Put local context first: When outputs, especially those based on models, conflict with what’s known locally (e.g. about malaria seasonality or burden), local input should guide decisions. Analysts should be ready to adjust or annotate outputs to reflect this context.

  • Document trade-offs and constraints: Keep a running log of what compromises were made, such as proxy choices, assumptions, or omitted datasets. This is key for transparency and for explaining decisions to stakeholders.

  • Keep a shared record of results and discussions: Alongside your code, maintain a shared PowerPoint deck (or similar document) that captures each step of the work, results, feedback, decisions, and rationale. Update it regularly so there’s a clear, centralized history of what’s been done and agreed. This makes onboarding new people easier, reduces repetition, and helps preserve institutional memory.

Together, these principles ensure that analysis is not just technically sound, but also aligned with country priorities and decision-making processes. Working closely with the SNT team helps build trust in the outputs, ensures that methods reflect local context, and increases the likelihood that results will be used to inform strategy. Think of the code library as a starting point—the real value comes from adapting it with the SNT team’s guidance.

Keeping records

Analysts should keep good records of their work. This may include storing all your process and results details through a growing PowerPoint slide deck (or other living document), which is shared with the SNT team at each update. The living document should also include records of discussions with the SNT team and their conclusions, such that the complete record of analysis exists in a single place.

This record, while extensive, should explain in a clear and logical way what was done, what was decided, and why. Analysts should also keep minutes of SNT team discussions pertaining to their work and disseminate minutes after meetings along with clear action items and assignees.

Keeping good records along the way will make it a lot easier and faster to prepare a thorough and clear final report of your work.

Orientation and Setup

How to Use This Guide

Each section of the SNT code library is designed to be clear, structured, and standalone. Whether you’re looking to process DHIS2 data, extract population rasters, or calculate incidence rates, you’ll find that every page follows the same consistent logic, allowing you to focus on the analytical steps without needing to reorient yourself each time.

Here’s how the guide is organized:

  1. Overview at the top

Every section begins with an Overview that explains what the workflow does, what kind of data it is designed for, and how it fits into the broader SNT process. The goal is to help you get oriented before any code is introduced. It lays out what the section covers, when the workflow is relevant, and how it connects with other parts of the pipeline. If there are points where consultation with the SNT team is required, for example, if you need to confirm whether a modeled dataset is appropriate or check that the method being used has been validated, this is noted clearly at the start, so it is not missed later in the steps.

  1. Step-by-Step Guidance

The core of each page is a series of steps that walk through the task in detail. Each step includes executable R, Python and Stata code, brief explanations of what the code is doing, and where appropriate, short comments on assumptions, logic, or typical pitfalls.

  1. Call-outs for Clarity and Emphasis

Throughout the documentation, visual call-outs are used to surface important information:

Objective callout

Objective call-outs appear at the beginning of each page and summarize what the section is designed to help you achieve. They help orient you before any code begins.

Tip callouts

Tip call-outs offer practical suggestions or useful reminders—naming conventions, performance tricks, or small ways to reduce complexity. These are based on hands-on experience and are meant to make your workflow smoother

Important callouts

Important call-outs are used to flag technical requirements, data limitations, or points where misinterpretation is easy. You’ll often see these when using modeled data like MAP or WorldPop, or working with incomplete datasets.

Consult SNT Team

Consult SNT Team call-outs appear when you should check with the SNT team before moving forward—for example, before using modeled rasters, setting classification rules, or applying proxy indicators. These are often placed early in the section, so they’re not missed later.

Validate with SNT Team

Validate with SNT Team call-outs are used after code execution steps that generate results requiring review—such as stratification outputs, prioritized district lists, or visualized risk layers. These steps must be reviewed by the national SNT team before being used in decision-making.

  1. How to adapt the code

At the end of most code blocks, there is a “How to adapt the code” note. These are tailored for users applying the workflow in a different country, with different filenames, or using different admin boundaries. These instructions are minimal but focused—just enough to show you what to modify and how, without slowing you down.

  1. Full code at the bottom

Each section ends with a full code block that puts everything together. If you already understand what the section does and just want to run the entire thing, you can scroll to the bottom and copy it in one go. It’s also helpful when reviewing or troubleshooting a script you’ve already written.

A final note: Many examples in this library use Sierra Leone’s DHIS2 data to illustrate how different workflows operate. These examples are meant to demonstrate structure and logic—not to be copied directly. You will need to adapt the code to fit your own context, including filenames, admin boundaries, indicator names, and data structures. Each section includes notes (in the How to adapt the code section) to guide this process, but it is your responsibility to ensure the workflow aligns with your country’s data and analysis requirements

System Requirements and Setup

The SNT Code Library supports workflows in R, Python, and Stata, with core modules designed to function consistently across languages. While each page of the code library may present one language first, equivalent scripts and logic are provided in tabbed sections throughout, allowing users to follow the same workflow in R, Python, or Stata.

To get started, choose the language you’ll be using and ensure your environment is properly set up. Each tab below outlines what’s required and expected for that language, including installation instructions, recommended tools, and package guidance.

  • R
  • Python
  • Stata

The code library is designed for use in an R-based workflow. All examples are written in R and assume you are working from RStudio, which offers the most reliable environment for executing code chunks and previewing outputs, navigating folders and managing file paths, working with Quarto documents and integrating version control via Git. At minimum, you should have: R version 4.2 or higher (which you can download from here), the latest version of RStudio (available here), an active internet connection for installing packages and downloading external data when needed.

All packages are handled using the pacman package, which simplifies both loading and installation. You’ll see this approach used throughout the code library:

pacman::p_load(
  dplyr,
  sf,
  exactextractr,
  terra)

Assumed Knowledge

This code library assumes basic working knowledge of R, Python, or Stata, depending on the language you are using to follow the examples.

For R users, you should be comfortable running code in RStudio, reading and writing data files using functions like rio::import() and readRDS(), and using commonly applied packages such as dplyr and ggplot2.

Many of the workflows rely on pre-built functions from the sntutils package which handles common tasks like downloading, cleaning, aggregating, or visualising data. Your main task is to supply the right inputs and understand how the output fits into the broader pipeline.

> If you encounter issues with a function in sntutils and the built-in help or documentation doesn’t resolve it, please contact  [`info@appliedhealthanalytics.org`](info@appliedhealthanalytics.org) for support. This ensures that any bugs or unclear behaviors are flagged and addressed centrally.

For example, the function below downloads monthly CHIRPS rainfall rasters for January to March 2022 across Africa and saves them in a local folder:

# Download Africa monthly rainfall for Jan to Mar 2022
download_chirps(
  dataset = "africa_monthly",
  start = "2022-01",
  end = "2022-03",
  out_dir = "data/chirps"
)

You’re not expected to look inside these functions or modify them—they’re built to simplify the workflow and reduce errors.

What if I’m New to Coding?

If you’re new to R, Python, or Stata, don’t worry, there are excellent, beginner-friendly resources to help you get started. While the SNT code library is designed to be readable and reusable, it assumes some basic familiarity with syntax, functions, and how to run scripts in your chosen language.

This library is not meant to teach any language from scratch, but the skills it assumes are widely taught and easy to build with the right tools. Below are curated resources to help you gain the foundation needed to work effectively with the code library.

  • R
  • Python
  • Stata

Here are some recommended resources for getting started with R, all of which are free and well-regarded in the data science and epidemiology communities. These will help you build the baseline skills needed to work with the SNT code library effectively:

Beginner books and tutorials

  • R for Data Science (2e) by Hadley Wickham and Mine Çetinkaya-Rundel*: An excellent and accessible introduction to R and the tidyverse. Chapters walk you through data import, wrangling, visualization, and modeling, with runnable examples.

  • The Epidemiologist R Handbook: Tailored to public health and epidemiology use cases. Offers concise examples and practical advice on real-world epidemiological workflows, including dplyr, sf, and ggplot2.

Self-paced interactive tutorials

  • RStudio Primers: Free, browser-based interactive tutorials hosted by Posit (formerly RStudio). Ideal for learning tidyverse packages such as dplyr, ggplot2, and tidyr through guided practice.

  • Swirl: A package that teaches you R directly in your RStudio console. Great for beginners—install it, run swirl::swirl(), and start learning from within R itself. Covers topics like data types, dplyr, and plotting.

Videos and MOOCs

  • Datacamp: Introduction to R: A hands-on beginner course that walks through R syntax, data types, and vectors. Offers interactive code challenges in the browser (free tier includes limited access).

  • Coursera: R Programming by Johns Hopkins: A foundational course that introduces R language basics and builds up to functions and data structures. Free to audit.

Other useful resources

  • Posit Cheatsheets: Compact PDF references for popular packages like dplyr, ggplot2, and sf. Useful for quick lookups and reminders as you work through the SNT code library.

  • R Bloggers: A community-driven blog aggregator for all things R. Good for discovering tutorials, best practices, and troubleshooting advice.

These resources are more than enough to help you build the baseline skills expected here. You don’t need to master everything before using the SNT code library, but gaining comfort with core syntax and workflows in your preferred language will make the entire process smoother and easier to adapt the code to your own specific context.

Working Conventions

Folder Structures and File Paths

Organizing your project files consistently is one of the most important steps you can take to reduce errors, avoid duplication, and make your analysis reproducible. In the SNT code library, we assume that you are working within a structured folder system that mirrors the logic of the SNT workflow, with dedicated folders for inputs (like raw shapefiles or survey data), outputs (like plots or summaries), and coding scripts grouped by topic.

We strongly recommend initializing each country analysis or workstream as its own self-contained folder system, ideally structured as an RStudio Project (see below). This ensures that relative paths will behave as expected and outputs don’t get scattered across your machine.

Example structure

Here’s a basic version of what this might look like:

your-snt-project/
│
├── 01_data/
│   ├── 1.1_foundational/
│   │   ├── 1a_admin_boundaries/
│   │   ├── 1b_population/
│   │   └── 1c_climate/
│   └── 1.2_epidemiology/
│       ├── 1.2a_dhis2/
│       └── 1.2b_pfpr_estimates/
├── 02_scripts/
│   ├── 02a_cleaning/
│   └── 02b_analysis/
├── 03_output/
│   ├── 3a_figures/
│   └── 3b_tables/
└── README.md

You don’t need to copy this exactly, the point is to have a clear separation between raw data, processed data, scripts, and outputs, and to keep your structure consistent across projects and team members.

Use the SNT template function (optional but recommended)

If you’re starting from scratch, the sntutils package (R example shown here) includes a utility function that sets up this kind of folder system for you. This saves time and helps reduce early friction with file management.

sntutils::initialize_project_structure(
  base_path = "sierra_leone_snt"
)

This will generate a directory like the one below, with subfolders already created for foundational data, epidemiology, interventions, environment, health systems, and other major domains of SNT analysis:

sierra_leone_snt/
│
├── 01_data/
│   ├── 1.1_foundational/
│   │   ├── 1.1a_admin_boundaries/
│   │   ├── 1.1b_health_facilities/
│   │   └── 1.1c_population/
│   │       ├── 1.1ci_national/
│   │       └── 1.1cii_worldpop_rasters/
│   ├── 1.2_epidemiology/
│   │   ├── 1.2a_routine_surveillance/
│   │   ├── 1.2b_pfpr_estimates/
│   │   └── 1.2c_mortality_estimates/
│   ├── 1.3_interventions/
│   ├── 1.4_drug_efficacy_resistance/
│   ├── 1.5_environment/
│   │   ├── 1.5a_climate/
│   │   ├── 1.5b_accessibility/
│   │   └── 1.5c_land_use/
│   ├── 1.6_health_systems/
│   │   └── 1.6a_dhs/
│   ├── 1.7_entomology/
│   └── 1.8_commodities/
├── 02_scripts/
├── 03_outputs/
│   └── plots/
├── 04_reports/
└── metadata_docs/

R Projects and Relative Paths (R users only)

For those using R, the SNT code library assumes that each country’s analysis is managed as a dedicated RStudio Project. This means that all scripts, data, outputs, and reports live in the same folder, with the .Rproj file at the root. Setting up an R Project is straightforward, and more detail can be found here on how to create one using RStudio.

Working this way has several benefits:

  • File paths are predictable and portable.
  • Hard-coded paths that only work on one machine are avoided.
  • Inputs and outputs are organized and easy to locate.
  • Errors from mismatched working directories (getwd()) are minimized.
  • Code becomes easier to share across teams when everyone uses the same structure.

When you open a project via the .Rproj file (e.g., sierra_leone_snt.Rproj), RStudio sets that folder as your working directory. You don’t need to use setwd() or manage the working directory manually.

Always use here::here() for file paths

This code library consistently uses here::here() to construct file paths. It ensures that scripts can access inputs and save outputs reliably, no matter the machine or operating system. For example:

# Avoid this
readRDS("C:/Users/yourname/Desktop/snt_project/01_data/...")

# Use this
readRDS(
  here::here(
    "01_data",
    "1.1_foundational",
    "1.1a_admin_boundaries",
    "sle_adm3_shp.rds"
  )
)

Example: Sierra Leone project folder

If this setup were applied to the Sierra Leone SNT workflow, the top-level folder would look like:

sierra_leone_snt/
│
├── sierra_leone_snt.Rproj
├── 01_data/
├── 02_scripts/
├── 03_output/
│   ├── 3a_figures/
│   └── 3b_tables/
└── README.md

By keeping everything in a single, self-contained folder and using relative paths with here::here(), you simplify collaboration, make debugging easier, and ensure your code remains portable and reproducible, core principles throughout the SNT code library. This structure also lays the groundwork for how data is organized and accessed in later steps. For more detail on how inputs, outputs, and file types are structured across the workflow, see the Data Structures section.

Coding Style and Formatting

To support clarity, collaboration, and ease of adaptation across projects, the SNT code library follows a consistent approach to writing code. Equivalent examples for R, Python, and Stata are provided throughout using tabsets. While the syntax differs, the same principles of clean structure, consistent logic, and readable formatting apply across all languages. We’re not strict about enforcing every line, but we encourage you to adopt these habits as they promote clarity and good practice, especially in shared or long-term workflows.

  • R
  • Python
  • Stata

General Style Principles

  • Prefer tidyverse-style syntax: We write most code using dplyr, tidyr, and other tidyverse tools. This keeps the logic transparent and consistent. We avoid mixing base R idioms with tidyverse unless needed.

  • Use the base R pipe |> by default: The native pipe is native to R (since version 4.1), requires no extra packages, and avoids unnecessary dependencies. This helps newer users avoid loading magrittr or dplyr just to use pipes. You’ll still see |> used in tidyverse-heavy scripts, but for general chaining, |> is preferred and more future-proof.

  • Use one pipe per line: Avoid writing long chains without breaks. Instead, write each step clearly on a new line. This makes debugging easier and helps others trace your logic.

  • Always include package namespaces: Write dplyr::mutate() or readxl::read_excel() rather than relying on library(). This avoids ambiguity when functions from different packages share names (e.g., filter() which exists in base R and dplyr), and it makes it easier for someone to understand where each function comes from.

  • Keep code to 80 characters wide: Line-wrapping helps with readability and makes version control differences much easier to follow. If a line is getting too long, split it logically across lines.

  • Use comments to structure your script: Start each section with a short header and use comments to explain logic where it’s not obvious. Even a few lines like # Import data or # Convert date columns help orient the reader.

To see how these principles come together, unfold the example below. It shows a well-organized script that follows the conventions we recommend: tidyverse-style syntax, base R pipes, explicit namespaces, structured comments, and clean formatting throughout. This structure improves readability, reduces bugs, and makes it easier to reuse or adapt the code in future workflows.

Unfold below to see these principles and style applied in practice, using DHIS2 data cleaning as an example.

Show example code
# Set up and data import -------------------------------------------------------

# Install pacman if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Load required packages
pacman::p_load(
  readxl, # excel import
  dplyr,  # data manipulation
  tidyr,  # reshaping
  janitor # cleaning names
)

# Import raw DHIS2 data
raw_data <- readxl::read_excel("data/dhis2_malaria_data.xlsx")

# Data cleaning and wrangling --------------------------------------------------

raw_data2 <- raw_data |>
  janitor::clean_names() |>
  dplyr::rename(
    adm1 = orgunitlevel2,
    adm2 = orgunitlevel3,
    facility = organisationunitname,
    period = periodname
  ) |>
  tidyr::separate(period, into = c("month", "year"), sep = " ") |>
  dplyr::mutate(
    month = match(
      month,
      c(
        "Janvier",
        "Fevrier",
        "Mars",
        "Avril",
        "Mai",
        "Juin",
        "Juillet",
        "Aout",
        "Septembre",
        "Octobre",
        "Novembre",
        "Decembre"
      )
    ),
    month = base::as.integer(month),
    year = base::as.integer(year)
  )

A consistent coding style makes your work easier to read, review, and reuse. The practices shared here—such as breaking code into clear sections, using descriptive naming, and applying consistent formatting, are not strict rules but helpful habits. They reduce errors, support collaboration, and make it easier to maintain and adapt your code over time. For newer users, these are especially good habits to build early.

Data Structures for SNT Workflows

After setting up your working environment and organizing your folder structure, the next step is to ensure that your data is structured in a way that supports analysis, collaboration, and reproducibility. This involves thoughtful design of how data is formatted, both within individual datasets and across the wider data repository.

Why this matters for SNT

SNT relies on combining diverse datasets — routine surveillance, PfPR estimates, intervention coverage, population, entomology, and more. These data must be interoperable: they should align on location, time, and variable definitions. Without structure, analysis becomes ad-hoc, manual, and error-prone. With good structure, workflows become automated, datasets become reusable, and analysis become easier to validate and share.

There are two levels to data structures, and these are:

  • Structure within datasets: Structure begins with internal clarity. Are columns named in a way that is intuitive and unambiguous? Are values—such as dates, codes, or categories—formatted consistently across all rows? Are geographic units and time variables standardized to support joining, filtering, and comparison? A well-structured dataset is tidy, analysis-ready, and reduces ambiguity before any integration or analysis begins.
  • Structure between datasets: Structure also means coherence across the different SNT datasets. Can they be reliably joined using common keys like admin1, admin2, year_month, or indicator_code? Are file formats and folder names consistent and intuitive? This ensures datasets can be joined without issues and reside in an organized, navigable repository—making the entire system easier to manage, scale, and trust.

In the sections that follow, we outline the principles and practical steps to make your datasets both tidy and harmonized—ready to plug into any SNT workflow.

Structure Within a Dataset

Structure starts within each dataset. Each dataset should be internally clear, tidy, and analysis-ready, this forms the foundation for reliable joining, filtering, and aggregation. It should follow tidy data principles, use standardized column names, and be self-contained with clear labels and documentation.

To achieve this, we focus on a few key practices:

1. Should be tidy

Tidy data is a concept by Hadley Wickham that defines a simple, structured way to organize data in organized and usable manner:

According to this principle: - Each variable forms a column - Each observation forms a row - Each type of observational unit forms its own table

This structure minimizes ambiguity and reduces friction during analysis. Instead of spending time reshaping and cleaning poorly organized tables, analysts can focus on more substantive tasks. Tidy datasets make it easier to merge with other sources, such as combining routine case data with population estimates or environmental rasters, and enhance code reusability across shared scripts and analysis pipelines

For example, consider a population dataset provided in this wide format:

region district pop_u5_2020 pop_u5_2021 pop_total_2020 pop_total_2021
Boké Gaoual 6500 6800 47000 48500

This structure makes it difficult to filter by year or age group. A tidy version reshapes the data into a long format, making it much more usable:

region district year age_group population
Boké Gaoual 2020 u5 6500
Boké Gaoual 2020 total 47000
Boké Gaoual 2021 u5 6800
Boké Gaoual 2021 total 48500

With data in tidy format, filtering and transforming becomes intuitive. For example:

  • R
  • Python
  • Stata
pop_data |>
    dplyr::filter(year == 2020) |>
    dplyr::select(
        district, year,
        age_group, pop
    )

Which gives:

district year age_group pop
Gaoual 2020 u5 6500
Gaoual 2020 total 47000

Tidy format makes it easier to filter, group, join, visualize, or model your data—without constant reshaping. By organizing datasets in a consistent, long format, you reduce friction throughout the workflow. This approach supports clear logic, scalable code, and smoother integration with tools across platforms. For more guidance on tidy data, refer to R for Data Science – Chapter 12: Tidy Data.

2. Should Have Standardized Columns

Tidy data gives us a solid foundation, but structure alone isn’t enough. To make data truly SNT-ready, we need standardization. Standardization means aligning key elements, including geographies, dates, variable names, and population groups, so datasets can be joined, compared, and analyzed reliably. Without this, merges fail and outputs become misleading. With it, your workflows are simpler, more reusable, and less error-prone.

At its core, standardization is about aligning key dimensions across datasets so they are interoperable:

  • Geographies: Use consistent names and codes for admin units (adm1, adm2, etc.). Mismatched district names or missing codes can silently break joins.

    Example: If adm2 is spelled Gaoual in one dataset and Guoal in another, the two won’t join properly, leading to missing or mismatched records.

  • Dates: Adopt a unified date format (e.g., year_month = “2023-09”). Reference periods should be aligned so that indicators from different sources reflect the same time frame.

    Example: Inconsistent formats like “Jan 2023” in one dataset and “2023-01” in another can cause joins to fail silently or produce misaligned results.

  • Variables: Column names and definitions must follow shared conventions (e.g., use conf_mic, not confirmed_microscopy; use adm2, not district_name).

    Example: If one dataset uses conf_mic and another uses confirmed_microscopy, trying to bind rows or join tables will fail unless columns are renamed first. Hence why it’s better to agree on a naming convention early-on and apply it consistently across all datasets.

  • Population Groups: If datasets are disaggregated (e.g., by age, sex, or risk group), ensure categories align. For instance, u5, 5_14, 15plus should be used consistently across indicators.

    Example: If one dataset uses u5, 5_14, 15plus, but another uses under_5, 5to14, 15_and_above, grouping and comparison across age bands becomes error-prone.

  • Units of Measure: Align how values are reported. Is prevalence reported as 0.1 or 10%? Are rates per 10,000 or per 100,000? This can affect comparisons and further analysis if not harmonized.

    Example: One dataset might report incidence as a percentage (1.2%), another as a decimal (0.012), and a third per 10,000 population. Without harmonizing these units, comparisons or weighted summaries will be misleading.

Investing in standardization upfront reduces downstream friction—enabling cleaner joins, consistent comparisons, and easier reuse across SNT workflows.

3. Should be Self-Contained

With tidy and standardized data in place, the final step is making it self-contained. This means all necessary context—labels, definitions, units, and versioning, is either embedded or clearly linked. It’s a small but crucial step that makes data reusable, reproducible, and easy to build on.

Good formatting practices include:

  • Metadata: Each variable should have labels, units, and clear definitions. For example, it should be obvious whether a value is a count, a percentage, or a rate per 1,000 people.

    Metadata doesn’t always live in the dataset itself—it can also include standalone notes or documents that capture important analytical decisions. For example, if reporting rates were calculated differently during the COVID-19 period due to disruptions, this should be explicitly noted. A simple Word document outlining these adjustments—for instance, how malaria indicator values were imputed or interpreted differently in 2020–2021—adds essential context. These less visible decisions shape how data should be understood and compared, and documenting them ensures transparency, reproducibility, and credibility across the team and with stakeholders.

    You can also use inline comments in your code to explain decisions at the point they were made. This is just as important as standalone metadata—it helps other analysts follow your thinking, and reminds you why something was done months down the line.

    Finally, consider generating a data dictionary automatically at the end of your script. For example, write code that extracts variable names, labels, and units into a table and saves it as an Excel or database file. This kind of summary is extremely useful—for your own reference, for handovers, and for sharing with the SNT team.

  • Data dictionary: A separate file (or tab in a datasets) that explains each column in plain terms: what it means, how it was calculated, and what categories or units are used. This is especially important when datasets are shared or reused by others.

    A strong data dictionary does more than describe variables. It also documents key processing steps. For instance, if confirmed_cases was imputed in some districts, the dictionary might note: “Imputed using average values from the same period and district.”

    You can even generate this dictionary automatically at the end of your script. For example, write code that extracts variable names, labels, and units into a table and saves it as an Excel file or database. This kind of summary is extremely useful, for your own reference, for handovers, and for sharing with the SNT team.

    It may also be worth including translated versions of the dictionary—especially when teams operate in multilingual settings. If analysts work in French but stakeholders prefer English, having both versions side by side makes the dataset more accessible. This is straightforward to generate using tools like the gtranslate R package.

  • Consistent versioning: Files should follow a predictable naming format (e.g., gin_dhis2_processed_2018-2024_v2025-03-28.rds) so that updates are traceable and confusion is avoided.

    In this example, gin refers to the country (Guinea), dhis2 is the data source, processed indicates the state of the dataset (processed vs. raw), 2018-2024 is the date range the data covers, and v2025-03-28 is the version date showing when the file was created.

Together, these steps make your SNT datasets self-explanatory—clear enough to stand on their own without someone needing to walk you through them. This reduces friction in collaboration and makes data pipelines easier to debug, reuse, and maintain.

Structure Across Datasets

With individual datasets now tidy, standardized, and self-contained, the next step is ensuring they can work together. Structure across datasets is about building a coherent system — one where files are logically organized, consistently formatted, and join-ready. This enables analysis to scale smoothly across domains, time periods, and geographies.

At its core, this step involves three main elements:

1. Organized Repository A well-organized folder structure reduces the time it takes to find relevant files, lowers the risk of mistakes, and makes it far easier to hand over work or onboard new users. As introduced earlier in the Folder Structure section, this organization should already be in place and consistently followed.

Datasets should be grouped thematically and not all thrown into one folder. For example, keep epidemiology, interventions, and environment data in separate folders, each with their own subfolders.

Each theme should also follow a clear split between:

  • raw/: untouched, original files. These are your source of truth and should never be edited directly.
  • processed/: cleaned, harmonized, and analysis-ready datasets.

The idea behind this seperation is that it prevents confusion over which file was modified, preserves traceability, and ensures that scripts can always be rerun from the same raw inputs.

Below is an example of a clear folder breakdown for core datasets. Any raw DHIS2 data should be stored in the raw/ folder, while cleaned, analysis-ready versions belong in processed/:

data/
├── 02_epidemiology/
│   └── 2a_routine_surveillance/
│       ├── raw/
│       └── processed/

A clear, thematic layout supports scaling—adding a new country or indicator becomes easier when your structure is predictable. It also makes handover and collaboration less error-prone.

2. Consistent Formatting and Join-Ready

Many of the principles we covered under Structure Within a Dataset, clear column and row structures, standardized date formats, harmonized admin units, are most effective when applied consistently across all SNT datasets. This is what is meant by consistent formatting across datasets.

When every dataset uses the same conventions—, adm1, adm2, year_month, and common variable names, then joining becomes easy. You’re no longer wasting time fixing mismatched columns or reconciling differences in formatting. Your data is analysis-ready and flows cleanly into the next SNT step.

Example: Even small differences can silently break joins. If adm2 is spelled Gaoual in one file but Gaoal in another, the merge will fail. If dates are 2023-04 in one and April 2023 in another, alignment becomes messy. And if one dataset uses u5 and another under_5, you’ll spend extra time fixing groupings. These are avoidable issues—standardizing early avoids friction later.

Consistent formatting turns your datasets into interoperable building blocks. It ensures that each dataset speaks the same language, across geographies, dates, variables, and groups. This reduces friction, increases reuse, and makes your SNT workflow easier to scale or hand off. When formatting is inconsistent, tidy datasets lose reliability. Applied consistently, they become reusable and robust across the SNT pipeline.

Summary

Establishing a clean setup—from file structure to coding conventions to dataset formatting, is the first and most critical step in building reliable SNT workflows. By grounding your work in these principles and staying in close coordination with the SNT team, you create a process that is not only technically sound but also aligned with national priorities. The SNT Code Library offers the tools and structure to get started, but its true value comes when it’s adapted, reviewed, and iterated in context. With these foundations in place, you’re ready to begin running workflows that are reproducible, scalable, and decision-ready.

 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved