Getting started: for analysts

Overview

This section of the SNT Code Library is for analysts, whether national program staff, implementing partners, or technical advisors, who are responsible for executing the data workflows that power subnational tailoring. It covers the necessary setup steps to help us use the code library effectively: how the guide is structured, what baseline skills are expected, how to organize project folders, and how to write clean, consistent code across languages.

The goal is to reduce friction, prevent early errors, and establish a smooth, reproducible workflow that others can follow. Whether we are starting a new analysis or continuing work someone else began, this section ensures our environment is set up correctly, our data is organized, and our code is structured in a portable and collaborative way.

If we are new to coding in a preferred language, R, Python, or Stata, we link to trusted resources to help build foundational skills quickly. If we are already experienced, this section helps align with the conventions used throughout the SNT codebase. Investing a bit of time here at the beginning will save hours later when scaling, adapting, or sharing work across teams.

Once we have completed this section, we will be ready to begin running the workflows across the SNT library. In practice, analysis work is sequenced with the national malaria strategic plan (NMSP) cycle, including programme reviews, plan development, and funding application timelines, so that analyses feed into decisions when they are needed.

How to Work with the SNT Team

See Annex 1 — Terms of reference for the SNT team lead in WHO’s Manual for subnational tailoring of malaria interventions for the composition and responsibilities of the SNT team, including how the analysis team fits in.

Strong collaboration between analysts and the SNT team is central to ensuring that analysis is relevant, trusted, and actionable. While the code library provides technical workflows, it is not a substitute for programmatic judgment. These workflows are meant to support a dialogue-led process, with the SNT team at the center.

Here are some principles to keep in mind when working with the SNT team:

Use the right data, with the right approval: We should only use data sources that have been approved and provided by the SNT team or agreed upon jointly. Access to country data is governed by the national malaria program and, where applicable, by formal data-sharing agreements; see the data assembly and management chapter of the WHO SNT manual for the recommended approach. Where modeled data (e.g., MAP, WorldPop) or proxies are used, these choices must be discussed and cleared before use. Always document the origin and version of every dataset used.
Be transparent about methods: All analysis steps should be documented clearly in code and supporting notes. Assumptions, transformations, exclusions, and calculations should be explainable. The goal is to make the reasoning behind outputs visible, not just the outputs themselves.
Expect review and validation: The SNT team will need to formally review and sign off on key outputs, including stratification results, prioritization scores, and scenario comparisons. We should expect multiple rounds of feedback and engage with it constructively. Not all feedback will lead to changes, but we should explain clearly when something cannot or should not be changed, and why. SNT is rarely black-and-white, and building shared understanding is part of the job.
Be clear when changes are not feasible: Not all requests from the SNT team can or should be implemented. We should be ready to explain when limitations in data, methods, or timelines prevent a revision, and to justify why. Being honest about constraints helps strengthen trust and avoids misinterpretation.
Stay aligned on purpose: Our role extends beyond running code. We help the SNT team interpret what the data is saying, how it connects to past and future decisions, and what uncertainties or gaps need to be acknowledged. This means engaging in dialogue rather than delivering files alone.
Expect to revise and iterate: SNT is not a one-time exercise. We should expect to revisit analyses as new input, questions, or feedback comes from the SNT team. This is a normal and necessary part of building ownership and ensuring the outputs serve real planning needs.
Understand that iteration ≠ rework: Revisiting outputs is not about redoing past work, it is about responding to shifting questions and making the analysis more useful. We should approach this as an expected and constructive part of the process.
Prioritize dialogue over delivery: Treat every output as a conversation starter. Do not just send files, do engage the SNT team to help them interpret the meaning and implications of results. Use clear visual aids, summaries, or verbal briefings where helpful.
Put local context first: When outputs, especially those based on models, conflict with what is known locally (e.g. about malaria seasonality or burden), local input should guide decisions. We should be ready to adjust or annotate outputs to reflect this context.
Document trade-offs and constraints: Keep a running log of what compromises were made, such as proxy choices, assumptions, or omitted datasets. This is key for transparency and for explaining decisions to stakeholders.
Keep a shared record of results and discussions: Alongside the code, maintain a shared PowerPoint deck (or similar document) that captures each step of the work, results, feedback, decisions, and rationale. Update it regularly so there is a clear, centralized history of what has been done and agreed. This makes onboarding new people easier, reduces repetition, and helps preserve institutional memory.

Together, these principles ensure that analysis is technically sound and aligned with country priorities and decision-making processes. Working closely with the SNT team helps build trust in the outputs, ensures that methods reflect local context, and increases the likelihood that results will be used to inform strategy. Think of the code library as a starting point. The real value comes from adapting it with the SNT team’s guidance.

Keeping records

We should keep good records of our work. This may include storing all process and results details through a growing PowerPoint slide deck (or other living document), which is shared with the SNT team at each update. The living document should also include records of discussions with the SNT team and their conclusions, such that the complete record of analysis exists in a single place.

This record, while extensive, should explain in a clear and logical way what was done, what was decided, and why. We should also keep minutes of SNT team discussions pertaining to our work and disseminate minutes after meetings along with clear action items and assignees.

Keeping good records along the way will make it easier and faster to prepare a thorough and clear final report of the work.

Key validation milestones

See Table 1 of the WHO SNT manual for the full step-by-step process and where the SNT team formally signs off on analysis outputs.

At a minimum, the analysis team should plan to present outputs to the SNT team for validation at these milestones:

After data assembly and the situation review, to confirm what data is being used and how it has been processed.
After stratification of malaria epidemiology and its determinants, to confirm classification choices and resulting maps.
After intervention tailoring, to confirm the proposed intervention mix and any targeting criteria.
After costed strategic and operational scenarios, to confirm prioritisation and resource allocation.

Each milestone is a point at which the analysis pauses for review and decision, rather than a routine status update.

Orientation and Setup

How to Use This Guide

Each section of the SNT code library is designed to be clear, structured, and standalone. Whether we are looking to process DHIS2 data, extract population rasters, or calculate incidence rates, we will find that every page follows the same consistent logic, allowing us to focus on the analytical steps without needing to reorient ourselves each time.

Here is how the guide is organized:

Overview at the top

Every section begins with an Overview that explains what the workflow does, what kind of data it is designed for, and how it fits into the broader SNT process. The goal is to help us get oriented before any code is introduced. It lays out what the section covers, when the workflow is relevant, and how it connects with other parts of the pipeline. If there are points where consultation with the SNT team is required, for example, if we need to confirm whether a modeled dataset is appropriate or check that the method being used has been validated, this is noted clearly at the start, so it is not missed later in the steps.

Step-by-Step Guidance

The core of each page is a series of steps that walk through the task in detail. Each step includes executable R, Python and Stata code, brief explanations of what the code is doing, and where appropriate, short comments on assumptions, logic, or typical pitfalls.

Call-outs for Clarity and Emphasis

Throughout the documentation, visual call-outs are used to surface important information:

Objective callout

Objective call-outs appear at the beginning of each page and summarize what the section is designed to help us achieve. They help orient us before any code begins.

Tip callouts

Tip call-outs offer practical suggestions or useful reminders, such as naming conventions, performance tricks, or small ways to reduce complexity. These are based on hands-on experience and are meant to make our workflow smoother.

Important callouts

Important call-outs are used to flag technical requirements, data limitations, or points where misinterpretation is easy. We will often see these when using modeled data like MAP or WorldPop, or working with incomplete datasets.

Consult with SNT team

Consult SNT Team call-outs appear when we should check with the SNT team before moving forward, for example, before using modeled rasters, setting classification rules, or applying proxy indicators. These are often placed early in the section, so they are not missed later.

Validate with SNT team

Validate with SNT Team call-outs are used after code execution steps that generate results requiring review, such as stratification outputs, prioritized district lists, or visualized risk layers. These steps must be reviewed by the national SNT team before being used in decision-making.

How to adapt the code

At the end of most code blocks, there is a “How to adapt the code” note. These are tailored for users applying the workflow in a different country, with different filenames, or using different admin boundaries. These instructions are minimal but focused, just enough to show what to modify and how, without slowing down the workflow.

Full code at the bottom

Each section ends with a full code block that puts everything together. If we already understand what the section does and just want to run the entire thing, we can scroll to the bottom and copy it in one go. It is also helpful when reviewing or troubleshooting a script we have already written.

A final note: Many examples in this library use Sierra Leone’s DHIS2 data to illustrate how different workflows operate. These examples are meant to demonstrate structure and logic, not to be copied directly. We will need to adapt the code to fit our own context, including filenames, admin boundaries, indicator names, and data structures. Each section includes notes (in the How to adapt the code section) to guide this process, but it is our responsibility to ensure the workflow aligns with our country’s data and analysis requirements

System Requirements and Setup

The SNT Code Library supports workflows in R, Python, and Stata, with core modules designed to function consistently across languages. While each page of the code library may present one language first, equivalent scripts and logic are provided in tabbed sections throughout, allowing users to follow the same workflow in R, Python, or Stata.

To get started, choose the language we will be using and ensure our environment is properly set up. Each tab below outlines what is required and expected for that language, including installation instructions, recommended tools, and package guidance.

The code library is designed for use in an R-based workflow. All examples are written in R and assume we are working from RStudio, which offers the most reliable environment for executing code chunks and previewing outputs, navigating folders and managing file paths, working with Quarto documents and integrating version control via Git. At minimum, we should have: R version 4.2 or higher (which can be downloaded from here), the latest version of RStudio (available here), an active internet connection for installing packages and downloading external data when needed.

All packages are handled using the pacman package, which simplifies both loading and installation. We will see this approach used throughout the code library:

pacman::p_load(
  dplyr,
  sf,
  exactextractr,
  terra)

The code library can also be used in a Python-based workflow. Examples assume we are working from either JupyterLab, Jupyter Notebook, or VS Code, which are the most common environments for interactive coding, data analysis, and visualization. At minimum, we should have: Python 3.9 or higher (downloadable from python.org or included in Anaconda), and an active internet connection for installing packages and downloading external data when needed.

All packages are handled using pip or conda (if using Anaconda). To simplify package management across different environments, we recommend creating a virtual environment or conda environment for the project. A minimal set of required packages can be installed as follows:

# Using pip
pip install pandas geopandas rasterio shapely matplotlib

# Or using conda
conda install pandas geopandas rasterio shapely matplotlib -c conda-forge

In Python scripts throughout the library, packages are typically imported at the top of each script like this:

import pandas as pd
import geopandas as gpd
import rasterio
from shapely.geometry import Point
import matplotlib.pyplot as plt

Assumed Knowledge

This code library assumes basic working knowledge of R, Python, or Stata, depending on the language we are using to follow the examples.

A typical SNT analyst profile combines several complementary skills:

Comfort with at least one of R, Python, or Stata for data wrangling and visualisation.
Basic understanding of geographic information systems (GIS) and spatial data, since most SNT outputs are mapped at adm1, adm2, or adm3.
Working familiarity with the main SNT data sources (routine HMIS / DHIS2, household surveys, modeled estimates, intervention coverage).
Comfort with version control (typically Git) and reproducible project structures.

We do not need to be expert in every area to use the library, but we should plan to build up any missing pieces as analyses progress.

For R users, we should be comfortable running code in RStudio, reading and writing data files using functions like rio::import() and readRDS(), and using commonly applied packages such as dplyr and ggplot2.

Many of the workflows rely on pre-built functions from the sntutils package which handles common tasks like downloading, cleaning, aggregating, or visualising data. Our main task is to supply the right inputs and understand how the output fits into the broader pipeline.

If we encounter issues with a function in sntutils and the built-in help or documentation does not resolve it, please contact info@appliedhealthanalytics.org for support. This ensures that any bugs or unclear behaviors are flagged and addressed centrally.

For example, the function below downloads monthly CHIRPS rainfall rasters for January to March 2022 across Africa and saves them in a local folder:

# download Africa monthly rainfall for jan to mar 2022
download_chirps(
  dataset = "africa_monthly",
  start = "2022-01",
  end = "2022-03",
  out_dir = "data/chirps"
)

We are not expected to look inside these functions or modify them. They are built to simplify the workflow and reduce errors.

What if I’m New to Coding?

If we are new to R, Python, or Stata, there are excellent, beginner-friendly resources to help get started. While the SNT code library is designed to be readable and reusable, it assumes some basic familiarity with syntax, functions, and how to run scripts in the chosen language.

This library is not meant to teach any language from scratch, but the skills it assumes are widely taught and easy to build with the right tools. Below are curated resources to help gain the foundation needed to work effectively with the code library.

Here are some recommended resources for getting started with R, all of which are free and well-regarded in the data science and epidemiology communities. These will help build the baseline skills needed to work with the SNT code library effectively:

Beginner books and tutorials

R for Data Science (2e) by Hadley Wickham and Mine Çetinkaya-Rundel: An excellent and accessible introduction to R and the tidyverse. Chapters walk through data import, wrangling, visualization, and modeling, with runnable examples.
The Epidemiologist R Handbook: Tailored to public health and epidemiology use cases. Offers concise examples and practical advice on real-world epidemiological workflows, including dplyr, sf, and ggplot2.

Self-paced interactive tutorials

RStudio Primers: Free, browser-based interactive tutorials hosted by Posit (formerly RStudio). Ideal for learning tidyverse packages such as dplyr, ggplot2, and tidyr through guided practice.
Swirl: A package that teaches R directly in the RStudio console. Great for beginners, install it, run swirl::swirl(), and start learning from within R itself. Covers topics like data types, dplyr, and plotting.

Videos and MOOCs

Datacamp: Introduction to R: A hands-on beginner course that walks through R syntax, data types, and vectors. Offers interactive code challenges in the browser (free tier includes limited access).
Coursera: R Programming by Johns Hopkins: A foundational course that introduces R language basics and builds up to functions and data structures. Free to audit.

Other useful resources

Posit Cheatsheets: Compact PDF references for popular packages like dplyr, ggplot2, and sf. Useful for quick lookups and reminders as we work through the SNT code library.
R Bloggers: A community-driven blog aggregator for all things R. Good for discovering tutorials, best practices, and troubleshooting advice.

Python is a high-level, general-purpose programming language that emphasizes readable code and uses significant indentation to structure programs.

Python ranks among the most popular programming languages and enjoys wide adoption in machine learning. Many instructors also use it to introduce programming.

Python’s versatility makes it ideal for data-driven projects, including epidemiology and public health analyses. To use the SNT code library effectively, we recommend the following resources. They teach key Python concepts, data manipulation, and analysis techniques, preparing us to work confidently with routine malaria data and related datasets.

Beginner books and tutorials

Python for Everybody: A widely used beginner-friendly book that introduces Python basics, data structures, and working with data files.
Automate the Boring Stuff with Python: A practical introduction to Python with real-world examples like file handling, spreadsheets, and automation. Great for beginners who want hands-on practice.
Think Python: A free and clear introduction that builds solid foundations in programming concepts using Python.

Self-paced interactive tutorials

Kaggle Learn Python: Short, free interactive courses designed for beginners. Includes Python basics, pandas, and data visualization.
DataCamp: Introduction to Python: A hands-on beginner course focused on Python for data science, covering variables, data structures, and simple plotting (free tier includes limited access).
Google’s Python Class: A free resource for people with a little programming experience who want to learn Python, with written lessons and exercises.

Videos and MOOCs

Coursera: Python for Everybody: A full beginner specialization covering Python fundamentals and data handling. Free to audit.
Harvard CS50’s Introduction to Programming with Python: An excellent free MOOC introducing core programming concepts with Python.

Other useful resources

Python Cheatsheet: A compact reference to common Python commands and syntax.
Real Python: A community-driven platform offering tutorials and best practices for Python, from beginner to advanced.
Stack Overflow Python Tag: A go-to place for troubleshooting and community support.

These resources are more than enough to help build the baseline skills expected here. We do not need to master everything before using the SNT code library, but gaining comfort with core syntax and workflows in the preferred language will make the entire process smoother and easier to adapt the code to our own specific context.

Working Conventions

Folder Structures and File Paths

Organizing project files consistently is one of the most important steps we can take to reduce errors, avoid duplication, and make analysis reproducible. In the SNT code library, we assume that we are working within a structured folder system that mirrors the logic of the SNT workflow, with dedicated folders for inputs (like raw shapefiles or survey data), outputs (like plots or summaries), and coding scripts grouped by topic.

We strongly recommend initializing each country analysis or workstream as its own self-contained folder system, ideally structured as an RStudio Project (see below). This ensures that relative paths will behave as expected and outputs do not get scattered across the machine.

Example structure

Here is a basic version of what this might look like:

your-snt-project/
│
├── 01_data/
│   ├── 1.1_foundational/
│   │   ├── 1.1a_admin_boundaries/{raw,processed}/
│   │   ├── 1.1c_health_facilities/{raw,processed}/
│   │   └── 1.1e_population/
│   │       ├── 1.1ei_national/{raw,processed}/
│   │       └── 1.1eii_worldpop_rasters/{raw,processed}/
│   ├── 1.2_epidemiology/
│   │   ├── 1.2a_routine_surveillance/{raw,processed}/
│   │   └── 1.2b_pfpr_estimates/{raw,processed}/
│   └── 1.5_environment/
│       └── 1.5a_climate/{raw,processed}/
├── 02_scripts/
├── 03_outputs/
│   ├── 3.1_validation/{figures,tables}/
│   ├── 3.2_intermediate_products/{figures,tables}/
│   ├── 3.3_final_snt_outputs/{figures,tables}/
│   └── 3.4_model/{figures,tables}/
├── 04_reports/
├── 05_metadata_docs/
└── README.md

We do not need to copy this exactly, the point is to have a clear separation between raw data, processed data, scripts, and outputs, and to keep the structure consistent across projects and team members.

Use the SNT template function (optional but recommended)

If we are starting from scratch, the sntutils package (R example shown here) includes a utility function that sets up this kind of folder system. This saves time and helps reduce early friction with file management.

sntutils::initialize_project_structure(
  base_path = "sierra_leone_snt"
)

This will generate a directory like the one below, with subfolders already created for foundational data, epidemiology, interventions, environment, health systems, and other major domains of SNT analysis:

sierra_leone_snt/
│
├── 01_data/
│   ├── 1.1_foundational/
│   │   ├── 1.1a_admin_boundaries/{raw,processed}/
│   │   ├── 1.1b_physical_features/{raw,processed}/
│   │   ├── 1.1c_health_facilities/{raw,processed}/
│   │   ├── 1.1d_community_health_workers/{raw,processed}/
│   │   ├── 1.1e_population/
│   │   │   ├── 1.1ei_national/{raw,processed}/
│   │   │   ├── 1.1eii_worldpop_rasters/{raw,processed}/
│   │   │   └── 1.1eiii_displaced_pop/{raw,processed}/
│   │   └── 1.1f_cache_files/{raw,processed}/
│   ├── 1.2_epidemiology/
│   │   ├── 1.2a_routine_surveillance/{raw,processed}/
│   │   ├── 1.2b_pfpr_estimates/{raw,processed}/
│   │   └── 1.2c_mortality_estimates/{raw,processed}/
│   ├── 1.3_interventions/
│   │   ├── 1.3a_itns/{raw,processed}/
│   │   ├── 1.3b_iptp/{raw,processed}/
│   │   ├── 1.3c_smc/{raw,processed}/
│   │   ├── 1.3d_vap/{raw,processed}/
│   │   ├── 1.3e_anc/{raw,processed}/
│   │   └── 1.3f_irs/{raw,processed}/
│   ├── 1.4_drug_efficacy_resistance/{raw,processed}/
│   ├── 1.5_environment/
│   │   ├── 1.5a_climate/{raw,processed}/
│   │   ├── 1.5b_accessibility/{raw,processed}/
│   │   └── 1.5c_land_use/{raw,processed}/
│   ├── 1.6_health_systems/
│   │   └── 1.6a_dhs/{raw,processed}/
│   ├── 1.7_entomology/{raw,processed}/
│   ├── 1.8_commodities/{raw,processed}/
│   ├── 1.9_finance/{raw,processed}/
│   └── 1.10_final/
├── 02_scripts/
├── 03_outputs/
│   ├── 3.1_validation/{figures,tables}/
│   ├── 3.2_intermediate_products/{figures,tables}/
│   ├── 3.3_final_snt_outputs/{figures,tables}/
│   └── 3.4_model/{figures,tables}/
├── 04_reports/
└── 05_metadata_docs/

R Projects and Relative Paths (R Users Only)

For those using R, the SNT code library assumes that each country’s analysis is managed as a dedicated RStudio Project. This means that all scripts, data, outputs, and reports live in the same folder, with the .Rproj file at the root. Setting up an R Project is straightforward, and more detail can be found here on how to create one using RStudio.

Working this way has several benefits:

File paths are predictable and portable.
Hard-coded paths that only work on one machine are avoided.
Inputs and outputs are organized and easy to locate.
Errors from mismatched working directories (getwd()) are minimized.
Code becomes easier to share across teams when everyone uses the same structure.

When we open a project via the .Rproj file (e.g., sierra_leone_snt.Rproj), RStudio sets that folder as the working directory. We do not need to use setwd() or manage the working directory manually.

Always use here::here() for file paths

This code library consistently uses here::here() to construct file paths. It ensures that scripts can access inputs and save outputs reliably, no matter the machine or operating system. For example:

# avoid this
readRDS("C:/Users/yourname/Desktop/snt_project/01_data/...")

# use this
readRDS(
  here::here(
    "01_data",
    "1.1_foundational",
    "1.1a_admin_boundaries",
    "sle_spatial_adm3_2021.rds"
  )
)

Example: Sierra Leone project folder

If this setup were applied to the Sierra Leone SNT workflow, the top-level folder would look like:

sierra_leone_snt/
│
├── sierra_leone_snt.Rproj
├── 01_data/
├── 02_scripts/
├── 03_output/
│   ├── 3a_figures/
│   └── 3b_tables/
└── README.md

By keeping everything in a single, self-contained folder and using relative paths with here::here(), we simplify collaboration, make debugging easier, and ensure our code remains portable and reproducible, core principles throughout the SNT code library. This structure also lays the groundwork for how data is organized and accessed in later steps. For more detail on how inputs, outputs, and file types are structured across the workflow, see the Data Structures section.

Coding Style and Formatting

To support clarity, collaboration, and ease of adaptation across projects, the SNT code library follows a consistent approach to writing code. Equivalent examples for R, Python, and Stata are provided throughout using tabsets. While the syntax differs, the same principles of clean structure, consistent logic, and readable formatting apply across all languages. We are not strict about enforcing every line, but we encourage adopting these habits as they promote clarity and good practice, especially in shared or long-term workflows.

General Style Principles

Prefer tidyverse-style syntax: We write most code using dplyr, tidyr, and other tidyverse tools. This keeps the logic transparent and consistent. We avoid mixing base R idioms with tidyverse unless needed.
Use the base R pipe |> by default: The native pipe is native to R (since version 4.1), requires no extra packages, and avoids unnecessary dependencies. This helps newer users avoid loading magrittr or dplyr just to use pipes. We will still see |> used in tidyverse-heavy scripts, but for general chaining, |> is preferred and more future-proof.
Use one pipe per line: Avoid writing long chains without breaks. Instead, write each step clearly on a new line. This makes debugging easier and helps others trace the logic.
Always include package namespaces: Write dplyr::mutate() or readxl::read_excel() rather than relying on library(). This avoids ambiguity when functions from different packages share names (e.g., filter() which exists in base R and dplyr), and it makes it easier to understand where each function comes from.
Keep code to 80 characters wide: Line-wrapping helps with readability and makes version control differences much easier to follow. If a line is getting too long, split it logically across lines.
Use comments to structure the script: Start each section with a short header and use comments to explain logic where it is not obvious. Even a few lines like # Import data or # Convert date columns help orient the reader.

To see how these principles come together, unfold the example below. It shows a well-organized script that follows the conventions we recommend: tidyverse-style syntax, base R pipes, explicit namespaces, structured comments, and clean formatting throughout. This structure improves readability, reduces bugs, and makes it easier to reuse or adapt the code in future workflows.

Unfold below to see these principles and style applied in practice, using DHIS2 data cleaning as an example.

Show example code

# set up and data import -------------------------------------------------------

# install pacman if missing
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# load required packages
pacman::p_load(
  readxl, # excel import
  dplyr,  # data manipulation
  tidyr,  # reshaping
  janitor # cleaning names
)

# import raw DHIS2 data
raw_data <- readxl::read_excel("data/dhis2_malaria_data.xlsx")

# data cleaning and wrangling --------------------------------------------------

raw_data2 <- raw_data |>
  janitor::clean_names() |>
  dplyr::rename(
    adm1 = orgunitlevel2,
    adm2 = orgunitlevel3,
    facility = organisationunitname,
    period = periodname
  ) |>
  tidyr::separate(period, into = c("month", "year"), sep = " ") |>
  dplyr::mutate(
    month = match(
      month,
      c(
        "Janvier",
        "Fevrier",
        "Mars",
        "Avril",
        "Mai",
        "Juin",
        "Juillet",
        "Aout",
        "Septembre",
        "Octobre",
        "Novembre",
        "Decembre"
      )
    ),
    month = base::as.integer(month),
    year = base::as.integer(year)
  )

General Style Principles for Python Scripts

Use clear, explicit imports

Always import all required packages at the top of the script

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

Use short, meaningful aliases for readability

# 'pd' for pandas, 'gpd' for geopandas, 'plt' for matplotlib
df = pd.read_csv("malaria_data.csv")
gdf = gpd.read_file("chiefdom_boundaries.shp")

Avoid from module import * because it hides the source of functions

# Bad practice:
# from pandas import *
# This can cause conflicts and make code unclear

Follow PEP 8 style guidelines

Use 4 spaces per indent and lowercase_with_underscores for variable names

# Example function showing proper indentation
def calculate_average_coverage(df):
    total_cases = df["cases"].sum()
    total_population = df["population"].sum()
    average_coverage = total_cases.div(total_population) * 100
    return average_coverage

# Example DataFrame
import pandas as pd

df = pd.DataFrame({
    "cases": [1000, 2000, 0],
    "population": [5000, 0, 2500]
})

avg_coverage = calculate_average_coverage(df)
print(f"Average coverage: {avg_coverage:.2f}%")

Use pandas .div() to avoid division by zero when working with DataFrames

# Example DataFrame
import pandas as pd

df = pd.DataFrame({
    "cases": [100, 50, 120],
    "population": [1000, 0, 2000]  # note the zero
})

# Safe division: returns NaN instead of raising an error
df["coverage"] = df["cases"].div(df["population"]) * 100

print(df)

Use docstrings to explain inputs, outputs, and purpose

def calculate_coverage(cases, population):
    """
    Calculate coverage as a percentage of cases over population.

    Args:
        cases (int or float): Number of cases.
        population (int or float): Total population.

    Returns:
        float: Coverage percentage.
    """
    return (cases / population) * 100

Use method chaining for readability

Chain methods instead of creating multiple intermediate variables

df_cleaned = (
    df.dropna(subset=["cases", "population"])  # Remove rows where 'cases' or 'population' are NaN
      .assign(coverage=lambda x: x["cases"] / x["population"] * 100)  # Calculate coverage %
      .query("coverage <= 100")  # Keep rows where coverage is <= 100
)

Keep one operation per line for clarity

# Each chained step is on its own line for readability
df_filtered = (
    df.dropna(subset=["cases"])
      .assign(coverage=lambda x: x["cases"]/x["population"]*100)
)

Comment generously and structure the code

Start sections with a header for easy navigation

# ==== 1. Load Data ====
df = pd.read_csv("malaria_data.csv")

Add inline comments for complex or non-obvious logic

df["coverage"] = df["cases"] / df["population"] * 100  # convert to %

Separate sections visually for clarity

# ==== 2. Filter Data ====
df_filtered = df.query("coverage <= 100")

6. Handle errors and exceptions

Use try/except blocks to prevent scripts from crashing unexpectedly.

try:
    df = pd.read_csv("malaria_data.csv")
except FileNotFoundError:
    print("File not found. Please check the path.")

A consistent coding style makes our work easier to read, review, and reuse. The practices shared here, such as breaking code into clear sections, using descriptive naming, and applying consistent formatting, are not strict rules but helpful habits. They reduce errors, support collaboration, and make it easier to maintain and adapt code over time. For newer users, these are especially good habits to build early.

Data Structures for SNT Workflows

After setting up our working environment and organizing our folder structure, the next step is to ensure that our data is structured in a way that supports analysis, collaboration, and reproducibility. This involves thoughtful design of how data is formatted, both within individual datasets and across the wider data repository.

Why this matters for SNT

SNT relies on combining diverse datasets: routine surveillance, PfPR estimates, intervention coverage, population, entomology, and more. These data must be interoperable: they should align on location, time, and variable definitions. Without structure, analysis becomes ad-hoc, manual, and error-prone. With good structure, workflows become automated, datasets become reusable, and analyses become easier to validate and share.

There are two levels to data structures, and these are:

Structure within datasets: Structure begins with internal clarity. Are columns named in a way that is intuitive and unambiguous? Are values, such as dates, codes, or categories, formatted consistently across all rows? Are geographic units and time variables standardized to support joining, filtering, and comparison? A well-structured dataset is tidy, analysis-ready, and reduces ambiguity before any integration or analysis begins.

Structure between datasets: Structure also means coherence across the different SNT datasets. Can they be reliably joined using common keys like adm1, adm2, year_month, or indicator_code? Are file formats and folder names consistent and intuitive? This ensures datasets can be joined without issues and reside in an organized, navigable repository, making the entire system easier to manage, scale, and trust.

In the sections that follow, we outline the principles and practical steps to make our datasets both tidy and harmonized, ready to plug into any SNT workflow.

Structure Within a Dataset

Structure starts within each dataset. Each dataset should be internally clear, tidy, and analysis-ready, this forms the foundation for reliable joining, filtering, and aggregation. It should follow tidy data principles, use standardized column names, and be self-contained with clear labels and documentation.

To achieve this, we focus on a few key practices:

1. Should be tidy

Tidy data is a concept by Hadley Wickham that defines a simple, structured way to organize data in an organized and usable manner:

According to this principle: - Each variable forms a column - Each observation forms a row - Each type of observational unit forms its own table

This structure minimizes ambiguity and reduces friction during analysis. Instead of spending time reshaping and cleaning poorly organized tables, analysts can focus on more substantive tasks. Tidy datasets make it easier to merge with other sources, such as combining routine case data with population estimates or environmental rasters, and enhance code reusability across shared scripts and analysis pipelines

For example, consider a population dataset provided in this wide format:

region	district	pop_u5_2020	pop_u5_2021	pop_total_2020	pop_total_2021
Boké	Gaoual	6500	6800	47000	48500

This structure makes it difficult to filter by year or age group. A tidy version reshapes the data into a long format, making it much more usable:

region	district	year	age_group	population
Boké	Gaoual	2020	u5	6500
Boké	Gaoual	2020	total	47000
Boké	Gaoual	2021	u5	6800
Boké	Gaoual	2021	total	48500

With data in tidy format, filtering and transforming becomes intuitive. For example:

pop_data |>
    dplyr::filter(year == 2020) |>
    dplyr::select(
        district, year,
        age_group, pop
    )

pop_data = (pop_data
    .query("year == 2020")
    .loc[:, ['district', 'year', 'age_group', 'population']]
)

# Alternative pandas syntax:
pop_data[pop_data['year'] == 2020][['district', 'year', 'age_group', 'population']]

# Or using method chaining:
pop_data.loc[pop_data['year'] == 2020, ['district', 'year', 'age_group', 'population']]

Which gives:

district	year	age_group	pop
Gaoual	2020	u5	6500
Gaoual	2020	total	47000

Tidy format makes it easier to filter, group, join, visualize, or model data without constant reshaping. By organizing datasets in a consistent, long format, we reduce friction throughout the workflow. This approach supports clear logic, scalable code, and smoother integration with tools across platforms. For more guidance on tidy data, refer to R for Data Science – Chapter 12: Tidy Data.

2. Should Have Standardized Columns

Tidy data gives us a solid foundation, but structure alone is not enough. To make data truly SNT-ready, we need standardization. Standardization means aligning key elements, including geographies, dates, variable names, and population groups, so datasets can be joined, compared, and analyzed reliably. Without this, merges fail and outputs become misleading. With it, our workflows are simpler, more reusable, and less error-prone.

At its core, standardization is about aligning key dimensions across datasets so they are interoperable:

Geographies: Use consistent names and codes for admin units (adm1, adm2, etc.). Mismatched district names or missing codes can silently break joins.

Example: If adm2 is spelled Gaoual in one dataset and Guoal in another, the two won’t join properly, leading to missing or mismatched records.
Dates: Adopt a unified date format (e.g., year_month = “2023-09”). Reference periods should be aligned so that indicators from different sources reflect the same time frame.

Example: Inconsistent formats like “Jan 2023” in one dataset and “2023-01” in another can cause joins to fail silently or produce misaligned results.
Variables: Column names and definitions must follow shared conventions (e.g., use conf_mic, not confirmed_microscopy; use adm2, not district_name).

Example: If one dataset uses conf_mic and another uses confirmed_microscopy, trying to bind rows or join tables will fail unless columns are renamed first. Hence why it’s better to agree on a naming convention early-on and apply it consistently across all datasets.
Population Groups: If datasets are disaggregated (e.g., by age, sex, or risk group), ensure categories align. For instance, u5, 5_14, 15plus should be used consistently across indicators.

Example: If one dataset uses u5, 5_14, 15plus, but another uses under_5, 5to14, 15_and_above, grouping and comparison across age bands becomes error-prone.
Units of Measure: Align how values are reported. Is prevalence reported as 0.1 or 10%? Are rates per 10,000 or per 100,000? This can affect comparisons and further analysis if not harmonized.

Example: One dataset might report incidence as a percentage (1.2%), another as a decimal (0.012), and a third per 10,000 population. Without harmonizing these units, comparisons or weighted summaries will be misleading.

Investing in standardization upfront reduces downstream friction, enabling cleaner joins, consistent comparisons, and easier reuse across SNT workflows.

3. Should be Self-Contained

With tidy and standardized data in place, the final step is making it self-contained. This means all necessary context, including labels, definitions, units, and versioning, is either embedded or clearly linked. It’s a small but important step that makes data reusable, reproducible, and easy to build on.

Good formatting practices include:

Metadata: Each variable should have labels, units, and clear definitions. For example, it should be obvious whether a value is a count, a percentage, or a rate per 1,000 people.

Metadata doesn’t always live in the dataset itself. It can also include standalone notes or documents that capture important analytical decisions. For example, if reporting rates were calculated differently during the COVID-19 period due to disruptions, this should be explicitly noted. A simple Word document outlining these adjustments, for instance, how malaria indicator values were imputed or interpreted differently in 2020–2021, adds important context. These less visible decisions shape how data should be understood and compared, and documenting them ensures transparency, reproducibility, and credibility across the team and with stakeholders.

We can also use inline comments in code to explain decisions at the point they were made. This is just as important as standalone metadata. It helps other analysts follow our thinking, and reminds us why something was done months down the line.

Finally, consider generating a data dictionary automatically at the end of the script. For example, write code that extracts variable names, labels, and units into a table and saves it as an Excel or database file. This kind of summary is extremely useful for our own reference, for handovers, and for sharing with the SNT team.
Data dictionary: A separate file (or tab in a dataset) that explains each column in plain terms: what it means, how it was calculated, and what categories or units are used. This is especially important when datasets are shared or reused by others.

A strong data dictionary does more than describe variables. It also documents key processing steps. For instance, if confirmed_cases was imputed in some districts, the dictionary might note: “Imputed using average values from the same period and district.”

We can even generate this dictionary automatically at the end of the script. For example, write code that extracts variable names, labels, and units into a table and saves it as an Excel file or database. This kind of summary is extremely useful for our own reference, for handovers, and for sharing with the SNT team.

It may also be worth including translated versions of the dictionary, especially when teams operate in multilingual settings. If analysts work in French but stakeholders prefer English, having both versions side by side makes the dataset more accessible. This is straightforward to generate using tools like the gtranslate R package.
Consistent versioning: Files should follow a predictable naming format (e.g., gin_dhis2_processed_2018-2024_v2025-03-28.rds) so that updates are traceable and confusion is avoided.

In this example, gin refers to the country (Guinea), dhis2 is the data source, processed indicates the state of the dataset (processed vs. raw), 2018-2024 is the date range the data covers, and v2025-03-28 is the version date showing when the file was created.

Together, these steps make SNT datasets self-explanatory, clear enough to stand on their own without someone needing to walk through them. This reduces friction in collaboration and makes data pipelines easier to debug, reuse, and maintain.

Structure Across Datasets

With individual datasets now tidy, standardized, and self-contained, the next step is ensuring they can work together. Structure across datasets is about building a coherent system, one where files are logically organized, consistently formatted, and join-ready. This enables analysis to scale smoothly across domains, time periods, and geographies.

At its core, this step involves three main elements:

1. Organized Repository A well-organized folder structure reduces the time it takes to find relevant files, lowers the risk of mistakes, and makes it far easier to hand over work or onboard new users. As introduced earlier in the Folder Structure section, this organization should already be in place and consistently followed.

Datasets should be grouped thematically and not all thrown into one folder. For example, keep epidemiology, interventions, and environment data in separate folders, each with their own subfolders.

Each theme should also follow a clear split between:

raw/: untouched, original files. These are the source of truth and should never be edited directly.
processed/: cleaned, harmonized, and analysis-ready datasets.

The idea behind this separation is that it prevents confusion over which file was modified, preserves traceability, and ensures that scripts can always be rerun from the same raw inputs.

Below is an example of a clear folder breakdown for core datasets. Any raw DHIS2 data should be stored in the raw/ folder, while cleaned, analysis-ready versions belong in processed/:

data/
├── 02_epidemiology/
│   └── 2a_routine_surveillance/
│       ├── raw/
│       └── processed/

A clear, thematic layout supports scaling, adding a new country or indicator becomes easier when the structure is predictable. It also makes handover and collaboration less error-prone.

2. Consistent Formatting and Join-Ready

Many of the principles we covered under Structure Within a Dataset, clear column and row structures, standardized date formats, harmonized admin units, are most effective when applied consistently across all SNT datasets. This is what is meant by consistent formatting across datasets.

When every dataset uses the same conventions, adm1, adm2, year_month, and common variable names, then joining becomes easy. We are no longer wasting time fixing mismatched columns or reconciling differences in formatting. Our data is analysis-ready and flows cleanly into the next SNT step.

Example: Even small differences can silently break joins. If adm2 is spelled Gaoual in one file but Gaoal in another, the merge will fail. If dates are 2023-04 in one and April 2023 in another, alignment becomes messy. And if one dataset uses u5 and another under_5, we will spend extra time fixing groupings. These are avoidable issues, standardizing early avoids friction later.

Consistent formatting turns our datasets into interoperable building blocks. It ensures that each dataset speaks the same language, across geographies, dates, variables, and groups. This reduces friction, increases reuse, and makes our SNT workflow easier to scale or hand off. When formatting is inconsistent, tidy datasets lose reliability. Applied consistently, they become reusable and reliable across the SNT pipeline.

Summary

Establishing a clean setup, from file structure to coding conventions to dataset formatting, is the first and most important step in building reliable SNT workflows. By grounding our work in these principles and staying in close coordination with the SNT team, we create a process that is technically sound and aligned with national priorities. The SNT Code Library offers the tools and structure to get started, but its true value comes when it is adapted, reviewed, and iterated in context. With these foundations in place, we are ready to begin running workflows that are reproducible, scalable, and decision-ready.