Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 3. Stratification
  2. 3.1 Epidemiological Stratification
  3. Incidence overview and crude incidence
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Determining active and inactive status
      • Contextual considerations
      • Missing data detection methods
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to Care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
    • 6.1: Trend analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • Key concepts for incidence and incidence adjustments
    • Data sources for incidence estimation
      • Routine malaria data – Variables
      • Routine data - Aggregations
    • Health facility master list
    • Treatment seeking data
    • Demographic information
      • Methods for incidence estimation
  • Step-by-Step Instructions
    • Step 1:
      • Step 1.1: Load required packages
      • Step 1.2: Load routine case data, population data and shape file
    • Step 2: Prepare data for incidence estimation
      • Step 2.1: Check the structure of the data sets
      • Step 2.2 Clean up the admin names across the datasets
    • Step 2.3: Aggregate data at operational unit level
    • Step 3: Join case data and population data
      • Step 3.1: Reshape population data if needed
      • Step 3.2: Join case data and population data
      • Step 3.3: Clean up column names
    • Step 4: Calculate monthly and annual incidence
      • Step 4.1: Calculate Monthly Crude Incidence
      • Step 4.2: Calculate Annual Crude Incidence
    • Step 5: Save Files
  • Summary
  • Full code
  1. 3. Stratification
  2. 3.1 Epidemiological Stratification
  3. Incidence overview and crude incidence

Incidence overview and crude incidence

Overview

Clinical malaria case incidence represents the number of newly diagnosed malaria cases during a defined period in a specified population, usually measured as total number of clinical cases per 1000 person-years at risk (cases per 1000 population per year).

In most malaria endemic areas, the measurement of true population-level clinical malaria incidence is affected by the factors that affect the cascade from onset of symptoms to reporting of a case, which all differ by area and over time:

  • Decision to access care
  • Barriers to access care
  • Choice of care (public, private or informal)
  • Coverage of the surveillance system, as most integrated facilities are from the public health sector only
  • Quality of care and availability of diagnostic tools to confirm a suspected malaria case
  • Health facility reporting of detected cases into routine surveillance

These factors all affect crude incidence as measured in routine surveillance. However, crude incidence can be adjusted to attempt to account for some of these factors.

This section demonstrates how to calculate crude malaria incidence from routine case reporting. It assumes routine data has been extracted, cleaned, and reviewed as described in the code library sections on routine data.

Objectives
  • TBD

Key concepts for incidence and incidence adjustments

Data sources for incidence estimation

Various data sources are required in order to estimate an adjusted malaria clinical incidence estimate that reflects population-level incidence.

Routine malaria data – Variables

Several malaria data elements collected routinely and reported to the national Health Management and Information Systems (HMIS) can be used to estimate clinical malaria incidence in nearly all malaria endemic countries. The minimally essential data elements for incidence calculation include:

  • Uncomplicated confirmed malaria cases per parasite strain and diagnostic tool: The temporal accuracy of confirmed cases can be affected by the changes in diagnostic tools with different diagnostic accuracies used throughout time. The number of confirmed cases reported for specific HFs can also be affected by the main diagnostic tool used and the skills of the health staff to read and interpret the results.

  • Tested cases per diagnostic tool: Usually, HFs report the number of microscopy and RDT cases conducted. Before aggregating the cases tested by either method to obtain the total number of patients tested, it is important to consider if adding the two sources of information will not lead to duplications. This is of particular relevance when microscopy is conducted to confirm an RDT result in the same HF, or among referred patients within the same district that are tested twice in the peripheral HF and referral hospital.

  • Presumed malaria cases: Suspected malaria cases (usually fever cases), that are not confirmed to be malaria infections but are presumed to be so, and usually treated as malaria cases. Presumed cases are sometimes directly reported by HFs as an individual data element, but oftentimes they are calculated instead. There are several calculations that NMPs use depending on what they consider most appropriate: a) number of suspected cases minus tested; b) number of malaria treatments distributed minus confirmed cases; etc. Whether directly or indirectly reported, this variable is still affected by substantial bias depending on the level of health staffs’ understanding of the concept, and the clinical perception of a suspected case. Presumed cases are generally reported in higher numbers in areas affected by low stocks or complete stock outs of diagnostics.

Routine data - Aggregations

Data should ideally be extracted at the health facility-month level for a detailed review of its completeness, and quality. Upon review, the routine data will need to be aggregated to the selected unit of analysis – usually the district – to estimate incidence at that level. When aggregating the information per district, it is important to pay special attention to the data reported from the regional or national hospitals for two reasons:

  1. Catchment area: Such hospitals have catchment populations that are wider than a specific district. To maximize the use of hospital data, countries may choose to estimate malaria incidence using the aggregated data from the health facilities where the hospital is located, first including, and then excluding the data from the regional hospital. This will allow understanding the impact of the presence of a regional or national hospital in specific districts.

  2. Duplications: Patients with malaria detected in regional hospitals may sometimes be referred from peripheral health facilities or district hospitals. Such cases are usually admitted as severe malaria cases and should not be reported as uncomplicated malaria in the hospital. In settings where the hospital reporting clearly differentiates outpatient uncomplicated malaria patients from referred severe malaria patients (that were detected, and likely reported in another health facility), the risk of double counting a case is low. However, settings with less reliable surveillance protocols should discuss the pros and cons of using hospital data to estimate incidence.

Health facility master list

Having an updated and informative master list of the health facilities and period of time when they are functional, allows us to measure the actual number of HFs that should have reported per month. In other words, the expected number of records that should be reported for a given variable, and thus, the denominator to calculate completeness rates. A master list that provides information on the HF type (hospital, district health centre, peripheral health centre of different levels, etc.) and the ownership of the HF (public, private, or other) allows to further understand the data submitted by each HF and how it should be treated when calculating incidence.

Treatment seeking data

Routine malaria data in its maximal form (if all malaria fever cases were tested and reported), only provides information from those fevers that are detected through the surveillance system, which generally covers the public HFs and, in some countries, part of the formal private sector. However, in nearly all areas where malaria is or has been endemic, there is an expectation that an additional number of clinical malaria episodes are detected through the private sector (formal or informal), or are never detected by any sector, as patients do not seek care for a fever and are therefore not tested. Areas where public care seeking is low also tend to suffer from a low access to care and other malaria or health services. Therefore, these areas may report a much lower number of cases compared to what would be expected if all febrile cases in the community were tested. However, using the incidence reported to the routine surveillance system in these areas may lead to an underestimation of the true clinical malaria incidence, and therefore of the true transmission intensity.

Demographic information

Reliable population denominators are required to estimate community-level incidence. Most malaria endemic countries count on official population projections provided by the National Institute of Statistics from the last census available, which vary in time per country. NMPs should use whatever source of population they deem most appropriate to calculate incidence at the selected unit of analysis (which is usually district but can be lower). Countries with special populations that may require specific targeted malaria approaches should also explore the most reliable data available to inform the presence of these groups per unit of analysis.

Methods for incidence estimation

Routine malaria data are affected by various factors that need to be considered in order to estimate the expected clinical malaria incidence in a given area and point in time within and beyond the public health sector. Such factors include issues with testing and reporting rates within the public health sector, as well as spatial and temporal changes in fever care-seeking behaviour in the public and private sector, as well as among those who do not generally seek care. WHO recommends a stepwise correction approach to control for these common factors with the objective of estimating a community-level estimate of clinical malaria. It is recommended that countries conduct this analysis using monthly data as the impact of several of the factors being adjusted for (testing and reporting rates) vary substantially within a given year, particularly in countries with intense seasonality patterns.

Step-by-Step Instructions

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1:

Step 1.1: Load required packages

The first step is to install and load the libraries required for this section. I HAVE PUT HERE A SAMPLE FROM ANOTHER PAGE, PLEASE ADAPT FOR THIS PAGE

  • R
  • Python
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  readxl,    # import and read Excel files
  ggplot2,   # plotting
  rio,       # for importing and exporting files
  gridExtra, # plot arrangements
  writexl,    # export Excel files
  stringr,    # clean up names,
  xts,       # return first or last element of a vector
  tidyverse,  # contains functions for data manipulations
  sf          # spatial features for use in mapping
)

Step 1.2: Load routine case data, population data and shape file

Here we show how to import datasets need for this section. The SLE routine dataset we will be using contains monthly reported cases of suspected, tested, confirmed, treated malaria cases, etc, for each health facility in the country. The dataset spans 9 years; from 2015 to 2023. We also load in the shapefile for the operational administrative level (adm3) and the population data for the admin level for the same years as the routine case data

  • R
  • Python

Step 2: Prepare data for incidence estimation

Step 2.1: Check the structure of the data sets

Check the structure of the datasets to understand how the types and format of the variables and use summarize to understand the range of values of each variable in the dataset

  • R
  • Python
# Check the structure of the routine data
str(clean_malaria_routine_data_final)

# Summarize the routine dataset to observe the range of values
summary(clean_malaria_routine_data_final)

# Summarize the population data
summary(pop)

We observe that the population data is in the wide format, with a column for each population year. We will later see how we will transform the population so that we can use it in our analysis.

Step 2.2 Clean up the admin names across the datasets

Cleaned up admin names and select relevant variables that will be needed for the incidence estimations. Here we assign a new name to our routine case data so that we keep it separate from the dataset we will use for analysis. We also do same for the population data

  • R
  • Python
#clean up the adm1, adm2 and adm3 names in the routine data set
inc_data  <-  clean_malaria_routine_data_final |>
  dplyr::mutate(adm1 = str_to_title(adm1), 
  adm2 = str_to_title(adm2), 
  adm3 = str_to_title(adm3)) |> 

 #select the relevant malaria variables
   dplyr::select(adm1, adm2, adm3, hf, year, month, 
        susp, test, conf, pres) |>
  dplyr::mutate(year = as.numeric(year),
                month = as.numeric(month))

#clean up the adm1, adm2 and adm3 names in the routine data set
pop_data  <-  pop |>
  dplyr::mutate(adm1 = str_to_title(adm1), 
  adm2 = str_to_title(adm2), 
  adm3 = str_to_title(adm3)) 

Step 2.3: Aggregate data at operational unit level

Countries should decide the appropriate administrative level for the analysis. In this example, the dataset is at the facility level while the operational unit for decision-making is adm3, so we aggregate the case data at adm3 level.

  • R
  • Python
# Summarise data by month-year for each adm3 level
inc_data  <-  inc_data |>
  dplyr::group_by(adm1, adm2, adm3, year, month) |> 
  dplyr::summarise(across(susp:pres, ~sum(.x, na.rm=TRUE))) |> 
  dplyr::ungroup() 
 
# visualize the first observations of the data set
head(inc_data)

Step 3: Join case data and population data

To enable us compute crude incidence estimates we need to get the population figures for each year of data in our incidence dataset.

Step 3.1: Reshape population data if needed

In most cases, population data comes in a wide format where there are different columns for each year’s population. This becomes difficult to use for analysis especially when our incidence data has one column for all the years and another column for months. To go around this, it is helpful to reformat the population dataset from wide format to long format to align with what we have in the routine dataset.

  • R
  • Python
# first convert the pop data into long format and create a variable for year
pop_long <- pop |>
  tidyr::pivot_longer(
    cols = pop2015:pop2023,
    names_to = "pop_year",
    values_to = "pop"
  ) |>
  dplyr::mutate(year = as.numeric(sub("pop", "", pop_year))) |>
  dplyr::select(adm1, adm2, adm3, year, pop)

# visualize the first observations of the data set
head(pop_long)

Step 3.2: Join case data and population data

Now that we have reformatted the population data, we can join it to the routine dataset at adm3 level and by year. It is advisable to use the population data as a base to be able identify records in the routine data that could not be correctly matched.

  • R
  • Python
# join the new pop_data to the incidence data
inc_data_pop <- pop_long |>
  dplyr::right_join(inc_data, by = c("adm3", "year"),
          relationship = "one-to-many") 

# visualize the first observations of the data set
head(inc_data_pop)

The data now has duplicate columns for adm1 and adm2 - identified by .x and .y coming from the 2 data sets. We will need to select only one. In the next section we show to do that.

Step 3.3: Clean up column names

If we have similar variable names in the two dataset, R will modify the common variables to distinguish them from the dataset in which they came from. We therefore rename the column we want to maintain and select the final set of variables to retain database

  • R
# join the new pop_data to the incidence data
inc_data_pop <- inc_data_pop |>
  dplyr::mutate(adm1 = adm1.y,
                adm2 = adm2.y) |>
  dplyr::select(c(adm1, adm2, adm3, year, month,
                  pop, susp, test, conf, pres))
  

# visualize the first observations of the data set
head(inc_data_pop)

Step 4: Calculate monthly and annual incidence

Incidence is estimated by dividing the crude and corrected incidence estimates in each district by the population at risk. If the analysis is conducted at monthly level, monthly incidence can be estimated by dividing the crude or adjusted (Ns) number of cases by the estimated population of the district and year (where the population will be the same for all months of a given year). To calculate annual incidence estimates, the number of monthly crude or adjusted cases should be aggregated annually and divided by the annual population projections for that district-year.

Step 4.1: Calculate Monthly Crude Incidence

Crude incidence: Calculation of crude incidence estimates is dependent on the following variables: confirmed cases and population. Here we want to maintain the name of our incidence data (inc_data) for subsequent analysis

  • R
  • Python
## Calculation of Monthly Crude Incidence 
inc_data <- inc_data_pop |> 
  dplyr::mutate(crudecases = conf,
         crudeinc = crudecases/pop)

# visualize the first observations of the data set
head(inc_data)

Step 4.2: Calculate Annual Crude Incidence

In the section above, we calculated monthly incidence estimates for each adm3 level. This might not be helpful enough operationally, unless we want to conduct further analysis to answer other question like identifying months of missed “epidemics”. For the purposes of SNT annual incidence estimates are more useful to compare between years and admin levels

  • R
  • Python
## Aggregate the dataset by year
cr_inc_ann <- inc_data |> 
  dplyr::group_by(adm1, adm2, adm3, year) |>
  dplyr::summarise(across(c(susp:crudeinc), sum, na.rm=TRUE),
                   pop_mean = mean(pop, na.rm = TRUE)) |>
   ungroup() 
 
# calculate annual crude incidence  
cr_inc_ann <- cr_inc_ann   |>
  dplyr::mutate(ann_crude = crudeinc*1000)

# visualize the first observations of the data set
head(cr_inc_ann)

Step 5: Save Files

Now we save our monthly and annual incidences as csv files.

  • R
## Save monthly incidence data set

rio::export(
    inc_data, 
    file = here("english/library/stratification/epidemiological/data_r/monthly_inc_data.csv"
))

# Save annual incidence data set
rio::export(
  cr_inc_ann, 
  file = here("english/library/stratification/epidemiological/data_r/annual_inc_data.csv"))

Summary

TBD

Full code

  • R
  • Python
Show full code
#===============================================================================
# End of Script
#===============================================================================
 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved