Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 4. Stratification
  2. 4.1 Epidemiological Stratification
  3. Incidence adjustment 2: incomplete reporting
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Acronyms and Resource Library
    • 1.6 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Determining active and inactive status
      • Routine data extraction
      • DHIS2 data preprocessing
      • Missing data detection methods
      • Health facility reporting rate
      • Contextual considerations
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
    • 2.10 Cost Data
  • 3. Situation Analysis
    • 3.1 Review of Past Interventions
      • Case Management
      • Routine Interventions
      • Mass ITN Campaigns
      • Chemoprevention Campaigns
      • Other Vector Control
    • 3.2 Trend Analysis
    • 3.3 Risk Factors
    • 3.4 Impact Evaluation
    • 3.5 Cost Analysis
  • 4. Stratification
    • 4.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 4.2 Access to Care
    • 4.3 Seasonality
      • Defining Seasonal Areas
      • Durations of Seasonality
    • 4.4 Urban Microstratification
  • 5. Intervention Targeting and Prioritization
    • 5.1 Intervention Targeting
    • 5.2 Prioritization
    • 5.3 Optimization under Limited Resources

On this page

  • Overview
  • Step-by-Step Instructions
    • Step 1: Load required packages and files
      • Step 1.1: Load packages
    • Step 4: Calculate adjusted incidence (N2)
      • step 4.1: Calculate monthly adjusted incidence (N2) cases
      • Step 4.2: Calculate annual adjusted incidence 2
    • Step 5: Save Updated Files
  • Summary
  • Full code
  1. 4. Stratification
  2. 4.1 Epidemiological Stratification
  3. Incidence adjustment 2: incomplete reporting

Incidence adjustment 2: incomplete reporting

Overview

Second adjustment: A second adjustment is made to account for the varying reporting rates (RRs) per area-time by inflating the number of corrected confirmed cases by the fraction of the expected records not received (N2). Through this step, it is assumed that the non-reported data follows a similar distribution to the data reported. Reporting rates can be calculated per health facility type to avoid an over- or underestimate of the effect of missing data observed in smaller or larger health facilities, respectively. An alternative approach to this adjustment is the imputation of data for the months of missing values per health facility. This can be computationally intensive and requires a relatively complete database to appropriately inform imputations, but it would provide a complete database for which a reporting rate adjustment would not be necessary. The equation for second incidence adjustment is given by: N2= N1/d

Where

  • N2 are the corrected number of cases for testing and reporting rates;
  • d are the reporting rates (records received / records expected), which can be weighted per the type of HF that did not report in a given point in time
NoteObjectives
  • TBD

Step-by-Step Instructions

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Load required packages and files

Step 1.1: Load packages

The first step is to install and load the libraries required for this section.

  • R
  • R
  • Python
# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  readxl,    # import and read Excel files
  ggplot2,   # plotting
  rio,       # for importing and exporting files
  gridExtra, # plot arrangements
  here,    # shows path to file
  stringr,    # clean up names,
  xts,       # return first or last element of a vector
  tidyverse,  # contains functions for data manipulations
  sf,          # spatial features for use in mapping
  scales      # calculates "pretty" breaks
)

Step 1.2: Import data files

We bring in all the data files we will use in this section. They include 1. Monthly routine data at facility level that we worked on in adjusted1 2. The monthly incidence data we saved in the adjusted1 incidence page

  • R
  • Python

Step 2: Calculate reporting rate

Here we calculate the monthly reporting rate using, adjusted incidence (N1), as our indicator of interest. This is to account for months in a facility where there were stock out of RDTs and so all cases were presumed, hence no cases were tested. We use the monthly facility level data for this calculation and then summarize afterwards at operation admin level.

The code performs the following steps:

  • Create a variable which gives a value of 1 if the facility reported for N1 and 0 otherwise

  • Create a variable which takes the first reporting date into consideration and gives a value of 1 if the facility is expected to report for the N1 and 0 otherwise

  • Count the number of non-NA reports for the indicator of interest (observed reports), aggregated by month and by the admin unit level of analysis

  • Count the number of expected reports by month and admin unit level; here we use find the first time the facility reported in the timeseries to determine it’s reporting status for the rest of the time series

  • Merge the two datasets (observed reports and expected reports, i.e. reporting rate numerator and denominator)

  • Compute the reporting rate

  • R
  • Python
mon_rep_vars <- routine_data |>
  
  # generate first date of reporting
 dplyr::mutate(
   date_obj = as.Date(date_obj),
   # create a reporting variable condition on whether suspected or presumed was reported
   rep_var = if_else(!is.na(susp) | !is.na(pres), 1L, 0L)
  ) |>
  group_by(hf_uid) |>
  mutate(
    first_month_reported_date = suppressWarnings(min(date_obj[rep_var == 1], na.rm = TRUE)),
    first_month_reported_date = if_else(
      is.infinite(first_month_reported_date),
      as.Date(NA),
      first_month_reported_date
    ),
    rep_expected = if_else(
      !is.na(first_month_reported_date) & date_obj >= first_month_reported_date,
      1L,
      0L
    )
  ) |>
  ungroup()

# Now we define a variable for observed reporting for N1, summarize at adm3 and calculate reporting rate
mon_rep_rate <- mon_rep_vars |>
  # create a variable if N1 was reported
dplyr::mutate(
    N1_rep = if_else(!is.na(conf_tpr), 1L, 0L)
  ) |>
  # aggregate at adm3 level
  dplyr::group_by(adm0, adm1, adm2, adm3, year, month) |>
  # calculate observed and expected reports
  dplyr::summarise(
    exp_rep = sum(rep_expected, na.rm = TRUE), 
    obs_rep = sum(N1_rep, na.rm = TRUE
  )) |>
  # calculate reporting rate
  dplyr::mutate(reprate = obs_rep/exp_rep
                )

# check of reporting rate is greater than 1
mon_rep_rate |> 
  dplyr::filter(reprate > 1)

# view data
knitr::kable(head(mon_rep_rate, 15))

To adapt the code:

Step 3: Join incidence data with Reporting rate data

We start with joining the file created under reporting rate to the incidence file we have been working with. Reporting rates are usually summarized at nearest operational admin level above health facilities by month-year. Here we use adm3 for illustration but countries can adapt to their setting.

Note: it is highly recommended that first and second adjusted incidence cases are calculated by month.

  • Step 3.1: Join the datasets

R

# Join incidence data with reporting rate data
inc_rep_rate <- inc_data %>%
  left_join(mon_rep_rate, ., by = c("adm0", "adm1", "adm2", "adm3", "month","year"),
            relationship = "one-to-one")

head(inc_rep_rate, 10)

Python

Step 3.2: Map reporting rate

# Examine values of reporting rate data

ggplot(inc_rep_rate, aes(x = factor(adm3), y = reprate)) +
  geom_boxplot() +
  labs(title = "Distribution of Reporting Rate by Admin3",
       x = "Admin3",
       y = "Reporting Rate") +
  theme_minimal()

Step 4: Calculate adjusted incidence (N2)

step 4.1: Calculate monthly adjusted incidence (N2) cases

This involves adjusting for reporting rates

Next we calculate for adjusted cases by accounting for reporting rates at adm3 level. We account for reporting rates by multiplying the adjusted 1 cases by the proportion of non-reporting i.e 1-reporting rate. The result is the additional number of cases should all facilities have reported

  • R
  • Python
# here we rename the database to our old name (inc_data)
inc_data <- inc_rep_rate |>
  dplyr::mutate(
    adjcases2 = adjcases1 + (adjcases1*(1-reprate)),

 # deriving adjusted incidence 2 by dividing by the population parameter
        adjinc2 = adjcases2/pop)

head(inc_data)

Step 4.2: Calculate annual adjusted incidence 2

For the purposes of SNT annual incidence estimates are more useful to compare between years and admin levels

  • R
  • Python
## Aggregate the dataset by year
adj2_inc_ann <- inc_data |>
  dplyr::group_by(adm0, adm1, adm2, adm3, year) |>
  dplyr::summarise(
                   across(c(susp:conf_tpr, adjcases1, adjinc1, adjcases2, adjinc2), sum, na.rm=TRUE),
                   across(c(pop, test_rate, tpr, reprate), mean, na.rm = TRUE)
                   ) |>
   ungroup()

head(adj2_inc_ann)

# calculate annual crude incidence
adj2_inc_ann <- adj2_inc_ann   |>
  dplyr::mutate(
    ann_crude = crudeinc * 1000,
    ann_adjinc1 = adjinc1 * 1000,
    ann_adjinc2 = adjinc2 * 1000)

# visualize the first observations of the data set
head(adj2_inc_ann, 10)

Step 5: Save Updated Files

Now we save our incidence data as a csv file

  • R
  • Python
# Save routine data
rio::export(
  routine_data,
  file = 
    here("english/library/stratification", 'epidemiological/data_r', 'routine_data_hf.csv')
)

## Save Monthly Incidence Data Set

rio::export(
  inc_data,
  file = 
    here("english/library/stratification", 'epidemiological/data_r', "monthly_inc_data.csv"
))

# Save annual incidence data set
rio::export(
  adj2_inc_ann,
  file = 
    here("english/library/stratification", 'epidemiological/data_r', "annual_inc_data.csv"
))

Summary

TBD

Full code

  • R
  • Python
Show full code
#===============================================================================
# End of Script
#===============================================================================
 

©2026 Applied Health Analytics for Delivery and Innovation. All rights reserved