Incidence adjustment 1: incomplete testing

Overview

First adjustment: To adjust for incomplete testing rate, a first correction is applied to the number of presumed cases using the test positivity rate (TPR = confirmed / tested) and added to the number of confirmed malaria cases by either RDT or microscopy. This adjustment allows estimating the number of additional cases not tested that were likely to be true malaria cases (N1), assuming that the TPR among the presumed is similar to the TPR among the cases tested. the formula for estimating the first incidence adjustment is given by; N1 = a + (b*(a/c))

Where

N1 is the corrected number of cases for testing rates
a is the confirmed malaria cases reported from the public sector;
b is the presumed cases (not tested but treated as malaria cases);
c is the tested cases;

Objectives

Step-by-Step Instructions

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Load packages and datasets

Step 1.1: Load required packages

The first step is to install and load the libraries required for this section.

# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  readxl,    # import and read Excel files
  ggplot2,   # plotting
  rio,       # for importing and exporting files
  gridExtra, # plot arrangements
  here,    # shows path to file
  stringr,    # clean up names,
  xts,       # return first or last element of a vector
  tidyverse,  # contains functions for data manipulations
  sf,          # spatial features for use in mapping
  scales      # calculates "pretty" breaks
)

Step 1.2: Load incidence data file and adm3 shapefile

We bring in the monthly incidence data we saved in the crude incidence page as well as the facility level cleaned routine data we will be using to calculate test positivity rate.

Step 2: Prepare and examine test positivity rates (TPR)

We then calculate monthly testing rates and test positivity rates. We will conduct the calculation at the facility level and later join the datasets to the incidence data we have been working with.

Step 2.1: Calculate testing rate and test positivity rate

## Calculating Adjusted Incidence 1
routine_data <- clean_malaria_routine_data_final |>
  # calculate testing rate
  dplyr::mutate(test_rate = case_when(
     susp == 0 & test == 0 ~ NA_real_,  # if both susp and test are 0, set to NA
     susp == 0 ~ NA_real_, # if susp is 0 but test is not, set to NA
     TRUE ~ test / susp  # perform normal cacluation if otherwise
  ),
  # Perform a similar algorithm for tpr
  # calculate test positivity rate
        tpr = case_when(
          test == 0 & conf == 0 ~ NA_real_,  # if both test and conf are 0, set to NA
          test == 0 ~ NA_real_, # if test is 0 but conf is not, set to NA
          TRUE ~ conf/test  # perform normal cacluation if otherwise
        )
  )
summary(routine_data)

############################################################### THE FOLLOWING STEPS ARE STILL UNDER REVIEW SO YOU CAN SKIP TO STEP 2.4 ############################################################### #### Step 2.2: Visualize testing rate

Step 2.2.1 Merging shapefiles to data

In Step 1.2, we noticed there are differences adm2 and adm3 names in the shapefile and the incidence data. If we don’t fix this, we may not be able to visualize our data correctly on map. We will adapt codes from Section 2.1 - Merging shapefiles with tabular data to clean the adm names of the incidence data to make it similar to the shapefile.

Looking at the incidence data extract, the key issue are at the adm1, adm2 and adm3 levels. In the incidence data, adm1 is not the region as in the shapefile. Instead, it repeats the district name, while adm2 carries the council structure such as District Council, City Council, or Municipal Council. By contrast, the shapefile follows a geographic hierarchy: adm1 = region, adm2 = district, adm3 = chiefdom. This mismatch means DHIS2 data lacks a proper regional tier, and its names are programmatic rather than purely geographic.

The fix involves cleaning and standardizing names by converting them to uppercase and removing suffixes like District and Chiefdom. Once names are harmonized, the regional tier can be rebuilt by grouping districts into their correct higher-level categories.

# initial cleaning
inc_data <- inc_data |>
  dplyr::mutate(
    adm0 = toupper(adm0),
    adm2 = toupper(adm1),
    adm3 = toupper(adm3),
    adm2 = stringr::str_replace(adm2, stringr::fixed(" DISTRICT"), ""),
    adm3 = stringr::str_replace(adm3, stringr::fixed(" CHIEFDOM"), "")
  ) |>
  dplyr::mutate(
    adm1 = dplyr::case_when(
      adm2 %in% c("KAILAHUN", "KENEMA", "KONO") ~ "EASTERN",
      adm2 %in% c("BOMBALI", "FALABA", "KOINADUGU", "TONKOLILI") ~ "NORTH EAST",
      adm2 %in% c("KAMBIA", "KARENE", "PORT LOKO") ~ "NORTH WEST",
      adm2 %in% c("BO", "BONTHE", "MOYAMBA", "PUJEHUN") ~ "SOUTHERN",
      adm2 %in% c("WESTERN AREA RURAL", "WESTERN AREA URBAN") ~ "WESTERN"
    )
  )

# check DHIS2 data and shapefile
cli::cli_h2("Incidence data (initial cleaning) ")

inc_data |>
  dplyr::distinct(adm0, adm1, adm2, adm3) |>
  dplyr::arrange(adm0, adm1, adm2, adm3) |>
  head(n = 10)

cli::cli_h2("Adm3 shapefile")
shp_adm3 |>
  dplyr::distinct(adm0, adm1, adm2, adm3) |>
  dplyr::arrange(adm0, adm1, adm2, adm3) |>
  head(n = 10)

Step 2.2.2: Assessing whether the two datasets can join

The outputs in Step 2.2.1 now show that administrative names (adm0, adm1, adm2, and adm3) in both datasets follow a consistent format: uppercase, and organized within the same hierarchy. With regions at the top, districts in the middle, and chiefdoms at the third level, both datasets are now aligned. This alignment allows us to attempt an inner join on these fields to merge the shapefile with the DHIS2 dataset.

To achieve a complete match, we must be clean and aligned incidence names with the shapefile naming conventions.

Step 2.2.2.1: Resolving mismatches with sntutils::prep_geonames()

Interactive harmonization (prep_geonames()) Uses the same preprocessing steps, then suggests likely matches for user confirmation. Decisions are stored and reused in future runs.

sntutils::prep_geonames() harmonizes the administrative hierarchy of a target dataset (e.g. DHIS2) with a reference dataset (e.g. a shapefile). The function works hierarchically:

Start with adm0 (country). Progress through adm1, adm2, and adm3. At each level, check if target values exist in the reference. If not, apply: uppercasing and trimming accent and punctuation removal string distance matching (Levenshtein, Jaro-Winkler etc.,) to suggest best matches

Unmatched units are returned with ranked suggestions by similarity score. The process is context-aware: names are only compared within their parent unit (e.g. adm2 only compared within the same adm1). This reduces false positives and keeps matches geographically consistent.

Corrections can be applied interactively or scripted. To streamline workflows, use the cache_path argument to save decisions. Cached corrections are automatically reused on updated datasets, ensuring consistency across reruns and team members.

R

sntutils::prep_geonames(
  target_df = inc_data,
  lookup_df = shp_adm3,
  level0 = "adm0",
  level1 = "adm1",
  level2 = "adm2",
  level3 = "adm3",
  method = "lv",
  cache_path = here::here("english/library/stratification/epidemiological/data_r/geoname_cache.rds")
)

Step 2.2.2.2: Verify the final join with the shapefile

With all unmatched names resolved using prep_geonames(), the Incidence data is now aligned with the shapefile structure. Corrections are saved in the cache for consistent reuse.

Let’s now verify that the cleaned incidence data fully matches the shapefile, ensuring no unmatched units remain.

R

# get distinct admin names from incidence data
inc_admins <-
  inc_data |>
  dplyr::distinct(adm0, adm1, adm2, adm3)

# drop geometry for name-only joins
shp_names <-
  shp_adm3 |>
  sf::st_drop_geometry() |>
  dplyr::distinct(adm0, adm1, adm2, adm3)

# inner join to count matches (polygons ∩ inc)
shp_with_inc <-
  shp_names |>
  dplyr::inner_join(
    inc_admins,
    by = c("adm0", "adm1", "adm2", "adm3")
  )

# polygons with no matching dhis2 admin (shapefile ▷ dhis2)
polygons_without_inc <-
  shp_names |>
  dplyr::anti_join(
    inc_admins,
    by = c("adm0", "adm1", "adm2", "adm3")
  )

# dhis2 admins with no matching polygon (dhis2 ▷ shapefile)
inc_without_polygons <-
  inc_admins |>
  dplyr::anti_join(
    shp_names,
    by = c("adm0", "adm1", "adm2", "adm3")
  )

# counts for clear diagnostics
total_polygons <- nrow(shp_adm3)
matched_polygons <- nrow(shp_with_inc)
polygons_no_match <- nrow(polygons_without_inc)
inc_no_match <- nrow(inc_without_polygons)

# report
cli::cli_h2("Admin join diagnostics (adm0 to adm3)")

# polygons perspective
cli::cli_alert_success(
  "Shapefile admin units (unique): {total_polygons}"
)
cli::cli_alert_success(
  "Matched with Inc: {matched_polygons}"
)

# DHIS2 perspective
cli::cli_alert_warning(
  "Unmatched Inc admin units (no polygon): {inc_no_match}"
)

cli::cli_alert_warning(
  "Unmatched shapefile admin units (no DHIS2 entry): {polygons_no_match}"
)

# previews
if (nrow(inc_without_polygons) > 0) {
  cli::cli_h2("DHIS2 admins with no polygon match (sample)")
  inc_without_polygons |> head(10)
}

if (nrow(polygons_without_inc) > 0) {
  cli::cli_h2("Polygons with no DHIS2 match (sample)")
  polygons_without_inc |> head(10)
}

After calculating for tr and tpr you want to visualize the range of values. We plot values of testing rate for each health facility over time to check for outliers

## Plot a Box Plot of Testing Rate
ggplot(routine_data, aes(x = factor(hf), y = test_rate)) +
  geom_boxplot() +
  labs(title = "Distribution of Testing Rate by health facility",
       x = "Health facility",
       y = "Testing Rate") +
  theme_minimal()
# if the values are within expected ranges, you can use the density plot below which provide insights into the distribution
ggplot(routine_data, aes(x = test_rate)) +
  geom_density(fill = "green", alpha = 0.5) +
  labs(title = "Density Plot of Testing Rate",
       x = "Test Positivity Rate",
       y = "Density") +
  theme_minimal()

Step 2.3: Visualize test positivity rate

We plot values of test positivity rate for each health facility over time to check for outliers

## Plot a Boxplot of Testing Rate
ggplot(routine_data, aes(x = factor(hf), y = tpr)) +
  geom_boxplot() +
  labs(title = "Distribution of Test Positivity Rate by Health facility",
       x = "Health facility",
       y = "Test Positivity Rate") +
  theme_minimal()

# if the values are within expected ranges, you can use the density plot below which provide insights into the distribution
ggplot(routine_data, aes(x = tpr)) +
  geom_density(fill = "green", alpha = 0.5) +
  labs(title = "Density Plot of Test Positivity Rate",
       x = "Test Positivity Rate",
       y = "Density") +
  theme_minimal()

Step 2.4: Generate clean database with valid TPR for each row

(row = admin unit or facility depending on what you chose in Step 1)

If you identify testing rates and tpr above 1, efforts should be made find reasons why testing rates and tpr are >1 for a particular month and the necessary corrections effected. As a conservative measure, we may equate testing rates >1 or tpr >1 to 1.

## Calculating Adjusted Incidence 1
routine_data <- routine_data |>
  # equate testing rate over 1 to 1
  dplyr::mutate(
  test_rate = ifelse(test_rate > 1, 1, test_rate)) |>
  # equate tpr over 1 to 1
  mutate(
         tpr = ifelse(tpr > 1, 1, tpr)
         )
summary(routine_data)

Step 3: Estimate additional cases if presumed cases had been tested

Step 3.1: Apply test positivity rate to the presumed cases

To be able to determine the number of confirmed cases out the presumed cases, we assume that the tpr for those not tested are the same as those tested. Here we apply the calculated tpr to the pres cases to derive additional confirmed cases

## Calculating Adjusted Incidence 1
routine_data <- routine_data |>
  
  # compute additional cases if tpr is not missing
  dplyr::mutate(
  conf_tpr = ifelse(!is.na(tpr),
                    round(tpr * pres, 0), # round values to nearest whole number
                    NA))
head(routine_data)

Step 3.2: Aggregate calculated cases at operational unit level by month

To be able to proceed with our analysis, we aggregate the TPR dataset at admin3 level (since Sierra Leone is making decisions at the adm3 level) and join to the incidence data we have been using.

## Calculating Adjusted Incidence 1
agg_data <- routine_data |>
  group_by(adm0, adm1, adm2, adm3, year, month) |>
  # select appropriate statistic to summarize variables of interest over. Use sum for counts and mean for proportions
  dplyr::summarise(
    across(c(allout:maldth, conf_tpr), sum, na.rm = TRUE), 
    across(c(test_rate, tpr), mean, na.rm = TRUE)
  ) |>
  dplyr::mutate(year = as.numeric(year),
                month = as.numeric(month))

# view data
knitr::kable(head(agg_data))

Step 3.3: Join dataset with additional confirmed cases to incidence dataset

Once we have calculated our additional confirmed cases, we join to the incidence data we have been using. Since the incidence data already has the case count variables (alout, susp, test, etc) we only select only the calculated variables to join to the incidence dataset

## Calculating Adjusted Incidence 1
join_data <- inc_data |>
dplyr::left_join(agg_data |>
                 select(conf_tpr, test_rate, tpr, adm0, adm1, adm2, adm3, year, month),
                 by = c("adm0", "adm1", "adm2", "adm3", "year", "month")
                 )
knitr::kable(head(join_data))

Step 4: Calculate adjusted cases and adjusted incidence (N1)

Step 4.1 Calculate monthly adjusted incidence cases and incidence

We compute monthly adjusted cases1 by adding the additional confirmed cases calculated to the crude cases

## Calculating Adjusted Incidence 1
inc_data <- join_data |>
  mutate(adjcases1 = crudecases + conf_tpr,
          # adjusted incidence 1
         adjinc1 = adjcases1/pop)
# visualise the first 5 observations of the data
head(inc_data)

Step 4.2: Calculate annual adjusted incidence proportions (N1)

In the section above, we calculated monthly adjusted incidence estimates for each adm3 level. This might not be helpful enough operationally, unless we want to conduct further analysis to answer other question like the effect of low testing rates on case management.

For SNT analysis, it is helpful to aggregate the data at operational level annually. Note: for count variables, you summarize over sum and use mean for proportions.

## Aggregate the dataset by year
adj1_inc_ann <- inc_data |>
  dplyr::group_by(adm0, adm1, adm2, adm3, year) |>
  dplyr::summarise(
                   across(c(susp:conf_tpr, adjcases1, adjinc1), sum, na.rm=TRUE),
                   across(c(pop, test_rate, tpr), mean, na.rm = TRUE)
                   ) |>
   ungroup()

head(adj1_inc_ann)

Step 4.3: Standardize the calculated incidence proportions

We have calculated the annual adjusted1 incidence proportions. Here we standardize by multiplying by 100 which becomes more useful to compare between years and admin levels

# calculate annual crude incidence
adj1_inc_ann <- adj1_inc_ann   |>
  dplyr::mutate(
    ann_crude = crudeinc*1000,
    ann_adjinc1 = adjinc1*1000)
# visualize the first observations of the data set
head(adj1_inc_ann)

########## THIS STEP IS ALSO UNDER REVIEW SO SKIP TO STEP 6 ###########

Step 5: Map testing rates and tpr

We want to map and compare testing rates and tpr over time. We reformat the data into long and then join with adm3 shapefile

Step 5.1: Join incidence dataset with shapefile

## Aggregate the dataset by year
adj1_adm_shp <- shp_adm3 |>
  dplyr::right_join(adj1_inc_ann |> 
                      select(adm3:ann_crude), 
                    by = "adm3",
                   relationship = "one-to-many",
                  keep = TRUE) |>
  sf::st_as_sf()

sf::st_drop_geometry(adj1_adm_shp)

head(adj1_adm_shp, 20)

Step 5.2: Map testing rates by year

Python

Step 5.3: Map test positivity rates by year

Python

Step 6: Save Files

Now we save our incidence data as a csv file

# Save monthly routine health facility data
rio::export(
  routine_data,
  file = 
    here("english/library/stratification", 'epidemiological/data_r', 'routine_data_hf.csv')
)

## Save monthly incidence data set
rio::export(
  inc_data,
  file = 
  here("english/library/stratification/epidemiological/data_r/monthly_inc_data.csv"
))
# Save annual incidence data set
rio::export(
  adj1_inc_ann,
  file = here("english/library/stratification/epidemiological/data_r/annual_inc_data.csv"
              )
)

Summary

TBD

Full code

Show full code

#===============================================================================
# End of Script
#===============================================================================