# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")
}
# install or load relevant packages
pacman::p_load(
readxl, # import and read Excel files
ggplot2, # plotting
rio, # for importing and exporting files
gridExtra, # plot arrangements
writexl, # export Excel files
stringr, # clean up names,
xts, # return first or last element of a vector
tidyverse, # contains functions for data manipulations
sf # spatial features for use in mapping
)Incidence overview and crude incidence
Overview
Clinical malaria case incidence represents the number of newly diagnosed malaria cases during a defined period in a specified population, usually measured as total number of clinical cases per 1000 person-years at risk (cases per 1000 population per year).
In most malaria endemic areas, the measurement of true population-level clinical malaria incidence is affected by the factors that affect the cascade from onset of symptoms to reporting of a case, which all differ by area and over time:
- Decision to access care
- Barriers to access care
- Choice of care (public, private or informal)
- Coverage of the surveillance system, as most integrated facilities are from the public health sector only
- Quality of care and availability of diagnostic tools to confirm a suspected malaria case
- Health facility reporting of detected cases into routine surveillance
These factors all affect crude incidence as measured in routine surveillance. However, crude incidence can be adjusted to attempt to account for some of these factors.
This section demonstrates how to calculate crude malaria incidence from routine case reporting. It assumes routine data has been extracted, cleaned, and reviewed as described in the code library sections on routine data.
- TBD
Key concepts for incidence and incidence adjustments
Data sources for incidence estimation
Various data sources are required in order to estimate an adjusted malaria clinical incidence estimate that reflects population-level incidence.
Routine malaria data – Variables
Several malaria data elements collected routinely and reported to the national Health Management and Information Systems (HMIS) can be used to estimate clinical malaria incidence in nearly all malaria endemic countries. The minimally essential data elements for incidence calculation include:
Uncomplicated confirmed malaria cases per parasite strain and diagnostic tool: The temporal accuracy of confirmed cases can be affected by the changes in diagnostic tools with different diagnostic accuracies used throughout time. The number of confirmed cases reported for specific HFs can also be affected by the main diagnostic tool used and the skills of the health staff to read and interpret the results.
Tested cases per diagnostic tool: Usually, HFs report the number of microscopy and RDT cases conducted. Before aggregating the cases tested by either method to obtain the total number of patients tested, it is important to consider if adding the two sources of information will not lead to duplications. This is of particular relevance when microscopy is conducted to confirm an RDT result in the same HF, or among referred patients within the same district that are tested twice in the peripheral HF and referral hospital.
Presumed malaria cases: Suspected malaria cases (usually fever cases), that are not confirmed to be malaria infections but are presumed to be so, and usually treated as malaria cases. Presumed cases are sometimes directly reported by HFs as an individual data element, but oftentimes they are calculated instead. There are several calculations that NMPs use depending on what they consider most appropriate: a) number of suspected cases minus tested; b) number of malaria treatments distributed minus confirmed cases; etc. Whether directly or indirectly reported, this variable is still affected by substantial bias depending on the level of health staffs’ understanding of the concept, and the clinical perception of a suspected case. Presumed cases are generally reported in higher numbers in areas affected by low stocks or complete stock outs of diagnostics.
Routine data - Aggregations
Data should ideally be extracted at the health facility-month level for a detailed review of its completeness, and quality. Upon review, the routine data will need to be aggregated to the selected unit of analysis – usually the district – to estimate incidence at that level. When aggregating the information per district, it is important to pay special attention to the data reported from the regional or national hospitals for two reasons:
Catchment area: Such hospitals have catchment populations that are wider than a specific district. To maximize the use of hospital data, countries may choose to estimate malaria incidence using the aggregated data from the health facilities where the hospital is located, first including, and then excluding the data from the regional hospital. This will allow understanding the impact of the presence of a regional or national hospital in specific districts.
Duplications: Patients with malaria detected in regional hospitals may sometimes be referred from peripheral health facilities or district hospitals. Such cases are usually admitted as severe malaria cases and should not be reported as uncomplicated malaria in the hospital. In settings where the hospital reporting clearly differentiates outpatient uncomplicated malaria patients from referred severe malaria patients (that were detected, and likely reported in another health facility), the risk of double counting a case is low. However, settings with less reliable surveillance protocols should discuss the pros and cons of using hospital data to estimate incidence.
Health facility master list
Having an updated and informative master list of the health facilities and period of time when they are functional, allows us to measure the actual number of HFs that should have reported per month. In other words, the expected number of records that should be reported for a given variable, and thus, the denominator to calculate completeness rates. A master list that provides information on the HF type (hospital, district health centre, peripheral health centre of different levels, etc.) and the ownership of the HF (public, private, or other) allows to further understand the data submitted by each HF and how it should be treated when calculating incidence.
Treatment seeking data
Routine malaria data in its maximal form (if all malaria fever cases were tested and reported), only provides information from those fevers that are detected through the surveillance system, which generally covers the public HFs and, in some countries, part of the formal private sector. However, in nearly all areas where malaria is or has been endemic, there is an expectation that an additional number of clinical malaria episodes are detected through the private sector (formal or informal), or are never detected by any sector, as patients do not seek care for a fever and are therefore not tested. Areas where public care seeking is low also tend to suffer from a low access to care and other malaria or health services. Therefore, these areas may report a much lower number of cases compared to what would be expected if all febrile cases in the community were tested. However, using the incidence reported to the routine surveillance system in these areas may lead to an underestimation of the true clinical malaria incidence, and therefore of the true transmission intensity.
Demographic information
Reliable population denominators are required to estimate community-level incidence. Most malaria endemic countries count on official population projections provided by the National Institute of Statistics from the last census available, which vary in time per country. NMPs should use whatever source of population they deem most appropriate to calculate incidence at the selected unit of analysis (which is usually district but can be lower). Countries with special populations that may require specific targeted malaria approaches should also explore the most reliable data available to inform the presence of these groups per unit of analysis.
Methods for incidence estimation
Routine malaria data are affected by various factors that need to be considered in order to estimate the expected clinical malaria incidence in a given area and point in time within and beyond the public health sector. Such factors include issues with testing and reporting rates within the public health sector, as well as spatial and temporal changes in fever care-seeking behaviour in the public and private sector, as well as among those who do not generally seek care. WHO recommends a stepwise correction approach to control for these common factors with the objective of estimating a community-level estimate of clinical malaria. It is recommended that countries conduct this analysis using monthly data as the impact of several of the factors being adjusted for (testing and reporting rates) vary substantially within a given year, particularly in countries with intense seasonality patterns.
Step-by-Step Instructions
To skip the step-by-step explanation, jump to the full code at the end of this page.
Step 1:
Step 1.1: Load required packages
The first step is to install and load the libraries required for this section. I HAVE PUT HERE A SAMPLE FROM ANOTHER PAGE, PLEASE ADAPT FOR THIS PAGE
Step 1.2: Load routine case data, population data and shape file
Here we show how to import datasets need for this section. The SLE routine dataset we will be using contains monthly reported cases of suspected, tested, confirmed, treated malaria cases, etc, for each health facility in the country. The dataset spans 9 years; from 2015 to 2023. We also load in the shapefile for the operational administrative level (adm3) and the population data for the admin level for the same years as the routine case data
Step 2: Prepare data for incidence estimation
Step 2.1: Check the structure of the data sets
Check the structure of the datasets to understand how the types and format of the variables and use summarize to understand the range of values of each variable in the dataset
We observe that the population data is in the wide format, with a column for each population year. We will later see how we will transform the population so that we can use it in our analysis.
Step 2.2 Clean up the admin names across the datasets
Cleaned up admin names and select relevant variables that will be needed for the incidence estimations. Here we assign a new name to our routine case data so that we keep it separate from the dataset we will use for analysis. We also do same for the population data
#clean up the adm1, adm2 and adm3 names in the routine data set
inc_data <- clean_malaria_routine_data_final |>
dplyr::mutate(adm1 = str_to_title(adm1),
adm2 = str_to_title(adm2),
adm3 = str_to_title(adm3)) |>
#select the relevant malaria variables
dplyr::select(adm1, adm2, adm3, hf, year, month,
susp, test, conf, pres) |>
dplyr::mutate(year = as.numeric(year),
month = as.numeric(month))
#clean up the adm1, adm2 and adm3 names in the routine data set
pop_data <- pop |>
dplyr::mutate(adm1 = str_to_title(adm1),
adm2 = str_to_title(adm2),
adm3 = str_to_title(adm3)) Step 2.3: Aggregate data at operational unit level
Countries should decide the appropriate administrative level for the analysis. In this example, the dataset is at the facility level while the operational unit for decision-making is adm3, so we aggregate the case data at adm3 level.
Step 3: Join case data and population data
To enable us compute crude incidence estimates we need to get the population figures for each year of data in our incidence dataset.
Step 3.1: Reshape population data if needed
In most cases, population data comes in a wide format where there are different columns for each year’s population. This becomes difficult to use for analysis especially when our incidence data has one column for all the years and another column for months. To go around this, it is helpful to reformat the population dataset from wide format to long format to align with what we have in the routine dataset.
# first convert the pop data into long format and create a variable for year
pop_long <- pop |>
tidyr::pivot_longer(
cols = pop2015:pop2023,
names_to = "pop_year",
values_to = "pop"
) |>
dplyr::mutate(year = as.numeric(sub("pop", "", pop_year))) |>
dplyr::select(adm1, adm2, adm3, year, pop)
# visualize the first observations of the data set
head(pop_long)Step 3.2: Join case data and population data
Now that we have reformatted the population data, we can join it to the routine dataset at adm3 level and by year. It is advisable to use the population data as a base to be able identify records in the routine data that could not be correctly matched.
The data now has duplicate columns for adm1 and adm2 - identified by .x and .y coming from the 2 data sets. We will need to select only one. In the next section we show to do that.
Step 3.3: Clean up column names
If we have similar variable names in the two dataset, R will modify the common variables to distinguish them from the dataset in which they came from. We therefore rename the column we want to maintain and select the final set of variables to retain database
Step 4: Calculate monthly and annual incidence
Incidence is estimated by dividing the crude and corrected incidence estimates in each district by the population at risk. If the analysis is conducted at monthly level, monthly incidence can be estimated by dividing the crude or adjusted (Ns) number of cases by the estimated population of the district and year (where the population will be the same for all months of a given year). To calculate annual incidence estimates, the number of monthly crude or adjusted cases should be aggregated annually and divided by the annual population projections for that district-year.
Step 4.1: Calculate Monthly Crude Incidence
Crude incidence: Calculation of crude incidence estimates is dependent on the following variables: confirmed cases and population. Here we want to maintain the name of our incidence data (inc_data) for subsequent analysis
Step 4.2: Calculate Annual Crude Incidence
In the section above, we calculated monthly incidence estimates for each adm3 level. This might not be helpful enough operationally, unless we want to conduct further analysis to answer other question like identifying months of missed “epidemics”. For the purposes of SNT annual incidence estimates are more useful to compare between years and admin levels
## Aggregate the dataset by year
cr_inc_ann <- inc_data |>
dplyr::group_by(adm1, adm2, adm3, year) |>
dplyr::summarise(across(c(susp:crudeinc), sum, na.rm=TRUE),
pop_mean = mean(pop, na.rm = TRUE)) |>
ungroup()
# calculate annual crude incidence
cr_inc_ann <- cr_inc_ann |>
dplyr::mutate(ann_crude = crudeinc*1000)
# visualize the first observations of the data set
head(cr_inc_ann)Step 5: Save Files
Now we save our monthly and annual incidences as csv files.
Summary
TBD