# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman")}
# Install or load relevant packages
pacman::p_load(
dplyr, # Data manipulation
tidyr, # Additional data manipulation tools
purrr, # Functional programming tools
here, # Path management
ggplot2, # Plotting
ggthemes, # Plot themes and scales
patchwork, # Combining plots
RColorBrewer, # Color palettes
scales, # Plot scales and formatting
tibble, # Modern data frames
readr, # Reading data files
knitr, # Formatted table outputs
stringr) # String manipulation
# Install/update and load sntutils
if (!requireNamespace("sntutils", quietly = TRUE)) {
devtools::install_github("ahadi-analytics/sntutils", quiet = TRUE, upgrade = "always")
} else {
devtools::install_github("ahadi-analytics/sntutils", quiet = TRUE, upgrade = "always")
}
library(sntutils)
dhis2_df <- read_csv(here::here(
"1.2_epidemiology/1.2a_routine_surveillance/raw",
"dhis2_processed_data_python.csv"))
knitr::kable(head(dhis2_df, 10))Health facility reporting rate
Overview
In the SNT workflow, routine surveillance data are often used to calculate key indicators such as malaria incidence, test positivity rate, and confirmed case counts. To interpret and use this data properly, we need to first assess their quality and completeness. One important assessment is of what proportion of health facilities are reporting consistently and completely across geography and time.
Reporting rates provide a simple and essential metric to evaluate the completeness of routine data. They help identify where gaps in reporting may affect the reliability of indicators used in decision-making and where routine surveillance should be strengthened. Reporting rates can also be used to estimate the number of additional cases (or other count indicator) that would have also been included in routine surveillance, if all facilities had reported.
This section outlines how to calculate, inspect, and save reporting rates using a reproducible approach, while grounding all choices in dialogue with the national malaria program and SNT team.
- Understand how to calculate reporting rates from routine health facility data
- Visualize and interpret reporting rate patterns at any admin unit level
- Compile and save validated reporting rate outputs for use in later SNT workflow steps
Defining and Calculating Reporting Rate for Routine Indicators
To ensure routine data can be used reliably in the SNT workflow, we need a clear and consistent method for calculating reporting rates across facilities and time. This section introduces a process to calculate monthly reporting rates for any routinely reported indicator.
What is reporting rate?
Reporting rate is the proportion of entities, such as health facilities or community health workers, in a given admin unit that reported on an indicator during a time period of interest.
In the example on this page, we are using monthly data from DHIS2, so we calculate monthly reporting rates. However, you should calculate reporting rate for the relevant reporting period in your dataset. For example, if you are analyzing weekly surveillance data, your reporting rate should be calculated on a weekly basis.
When presenting reporting rates, it is important to specify what indicator(s) are used to define the entity as reporting. In SNT, it is best practice to re-calculate reporting rate for each indicator of interest, as reporting practice may vary across indicators within the same facility. For example, a facility may prioritize reporting confirmed cases, as their stock replenishment depends on showing consistent reporting, but neglect to report all-cause outpatient visits.
An overall reporting rate can also be calculated for each entity in a given time period. In this case, a facility may only be defined as reporting if it reports suspected cases, tested cases, confirmed cases, and treated cases. This aggregated reporting rate should be at most the minimum of the reporting rate of each individual indicator.
Establishing the denominator: Which facilities are expected to report?
Before evaluating the proportion of facilites that reported in a given reporting period, we need to first determine the number of facilities that should report. To avoid underestimating the reporting rate and gaining and inaccurate assessment of the quality of surveillance, the SNT team may, for example, consider:
When calculating reporting rate for confirmed malaria cases, exclude facility types that do not test or treat malaria. For example, HIV clinics or maternity wards.
When calculating reporting rate for malaria admissions, exclude facility types that only handle outpatients, for example community health workers or health posts.
When calculating reporting rate for an indicator, exclude facilities that have closed, that are not yet active, or are temporarily nonfunctional. This avoids penalizing newly opened facilities that weren’t expected to report in earlier months, or facilities that are permanently or temporarily closed and therefore are not expected to report.
Up-to-date master facility lists (MFL) that track facility type and activity status are very helpful for determining which health facilities should be included in the denominator for reporting rate of each indicator, for each reporting period. In the absence of an MFL, or official determination of activity status, it is still possible to infer which health facilities should be excluded.
For more guidance on how to consider active vs inactive facilities, please see the code library page Determining active and inactive status.
Code for excluding by specific facility type is included on this page.
Consult the SNT team to understand how to determine which facilities, if any, should not be included in the denominator for reporting rate. National practices vary, and the surveillance focal person on the SNT team should explain what would be appropriate for each indicator.
Calculating reporting rates
Once health facility activity status has been established, reporting rates can be calculated. These rates reflect the proportion of expected facilities that submitted valid data for a given indicator in a given time period.
For each indicator of interest, reporting rate is defined as:
\[ \text{Indicator Reporting Rate}_{a,t} = \frac{o_{a,t}}{e_{a,t}} \]
Where:
- \(a\) is the administrative unit (e.g. chiefdom or district)
- \(t\) is the time period (e.g. “2022-03”)
- \(o_{a,t}\) is the number of observed facilities in unit \(a\) during time \(t\)
- \(e_{a,t}\) is the number of expected facilities in unit \(a\) during time \(t\)
Observed facilities are those that submitted a valid (non-missing) value for the indicator of interest during time \(t\).
Expected facilities are those who were expected to report during time \(t\) (see previous section). Remember to consult with the SNT team to decide what rules determine whether a facility is expected to report.
Worked example
Suppose we are calculating the reporting rate for total confirmed cases for Kailahun District in March 2022.
- There are 89 health facilities in Kailahun that have ever submitted data on any key malaria indicator
- All 89 submitted their first report on or before March 2022, so they are assumed to be active and expected to report that month
- Of these, 80 facilities reported a valid value for
conf(total confirmed cases) in March 2022 → 80 are observed reporting - The other two do not have a valid value for total confirmed cases (they show
NAin the database) for March 2022
The reporting rate is calculated as:
\[ \text{Reporting Rate for Total Confirmed Cases}_{\text{Kailahun}, \text{Mar 2022}} = \frac{80}{89} = 0.899 \]
Weighted reporting rates
For some SNT applications, a weighted reporting rate may be of interest. The weighted reporting rate is an estimate of the proportion of an indicator’s expected total counts in a given admin unit over a given time period that was reported into routine surveillance.
This means that if a non-reporting facility generally reports fewer confirmed cases than average in its admin unit, the weighted reporting rate for confirmed cases is higher than the unweighted reporting rate (fewer cases are missing). Conversely, if a non-reporting facility generally reports relatively many confirmed cases for its admin unit, the weighted reporting rate for confirmed cases is lower than the unweighted reporting rate (more cases are missing).
Calculating weighted reporting rates
PLACEHOLDER EXPLANATION BELOW. WOULD BE GREAT TO INCLUDE THE SIMPLE EXAMPLE FROM BEA’S SLIDE DECK incl the illustration if possible
The monthly weighted reporting rate for each health district was determined as follows. For each calendar month (January through December), health facility weights were calculated by dividing the health facility’s average number of malaria cases reported for that month, across all years of data, by the district sum of the average number of malaria cases reported for that month, across all years of data. For dates in which a health facility is inactive, the average number of malaria cases reported for that month-and-year pair would be 0, and the weights for the health facilities in that district for that date would be calculated as usual. The district monthly weighted reporting rate was then calculated by summing the weights of the active health facilities. This value captures the proportion of expected confirmed malaria cases that are reported at the district level each month by the active health facilities included in the HMIS database.
How do I know whether to use unweighted or weighted reporting rate?
To assess the performance of the routine surveillance system, the unweighted reporting rate is likely to be more appropriate.
To estimate an admin unit’s unreported confirmed cases (as an example indicator), the weighted reporting rate may be the more accurate option, as it will account for the size of the non-reporting facility. If you choose to use weighted reporting rates, best practice is to also calculate the unweighted reporting rates for the same indicator, compare the two outputs, and discuss with the SNT team.
WOULD BE GREAT TO INCLUDE IN THE STEP BY STEP EXAMPLE VISUALIZATIONS. BOTH SEBASTIAN’S LINE PLOT VERSION AND OUSMANE’S HEATMAPS
If you think the weighted reporting rate might be the better option, produce reporting rates for using both methods and present to the SNT team. Discuss with the SNT team to understand how, where, when, and why the two reporting rates are different. Together with the SNT team, discuss which to use for downstream analysis.
Step-by-Step
Now that we’ve defined how reporting rates are constructed—by identifying active facilities and calculating observed reporting—we move into the step-by-step process for implementing this in code using example DHIS2 data from Sierra Leone. In this section, we walk through the steps for calculating and visualising monthly reporting rates. Each step is designed to guide you through the process. Follow the notes in the code, especially where edits are required.
To skip the step-by-step explanation, jump to the full code at the end of this page.
- Calculate monthly reporting rates
- Visualise reporting rate over time, by indicator
Step 1: Import relevant packages and pre-processed routine data
In this step, we load the necessary packages to run this section, as well as the routine DHIS2 dataset that was initially processed in the DHIS2 Data Preprocessing section of this code library.
To adapt the code:
# Import packages
import pandas as pd
import numpy as np
from pyhere import here
import matplotlib.pyplot as plt
import seaborn as sns
from collections import OrderedDict
import matplotlib.patches as mpatches
# import dhis2 data data
dhis2_df = pd.read_parquet(here('english/data_r/routine_cases', 'dhis2_processed_data_python.parquet'))
# Inspect
dhis2_df.head(10).styleTo adapt the code: Edit the filepath on line 11.
Step 2: Import the number of expected reports per reporting period
Here we import the expected reports dataframe we built in in our analysis of active and inactive health facilities. This dataframe contains, for each reporting period (month, in our example), the number of health facilities expected to report.
Make sure that when using a calculated number of expected reports, that the method of calculation has been validated by the SNT team. The set of indicators used to define reporting and non-reporting status should also be clearly stated and validated by the SNT team.
Depending on the indicator set for defining reporting status, and the indicator of interest to evaluate reporting rate, you may need to use different dataframes of expected reports for the reporting rate denominator.
This data corresponds to the denominator when calculating the reporting rate. We import it here to avoid having to recalculate the number of expected reports everytime we need to calculate the reporting rate.
In following steps, we will calculate the number of observed reports (the numerator for the reporting rate), merge the two datasets and then calculate the reporting rate.
To adapt the code:
Step 3: Define function to calculate reporting rate
Now we define a function to calculate the monthly reporting rate, for a given indicator, at a given admin unit level. The function performs the following steps:
- Count the number of non-NA reports for the indicator of interest (observed reports), aggregated by month and by the admin unit level requested by the user
- Count the number of expected reports by month and admin uit unit level requested by the user, by aggregating the dataframe of expected reports
- Merge the two datasets (observed reports and expected reports, i.e. reporting rate numerator and denominator)
- Compute the reporting rate
compute_RR <- function(base_df, df_expected, level, indicator) {
cols_group <- c('YM', paste0('adm', level), paste0('adm', level, '_uid'))
# count number of non-null reports for confirmed cases
df <- base_df |>
dplyr::group_by(dplyr::across(dplyr::all_of(cols_group))) |>
dplyr::summarise(observed = sum(!is.na(.data[[indicator]])), .groups = "drop")
# aggregate denominator at this level
temp <- df_expected |>
dplyr::group_by(dplyr::across(dplyr::all_of(cols_group))) |>
dplyr::summarise(expected = sum(denominator, na.rm = TRUE), .groups = "drop")
# merge numerator and denominator
df <- df |>
dplyr::full_join(temp, by = cols_group) |>
dplyr::distinct() # equivalent to validate = '1:1'
# compute reporting rate
df <- df |>
dplyr::mutate(
"{indicator}_RR" := dplyr::if_else(
expected == 0,
NA_real_,
observed / expected
)
)
# select and output dataframe
df <- df |>
dplyr::select(
YM,
paste0('adm', level),
paste0('adm', level, '_uid'),
observed,
expected,
paste0(indicator, '_RR')
)
return(df)
}To adapt the code:
def compute_RR(base_df, df_expected, level, indicator):
cols_group = ['YM', f'adm{level}', f'adm{level}_uid']
df = base_df.copy()
# count number of non-null reports for confirmed cases
df = df.groupby(cols_group)[indicator].count().reset_index()
# aggregate denominator at this level
temp = (df_expected.groupby(cols_group)['denominator']
.sum(min_count = 1)
.reset_index())
# merge numerator and denominator
df = df.merge(temp, on = cols_group, how = 'outer', validate = '1:1')
# compute reporting rate
df.insert(len(df.columns), f'{indicator}_RR', np.where(df['denominator'] == 0, np.nan, df[indicator].div(df['denominator'])))
# rename columns and output dataframe
df = df.rename(columns = {indicator: 'observed', 'denominator': 'expected'})
df = df[['YM', f'adm{level}', f'adm{level}_uid', 'observed', 'expected', f'{indicator}_RR']]
return dfTo adapt the code: Nothing to adapt here.
Step 4: Calculate reporting rate for a given indicator, at a given admin unit level
Step 4.1: Unweighted reporting rate
We now use the function defined in Step 3 above, to calculate the reporting rate. We pass two arguments ot the function:
- the indicator we are calculating reporting rate for;
- the admin unit level we are performing the calculation at.
The resulting dataframe is saved to file if needed.
# define indicator and admin unit level
level <- 3
indicator <- 'conf'
# calculate reporting rate
df <- compute_RR(dhis2_df, df_expected, level, indicator)
# Save
readr::write_csv(df, here::here('english', 'data_r', 'routine_cases', paste0('Monthly_', indicator, '_RR_adm', level, '.csv')))
# Inspect results
knitr::kable(utils::head(df, 10))sntutils package
# calculate reporting rate using sntutils
df <- sntutils::calculate_reporting_metrics(
data = dhis2_df,
vars_of_interest = indicator,
x_var = "YM",
y_var = paste0("adm", level),
hf_col = "hf_uid",
key_indicators = c("allout", "conf", "test", "treat", "pres"),
method = 2 # uses method 2 (first-to-last) for facility activity classification
)To adapt the code:
To adapt the code: Set the parameters on lines 2 and 3, and adapt the filepath on line 9.
Step 4.2: Weighted reporting rate
Weighted reporting rates account for the relative size of each facility when calculating reporting completeness. Unlike unweighted rates that treat all facilities equally, weighted rates give more influence to facilities that typically report higher case numbers. This provides a better estimate of what proportion of total expected cases were actually reported.
sntutils package
level <- 3
indicator <- 'conf'
# calculate weighted reporting rate using sntutils
df_weighted <- sntutils::calculate_reporting_metrics(
data = dhis2_df,
vars_of_interest = indicator,
x_var = "YM",
y_var = paste0("adm", level),
hf_col = "hf_uid",
key_indicators = c("allout", "conf", "test", "treat", "pres"),
method = 2,
weighting = TRUE,
weight_var = "test",
weight_window = 12, # 12-month rolling window for typical facility size
exclude_current_x = TRUE, # exclude current period from weight calculation
cold_start = "median_within_y" # use median within admin unit for new facilities
)To adapt the code:
To adapt the code: Set the parameters on lines 2 and 3, and adapt the filepath on line 9.
Step 4.3: Compare unweighted and weighted reporting rates
Weighted reporting rates account for the relative size of each facility when calculating reporting completeness. Unlike unweighted rates that treat all facilities equally, weighted rates give more influence to facilities that typically report higher case numbers. This provides a better estimate of what proportion of total expected cases were actually reported.
Show the code
# Compare weighted vs unweighted results
df_comparison <- df |>
dplyr::select(YM, adm3, reprate) |>
dplyr::left_join(
df_weighted |> dplyr::select(YM, adm3, reprate_w),
by = c("YM", "adm3")
) |>
dplyr::mutate(
difference = reprate_w - reprate,
abs_difference = abs(difference),
weighted_higher = reprate_w > reprate # ADD THIS LINE - creates the missing column
)
# Inspect comparison
cat("Comparison of first 10 rows:\n")
knitr::kable(utils::head(df_comparison, 10))
# Summary of differences
comparison_summary <- df_comparison |>
dplyr::filter(!is.na(difference)) |>
dplyr::summarise(
n_comparisons = dplyr::n(),
mean_diff = mean(difference),
median_diff = median(difference),
max_abs_diff = max(abs_difference),
prop_weighted_higher = mean(weighted_higher, na.rm = TRUE) # Now this will work
)
cat("\nWeighted vs Unweighted Comparison Summary:\n")
print(comparison_summary)
# Show examples with largest differences
largest_differences <- df_comparison |>
dplyr::arrange(desc(abs_difference)) |>
head(10)
cat("\nLargest differences between weighted and unweighted:\n")
knitr::kable(largest_differences)Step 5: Visualise mulitple indicator-specific reporting rates at the national level
Step 5.1 Multiple indicator heatmap at the national level
Here we produce a visual to display the monthly reporting rate for a list of indicators, at national level. The first visual is a heatmap, convenient for quickly comparing reporting rate over time (x-axis) and between indicators (y-axis).
Show the code
# Make a heatmap of monthly reporting rate by variable
indicators <- c('allout', 'test', 'conf', 'pres', 'maltreat')
level <- 0
# Initialize dataframe and merge all indicators
df <- tibble::tibble(YM = character())
for (i in indicators) {
t <- compute_RR(dhis2_df, df_expected, level, i) |>
dplyr::select(-dplyr::all_of(c(paste0('adm', level), paste0('adm', level, '_uid'), 'observed', 'expected'))) |>
dplyr::mutate("{i}_RR" := .data[[paste0(i, '_RR')]] * 100)
df <- df |>
dplyr::full_join(t, by = 'YM')
}
# Rearrange for heatmap (transpose and set YM as row names)
df_heatmap <- df |>
tibble::column_to_rownames('YM') |>
t() |>
as.data.frame()
# Get every other month for x-axis labels
ym_labels <- names(df_heatmap)
every_other_ym <- ym_labels[seq(1, length(ym_labels), by = 3)]
# Create heatmap
# Create heatmap
heatmap_plot <- ggplot2::ggplot(
data = df_heatmap |>
tibble::rownames_to_column('Indicator') |>
tidyr::pivot_longer(
cols = -Indicator,
names_to = 'YM',
values_to = 'RR'
) |>
dplyr::mutate(Indicator = stringr::str_remove(Indicator, '_RR')) |>
dplyr::mutate(Indicator = factor(Indicator, levels = c('test', 'pres', 'maltreat', 'conf', 'allout'))),
ggplot2::aes(x = YM, y = Indicator, fill = RR)
) +
ggplot2::geom_tile() +
ggplot2::scale_fill_viridis_c(
name = '%',
limits = c(0, 100),
na.value = 'white'
) +
ggplot2::scale_x_discrete(
breaks = every_other_ym # Show only every other month
) +
ggplot2::labs(
x = '',
y = '',
title = 'Monthly reporting rate'
) +
ggplot2::theme_minimal() +
ggplot2::theme(
axis.text.x = ggplot2::element_text(angle = 45, hjust = 1),
axis.text.y = ggplot2::element_text(hjust = 1),
plot.title = ggplot2::element_text(hjust = 0.5),
legend.position = 'right'
)
# Save plot
ggplot2::ggsave(
here::here('english', 'data_r', 'routine_cases', paste0('Monthly_reporting_rate_by_variable_heatmap.png')),
plot = heatmap_plot,
width = 15,
height = 6,
bg = 'white'
)sntutils package
# Make a heatmap of monthly reporting rate by variable using sntutils
sntutils::reporting_rate_plot(
data = dhis2_df,
x_var = "YM",
vars_of_interest = c('allout', 'test', 'conf', 'pres', 'maltreat'),
use_reprate = TRUE,
plot_path = here::here('english', 'data_r', 'routine_cases'),
plot_width = 15,
plot_height = 6
)To adapt the code:
Show the code
# Make a heatmap of monthly reporting rate by variable
indicators = ['allout', 'test', 'conf', 'pres', 'maltreat']
level = 0
df = pd.DataFrame(columns = ['YM'])
for i in indicators:
t = compute_RR(dhis2_df, df_expected, level, i).drop([f'adm{level}', f'adm{level}_uid', 'observed', 'expected'], axis = 1)
# make RR a percentage for visual
t[f'{i}_RR'] = t[f'{i}_RR']*100
df = df.merge(t, on = 'YM', how = 'outer', validate = '1:1')
# rearrange for heatmap
df = df.set_index('YM').T
fig, ax = plt.subplots(figsize = (15, 6))
cbar_ax = fig.add_axes([.91, .2, .02, .6])
sns.heatmap(ax = ax
, data = df
, cmap = 'viridis'
, cbar_ax = cbar_ax
, vmin = 0
, vmax = 100
, cbar_kws = {'label': '%'})
ax.set_xlabel('')
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')
ax.set_yticks(ax.get_yticks())
ax.set_yticklabels([l.get_text()[0:-3] for l in ax.get_yticklabels()], rotation = 0, ha = 'right')
ax.set_title('Monthly reporting rate')
fig.savefig(here('english/data_r/routine_cases', f'Monthly_reporting_rate_by_variable_heatmap.png')
, facecolor = 'white'
, bbox_inches = 'tight')
plt.show()To adapt the code:
Step 5.2 Multiple indicator line plot at the national level
The second visual is a line plot, another more traditional way of quickly comparing reporting rate (y-axis) over time (x-axis) and between indicators (different colours).
Show the code
# Make a line plot of monthly reporting rate by variable
indicators <- c('allout', 'test', 'conf', 'pres', 'maltreat')
# Create color palette for each indicator
indicator_colors <- c('#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd') # matplotlib default colors
names(indicator_colors) <- indicators
# Prepare data for plotting
plot_data <- purrr::map_dfr(indicators, ~ {
t <- compute_RR(dhis2_df, df_expected, 0, .x)
t |>
dplyr::select(YM, !!paste0(.x, '_RR')) |>
dplyr::mutate(indicator = .x, value = .data[[paste0(.x, '_RR')]]) |>
dplyr::select(YM, indicator, value)
})
# Get every other YM for x-axis labels
ym_levels <- unique(plot_data$YM)
every_other_ym <- ym_levels[seq(1, length(ym_levels), by = 2)]
# Create the plot
line_plot <- plot_data |>
ggplot2::ggplot(ggplot2::aes(x = YM, y = value, color = indicator, group = indicator)) +
ggplot2::geom_line(linewidth = 1) +
ggplot2::scale_color_manual(values = indicator_colors) +
ggplot2::scale_x_discrete(breaks = every_other_ym) + # Fewer x-axis labels
ggplot2::coord_cartesian(ylim = c(0, 1)) +
ggplot2::labs(
x = '',
y = 'Reporting rate',
title = 'Monthly reporting rate, Sierra Leone',
color = 'Indicator'
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = ggplot2::element_text(hjust = 0.5),
axis.text.x = ggplot2::element_text(angle = 45, hjust = 1),
legend.position = 'bottom'
)
# Save plot
ggplot2::ggsave(
here::here('english', 'data_r', 'routine_cases', 'Monthly_reporting_rate_line_plot.png'),
plot = line_plot,
width = 6,
height = 4,
bg = 'white'
)To adapt the code:
Show the code
# Make a line plot of monthly reporting rate by variable
indicators = ['allout', 'test', 'conf', 'pres', 'maltreat']
fig, ax = plt.subplots(figsize = (6, 4))
for ind in indicators:
t = compute_RR(dhis2_df, df_expected, 0, ind)
t.plot(ax = ax, x = 'YM', y = f'{ind}_RR')
ax.set_ylim(0, 1)
ax.set_xlabel('')
ax.set_ylabel('Reporting rate')
ax.set_title('Monthly reporting rate, Sierra Leone')
fig.tight_layout()To adapt the code: Specify your list of indicators on line 1.
Step 6: Visualise single indicator reporting rates by admin unit
Step 6.1 Single indicator heatmap by admin unit
Here we focus on displaying the reporting rate for a single indicator, but we disaggregate by admin unit. First we use the heatmap format, which now allows for easy comparison over time (x-axis) and between admin units (y-axis).
Show the code
# VT to clean this up after agreeing on import.qmd with team
dftree <- dhis2_df |>
dplyr::select(adm0, adm0_uid, adm1, adm1_uid, adm2, adm2_uid, adm3, adm3_uid) |>
dplyr::distinct() |>
dplyr::mutate(index = dplyr::row_number())
level <- 3
level_yaxis <- 1
indicator <- 'conf'
fs <- 15
df <- compute_RR(dhis2_df, df_expected, level, indicator) |>
dplyr::mutate("{indicator}_RR" := .data[[paste0(indicator, '_RR')]] * 100)
# Pivot to wide format (equivalent to Python's pivot)
df_wide <- df |>
tidyr::pivot_wider(
names_from = YM,
values_from = paste0(indicator, '_RR'),
id_cols = paste0('adm', level)
)
# Merge with adm1 and adm2 information
t <- dftree |>
dplyr::select(paste0('adm', level_yaxis), adm2, paste0('adm', level)) |>
dplyr::distinct()
df_merged <- df_wide |>
dplyr::left_join(t, by = paste0('adm', level), relationship = "many-to-one") |>
dplyr::arrange(.data[[paste0('adm', level_yaxis)]], .data[[paste0('adm', level)]]) |>
tibble::column_to_rownames(paste0('adm', level))
# Extract the numeric data for heatmap
heatmap_data <- df_merged |>
dplyr::select(-dplyr::all_of(c(paste0('adm', level_yaxis), "adm2"))) |>
as.matrix() |>
apply(2, as.numeric)
# Get adm1 groups for labeling and gaps
adm1_groups <- df_merged[[paste0('adm', level_yaxis)]]
adm2_names <- df_merged$adm2 # Get district names
# Use adm2 (district names) as row labels
row_labels <- df_merged$adm2 # These are district names
# Get the adm1 groups for the labeling trick
unique_adm1 <- unique(adm1_groups)
# Calculate gaps between adm1 groups
adm1_counts <- table(adm1_groups)
hline_positions <- cumsum(adm1_counts)[-length(adm1_counts)]
# Show district name only once per region group
for (adm1 in unique_adm1) {
adm1_indices <- which(adm1_groups == adm1)
if (length(adm1_indices) > 0) {
# Only label the first row in the group with district name
row_labels[adm1_indices[1]] <- row_labels[adm1_indices[1]] # Just the district name
# Clear labels for other rows in the same district group
if (length(adm1_indices) > 1) {
row_labels[adm1_indices[2:length(adm1_indices)]] <- ""
}
}
}
# Create heatmap
heatmap_plot <- pheatmap::pheatmap(
heatmap_data,
color = viridis::viridis(100),
breaks = seq(0, 100, length.out = 101),
cluster_rows = FALSE,
cluster_cols = FALSE,
display_numbers = FALSE,
legend = TRUE,
main = paste('Monthly', indicator, 'reporting rate by adm', level),
angle_col = 45,
fontsize = fs,
fontsize_row = fs - 5,
fontsize_col = fs -5,
gaps_row = hline_positions,
labels_row = row_labels,
treeheight_row = 0,
treeheight_col = 0,
legend = TRUE,
na_col = 'grey',
silent = FALSE
)
# Save plot
ggplot2::ggsave(
here::here('english', 'data_r', 'routine_cases', paste0('Monthly_', indicator, '_RR_adm', level, '_heatmap.png')),
plot = heatmap_plot,
width = 15,
height = 6,
bg = 'white'
)sntutils package
# Create admin unit heatmap using sntutils
level <- 3
indicator <- 'conf'
sntutils::reporting_rate_plot(
data = dhis2_df,
x_var = "YM",
y_var = paste0("adm", level),
vars_of_interest = indicator,
hf_col = "hf_uid",
use_reprate = TRUE,
plot_path = here::here('english', 'data_r', 'routine_cases'),
plot_width = 15,
plot_height = 10
)To adapt the code:
Show the code
# VT to clean this up after agreeing on import.qmd with team
dftree = dhis2_df[['adm0', 'adm0_uid', 'adm1', 'adm1_uid', 'adm2', 'adm2_uid', 'adm3', 'adm3_uid']].drop_duplicates().reset_index(drop = True)
level = 3
level_yaxis = 1
indicator = 'conf'
fs = 15
df = compute_RR(dhis2_df, df_expected, level, indicator)
df[f'{indicator}_RR'] = df[f'{indicator}_RR']*100
# rearrange for heatmap
df = df.pivot(index = [f'adm{level}'], columns = 'YM', values = f'{indicator}_RR')
t = dftree[[f'adm{level_yaxis}', f'adm{level}']].drop_duplicates()
df = df.merge(t, on = f'adm{level}', how = 'left', validate = 'm:1')
df = (df.sort_values(by = [f'adm{level_yaxis}', f'adm{level}'])
.reset_index(drop= True)
.set_index([f'adm{level_yaxis}', f'adm{level}']))
fig, ax = plt.subplots(figsize = (30, 15))
cbar_ax = fig.add_axes([.91, .2, .02, .6])
sns.heatmap(ax = ax
, data = df
, cmap = 'viridis'
, cbar_ax = cbar_ax
, vmin = 0
, vmax = 100
, cbar_kws = {'label': '%'})
_ = ax.set_xlabel('')
_ = ax.set_xticks(ax.get_xticks())
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')
# trick to label adm unit names nicely on the yaxis
yticklabels = [l.get_text().split('-')[0] for l in ax.get_yticklabels()]
t = pd.DataFrame(yticklabels).reset_index()
t1 = t.groupby(0)['index'].mean().astype(int).reset_index()
t.insert(0, 'pos', t[0].map(dict(zip(t1[0], t1['index']))))
t = t[[0, 'pos', 'index']]
t[0] = np.where(t.pos == t['index'], t[0], '')
test = t[0].to_list()
_ = ax.set_yticks(ax.get_yticks())
_ = ax.set_yticklabels(test, size = fs)
_ = ax.set_ylabel(f'adm{level}', size = fs)
ylabel_mapping = OrderedDict()
for adm1, adm3_uid in df.index:
ylabel_mapping.setdefault(adm1, [])
ylabel_mapping[adm1].append(adm3_uid)
hline = []
new_ylabels = []
for adm1, adm3_list in ylabel_mapping.items():
adm3_list[0] = "{} - {}".format(adm1, adm3_list[0])
new_ylabels.extend(adm3_list)
if hline:
hline.append(len(adm3_list) + hline[-1])
else:
hline.append(len(adm3_list))
ax.hlines(hline, xmin = -10, xmax = 0, color = "grey", linewidth = 2, clip_on = False)
# color NaN values
colour = 'grey'
ax.set_facecolor(colour);
handle = [mpatches.Patch(color = colour, label = 'No data')]
ax.legend(handles = handle
, fontsize = fs
, bbox_to_anchor = (1, 0)
, loc = 'lower left')
ax.set_title(f'Monthly {indicator} reporting rate by adm{level}', size = fs);
plt.show()To adapt the code:
Step 5.2: Single indicator line plot by admin unit
Now we use the more traditonal line plot format. We specifcy am indicator, a district (here adm2) and plot the reporting rate for each chiefdom (here adm3).
Show the code
indicator <- 'test'
level <- 3
adm2 <- 'Western Area Rural District Council'
# Compute reporting rate and merge with administrative tree
df <- compute_RR(dhis2_df, df_expected, level, indicator)
dftree <- dhis2_df |>
dplyr::select(adm2, adm2_uid, adm3_uid) |>
dplyr::distinct() |>
dplyr::mutate(index = dplyr::row_number())
df <- df |>
dplyr::left_join(dftree, by = 'adm3_uid', relationship = "many-to-one") |>
dplyr::filter(adm2 == !!adm2)
# Get unique adm3 values for coloring
adm3_list <- unique(df$adm3)
# Use standard matplotlib colors
adm3_colors <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
"#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf")
names(adm3_colors) <- adm3_list[1:length(adm3_colors)]
# Prepare data for plotting
plot_data <- purrr::map_dfr(adm3_list, ~ {
df |>
dplyr::filter(adm3 == .x) |>
dplyr::select(YM, !!paste0(indicator, '_RR')) |>
dplyr::mutate(adm3 = .x, value = .data[[paste0(indicator, '_RR')]]) |>
dplyr::select(YM, adm3, value)
})
# Get every other YM for x-axis labels
ym_levels <- unique(plot_data$YM)
every_other_ym <- ym_levels[seq(1, length(ym_levels), by = 8)]
# Create the plot
line_plot <- plot_data |>
ggplot2::ggplot(ggplot2::aes(x = YM, y = value, color = adm3, group = adm3)) +
ggplot2::geom_line(linewidth = 1) +
ggplot2::scale_color_manual(values = adm3_colors) +
ggplot2::scale_x_discrete(breaks = every_other_ym) + # Add this line
ggplot2::coord_cartesian(ylim = c(0, 1)) +
ggplot2::labs(
x = '',
y = 'Reporting rate',
title = paste('Monthly', indicator, 'reporting rate\n', adm2),
color = 'Administrative Unit 3'
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = ggplot2::element_text(hjust = 0.5),
legend.position = 'bottom',
legend.title = ggplot2::element_blank()
)
# Apply tight layout equivalent
line_plot <- line_plot + ggplot2::theme(plot.margin = ggplot2::unit(c(1, 1, 1, 1), "cm"))
# Save plot
ggplot2::ggsave(
here::here('english', 'data_r', 'routine_cases', paste0('Monthly_', indicator, '_RR_adm2_', stringr::str_replace_all(adm2, ' ', '_'), '.png')),
plot = line_plot,
width = 6,
height = 4,
bg = 'white'
)To adapt the code:
Show the code
indicator = 'test'
level = 3
adm2 = 'Western Area Rural District Council'
fig, ax = plt.subplots(figsize = (6, 4))
df = compute_RR(dhis2_df, df_expected, level, indicator)
dftree = dhis2_df[['adm2', 'adm2_uid', 'adm3_uid']].drop_duplicates().reset_index(drop = True)
df = df.merge(dftree, on = 'adm3_uid', how = 'left', validate = 'm:1')
df = df[df['adm2'] == adm2].copy()
for adm3 in df['adm3'].unique():
t = df[df['adm3'] == adm3]
t.plot(ax = ax, x = 'YM', y = f'{indicator}_RR', label = adm3)
ax.set_ylim(0, 1.1)
ax.set_xlabel('')
ax.set_ylabel('Reporting rate')
ax.set_title(f'Monthly {indicator} reporting rate\n{adm2}')
fig.tight_layout()
plt.show()To adapt the code: Specify your indicator, level for reporting rate calculation and district (adm2) in lines 1, 2, and 3.
Summary
This page has covered how to calculate, visualize, and interpret reporting rates. By establishing clear denominators through expected facility counts and calculating indicator-specific reporting completeness, we can identify geographic and temporal patterns in data reporting. The visualization approaches help communicate patterns effectively to SNT teams. Validated reporting rates form the foundation for reliable burden estimation, stratification, and intervention targeting in subsequent SNT workflow steps.
Full code
Find the full code script for calculating reporting rates below.