Dev Site — You are viewing the development build. Go to Main Site

  • English
  • Français
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. Final database
  • Code library for subnational tailoring
    English version
  • 1. Getting Started
    • 1.1 About and Contact Information
    • 1.2 For Everyone
    • 1.3 For the SNT Team
    • 1.4 For Analysts
    • 1.5 Producing High-Quality Outputs
  • 2. Data Assembly and Management
    • 2.1 Working with Shapefiles
      • Spatial data overview
      • Basic shapefile use and visualization
      • Shapefile management and customization
      • Merging shapefiles with tabular data
    • 2.2 Health Facilities Data
      • Fuzzy matching of names across datasets
      • Health facility coordinates and point data
    • 2.3 Routine Surveillance Data
      • Routine data extraction
      • DHIS2 data preprocessing
      • Determining active and inactive status
      • Contextual considerations
      • Missing data detection methods
      • Health facility reporting rate
      • Data coherency checks
      • Outlier detection methods
      • Imputation methods
      • Final database
    • 2.4 Stock Data
      • LMIS
    • 2.5 Population Data
      • National population data
      • WorldPop population raster
    • 2.6 National Household Survey Data
      • DHS data overview and preparation
      • Prevalence of malaria infection
      • All-cause child mortality
      • Treatment-seeking rates
      • ITN ownership, access, and usage
      • Wealth quintiles analysis
    • 2.7 Entomological Data
      • Entomological data
    • 2.8 Climate and Environmental Data
      • Climate and environment data extraction from raster
    • 2.9 Modeled Data
      • Generating spatial modeled estimates
      • Working with geospatial model estimates
      • Modeled estimates of malaria mortality and proxies
      • Modeled estimates of entomological indicators
  • 3. Stratification
    • 3.1 Epidemiological Stratification
      • Incidence overview and crude incidence
      • Incidence adjustment 1: incomplete testing
      • Incidence adjustment 2: incomplete reporting
      • Incidence adjustment 3: treatment-seeking
      • Incidence stratification
      • Prevalence and mortality stratification
      • Combined risk categorization
      • Risk categorization REMOVE?
      • Risk categorization REMOVE?
    • 3.2 Stratification of Determinants of Malaria Transmission
      • Seasonality
      • Access to Care
  • 4. Review of Past Interventions
    • 4.1 Case Management
    • 4.2 Routine Interventions
    • 4.3 Campaign Interventions
    • 4.4 Other Interventions
  • 5. Targeting of Interventions
  • 6. Retrospective Analysis
    • 6.1: Trend analysis
  • 7. Urban Microstratification

On this page

  • Overview
  • Final Database Assembly Prerequisites
  • Proposed Final Database Assembly Workflow
  • Step-by-step
    • Step 1: Load packages and data files
    • Step 2: Examine data files
      • Step 2.1: Create column inventory
      • Step 2.2: Classify columns by type and origin
    • Step 3: Merge data files
      • Step 3.1: Apply decision tree to each column
      • Step 3.2: Identify and merge corrected/imputed columns
      • Step 3.3 Flag computed indicators for re-calculation
    • Step 4: Select and rename merged indicators
    • Step 5: Re-aggregate and calculate computed indicators
    • Step 6: Final database quality control
    • Step 7: Save and export final database
  • Full code
  1. 2. Data Assembly and Management
  2. 2.3 Routine Surveillance Data
  3. Final database

Final database

Overview

This page covers the creation of the final, analysis-ready database by consolidating all corrected and imputed data from previous preprocessing steps. The final database serves as the authoritative source for all subsequent analyses, visualizations, and intervention tailoring decisions.

Objectives
  • Consolidate all individually processed malaria indicators into a single, unified dataset
  • Standardize column names and data formats for consistent analysis
  • Ensure data integrity by verifying no duplicates and maintaining original structure
  • Document the final dataset structure for future reference

Final Database Assembly Prerequisites

The final database is the finalized, re-assembled DHIS2 data. The final database needs to be assembled when the data has been modified, such as due to imputation or outlier correction. Not only do these imputed and corrected columns need to be included, but also any aggregated totals or other computed columns.

  • Completed previous routine data preprocessing steps
  • Access to all intermediate corrected data files
  • Decisions from SNT team on which correction/imputation methods to use and include

Proposed Final Database Assembly Workflow

graph TD
    A[Start: Column from Preprocessed DHIS2 Data] --> B{Is this metadata?<br/>adm, period, etc.}
    
    B -->|Yes| C{Was metadata cleaned?<br/>Check preprocessing log}
    B -->|No| D{Check preprocessing log<br/>for column history}
    
    C -->|Yes - cleaned| E[Type: Cleaned Metadata<br/>Processing: Join as-is]
    C -->|No - original| F[Type: Original Metadata<br/>Processing: Join as-is]
    
    D --> G{Column processing history?}
    G -->|Corrected/Imputed| H[Type: Corrected Disaggregated<br/>Processing: Join as-is<br/>Note: Already corrected at right level]
    G -->|Computed indicator| I[Type: Computed<br/>Processing: Always recalculate]
    G -->|No log entry| J[Type: Direct Extract<br/>Processing: Join as-is<br/>Note: Raw disaggregated data]
    
    E --> K[Apply final naming rules]
    F --> K
    H --> K
    I --> K
    J --> K
    
    K --> L[Add to column classification table]
    L --> M[Next column]
    
    style A fill:#e1f5fe
    style K fill:#fff3e0
    style L fill:#c8e6c9

Step-by-step

The step-by-step process for creating the final version of your DHIS2 database is provided below. Each step includes a detailed explanation to complement the code.

To skip the step-by-step explanation, jump to the full code at the end of this page.

Step 1: Load packages and data files

SS 09/22 Notes:

  • propose an inventory of all extracted DHIS2 columns
  • should users have a “preprocessing log” to track all changes to DHIS2?
  • Mohamed suggests only key variables (allout, susp, test, conf, maltreat, pres, maladm, maldth) should be included in final database to avoid file being too heavy.
  • R
  • Python

Step 2: Examine data files

Step 2.1: Create column inventory

  • R
  • Python

Step 2.2: Classify columns by type and origin

  • R
  • Python

Step 3: Merge data files

Step 3.1: Apply decision tree to each column

  • R
  • Python

Step 3.2: Identify and merge corrected/imputed columns

  • R
  • Python

Step 3.3 Flag computed indicators for re-calculation

  • R
  • Python

Step 4: Select and rename merged indicators

  • R
  • Python

Step 5: Re-aggregate and calculate computed indicators

  • R
  • Python

Step 6: Final database quality control

This step verifies aggregation results, checks for consistency with preprocessing, and validates naming conventions.

  • R
  • Python

Step 7: Save and export final database

  • R
  • Python

Full code

  • R
  • Python
# Install and load required packages
if (!require("readxl")) install.packages("readxl")
if (!require("dplyr")) install.packages("dplyr")
if (!require("writexl")) install.packages("writexl")

library(readxl)
library(dplyr)
library(writexl)

merge_all_results <- function() {
  # Define the merge keys
  merge_keys <- c("adm1", "adm2", "adm3", "hf", "hf_uid", "year", "month")

  # Create a list of file URLs
  file_urls <- c(
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/allout_results.xlsx",
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/susp_results.xlsx",
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/test_results.xlsx",
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/conf_results.xlsx",
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/maltreat_results.xlsx",
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/pres_results.xlsx",
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/maladm_results.xlsx",
    "https://raw.githubusercontent.com/ahadi-analytics/snt-data-files/58a0d28bec84fa4c00b69fd3edc680b5318ce372/Excel%20files/maldth_results.xlsx"
  )

  # Initialize list to store dataframes
  df_list <- list()

  # Load each Excel file into separate DataFrames
  for (i in seq_along(file_urls)) {
    tryCatch({
      temp <- tempfile(fileext = ".xlsx")
      download.file(url = file_urls[i], destfile = temp, mode = "wb")
      df_list[[i]] <- read_excel(temp)
      unlink(temp)
      cat(sprintf("Loaded file %d successfully.\n", i))
    }, error = function(e) {
      cat(sprintf("Error loading file %d: %s\n", i, e$message))
    })
  }

  # Check if any dataframes were loaded
  if (length(df_list) == 0) {
    stop("No files were loaded successfully")
  }

  # Start with the first dataframe
  merged_df <- df_list[[1]]

  # Merge remaining dataframes
  if (length(df_list) > 1) {
    for (i in 2:length(df_list)) {
      merged_df <- full_join(merged_df, df_list[[i]], by = merge_keys)
      cat(sprintf("Merged file %d into the dataset.\n", i))
    }
  }

  # Sort by the merge keys
  merged_df <- merged_df |> arrange(across(all_of(merge_keys)))

  # Save the merged DataFrame to an Excel file
  write_xlsx(merged_df, "clean_routine_data.xlsx")
  cat("Merged data saved to 'clean_routine_data.xlsx'\n")

  return(merged_df)
}

# Perform the merge
merged_df <- merge_all_results()
df <- merged_df

head(df, 13)



# Function to save selected columns to a new Excel file
save_selected_columns_to_excel <- function(input_file, selected_columns, output_file) {
  tryCatch({
    # Read the complete dataset
    df <- read_excel(input_file)

    # Select only the specified columns that exist in the dataset
    existing_columns <- intersect(selected_columns, colnames(df))
    if (length(existing_columns) == 0) {
      stop("None of the selected columns exist in the dataset.")
    }

    # Create a DataFrame with only the selected columns
    selected_df <- df |> select(all_of(existing_columns))

    # Save to Excel
    write_xlsx(selected_df, output_file)
    cat(sprintf("Selected columns saved to '%s' successfully!\n", output_file))
    cat("Columns saved: ", paste(existing_columns, collapse = ", "), "\n")
  }, error = function(e) {
    cat("Error:", e$message, "\n")
  })
}

# Function to rename columns in an Excel file and save to a new file
rename_columns_in_excel <- function(file_path, rename_mapping, output_file) {
  tryCatch({
    # Read the dataset from the existing file
    df <- read_excel(file_path)

    # Rename the columns
    colnames(df) <- dplyr::recode(colnames(df), !!!rename_mapping)

    # Save the renamed DataFrame to a new Excel file
    write_xlsx(df, output_file)
    cat(sprintf("Renamed columns saved to '%s' successfully!\n", output_file))
    cat("Columns renamed: ", paste(names(rename_mapping), "->", rename_mapping, collapse = ", "), "\n")
  }, error = function(e) {
    cat("Error:", e$message, "\n")
  })
}

# Specify input file, selected columns, and intermediate file
input_file <- "clean_routine_data.xlsx"
selected_columns <- c(
  "adm1", "adm2", "adm3", "hf", "hfuid", "year", "month",
  "allout_corrected_winsorised", "susp_corrected_winsorised", "test_corrected_winsorised",
  "conf_corrected_winsorised", "maltreat_corrected_winsorised", "pres_corrected_winsorised",
  "maladm_corrected_winsorised", "maldth_corrected_winsorised"
)
intermediate_file <- "intermediate_selected_data.xlsx"

# Save selected columns to an intermediate file
save_selected_columns_to_excel(input_file, selected_columns, intermediate_file)

# Specify rename mapping and final output file
rename_mapping <- c(
  "allout_corrected_winsorised" = "allout",
  "susp_corrected_winsorised" = "susp",
  "test_corrected_winsorised" = "test",
  "conf_corrected_winsorised" = "conf",
  "maltreat_corrected_winsorised" = "maltreat",
  "pres_corrected_winsorised" = "pres",
  "maladm_corrected_winsorised" = "maladm",
  "maldth_corrected_winsorised" = "maldth"
)

final_output_file <- "clean_malaria_routine_data_final.xlsx"

# Rename columns and save to the final file
rename_columns_in_excel(intermediate_file, rename_mapping, final_output_file)

Clean database

 

©2025 Applied Health Analytics for Delivery and Innovation. All rights reserved