Standardize name-like strings with optional steps — standardize

Applies a sequence of standardization steps to a vector of names. Each step can be enabled or disabled. By default, all steps are enabled.

Usage

standardize_names(
  name_vec,
  to_upper = TRUE,
  replace_punct = TRUE,
  squish_spaces = TRUE,
  normalize_accents = TRUE,
  normalize_spaces = TRUE,
  roman_to_arabic = TRUE,
  sort_tokens = TRUE
)

Arguments

name_vec: Atomic vector. Values to standardize.
to_upper: Logical. Convert to uppercase. Default is TRUE.
replace_punct: Logical. Replace punctuation with spaces. Default is TRUE.
squish_spaces: Logical. Trim/condense spaces. Default is TRUE.
normalize_accents: Logical. Convert accents to ASCII. Default is TRUE.
normalize_spaces: Logical. Normalize all space-like chars to " ". Default is TRUE.
roman_to_arabic: Logical. Convert roman numerals I..IX to 1..9 at token boundaries. Default is TRUE.
sort_tokens: Logical. Sort tokens with letters first, numbers last, both groups alphabetically. Default is TRUE.

Value

Character vector of standardized names.

Examples

# basic usage
standardize_names(c("Chp Kpalime-III", "Kpalimé CHP 3"))
#> [1] "CHP KPALIME 3" "CHP KPALIME 3"

# disable token sorting and roman conversion
standardize_names(
  c("USP Hanyigba-Todzi ii", "USP Hanyigba Todzi 2"),
  roman_to_arabic = FALSE, sort_tokens = FALSE
)
#> [1] "USP HANYIGBA TODZI II" "USP HANYIGBA TODZI 2" 

# keep accents and case, only squish spaces and replace punctuation
standardize_names(
  c("Hôpital d’Adéta", "Hopital d'Adeta"),
  to_upper = FALSE, normalize_accents = FALSE
)
#> [1] "Adéta Hôpital d" "Adeta Hopital d"