Title: | Fuzzy Matching For Proper Names |
---|---|
Description: | Tools for comparing sets of proper names, accounting for common types of variation in format and style. |
Authors: | Patrick Barks [aut, cre] |
Maintainer: | Patrick Barks <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2024-10-31 08:22:23 UTC |
Source: | https://github.com/epicentre-msf/nmatch |
Example hospital datasets, one from an in-patient department
(dat_ipd
), and the other from an ICU department (dat_icu
). The datasets
contain some common patients but with variation in how the names are written.
Note these data are fake – the patient names are simply random combinations
of common French names.
Data frames each with two columns:
patient name
date of entry to given department
Evaluate token match details to determine overall match status
match_eval(k_x, k_y, n_match, n_match_crit, ...)
match_eval(k_x, k_y, n_match, n_match_crit, ...)
k_x |
Integer vector specifying number of tokens in names |
k_y |
Integer vector specifying number of tokens in names |
n_match |
Integer vector specifying number of aligned tokens between x
and y that are matching (i.e. based on argument |
n_match_crit |
Minimum number of matching tokens for names x and y to be considered an overall match |
... |
Additional arguments (not used) |
Logical vector indicating whether names x
and y
match, based on the token
match details provided as arguments
Evaluate string distance and token lengths to determine whether two tokens match
match_eval_token(nchar_x, nchar_y, nchar_max, dist, ...)
match_eval_token(nchar_x, nchar_y, nchar_max, dist, ...)
nchar_x |
Number of characters in token |
nchar_y |
Number of characters in token |
nchar_max |
|
dist |
String distance between tokens x and y |
... |
Additional arguments (not used) |
Logical vector indicating whether tokens x
and y
match, based on their
respective lengths and the string distance between them
Standardize strings prior to performing a match, using the following transformations:
standardize case (base::toupper
)
remove accents/diacritics (stringi::stri_trans_general
)
replace punctuation characters with whitespace
remove extraneous space characters with (stringr::str_squish
)
name_standardize(x)
name_standardize(x)
x |
a string |
The standardized version of x
name_standardize("angela_merkel") name_standardize("QUOIREZ, Fran\U00E7oise D.")
name_standardize("angela_merkel") name_standardize("QUOIREZ, Fran\U00E7oise D.")
Example data with proper names from two different sources
names_ex
names_ex
A data.frame with 6 rows and 2 variables, both of class character:
name from source 1
name from source 2
Compare proper names across two sources using string-standardization to account for variation in punctuation, accents, and character case, token-permutation to account for variation in name order, and fuzzy matching to handle alternate spellings. The specific steps are:
Standardize strings. The default function is
name_standardize
which removes accents and punctuation,
standardizes case, and removes extra whitespace. E.g. "Brontë, Emily J." is
standardized to "BRONTE EMILY J".
Tokenize standardized names, optionally retaining only tokens larger than a given nchar limit.
For each pair of names, calculate string distance between all combinations of tokens, and find the best overall token alignment (i.e. the alignment that minimizes the summed string distance). If two names being compared differ in their number of tokens, the alignment is made with respect to the smaller number of tokens. E.g. If comparing "Angela Dorothea Merkel" to "Merkel Angela", the token "Dorothea" would ultimately be omitted from the best alignment.
For each pair of tokens in the best alignment, classify whether or not the tokens match (TRUE/FALSE) based on their respective lengths and the string distance between them.
Summarize the number of tokens in each name, the number of tokens in the best alignment, the number of aligned tokens that match, and the summed string distance of the best alignment.
Classify overall match status (TRUE/FALSE) based on the match details described in (5). By default, two names are considered to be matching if two or more tokens match across names (e.g. "Merkel Angela" matches "Angela Dorothea Merkel"), or if both names consist of only a single token which is matching (e.g. "Beyonce" matches "Beyoncé").
nmatch( x, y, token_split = "[-_[:space:]]+", nchar_min = 2L, dist_method = "osa", std = name_standardize, ..., return_full = FALSE, eval_fn_token = match_eval_token, eval_fn = match_eval, eval_params = list(n_match_crit = 2), token_freq = NULL )
nmatch( x, y, token_split = "[-_[:space:]]+", nchar_min = 2L, dist_method = "osa", std = name_standardize, ..., return_full = FALSE, eval_fn_token = match_eval_token, eval_fn = match_eval, eval_params = list(n_match_crit = 2), token_freq = NULL )
x , y
|
Vectors of proper names to compare. Must be of same length. |
token_split |
Regex pattern to split strings into tokens. Defaults to
|
nchar_min |
Minimum token size to compare. Defaults to |
dist_method |
Method to use for string distance calculation (see
stringdist-metrics). Defaults to |
std |
Function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
return_full |
Logical indicating whether to return data frame with full
summary of match details ( |
eval_fn_token |
Function to determine token match status. Defaults to
|
eval_fn |
Function to determine overall match status. Defaults to
|
eval_params |
List of additional arguments passed to |
token_freq |
Optional data frame containing the frequencies of name tokens within the population of interest. Must have two columns
|
If return_full = FALSE
(the default), returns a logical vector indicating
which elements of x
and y
are matches.
If return_full = TRUE
, returns a tibble-style data frame summarizing the
match details, including columns:
is_match
: logical vector indicating overall match status
k_x
: number of tokens in x
(excludes tokens smaller than nchar_min
)
k_y
: number of tokens in y
(excludes tokens smaller than nchar_min
)
k_align
: number of aligned tokens (i.e. min(k_x, k_y)
)
n_match
: number of aligned tokens that match (i.e. distance <= dist_max
)
dist_total
: summed string distance across aligned tokens
names1 <- c( "Angela Dorothea Merkel", "Emmanuel Jean-Michel Fr\u00e9d\u00e9ric Macron", "Mette Frederiksen", "Katrin Jakobsd\u00f3ttir", "Pedro S\u00e1nchez P\u00e9rez-Castej\u00f3n" ) names2 <- c( "MERKEL, Angela", "MACRON, Emmanuel J.-M. F.", "FREDERICKSON, Mette", "JAKOBSDOTTIR Kathríne", "PEREZ-CASTLEJON, Pedro" ) # return logical vector specifying which names are matches nmatch(names1, names2) # return data frame with full match details nmatch(names1, names2, return_full = TRUE) # use a custom function to classify matches classify_matches <- function(k_align, n_match, dist_total, ...) { n_match == k_align & dist_total < 2 } nmatch(names1, names2, return_full = TRUE, eval_fn = classify_matches)
names1 <- c( "Angela Dorothea Merkel", "Emmanuel Jean-Michel Fr\u00e9d\u00e9ric Macron", "Mette Frederiksen", "Katrin Jakobsd\u00f3ttir", "Pedro S\u00e1nchez P\u00e9rez-Castej\u00f3n" ) names2 <- c( "MERKEL, Angela", "MACRON, Emmanuel J.-M. F.", "FREDERICKSON, Mette", "JAKOBSDOTTIR Kathríne", "PEREZ-CASTLEJON, Pedro" ) # return logical vector specifying which names are matches nmatch(names1, names2) # return data frame with full match details nmatch(names1, names2, return_full = TRUE) # use a custom function to classify matches classify_matches <- function(k_align, n_match, dist_total, ...) { n_match == k_align & dist_total < 2 } nmatch(names1, names2, return_full = TRUE, eval_fn = classify_matches)