Title: | Tools for Cleaning and Matching Hierarchically-Structured Data |
---|---|
Description: | Tools for matching raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset. |
Authors: | Patrick Barks [aut, cre] , Paul Campbell [ctb] |
Maintainer: | Patrick Barks <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0.9000 |
Built: | 2024-10-24 16:07:44 UTC |
Source: | https://github.com/epicentre-msf/hmatch |
Tokenized matching of hierarchical columns can yield false positives when there are tokens that occur frequently in multiple unique hierarchical values (e.g. "South", "North", "City", etc.).
This is a helper function to find such frequently-occurring tokens, which can
then be passed to the exclude
argument of hmatch_tokens
. The
frequency calculated is the number of unique,
string-standardized values in which a given
token is found.
count_tokens( x, split = "[-_[:space:]]+", min_freq = 2, min_nchar = 3, return_values = TRUE, std_fn = string_std, ... )
count_tokens( x, split = "[-_[:space:]]+", min_freq = 2, min_nchar = 3, return_values = TRUE, std_fn = string_std, ... )
x |
a character vector (generally a hierarchical column) |
split |
regex pattern used to split values into tokens. By default splits on any sequence of one or more space characters ("[:space:]"), dashes ("-"), and/or underscores ("_"). |
min_freq |
minimum token frequency (i.e. number of unique values in
which a given token occurs). Defaults to |
min_nchar |
minimum token size in number of characters. Defaults to |
return_values |
logical indicating whether to return the standardized
values in which each token is found ( |
std_fn |
function to standardize strings, as performed within all
|
... |
additional arguments passed to |
french_departments <- c( "Alpes-de-Haute-Provence", "Hautes-Alpes", "Ardennes", "Bouches-du-Rhône", "Corse-du-Sud", "Haute-Corse", "Haute-Garonne", "Ille-et-Vilaine", "Haute-Loire", "Hautes-Pyrénées", "Pyrénées-Atlantiques", "Hauts-de-Seine" ) count_tokens(french_departments)
french_departments <- c( "Alpes-de-Haute-Provence", "Hautes-Alpes", "Ardennes", "Bouches-du-Rhône", "Corse-du-Sud", "Haute-Corse", "Haute-Garonne", "Ille-et-Vilaine", "Haute-Loire", "Hautes-Pyrénées", "Pyrénées-Atlantiques", "Hauts-de-Seine" ) count_tokens(french_departments)
During hierarchical matching with the hmatch_
group of functions, values
within raw
can be temporarily recoded to match values within ref
based on
a dictionary (argument dict
) that maps raw values to their desired
replacement values (optionally limited to a given hierarchical column).
Note that this recoding is done internally, and doesn't actually modify the
values of raw
that are returned (it just enables a match to the proper
values of ref
).
For example, if the raw data contains entries of "USA" for variable "adm0", which we know correspond to the value "United States" within the reference data, we can specify a dictionary as follows:
dict <- data.frame(value = "USA", replacement = "United States", variable = "adm0")
The column names in the dictionary don't actually matter, but the column order must be:
value in raw
to temporarily replace
replacement value (to match value in ref
)
(optional) name of hierarchical column in raw
to recode
If the dictionary contains only two columns (values and replacements), then all recoding will be applied to every hierarchical column.
To apply only a portion of the dictionary to all hierarchical columns (and
the rest to specified columns), a user can specify a third dictionary column
with values of <NA>
in rows where the recoding should apply to all
hierarchical columns. E.g.
dict <- data.frame(value = c("USA", "Washingtin" replace = c("United States", "Washington"), variable = c("adm0", NA))
For example, the dictionary above specifies that values of "USA" within column "adm0" will be temporarily replaced with "United States", while values of "Washingtin" within any hierarchical column will be replaced with "Washington".
Note that string standardization as specifed by argument std_fn
(see
string_standardization) also applies to dictionaries. For example,
given the default standardization function which includes
case-standardization, a dictionary value of "USA" will match (and therefore
recode) raw
enries "USA" and "usa", but not e.g. "U.S.A.".
Create codes to identify each unique combination of hierarchical levels in a
reference dataset. Codes may be integer-based (function hcodes_int
) or
string-based (hcodes_str
). Integer-based codes reflect the alphabetical
ranking of each level within the next-highest level. They are constant-width
and may optionally be prefixed with any given string. String-based codes are
created by pasting together the values of each hierarchical level with a
given separator (with options for string standardization prior to
collapsing).
hcodes_str(ref, pattern, by, sep = "__", std_fn = string_std) hcodes_int(ref, pattern, by, prefix = "")
hcodes_str(ref, pattern, by, sep = "__", std_fn = string_std) hcodes_int(ref, pattern, by, prefix = "")
ref |
|
pattern |
regex pattern to match the names of the hierarchical columns
in |
by |
vector giving the names of the hierarchical columns in |
sep |
(only for |
std_fn |
(only for |
prefix |
(only for |
A vector of codes
data(ne_ref) # string-based codes hcodes_str(ne_ref, pattern = "^adm") # integer-based codes hcodes_int(ne_ref, pattern = "^adm")
data(ne_ref) # string-based codes hcodes_str(ne_ref, pattern = "^adm") # integer-based codes hcodes_int(ne_ref, pattern = "^adm")
Match sets of hierarchical values (e.g. province, county, township) in a raw, messy dataset to corresponding values within a reference dataset, optionally accounting for discrepancies between the datasets such as:
variation in character case, use of accents, or spelling
variation in hierarchical resolution (e.g. some entries specified to municipality but others only to region)
missing values at one or more hierarchical levels
hmatch( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
hmatch( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
In hmatch
, if argument type
corresponds to a resolve join, rows
of raw
with multiple matches to ref
are always resolved to 'no match'.
This is because hmatch
does not accept matches below the highest
non-missing level within a given row of raw
. E.g.
raw
: 1. | United States | <NA> | Jefferson |
Relevant rows from ref
: 1. | United States | New York | Jefferson |
2. | United States | Pennsylvania | Jefferson |
In a regular join with hmatch
, the single row from raw
(above)
will match both rows of ref
. However, in a resolve join the multiple
conflicting matches (i.e. conflicting values at the 2nd hierarchical level)
will result in the row from raw
being treated as non-matching to ref
.
data(ne_raw) data(ne_ref) hmatch(ne_raw, ne_ref, pattern = "adm", type = "inner")
data(ne_raw) data(ne_ref) hmatch(ne_raw, ne_ref, pattern = "adm", type = "inner")
Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using a variety of matching strategies implemented in sequence to identify the best-possible match (i.e. highest-resolution) for each row.
The sequence of matching strategies is:
(optional) manually-specified matching with hmatch_manual
complete matching with hmatch(..., allow_gaps = FALSE)
partial matching with hmatch(..., allow_gaps = TRUE)
fuzzy partial matching with hmatch(allow_gaps = TRUE, fuzzy = TRUE)
best-possible matching with hmatch_settle
Each approach is implement only on the rows of data for which a single match has not already been identified using the previous approaches.
hmatch_composite( raw, ref, man, pattern, pattern_ref = pattern, by, by_ref = by, code_col, type = "resolve_left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
hmatch_composite( raw, ref, man, pattern, pattern_ref = pattern, by, by_ref = by, code_col, type = "resolve_left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
man |
(optional) data frame of manually-specified matches, relating a
given set of hierarchical values to the code within |
pattern |
regex pattern to match the hierarchical columns in |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
code_col |
name of the code column containing codes for matching |
type |
type of join ("resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
data(ne_raw) data(ne_ref) hmatch_composite(ne_raw, ne_ref, fuzzy = TRUE)
data(ne_raw) data(ne_ref) hmatch_composite(ne_raw, ne_ref, fuzzy = TRUE)
Match a data.frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using a dictionary of manually-specified matches.
hmatch_manual( raw, ref, man, pattern, pattern_ref = pattern, by, by_ref = by, code_col, type = "left", ref_prefix = "ref_", std_fn = string_std, ... )
hmatch_manual( raw, ref, man, pattern, pattern_ref = pattern, by, by_ref = by, code_col, type = "left", ref_prefix = "ref_", std_fn = string_std, ... )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
man |
|
pattern |
regex pattern to match the hierarchical columns in |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
code_col |
name of the code column containing codes for matching |
type |
type of join ("left", "inner", or "anti"). Defaults to "left".
See join_types. Note that this function does not allow 'resolve
joins', unlike most other |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
based on sets of matches specified in man
, using the join type
specified by argument type
(see join_types for more details)
data(ne_raw) data(ne_ref) # create df mapping sets of raw hierarchical values to codes within ref ne_man <- data.frame( adm0 = NA_character_, adm1 = NA_character_, adm2 = "Bergen, N.J.", hcode = "211", stringsAsFactors = FALSE ) # find manual matches hmatch_manual(ne_raw, ne_ref, ne_man, code_col = "hcode", type = "inner")
data(ne_raw) data(ne_ref) # create df mapping sets of raw hierarchical values to codes within ref ne_man <- data.frame( adm0 = NA_character_, adm1 = NA_character_, adm2 = "Bergen, N.J.", hcode = "211", stringsAsFactors = FALSE ) # find manual matches hmatch_manual(ne_raw, ne_ref, ne_man, code_col = "hcode", type = "inner")
Match a hierarchical column (e.g. region, province, or county) within a raw, potentially messy dataset against a corresponding column within a reference dataset, by searching for similar sets of 'offspring' (i.e. values at the next hierarchical level).
For example, if the raw dataset uses admin1 level "NY" whereas the reference dataset uses "New York", it would be difficult to automatically match these values using only fuzzy-matching. However, we might nonetheless be able to match "NY" to "New York" if they share a common and unique set of 'offspring' (i.e. admin2 values) across both datasets (e.g "Kings", "Queens", "New York", "Suffolk", "Bronx", etc.).
Unlike other hmatch
functions, the data frame returned by hmatch_parents
only includes unique hierarchical combinations and only relevant
hierarchical levels (i.e. the parent level and above), along with additional
columns giving the number of matching children and total number of children
for a given parent.
hmatch_parents( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, level, min_matches = 1L, type = "left", fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, ref_prefix = "ref_", std_fn = string_std, ... )
hmatch_parents( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, level, min_matches = 1L, type = "left", fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, ref_prefix = "ref_", std_fn = string_std, ... )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
level |
name or integer index of the hierarchical level to match at
(i.e. the 'parent' level). If a name, must correspond to a hierarchical
column within |
min_matches |
minimum number of matching offspring required for parents
to be considered a match. Defaults to |
type |
type of join ("left", "inner" or "anti") (defaults to "left") |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
(at the parent level and above), using the join type specified by
argument type
(see join_types for more details). Note that unlike
other hmatch_
functions, hmatch_parents returns only unique rows and
relevant hierarchical columns (i.e. the parent level and above), along with
additional columns describing the number of matching children and total
number of children for a given parent.
... |
hierarchical columns from |
... |
hierarchical columns from |
n_child_raw |
total number of unique children belonging to the parent within |
n_child_ref |
total number of unique children belonging to the parent within |
n_child_match |
number of children in |
# e.g. match abbreviated adm1 names to full names based on common offspring raw <- ne_ref raw$adm1[raw$adm1 == "Ontario"] <- "ON" raw$adm1[raw$adm1 == "New York"] <- "NY" raw$adm1[raw$adm1 == "New Jersey"] <- "NJ" raw$adm1[raw$adm1 == "Pennsylvania"] <- "PA" hmatch_parents( raw, ne_ref, pattern = "adm", level = "adm1", min_matches = 2, type = "left" )
# e.g. match abbreviated adm1 names to full names based on common offspring raw <- ne_ref raw$adm1[raw$adm1 == "Ontario"] <- "ON" raw$adm1[raw$adm1 == "New York"] <- "NY" raw$adm1[raw$adm1 == "New Jersey"] <- "NJ" raw$adm1[raw$adm1 == "Pennsylvania"] <- "PA" hmatch_parents( raw, ne_ref, pattern = "adm", level = "adm1", min_matches = 2, type = "left" )
Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using sequential permutation of the hierarchical columns to allow for values entered at the wrong hierarchical level.
The function calls hmatch
on each possible permutation of the
hierarchical columns, and then combines the results. Rows of raw
yielding
multiple matches to ref
can optionally be resolved using a resolve-type
join (see section Resolve joins below).
hmatch_permute( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
hmatch_permute( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
In hmatch_permute
, if argument type
corresponds to a resolve join, rows
of raw
with multiple matches to ref
are resolved to the highest
hierarchical level that is common among all matches (or no match if there is
a conflict at the very first level). E.g.
raw
: 1. | United States | <NA> | New York |
Relevant rows from ref
: 1. | United States | New York | <NA> |
2. | United States | New York | New York |
In a regular join with hmatch_permute
, the single row from raw
(above)
will match both of the depicted rows from ref
. However, in a resolve join
the two matches will resolve to the first row from ref
, because it reflects
the highest hierarchical level that is common to all matches.
data(ne_raw) data(ne_ref) hmatch_permute(ne_raw, ne_ref, pattern = "^adm", type = "inner")
data(ne_raw) data(ne_ref) hmatch_permute(ne_raw, ne_ref, pattern = "^adm", type = "inner")
Match sets of hierarchical values (e.g. province / county / township) in a
raw, messy dataset to corresponding values within a reference dataset,
sequentially over each hierarchical level. Specifically, implements
hmatch
at each successive hierarchical level, starting with
only the first level (lowest resolution), then first and second, first second
and third, etc.
After the initial matching over all levels, users can optionally use a resolve join to 'settle' for the highest match possible for each row of raw data, even if that match is below the highest-resolution level specified.
hmatch_settle( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
hmatch_settle( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
In a resolve type join with hmatch_settle
, rows of raw
with multiple
matches to ref
are resolved to the highest hierarchical level that is
non-conflicting among all matches (or no match if there is a conflict at the
very first level). E.g.
raw
: 1. | United States | <NA> | Jefferson |
Relevant rows from ref
: 1. | United States | <NA> | <NA> |
2. | United States | New York | Jefferson |
3. | United States | Pennsylvania | Jefferson |
In a regular join, the single row from raw
(above) will match all three
rows from ref
. However, in a resolve join the multiple matches will be
resolved to the first row from ref
, because only the first hierarchical
level ("United States") is non-conflicting among all possible matches.
Note that there's a distinction between "common" values at a given hierarchical level (i.e. a single unique value in each row) and "non-conflicting" values (i.e. a single unique value or a missing value). E.g.
raw
: 1. | United States | New York | New York |
Relevant rows from ref
: 1. | United States | <NA> | <NA> |
2. | United States | New York | <NA> |
3. | United States | New York | New York |
In the example above, only the 1st hierarchical level ("United States") is
"common" to all matches, but all hierarchical levels are "non-conflicting"
(i.e. because row 2 is a hierarchical child of row 1, and row 3 a child of
row 2), and so a resolve-type match will be made to the 3rd row in ref
.
data(ne_raw) data(ne_ref) # return matches at all levels hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "inner") # use a resolve join to settle for the best possible match for each row hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner")
data(ne_raw) data(ne_ref) # return matches at all levels hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "inner") # use a resolve join to settle for the best possible match for each row hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner")
Implements hierarchical matching, separately at each hierarchical level within the data. For a given level, the raw data that is matched includes every unique combination of values at and below the level of interest. E.g.
Level 1: | Canada |
| United States |
Level 2: | Canada | Ontario |
| United States | New York |
| United States | Pennsylvania |
Level 3: | Canada | Ontario | Ottawa |
| Canada | Ontario | Toronto |
| United States | New York | Bronx |
| United States | New York | New York |
| United States | Pennsylvania | Philadelphia |
hmatch_split( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, fn = "hmatch", type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ..., levels = NULL, always_list = FALSE, man, code_col, always_tokenize = FALSE, token_split = "_", exclude_freq = 3, exclude_nchar = 3, exclude_values = NULL )
hmatch_split( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, fn = "hmatch", type = "left", allow_gaps = TRUE, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ..., levels = NULL, always_list = FALSE, man, code_col, always_tokenize = FALSE, token_split = "_", exclude_freq = 3, exclude_nchar = 3, exclude_values = NULL )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
fn |
which function to use for matching. Current options are
Note that some subsequent arguments are only relevant for specific
functions (e.g. the |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. Note that the details of resolve joins vary somewhat among hmatch functions
(see documentation for the relevant function), and that function
|
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
levels |
a vector of names or integer indices corresponding to one or
more of the hierarchical columns in |
always_list |
logical indicating whether to always return a list, even
when argument |
man |
(optional) data frame of manually-specified matches, relating a
given set of hierarchical values to the code within |
code_col |
name of the code column containing codes for matching |
always_tokenize |
logical indicating whether to tokenize all values
prior to matching ( |
token_split |
regex pattern to split strings into tokens. Currently
tokenization is implemented after
string-standardizatipn with argument
|
exclude_freq |
exclude tokens from matching if they have a frequency
greater than or equal to this value. Refers to the number of unique,
string-standardized values at a given hierarchical level in which a given
token occurs, as calculated by |
exclude_nchar |
exclude tokens from matching if they have nchar
less than or equal to this value. Defaults to |
exclude_values |
character vector of additional tokens to exclude from
matching. Subject to string-standardizatipn
with argument |
A list of data frames, each returned by a call to fn
on the unique
combination of hierarchical values at the given hierarchical level. The
number of elements in the list corresponds to the number of hierarchical
columns in raw
, or, if specified, the number of elements in argument
levels
.
However, if always_list = FALSE
and length(levels) == 1
, a single data
frame is returned (i.e. not wrapped in a list).
data(ne_raw) data(ne_ref) # by default calls fn `hmatch` separately for each hierarchical level hmatch_split(ne_raw, ne_ref) # can also specify other hmatch functions, and subsets of hierarchical levels hmatch_split(ne_raw, ne_ref, fn = "hmatch_tokens", levels = 2:3)
data(ne_raw) data(ne_ref) # by default calls fn `hmatch` separately for each hierarchical level hmatch_split(ne_raw, ne_ref) # can also specify other hmatch functions, and subsets of hierarchical levels hmatch_split(ne_raw, ne_ref, fn = "hmatch_tokens", levels = 2:3)
Match sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, using tokenization to help match multi-term values that might otherwise be difficult to match (e.g. "New York City" vs. "New York").
Includes options for ignoring matches from frequently-occurring tokens (e.g. "North", "South", "City"), small tokens (e.g. "El", "San", "New"), or any other set of tokens specified by the user.
hmatch_tokens( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, always_tokenize = FALSE, token_split = "_", token_min = 1, exclude_freq = 3, exclude_nchar = 3, exclude_values = NULL, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
hmatch_tokens( raw, ref, pattern, pattern_ref = pattern, by, by_ref = by, type = "left", allow_gaps = TRUE, always_tokenize = FALSE, token_split = "_", token_min = 1, exclude_freq = 3, exclude_nchar = 3, exclude_values = NULL, fuzzy = FALSE, fuzzy_method = "osa", fuzzy_dist = 1L, dict = NULL, ref_prefix = "ref_", std_fn = string_std, ... )
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
always_tokenize |
logical indicating whether to tokenize all values
prior to matching ( |
token_split |
regex pattern to split strings into tokens. Currently
tokenization is implemented after
string-standardizatipn with argument
|
token_min |
minimum number of tokens that must match for a term to be considered matching overall. Defaults to 1. |
exclude_freq |
exclude tokens from matching if they have a frequency
greater than or equal to this value. Refers to the number of unique,
string-standardized values at a given hierarchical level in which a given
token occurs, as calculated by |
exclude_nchar |
exclude tokens from matching if they have nchar
less than or equal to this value. Defaults to |
exclude_values |
character vector of additional tokens to exclude from
matching. Subject to string-standardizatipn
with argument |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
Uses the same approach to resolve joins as hmatch
.
data(ne_raw) data(ne_ref) # add tokens to some values within ref to illustrate tokenized matching ne_ref$adm0[ne_ref$adm0 == "United States"] <- "United States of America" ne_ref$adm1[ne_ref$adm1 == "New York"] <- "New York State" hmatch_tokens(ne_raw, ne_ref, type = "inner", token_min = 1)
data(ne_raw) data(ne_ref) # add tokens to some values within ref to illustrate tokenized matching ne_ref$adm0[ne_ref$adm0 == "United States"] <- "United States of America" ne_ref$adm1[ne_ref$adm1 == "New York"] <- "New York State" hmatch_tokens(ne_raw, ne_ref, type = "inner", token_min = 1)
The basic join types used in the hmatch
package ("left", "inner",
"anti") are conceptually equivalent to dplyr's
join
types.
For each of the three join types there is also a counterpart prefixed by
"resolve_" ("resolve_left", "resolve_inner", "resolve_anti"). In a resolve
join rows of raw
with matches to multiple rows of ref
are resolved either
to a single best match or no match before the subsequent join type is
implemented. In a resolve join, rows of raw
are never duplicated.
The exact details of match resolution vary somewhat among functions, and are explained within each function's documentation.
left |
return all rows from |
inner |
return only the rows of |
anti |
return all rows from |
resolve_left |
similar to "left", except that any row of |
resolve_inner |
similar to "inner", except that any row of |
resolve_anti |
similar to "anti", except that any row of |
Given a data frame with columns specifying hierarchically-nested levels, find the maximum non-missing hierarchical level for each row.
max_levels(x, pattern, by, type = c("index", "name"))
max_levels(x, pattern, by, type = c("index", "name"))
x |
a data frame containing hierarchical columns |
pattern |
regex pattern to match the names of the hierarchical columns
in |
by |
vector giving the names of the hierarchical columns in |
type |
type of return, either "index" to return integer indices
(starting at 1) or "name" to return column names (as matched by |
Vector of indices or names corresponding to the maximum non-missing hierarchical level for each row
data(ne_ref) # return integer indices (starting at 1) max_levels(ne_raw, pattern = "^adm") # return column names max_levels(ne_raw, pattern = "^adm", type = "name")
data(ne_ref) # return integer indices (starting at 1) max_levels(ne_raw, pattern = "^adm") # return column names max_levels(ne_raw, pattern = "^adm", type = "name")
Raw entries of select administrative districts from the northeastern portion of North America.
ne_raw
ne_raw
A data.frame with 15 rows and 4 variables:
Identifier
Name of administrative 0 level (country)
Name of administrative 1 level (state/province)
Name of administrative 2 level (county/census division)
Reference table of select administrative districts in the northeastern portion of North America.
ne_ref
ne_ref
A data.frame with 31 rows and 4 variables, all of class character:
Administrative level
Name of administrative 0 level (country)
Name of administrative 1 level (state/province)
Name of administrative 2 level (county/census division)
Hierarchical code
For example, a municipality-level reference data.frame might contain three hierarchical columns — country, state, and municipality — but nonetheless only reflect the municipality level in that all rows represent a unique municipality. The lower-resolution levels (state, country) are implied but not explicitly represented as unique rows. If we wish to allow matches to the lower-resolution levels, we need additional rows specific to these levels.
This function takes a reference data.frame with N hierarchical columns, and adds rows for each unique combination of each level that is not currently explicitly represented.
ref_expand(ref, pattern, by, lowest_level = 1L)
ref_expand(ref, pattern, by, lowest_level = 1L)
ref |
|
pattern |
regex pattern to match the names of the hierarchical columns
in |
by |
vector giving the names of the hierarchical columns in |
lowest_level |
integer representing the lowest-resolution level
(defaults to |
A data.frame
created by expanding ref
to all implied hierarchical levels
# subset example reference df to the admin-2 level ne_ref_adm2 <- ne_ref[!is.na(ne_ref$adm2),] # expand back to all levels ref_expand(ne_ref_adm2, pattern = "adm", lowest_level = 0)
# subset example reference df to the admin-2 level ne_ref_adm2 <- ne_ref[!is.na(ne_ref$adm2),] # expand back to all levels ref_expand(ne_ref_adm2, pattern = "adm", lowest_level = 0)
Separate a data frame column containing hierarchical codes into multiple columns, one for each level within the hierarchical code.
Like tidyr::separate
except that successive levels are cumulative rather
then independent. E.g. the code "canada__ontario__toronto" would be split
into three levels:
"canada"
"canada__ontario"
"canada__ontario__toronto"
separate_hcode( x, col, into, sep = "__", extra = c("warn", "drop"), remove = FALSE )
separate_hcode( x, col, into, sep = "__", extra = c("warn", "drop"), remove = FALSE )
x |
|
col |
Name of the column within |
into |
Vector of column names to separate |
sep |
Separator between levels in the hierarchical codes. Defaults to "__". |
extra |
What to do if a hierarchical code contains more levels than are
implied by argument
|
remove |
Logical indicating whether to remove |
The original data.frame x
with additional columns for each level of the
hierarchical code
data(ne_ref) # generate pcode ne_ref$pcode <- hcodes_str(ne_ref, pattern = "^adm\\d") # separate pcode into constituent levels separate_hcode( ne_ref, col = "pcode", into = c("adm0_pcode", "adm1_pcode", "adm2_pcode") )
data(ne_ref) # generate pcode ne_ref$pcode <- hcodes_str(ne_ref, pattern = "^adm\\d") # separate pcode into constituent levels separate_hcode( ne_ref, col = "pcode", into = c("adm0_pcode", "adm1_pcode", "adm2_pcode") )
pattern
or by
Within the hmatch_
group of functions, there are three ways to specify the
hierarchical columns to be matched.
In all cases, it is assumed that matched columns are already correctly ordered, with the first matched column reflecting the broadest hierarchical level (lowest-resolution, e.g. country) and the last column reflecting the finest level (highest-resolution, e.g. township).
raw
and ref
If neither pattern
nor by
are specified (the default), then the
hierarchical columns are assumed to be all column names that are common to
both raw
and ref
.
Arguments pattern
and pattern_ref
take regex patterns to match the
hierarchical columns in raw
and ref
, respectively. Argument pattern_ref
only needs to be specified if it's different from pattern
(i.e. if the
hierarchical columns have different names in raw
vs. ref
).
For example, if the hierarchical columns in raw
are "ADM_1", "ADM_2", and
"ADM_3", which correspond respectively to columns within ref
named
"REF_ADM_1", "REF_ADM_2", and "REF_ADM_3", then the pattern arguments can be
specified as:
pattern = "^ADM_[[:digit:]]"
pattern_ref = "^REF_ADM_[[:digit:]]"
Alternatively, because pattern_ref
defaults to the same value as
pattern
(unless otherwise specified), one could specify a single regex pattern
that matches the hierarchical columns in both raw
and ref
, e.g.
pattern = "ADM_[[:digit:]]"
However, the user should exercise care to ensure that there are no
non-hierarchical columns within raw
or ref
that may inadvertently be
matched by the given pattern.
If the hierarchical columns cannot easily be matched with a regex pattern,
one can specify the relevant column names in vector form using arguments by
and by_ref
. As with pattern_ref
, argument by_ref
only needs to be
specified if it's different from by
(i.e. if the hierarchical columns have
different names in raw
vs. ref
).
For example, if the hierarchical columns in raw
are "state", "county", and
"township", which correspond respectively to columns within ref
named
"admin1", "admin2", and "admin3", then theby
arguments can be specified
with:
by = c("state", "county", "township")
by_ref = c("admin1", "admin2", "admin3")
Prior to matching raw and reference datasets, one might wish to standardize the strings within the match columns to account for differences in case, punctuation, etc.
By default, this standardization is performed with function
string_std
, which implements four transformations:
standardize case (base::tolower
)
remove sequences of non-alphanumeric characters at start or end of string
replace remaining sequences of non-alphanumeric characters with "_"
remove diacritics (stringi::stri_trans_general
)
(optional) convert roman numerals (I, II, ..., XLIX) to arabic (1, 2, ..., 49)
Alternatively, the user may provide any function that takes a vector of
strings and returns a vector of transformed strings. To omit any
transformation, set argument std_fn = NULL
.
Note that the standardized versions of the match columns are never returned. They are used only during matching, and then removed prior to the return.
Standardizes strings prior to performing a match, using the following transformations:
standardize case (base::tolower
)
remove sequences of non-alphanumeric characters at start or end of string
replace remaining sequences of non-alphanumeric characters with "_"
remove diacritics (stringi::stri_trans_general
)
(optional) convert roman numerals (I, II, ..., XLIX) to arabic (1, 2, ..., 49)
string_std(x, convert_roman = FALSE)
string_std(x, convert_roman = FALSE)
x |
a string |
convert_roman |
logical indiciating whether to convert roman numerals (I, II, ..., XLIX) to arabic (1, 2, ..., 49) |
The standardized version of x
string_std("United STATES") string_std("R\u00e9publique d\u00e9mocratique du Congo") # convert roman numerals to arabic string_std("Mungindu-II (Sud)") string_std("Mungindu-II (Sud)", convert_roman = TRUE) # note the conversion only works if the numeral is separated from other # alphanumeric characters by punctuation or space characters string_std("MunginduII", convert_roman = TRUE) # roman numeral not recognized
string_std("United STATES") string_std("R\u00e9publique d\u00e9mocratique du Congo") # convert roman numerals to arabic string_std("Mungindu-II (Sud)") string_std("Mungindu-II (Sud)", convert_roman = TRUE) # note the conversion only works if the numeral is separated from other # alphanumeric characters by punctuation or space characters string_std("MunginduII", convert_roman = TRUE) # roman numeral not recognized