Package 'dbc' reference manual

Title:	Dictionary-Based Cleaning
Description:	Tools for dictionary-based data cleaning.
Authors:	Patrick Barks [aut, cre]
Maintainer:	Patrick Barks <[email protected]>
License:	MIT + file LICENSE
Version:	0.0.0.9000
Built:	2025-03-23 05:55:01 UTC
Source:	https://github.com/epicentre-msf/dbc

Produce a dictionary of non-valid categorical values within a dataset, for use in subsequent data cleaning

Description

Values are compared against a user-provided dictionary specifying the allowed values of each categorical variable, after text standardization to account for minor differences in character case, spacing, and punctuation.

The resulting cleaning dictionary can then be manually reviewed to fill in appropriate replacement values for each non-valid categorical value, or a missing-value keyword indicating that the value should be converted to NA.

Usage

check_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  fn = std_text,
  allow_na = TRUE,
  na = ".na",
  populate_na = FALSE,
  return_all = FALSE
)
check_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  fn = std_text,
  allow_na = TRUE,
  na = ".na",
  populate_na = FALSE,
  return_all = FALSE
)

Arguments

`x`	A data frame with one or more columns to check
`dict_allowed`	Dictionary of allowed values for each variable of interest. Must include columns for "variable" and "value" (the names of which can be modified with args `col_allowed_var` and `col_allowed_value`).
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. from a previous run of this function). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by `vars_id`.
`vars_id`	Optional vector of one or more ID columns within `x` on which corrections should be conditional. If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.
`col_allowed_var`	Name of column in `dict_allowed` giving variable name (defaults to "variable")
`col_allowed_value`	Name of column in `dict_allowed` giving allowed values (defaults to "value")
`fn`	Function to standardize raw values in both the dataset and dictionary prior to comparing, to account for minor variation in character case, spacing, punctuation, etc. Defaults to `std_text`. To omit the standardization step can use e.g. `as.character` or an identity function `function(x) x`.
`allow_na`	Logical indicating whether missing values should always be treated as 'allowed' even if not explicitly specified in `dict_allowed`. Defaults to `TRUE`.
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.
`populate_na`	Logical indicating whether to pre-populate column "replacement" with values specified by keyword `na`. If most non-valid values in `x` are non-correctable, pre-populating the keyword `na` can save time during the manual verification/correction phase. Defaults to `FALSE`.
`return_all`	Logical indicating whether to return all non-valid values including those already specified in argument `dict_clean` (if specified) (`TRUE`), or only the new non-valid entries not already specified in `dict_clean` (`FALSE`). Defaults to `FALSE`.

Value

Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:

columns specified in vars_id, if given
variable: column name of variable within x
value: non-valid value
replacement: correct value that should replace a given non-valid value
new: logical indicating whether the entry is new (TRUE) or already specified in argument dict_clean (⁠<NA>⁠)

Examples

# load example dataset, and dictionary of allowed categorical values
data(ll1)
data(dict_categ1)

# basic output
check_categorical(ll1, dict_allowed = dict_categ1)

# load example dataset, and dictionary of allowed categorical values
data(ll1)
data(dict_categ1)

# basic output
check_categorical(ll1, dict_allowed = dict_categ1)

Produce a dictionary of non-valid date values within a dataset, for use in subsequent data cleaning

Description

The resulting cleaning dictionary can be manually reviewed to fill in appropriate replacement values for each non-valid date value, or a missing-value keyword indicating that the value should be converted to NA, and then used with function clean_dates.

Similar to check_numeric, values are considered 'non-valid' if they cannot be coerced using a given function. The default date-coercing function is parse_dates, which can handle a wide variety of date formats, but the user could alternatively specify a simpler function like as.Date. The user may also specify additional expressions that would indicate a non-valid date value. For example, the expression date_admit > Sys.Date() could be used to check for admission dates in the future.

Usage

check_dates(
  x,
  vars,
  vars_id,
  queries = list(),
  dict_clean = NULL,
  fn = parse_dates,
  na = ".na",
  populate_na = FALSE
)
check_dates(
  x,
  vars,
  vars_id,
  queries = list(),
  dict_clean = NULL,
  fn = parse_dates,
  na = ".na",
  populate_na = FALSE
)

Arguments

`x`	A data frame with one or more columns to check
`vars`	Names of date columns within `x` to check
`vars_id`	Vector of one or more ID columns within `x` on which corrections should be conditional.
`queries`	Optional list of expressions to check for non-valid dates. May include a `.x` selector which is a stand-in for any of the date variables specified in argument `vars`. E.g. list( date_admit > date_exit, # admission later than exit .x > Sys.Date() # any date in future )
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. produced by a prior run of `check_dates`). Must include columns "variable", "value", "replacement", and all columns specified by `vars_id`.
`fn`	Function to parse raw date values. Defaults to `parse_dates`. Any value not coercible by `fn` will be flagged as a "Non-valid date".
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.
`populate_na`	Logical indicating whether to pre-populate column "replacement" with values specified by keyword `na`, for queries of type "Non-valid date". If most non-valid dates in `x` are non-correctable, pre-populating the keyword `na` can save time during the manual verification/correction phase. Defaults to `FALSE`.

Value

Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:

columns specified in vars_id
variable: column name of date variable within x
value: raw date value
date: parsed date value
replacement: correct value that should replace a given non-valid value
query: which query was triggered by the given raw date value (if any)

Note that, unlike functions check_numeric and check_categorical, which only return rows corresponding to non-valid values, this function returns all date values corresponding to any observation (i.e. row) with at least one non-valid date value. This is to provide context for the non-valid value and aid in making the appropriate correction.

Examples

# load example dataset
data(ll1)

# basic output
check_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id"
)

# add additional queries to evaluate
check_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id",
  queries = list(
    date_onset > date_admit,
    date_admit > date_exit,
    .x > as.Date("2021-01-01")
  )
)

# load example dataset
data(ll1)

# basic output
check_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id"
)

# add additional queries to evaluate
check_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id",
  queries = list(
    date_onset > date_admit,
    date_admit > date_exit,
    .x > as.Date("2021-01-01")
  )
)

Produce a dictionary of non-valid numeric values within a dataset, for use in subsequent data cleaning

Description

The resulting cleaning dictionary can then be manually reviewed to fill in appropriate replacement values for each non-valid numeric value, or a missing-value keyword indicating that the value should be converted to NA.

Usage

check_numeric(
  x,
  vars,
  vars_id = NULL,
  queries = list(),
  dict_clean = NULL,
  fn = as.numeric,
  na = ".na",
  populate_na = FALSE,
  return_all = FALSE
)
check_numeric(
  x,
  vars,
  vars_id = NULL,
  queries = list(),
  dict_clean = NULL,
  fn = as.numeric,
  na = ".na",
  populate_na = FALSE,
  return_all = FALSE
)

Arguments

`x`	A data frame with one or more columns to check
`vars`	Names of columns within `x` to check
`vars_id`	Optional vector of one or more ID columns within `x` on which corrections should be conditional. If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.
`queries`	Optional list of expressions to check for non-valid values. May include a `.x` selector which is a stand-in for any of the numeric variables specified in argument `vars`. E.g. list( age > 110, # age greater than 110 .x < 0 # any numeric value less than 0 )
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. from a previous run of this function). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by `vars_id`.
`fn`	Function to convert values to numeric. Defaults to `as.numeric`.
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.
`populate_na`	Logical indicating whether to pre-populate column "replacement" with values specified by keyword `na`. If most non-valid values in `x` are non-correctable, pre-populating the keyword `na` can save time during the manual verification/correction phase. Defaults to `FALSE`.
`return_all`	Logical indicating whether to return all non-valid values including those already specified in argument `dict_clean` (if specified) (`TRUE`), or only the new non-valid entries not already specified in `dict_clean` (`FALSE`). Defaults to `FALSE`.

Value

Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:

columns specified in vars_id, if given
variable: column name of variable within x
value: non-valid value
replacement: correct value that should replace a given non-valid value
new: logical indicating whether the entry is new (TRUE) or already specified in argument dict_clean (⁠<NA>⁠)

Examples

# load example dataset
data(ll1)
data(clean_num1)

# basic output
check_numeric(ll1, c("age", "contacts"))

# include id var "id"
check_numeric(ll1, c("age", "contacts"), vars_id = "id")

# add custom query
check_numeric(ll1, c("age", "contacts"), vars_id = "id", queries = list(age > 90))

# prepopulate column 'replacement'
check_numeric(ll1, c("age", "contacts"), vars_id = "id", populate_na = TRUE)

# use dictionary of pre-specified corrections
check_numeric(ll1, c("age", "contacts"), dict_clean = clean_num1)

# load example dataset
data(ll1)
data(clean_num1)

# basic output
check_numeric(ll1, c("age", "contacts"))

# include id var "id"
check_numeric(ll1, c("age", "contacts"), vars_id = "id")

# add custom query
check_numeric(ll1, c("age", "contacts"), vars_id = "id", queries = list(age > 90))

# prepopulate column 'replacement'
check_numeric(ll1, c("age", "contacts"), vars_id = "id", populate_na = TRUE)

# use dictionary of pre-specified corrections
check_numeric(ll1, c("age", "contacts"), dict_clean = clean_num1)

A cleaning dictionary for categorical variables in example dataset `ll1`

Description

A cleaning dictionary for categorical variables in example dataset ll1

Usage

clean_categ1
clean_categ1

Format

A data.frame with 7 rows and 4 variables:

variable: Column name within dataset
value: Non-valid numeric value
replacement: Replacement value for given non-valid value
new: Logical indicating whether the dictionary entry is new

Clean categorical variables within a dataset based on a dictionary of value-replacement pairs

Description

Applies a dictionary of value-replacement pairs to clean and standardize values of categorical variables. Includes options for text standardization to standardize minor differences in character case, spacing, and punctuation.

Usage

clean_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  non_allowed_to_missing = TRUE,
  fn = std_text,
  na = ".na"
)
clean_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  non_allowed_to_missing = TRUE,
  fn = std_text,
  na = ".na"
)

Arguments

`x`	A data frame with one or more columns to clean
`dict_allowed`	Dictionary of allowed values for each variable of interest. Must include columns for "variable" and "value" (the names of which can be modified with args `col_allowed_var` and `col_allowed_value`).
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. produced by `check_categorical`). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by `vars_id`.
`vars_id`	Optional vector of one or more ID columns within `x` on which corrections should be conditional. If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.
`col_allowed_var`	Name of column in `dict_allowed` giving variable name (defaults to "variable")
`col_allowed_value`	Name of column in `dict_allowed` giving allowed values (defaults to "value")
`non_allowed_to_missing`	Logical indicating whether to replace values that remain non-allowed, even after cleaning and standardization, to NA. Defaults to TRUE. If no dictionary is provided, will simply standardize columns to match allowed values specified in `dict_allowed`.
`fn`	Function to standardize raw values in both the dataset and dictionary prior to comparing, to account for minor variation in character case, spacing, punctuation, etc. Defaults to `std_text`. To omit the standardization step can use e.g. `as.character` or an identity function `function(x) x`.
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.

Value

The original data frame x but with cleaned versions of the categorical variables specified in argument dict_allowed

Examples

# load example dataset, dictionary of allowed categorical values, and
# cleaning dictionary
data(ll1)
data(dict_categ1)
data(clean_categ1)

# dictionary-based corrections to categorical vars
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1
)

# require exact matching, including character case
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1,
  fn = identity
)

# apply standardization to dict_allowed but no additional dict-based cleaning
clean_categorical(
  ll1,
  dict_allowed = dict_categ1
)

# load example dataset, dictionary of allowed categorical values, and
# cleaning dictionary
data(ll1)
data(dict_categ1)
data(clean_categ1)

# dictionary-based corrections to categorical vars
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1
)

# require exact matching, including character case
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1,
  fn = identity
)

# apply standardization to dict_allowed but no additional dict-based cleaning
clean_categorical(
  ll1,
  dict_allowed = dict_categ1
)

Clean date variables within a dataset based on a dictionary of value-replacement pairs

Description

Applies a dictionary of value-replacement pairs and a conversion function (defaults to parse_dates) to clean and standardize values of date variables. To use this approach the date columns of the original dataset should generally be imported as type "text" or "character" so that non-valid values are not automatically coerced to missing values on import.

Usage

clean_dates(x, vars, vars_id, dict_clean = NULL, fn = parse_dates, na = ".na")
clean_dates(x, vars, vars_id, dict_clean = NULL, fn = parse_dates, na = ".na")

Arguments

`x`	A data frame with one or more date columns to clean
`vars`	Names of date columns within `x` to clean
`vars_id`	Vector of one or more ID columns within `x` on which corrections should be conditional.
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. produced by a prior run of `check_dates`). Must include columns "variable", "value", "replacement", and all columns specified by `vars_id`.
`fn`	Function to parse raw date values. Defaults to `parse_dates`.
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.

Value

The original data frame x but with cleaned versions of the date variables specified in argument vars

Examples

# load example dataset and cleaning dictionary
data(ll1)
data(clean_dates1)

# clean dates using only date coercion function
clean_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id"
)

# clean dates using dictionary and coercion function
clean_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id",
  dict_clean = clean_dates1
)

# load example dataset and cleaning dictionary
data(ll1)
data(clean_dates1)

# clean dates using only date coercion function
clean_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id"
)

# clean dates using dictionary and coercion function
clean_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id",
  dict_clean = clean_dates1
)

A cleaning dictionary for categorical variables in example dataset `ll1`

Description

A cleaning dictionary for categorical variables in example dataset ll1

Usage

clean_dates1
clean_dates1

Format

A data.frame with 12 rows and 6 variables:

id: ID column within raw dataset
variable: Column name within raw dataset
value: Raw date value
date: Parsed date value
replacement: Replacement value for given non-valid value
query: Which query was triggered by the given raw date value (if any)

A cleaning dictionary for numeric variables in example dataset `ll1`

Description

A cleaning dictionary for numeric variables in example dataset ll1

Usage

clean_num1
clean_num1

Format

A data.frame with 4 rows and 4 variables:

variable: Column name within dataset
value: Non-valid numeric value
replacement: Replacement value for given non-valid value
new: Logical indicating whether the dictionary entry is new

Clean numeric variables within a dataset based on a dictionary of value-replacement pairs

Description

Applies a dictionary of value-replacement pairs and a conversion function (defaults to as.numeric) to clean and standardize values of numeric variables. To use this approach the numeric columns of the original dataset should generally be imported as type "text" or "character" so that non-valid values are not automatically coerced to missing values on import.

Usage

clean_numeric(
  x,
  vars,
  vars_id = NULL,
  dict_clean = NULL,
  fn = as.numeric,
  na = ".na"
)
clean_numeric(
  x,
  vars,
  vars_id = NULL,
  dict_clean = NULL,
  fn = as.numeric,
  na = ".na"
)

Arguments

`x`	A data frame with one or more columns to clean
`vars`	Names of columns within `x` to clean
`vars_id`	Optional vector of one or more ID columns within `x` on which corrections should be conditional. If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. produced by `check_numeric`). If provided, must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by `vars_id`. If no dictionary is provided, will simply apply the conversion function to all columns specified in `vars`.
`fn`	Function to convert values to numeric. Defaults to `as.numeric`.
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.

Value

The original data frame x but with cleaned versions of columns vars

Examples

# load example dataset and dictionary of value-replacement pairs
data(ll1)
data(clean_num1)

# dictionary-based corrections to numeric vars 'age' and 'contacts'
clean_numeric(
  ll1,
  vars = c("age", "contacts"),
  dict_clean = clean_num1
)

# apply standardization with as.integer() rather than default as.numeric()
clean_numeric(
  ll1,
  vars = c("age", "contacts"),
  dict_clean = clean_num1,
  fn = as.integer
)

# apply standardization but no dictionary-based cleaning
clean_numeric(
  ll1,
  vars = c("age", "contacts")
)

# load example dataset and dictionary of value-replacement pairs
data(ll1)
data(clean_num1)

# dictionary-based corrections to numeric vars 'age' and 'contacts'
clean_numeric(
  ll1,
  vars = c("age", "contacts"),
  dict_clean = clean_num1
)

# apply standardization with as.integer() rather than default as.numeric()
clean_numeric(
  ll1,
  vars = c("age", "contacts"),
  dict_clean = clean_num1,
  fn = as.integer
)

# apply standardization but no dictionary-based cleaning
clean_numeric(
  ll1,
  vars = c("age", "contacts")
)

A dictionary of allowed values for categorical variables in example dataset `ll1`

Description

A dictionary of allowed values for categorical variables in example dataset ll1

Usage

dict_categ1
dict_categ1

Format

A data.frame with 13 rows and 2 variables:

variable: Column name within dataset
value: Allowed categorical values for given column

An example messy dataset to clean

Description

An example messy dataset to clean

Usage

ll1
ll1

Format

A data.frame with 7 rows and 10 variables:

id: Patient identifier
age: Age value
age_unit: Age units
sex: Patient sex
status: Patient status
contacts: Number of epidemiological contacts
date_onset: Date of symptom onset
date_admit: Date of admission to hospital
date_exit: Date of exit from hospital
exit_status: Patient outcome status

An example messy dataset to clean, an extension of `ll1`

Description

An example messy dataset to clean, an extension of ll1

Usage

ll2
ll2

Format

A data.frame with 10 rows and 10 variables:

id: Patient identifier
age: Age value
age_unit: Age units
sex: Patient sex
status: Patient status
contacts: Number of epidemiological contacts
date_onset: Date of symptom onset
date_admit: Date of admission to hospital
date_exit: Date of exit from hospital
exit_status: Patient outcome status

Parse dates

Description

Parse dates

Usage

parse_dates(
  x,
  convert_excel = TRUE,
  orders = c("Ymd", "dmY", "dmy", "mdY", "Ymd HMS")
)
parse_dates(
  x,
  convert_excel = TRUE,
  orders = c("Ymd", "dmY", "dmy", "mdY", "Ymd HMS")
)

Arguments

`x`	A character or numeric vector of dates
`convert_excel`	Logical indicating whether to convert Excel-encoded date values (e.g. "42370") into dates, using janitor::excel_numeric_to_date
`orders`	a character vector of date-time formats. Each order string is a series of formatting characters as listed in `base::strptime()` but might not include the `"%"` prefix. For example, "ymd" will match all the possible dates in year, month, day order. Formatting orders might include arbitrary separators. These are discarded. See details for the implemented formats. If multiple order strings are supplied, they are applied in turn for `parse_date_time2()` and `fast_strptime()`. For `parse_date_time()` the order of applied formats is determined by `select_formats` parameter.

Value

A vector of class "Date". Values that cannot be converted to valid dates will be returned as ⁠<NA>⁠.

Examples

x <- c("44087", "12//02/2019", "2020_05_14", "2021-01-30 14:00:04")
parse_dates(x)

x <- c("44087", "12//02/2019", "2020_05_14", "2021-01-30 14:00:04")
parse_dates(x)

Standardize text prior to matching to account for minor variation in character case, spacing, punctuation, or use of accents

Description

Implements the following transformations:

standardize case (base::tolower)
remove diacritic/accent characters (stringi::stri_trans_general)
remove sequences of space or punctuation characters at start or end of string
replace repeated whitespace characters with a single space

Usage

std_text(x)
std_text(x)

Arguments

`x`	A vector of strings

Value

The standardized version of x

Examples

std_text(c("CONFIRMED", "Conf.", "confirmed"))
std_text(c("R\u00e9publique d\u00e9mocratique du  Congo", "Nigeria_"))
std_text(c("CONFIRMED", "Conf.", "confirmed"))
std_text(c("R\u00e9publique d\u00e9mocratique du  Congo", "Nigeria_"))

Package 'dbc'

Help Index

Produce a dictionary of non-valid categorical values within a dataset, for use in subsequent data cleaning

Description

Usage

Arguments

Value

Examples

Produce a dictionary of non-valid date values within a dataset, for use in subsequent data cleaning

Description

Usage

Arguments

Value

Examples

Produce a dictionary of non-valid numeric values within a dataset, for use in subsequent data cleaning

Description

Usage

Arguments

Value

Examples

A cleaning dictionary for categorical variables in example dataset ll1

Description

Usage

Format

Clean categorical variables within a dataset based on a dictionary of value-replacement pairs

Description

Usage

Arguments

Value

Examples

Clean date variables within a dataset based on a dictionary of value-replacement pairs

Description

Usage

Arguments

Value

Examples

A cleaning dictionary for categorical variables in example dataset ll1

Description

Usage

Format

A cleaning dictionary for numeric variables in example dataset ll1

Description

Usage

Format

Clean numeric variables within a dataset based on a dictionary of value-replacement pairs

Description

Usage

Arguments

Value

Examples

A dictionary of allowed values for categorical variables in example dataset ll1

Description

Usage

Format

An example messy dataset to clean

Description

Usage

Format

An example messy dataset to clean, an extension of ll1

Description

Usage

Format

Parse dates

Description

Usage

Arguments

Value

Examples

Standardize text prior to matching to account for minor variation in character case, spacing, punctuation, or use of accents

Description

Usage

Arguments

Value

Examples

A cleaning dictionary for categorical variables in example dataset `ll1`

A cleaning dictionary for categorical variables in example dataset `ll1`

A cleaning dictionary for numeric variables in example dataset `ll1`

A dictionary of allowed values for categorical variables in example dataset `ll1`

An example messy dataset to clean, an extension of `ll1`