Package 'queryr' reference manual

Title:	Data Validation Queries With Tidy Output
Description:	Data validation queries with tidy, stackable output.
Authors:	Patrick Barks [aut, cre]
Maintainer:	Patrick Barks <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.0.9000
Built:	2025-03-23 05:55:01 UTC
Source:	https://github.com/epicentre-msf/queryr

Example dataset, an epidemiological linelist from a treatment centre

Description

Example dataset, an epidemiological linelist from a treatment centre

Usage

ll
ll

Format

A data.frame with 12 rows and 10 variables:

id: Patient identifier
site: Site identifier
age: Patient age in years
status: Patient status
date_onset: Date of symptom onset
date_admit: Date of admission to treatment centre
date_lab: Date of laboratory test
lab_result: Result of laboratory test
date_exit: Date of exit from treatment centre
outcome: Patient outcome

Example set of queries to run on `ll` using `query_vec`

Description

Example set of queries to run on ll using query_vec

Usage

ll_queries
ll_queries

Format

A data.frame with 5 rows and 2 variables:

query_id: Query IDs
query: Query expressions in string format

Data validation queries with tidy, stackable output

Description

Find observations within a data frame matching a given query (a logical expression relating to one or more variables), and return tidy output that can be stacked across different queries on different variables. Stackability is achieved by pivoting the columns indicated in the query expression to long-form, e.g. "variable1", "value1", "variable2", "value2", ...

The query expression can optionally incorporate up to two dot-selectors (".x" and ".y"), which each refer to a set of variables specified separately using tidy-selection (see section Using a dot-selector). If both selectors are used in a given query expression, the sets of variables they respectively match can either be "crossed" such that all combinations are evaluated, or evaluated in parallel.

By default, only the data columns referenced in the query expression are returned, but additional columns can optionally be added with argument cols_base.

Usage

query(
  data,
  cond,
  cols_dotx,
  cols_doty,
  crossed = FALSE,
  cols_base,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE,
  count = FALSE
)
query(
  data,
  cond,
  cols_dotx,
  cols_doty,
  crossed = FALSE,
  cols_base,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE,
  count = FALSE
)

Arguments

`data`	A data frame
`cond`	An expression to evaluate with respect to variables within `data`. Can specify multiple variables using a dot-selector ("`.x`" and "`.y`") within the expression (e.g. `.x > 0`) and then separately specifying the columns that the selector refers to with arguments `cols_dotx`/`cols_doty`.
`cols_dotx`, `cols_doty`	Tidy-selection of one or more columns represented by a .x or .y selector. Only used if `cond` contains the relevant selector. See section Using a dot-selector below.
`crossed`	if `cond` contains both a .x and .y selector, should the variables matched by `cols_dotx` and `cols_doty` be "crossed" such that all combinations are evaluated (`TRUE`), or should they be evaluated in parallel (`FALSE`). The latter requires that the number of variables matched by `cols_dotx` and `cols_doty` is the same. Defaults to `FALSE`.
`cols_base`	(Optional) Tidy-selection of other columns within `data` to retain in the output. Can optionally be set for an entire session using option "queryr_cols_base", e.g. `options(queryr_cols_base = quote(id:site))`.
`pivot_long`	Logical indicating whether to pivot the variables referenced within `cond` to a long (i.e. stackable) format, with default column names "variable1", "value1", "variable2", "value2", ... Defaults to `TRUE`.
`pivot_var`	Prefix for pivoted variable column(s). Defaults to "variable". Only used if `pivot_long = TRUE`.
`pivot_val`	Prefix for pivoted value column(s). Defaults to "value". Only used if `pivot_long = TRUE`.
`as_chr`	Logical indicating whether to coerce the columns referenced in the query expression `cond` to character prior to returning. This enables row-binding multiple queries with variables of different classes, but is only important if `pivot_long = TRUE`. Defaults to `TRUE`.
`count`	Logical indicating whether to summarize the output by counting the number of unique combinations across all returned columns (with count column "n"). Defaults to `FALSE`.

Value

A data frame reflecting the rows of data that match the given query. Returned columns include:

(optional) columns matched by argument cols_base
columns referenced within the query expression (pivoted to long form by default)
(optional) count column "n" (if count = TRUE)

Using a dot-selector

A query expression can optionally incorporate up to two dot-selectors (".x" and ".y"), which each refer to a set of variables specified separately using tidy-selection (arguments cols_dotx and cols_doty).

When cond contains a .x selector, the query expression is evaluated repeatedly with each relevant variable from cols_dotx individually substituted into the .x position of the expression. The results of these multiple 'subqueries' are then combined with dplyr::bind_rows.

If cond contains both a .x and .y selector, the sets of variables matched by cols_dotx and cols_doty respectively can either be "crossed" such that all combinations are evaluated, or evaluated in parallel. Evaluating in parallel requires that the number of variables matched by cols_dotx and cols_doty is the same.

Consider a hypothetical query checking that, if a patient has a particular symptom, the date of onset of that symptom is not missing. E.g.
cond = .x == "Yes" & is.na(.y)
cols_dotx = c(symptom_fever, symptom_headache)
cols_doty = c(date_symptom_fever, date_symptom_headache)

If argument crossed is FALSE, the relevant variables from cols_dotx and cols_doty will be evaluated in parallel, as in:
has_symptom_fever == "Yes" & is.na(date_symptom_fever)
has_symptom_headache == "Yes" & is.na(date_symptom_headache)

Conversely, if argument crossed is TRUE, all combinations of the relevant variables will be evaluated, which for this particular query wouldn't make sense:
symptom_fever == "Yes" & is.na(date_symptom_fever)
symptom_fever == "Yes" & is.na(date_symptom_headache) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_fever) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_headache)

Note that if a dot-selector is used with argument pivot_long = FALSE, the row-binding of multiple subqueries may result in a sparse output with respect to the variables represented by the dot-selector, because for each subquery only the columns matched by expression cond are returned.

Examples

# load example dataset, an epidemiological 'linelist'
data(ll)

# find observations where date_exit is earlier than date_admit
query(
  ll,
  date_exit < date_admit,
  cols_base = id:site
)

# find any date value in the future using a .x column selector
query(
  ll,
  .x > Sys.Date(),
  cols_dotx = starts_with("date"),
  cols_base = id:site
)

# incorporate an external object into the query expression
lab_result_valid <- c("Positive", "Negative", "Inc.", NA)

query(
  ll,
  !lab_result %in% lab_result_valid,
  cols_base = id:site,
)

# load example dataset, an epidemiological 'linelist'
data(ll)

# find observations where date_exit is earlier than date_admit
query(
  ll,
  date_exit < date_admit,
  cols_base = id:site
)

# find any date value in the future using a .x column selector
query(
  ll,
  .x > Sys.Date(),
  cols_dotx = starts_with("date"),
  cols_base = id:site
)

# incorporate an external object into the query expression
lab_result_valid <- c("Positive", "Negative", "Inc.", NA)

query(
  ll,
  !lab_result %in% lab_result_valid,
  cols_base = id:site,
)

Data validation queries across a list of data frames

Description

Find observations matching a query that concerns one or more data frames within a list of data frames, and return tidy, stackable output. Like query but enables query expressions that reference variables in multiple data frames.

If the query expression references variables from data frames (i.e. list elements) other than the focal element, the relevant variable(s) will be joined to the focal element before the query expression is evaluated, see arguments join_type and join_by below.

Usage

query_list(
  x,
  cond,
  element,
  cols_base,
  join_type = "left",
  join_by,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE
)
query_list(
  x,
  cond,
  element,
  cols_base,
  join_type = "left",
  join_by,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE
)

Arguments

`x`	A list of data frames
`cond`	Expression to evaluate with respect to one or more variables in one or more of the data frames within `x`.
`element`	Name or integer index of the focal list element of `x` for the given query. If the query expression `cond` references variables from list elements apart from `element`, the relevant variable(s) will be joined to `x[[element]]` before the query expression is evaluated, based on the `join_type` and `join_by` arguments described below.
`cols_base`	(Optional) Tidy-selection of other columns within `data` to retain in the output. Can optionally be set for an entire session using option "queryr_cols_base", e.g. `options(queryr_cols_base = quote(id:site))`.
`join_type`	If `cond` references variables within elements of `x` apart from `x[[element]]`, what type of join should be used to join the relevant elements? Options are "left" (the default) and "inner". Based on dplyr `join` types.
`join_by`	A character vector of variables to join by. If the join key columns have different names in `x[[element]]` and `x[[other]]`, use a named vector. For example, `join_by = c("a" = "b")` will match `x[[element]]$a` to `x[[other]]$b`.
`pivot_long`	Logical indicating whether to pivot the variables referenced within `cond` to a long (i.e. stackable) format, with default column names "variable1", "value1", "variable2", "value2", ... Defaults to `TRUE`.
`pivot_var`	Prefix for pivoted variable column(s). Defaults to "variable". Only used if `pivot_long = TRUE`.
`pivot_val`	Prefix for pivoted value column(s). Defaults to "value". Only used if `pivot_long = TRUE`.
`as_chr`	Logical indicating whether to coerce the columns referenced in the query expression `cond` to character prior to returning. This enables row-binding multiple queries with variables of different classes, but is only important if `pivot_long = TRUE`. Defaults to `TRUE`.

Value

A data frame reflecting the rows of x[[element]] that match the given query. Returned columns include:

(optional) columns matched by argument cols_base
columns referenced within the query expression (pivoted to long form by default)

Data validation queries vectorized over multiple query expressions

Description

Data validation queries with query or query_list, but vectorized over a set of query expressions in string format (and optionally a corresponding vector of query names/IDs). Results of the multiple queries are stacked and returned in a single tidy data frame, with columns referenced in the query expressions pivoted to long-form (e.g. "variable1", "value1", "variable2", "value2", ...).

Usage

query_vec(
  x,
  cond,
  element,
  name,
  cols_base,
  name_col = "query_id",
  join_type = "left",
  join_by = NULL,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE
)
query_vec(
  x,
  cond,
  element,
  name,
  cols_base,
  name_col = "query_id",
  join_type = "left",
  join_by = NULL,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE
)

Arguments

`x`	A data frame or a list of data frames to query. If a single data frame will vectorize with `query`, whereas given a list of data frames will use `query_list`.
`cond`	Character vector of expressions to evaluate with respect to variables within `x`.
`element`	If `x` is a list of data frames, the names or integer indexes of the focal list element of `x` corresponding to each query expression (i.e. each element of `cond`). Only used if `x` is a list of data frames (see `query_list`).
`name`	(Optional) Character vector giving query names/IDs for each of the expressions within `cond`. If missing the expressions themselves (in string format) are used as names.
`cols_base`	(Optional) Tidy-selection of other columns within `x` (or `x[[element]]`) to retain in the final output. Can be set for an entire session using option "queryr_cols_base", e.g. `options(queryr_cols_base = quote(id:site))`.
`name_col`	Column name for the query names/IDs. Defaults to "query_id".
`join_type`	If `x` is a list of data frames and `cond` references variables within elements of `x` apart from `x[[element]]`, what type of join should be used to join the relevant elements? Options are "left" (the default) and "inner". Based on dplyr `join` types. Can specify different join types for different query expressions by passing a vector the same length as `cond`.
`join_by`	A character vector of variables to join by, or list of vectors the same length as `cond`. If the join key columns have different names in `x[[element]]` and `x[[other]]`, use a named vector. For example, `join_by = c("a" = "b")` will match `x[[element]]$a` to `x[[other]]$b`. Can specify different join columns for different query expressions by passing a list of vectors the same length as `cond`.
`pivot_var`	Prefix for pivoted variable column(s). Defaults to "variable".
`pivot_val`	Prefix for pivoted value column(s). Defaults to "value".
`as_chr`	Logical indicating whether to coerce the columns referenced in the query expression(s) to character prior to returning. This enables row-binding multiple queries with variables of different classes. Defaults to `TRUE`.

Value

A data frame reflecting the rows of data that match the given queries. Returned columns include:

query name/ID column (name taken from argument name_col)
(optional) columns matched by argument cols_base
columns referenced within the query expressions, pivoted to long form

Examples

data(ll)          # example dataset, an epidemiological linelist
data(ll_queries)  # example data frame defining queries to run on ll

# run all queries defined in ll_queries
query_vec(
  ll,
  cond = ll_queries$query,
  name = ll_queries$query_id,
  cols_base = c(id, site)
)

data(ll)          # example dataset, an epidemiological linelist
data(ll_queries)  # example data frame defining queries to run on ll

# run all queries defined in ll_queries
query_vec(
  ll,
  cond = ll_queries$query,
  name = ll_queries$query_id,
  cols_base = c(id, site)
)

Data validation queries across two data frames

Description

Find observations matching a query that concerns two data frames, and return tidy, stackable output. Entails three steps:

separately query each of the two data frames using query
combine the resulting query outputs based on a given join type (semi, anti, left, or inner)
execute a third query on the joined output

Each of the query steps is optional — unspecified query expressions are replaced with TRUE such that all rows of the relevant input are returned.

Usage

query2(
  data1,
  data2,
  cond1,
  cond2,
  cols_base1,
  cols_base2,
  join_type,
  join_by,
  cond3,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE
)
query2(
  data1,
  data2,
  cond1,
  cond2,
  cols_base1,
  cols_base2,
  join_type,
  join_by,
  cond3,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE
)

Arguments

`data1`	Data frame to query (#1)
`data2`	Data frame to query (#2)
`cond1`	(Optional) Expression to evaluate with respect to `data1`. If missing will be set to `TRUE` to select all rows.
`cond2`	(Optional) Expression to evaluate with respect to `data2`. If missing will be set to `TRUE` to select all rows.
`cols_base1`	(Optional) Tidy-selection of other columns within `data1` to retain in the final output. Can be set for an entire session using option "queryr_cols_base", e.g. `options(queryr_cols_base = quote(id:site))`.
`cols_base2`	(Optional) Tidy-selection of other columns within `data2` to retain in the final output.
`join_type`	How to join the output from the two initial queries ("semi", "anti", "left", or "inner"). Based on dplyr `join` types.
`join_by`	A character vector of variables to join by. If the join key columns have different names in `data1` and `data2`, use a named vector. For example, `by = c("a" = "b")` will match `data1$a` to `data2$b`.
`cond3`	(Optional) Expression to evaluate with respect to the joined output of the two initial queries. If missing will be set to `TRUE` to select all rows. Note that if `join_type` is a filtering join ("anti" or "semi"), only variables from `data1` can be referenced in `cond3` (referencing a variable that only exists in `data2` will result in an error). If `join_type` is instead a mutating join ("left" or "inner"), all variables from `data1` and `data2` will be available to `cond3`, even if not otherwise referenced with `cond1`/`cond2` or `cols_base1`/`cols_base2`.
`pivot_long`	Logical indicating whether to pivot the variables referenced within the query expression(s) to a long (i.e. stackable) format, with default column names "variable1", "value1", "variable2", "value2", ... Defaults to `TRUE`. If `cond3` is specified and `pivot_long` is `TRUE`, the pivot happens only in the final query (i.e. `cond3`).
`pivot_var`	Prefix for pivoted variable column(s). Defaults to "variable". Only used if `pivot_long = TRUE`.
`pivot_val`	Prefix for pivoted value column(s). Defaults to "value". Only used if `pivot_long = TRUE`.
`as_chr`	Logical indicating whether to coerce the columns referenced in the query expression(s) to character prior to returning. This enables row-binding multiple queries with variables of different classes, but is only important if `pivot_long = TRUE`. Defaults to `TRUE`.

Value

A data frame reflecting the rows of data1 that match the given query. Returned columns include:

Columns matched by argument cols_base1
Columns matched by argument cols_base2 (only if join type is "left" or "inner")
Columns referenced within the relevant condition statements (pivoted to long form by default).

If the join type is a mutating join ("left" or "inner"), variables from data1 or data2 referenced in any of the condition statements (cond1, cond2, or cond3) will appear in the output. However, with a filtering join ("anti" or "semi") only variables from data1 will appear in the output.

Examples

# example datasets: two related epidemiological linelists
data(ll)  # ll from treatment center (all cases, confirmed and non-confirmed)
data(sll) # summary linelist (only confirmed/probable cases)

# find patients in ll that don't appear in sll
query2(
  ll,
  sll,
  cols_base1 = c(id, site, status),
  join_type = "anti",
  join_by = c("id" = "tc_id")
)

# find patients with different outcome status in ll vs sll
query2(
  ll,
  sll,
  cols_base1 = id:site,
  join_type = "inner",
  join_by = c("id" = "tc_id"),
  cond3 = status != sll_status
)

# example datasets: two related epidemiological linelists
data(ll)  # ll from treatment center (all cases, confirmed and non-confirmed)
data(sll) # summary linelist (only confirmed/probable cases)

# find patients in ll that don't appear in sll
query2(
  ll,
  sll,
  cols_base1 = c(id, site, status),
  join_type = "anti",
  join_by = c("id" = "tc_id")
)

# find patients with different outcome status in ll vs sll
query2(
  ll,
  sll,
  cols_base1 = id:site,
  join_type = "inner",
  join_by = c("id" = "tc_id"),
  cond3 = status != sll_status
)

Example dataset, a summary epidemiological linelist containing only confirmed/probable cases

Description

Example dataset, a summary epidemiological linelist containing only confirmed/probable cases

Usage

sll
sll

Format

A data.frame with 10 rows and 8 variables:

sll_id: Patient identifier in the summary linelist
tc_admit: Was patient admitted to a treatment centre?
tc_id: Patient ID at treatment centre
tc_site: Site of treatment centre
sll_age: Date of symptom onset
sll_status: Date of admission to hospital
sll_date_outcome: Date of laboratory test
sll_outcome: Patient outcome

Package 'queryr'

Help Index

Example dataset, an epidemiological linelist from a treatment centre

Description

Usage

Format

Example set of queries to run on `ll` using `query_vec`

Description

Usage

Format

Data validation queries with tidy, stackable output

Description

Usage

Arguments

Value

Using a dot-selector

Examples

Data validation queries across a list of data frames

Description

Usage

Arguments

Value

Data validation queries vectorized over multiple query expressions

Description

Usage

Arguments

Value

See Also

Examples

Data validation queries across two data frames

Description

Usage

Arguments

Value

Examples

Example dataset, a summary epidemiological linelist containing only confirmed/probable cases

Description

Usage

Format

Package 'queryr'

Help Index

Example dataset, an epidemiological linelist from a treatment centre

Description

Usage

Format

Example set of queries to run on ll using query_vec

Description

Usage

Format

Data validation queries with tidy, stackable output

Description

Usage

Arguments

Value

Using a dot-selector

Examples

Data validation queries across a list of data frames

Description

Usage

Arguments

Value

Data validation queries vectorized over multiple query expressions

Description

Usage

Arguments

Value

See Also

Examples

Data validation queries across two data frames

Description

Usage

Arguments

Value

Examples

Example dataset, a summary epidemiological linelist containing only confirmed/probable cases

Description

Usage

Format

Example set of queries to run on `ll` using `query_vec`