Title: | Data Validation Queries With Tidy Output |
---|---|
Description: | Data validation queries with tidy, stackable output. |
Authors: | Patrick Barks [aut, cre] |
Maintainer: | Patrick Barks <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0.9000 |
Built: | 2024-10-24 16:07:24 UTC |
Source: | https://github.com/epicentre-msf/queryr |
Example dataset, an epidemiological linelist from a treatment centre
ll
ll
A data.frame with 12 rows and 10 variables:
Patient identifier
Site identifier
Patient age in years
Patient status
Date of symptom onset
Date of admission to treatment centre
Date of laboratory test
Result of laboratory test
Date of exit from treatment centre
Patient outcome
ll
using query_vec
Example set of queries to run on ll
using query_vec
ll_queries
ll_queries
A data.frame with 5 rows and 2 variables:
Query IDs
Query expressions in string format
Find observations within a data frame matching a given query (a logical expression relating to one or more variables), and return tidy output that can be stacked across different queries on different variables. Stackability is achieved by pivoting the columns indicated in the query expression to long-form, e.g. "variable1", "value1", "variable2", "value2", ...
The query expression can optionally incorporate up to two dot-selectors
(".x
" and ".y
"), which each refer to a set of variables specified
separately using tidy-selection (see section Using a dot-selector). If
both selectors are used in a given query expression, the sets of variables
they respectively match can either be "crossed" such that all combinations
are evaluated, or evaluated in parallel.
By default, only the data columns referenced in the query expression are
returned, but additional columns can optionally be added with argument
cols_base
.
query( data, cond, cols_dotx, cols_doty, crossed = FALSE, cols_base, pivot_long = TRUE, pivot_var = "variable", pivot_val = "value", as_chr = TRUE, count = FALSE )
query( data, cond, cols_dotx, cols_doty, crossed = FALSE, cols_base, pivot_long = TRUE, pivot_var = "variable", pivot_val = "value", as_chr = TRUE, count = FALSE )
data |
A data frame |
cond |
An expression to evaluate with respect to variables within
|
cols_dotx , cols_doty
|
Tidy-selection of one or more columns represented
by a .x or .y selector. Only used if |
crossed |
if |
cols_base |
(Optional) Tidy-selection of other columns within |
pivot_long |
Logical indicating whether to pivot the variables
referenced within |
pivot_var |
Prefix for pivoted variable column(s). Defaults to
"variable". Only used if |
pivot_val |
Prefix for pivoted value column(s). Defaults to "value".
Only used if |
as_chr |
Logical indicating whether to coerce the columns referenced in
the query expression |
count |
Logical indicating whether to summarize the output by counting
the number of unique combinations across all returned columns (with count
column "n"). Defaults to |
A data frame reflecting the rows of data
that match the given query.
Returned columns include:
(optional) columns matched by argument cols_base
columns referenced within the query expression (pivoted to long form by default)
(optional) count column "n" (if count
= TRUE)
A query expression can optionally incorporate up to two dot-selectors
(".x
" and ".y
"), which each refer to a set of variables specified
separately using tidy-selection (arguments cols_dotx
and cols_doty
).
When cond
contains a .x
selector, the query expression is evaluated
repeatedly with each relevant variable from cols_dotx
individually
substituted into the .x
position of the expression. The results of these
multiple 'subqueries' are then combined with
dplyr::bind_rows
.
If cond
contains both a .x and .y selector, the sets of variables matched
by cols_dotx
and cols_doty
respectively can either be "crossed" such that
all combinations are evaluated, or evaluated in parallel. Evaluating in
parallel requires that the number of variables matched by cols_dotx
and
cols_doty
is the same.
Consider a hypothetical query checking that, if a patient has a particular
symptom, the date of onset of that symptom is not missing. E.g. cond = .x == "Yes" & is.na(.y)
cols_dotx = c(symptom_fever, symptom_headache)
cols_doty = c(date_symptom_fever, date_symptom_headache)
If argument crossed
is FALSE
, the relevant variables from cols_dotx
and
cols_doty
will be evaluated in parallel, as in: has_symptom_fever == "Yes" & is.na(date_symptom_fever)
has_symptom_headache == "Yes" & is.na(date_symptom_headache)
Conversely, if argument crossed
is TRUE
, all combinations of the relevant
variables will be evaluated, which for this particular query wouldn't make
sense: symptom_fever == "Yes" & is.na(date_symptom_fever)
symptom_fever == "Yes" & is.na(date_symptom_headache) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_fever) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_headache)
Note that if a dot-selector is used with argument pivot_long = FALSE
, the
row-binding of multiple subqueries may result in a sparse output with respect
to the variables represented by the dot-selector, because for each subquery
only the columns matched by expression cond
are returned.
# load example dataset, an epidemiological 'linelist' data(ll) # find observations where date_exit is earlier than date_admit query( ll, date_exit < date_admit, cols_base = id:site ) # find any date value in the future using a .x column selector query( ll, .x > Sys.Date(), cols_dotx = starts_with("date"), cols_base = id:site ) # incorporate an external object into the query expression lab_result_valid <- c("Positive", "Negative", "Inc.", NA) query( ll, !lab_result %in% lab_result_valid, cols_base = id:site, )
# load example dataset, an epidemiological 'linelist' data(ll) # find observations where date_exit is earlier than date_admit query( ll, date_exit < date_admit, cols_base = id:site ) # find any date value in the future using a .x column selector query( ll, .x > Sys.Date(), cols_dotx = starts_with("date"), cols_base = id:site ) # incorporate an external object into the query expression lab_result_valid <- c("Positive", "Negative", "Inc.", NA) query( ll, !lab_result %in% lab_result_valid, cols_base = id:site, )
Find observations matching a query that concerns one or more data frames
within a list of data frames, and return tidy, stackable output. Like
query
but enables query expressions that reference variables in multiple
data frames.
If the query expression references variables from data frames (i.e. list
elements) other than the focal element, the relevant variable(s) will be
joined to the focal element before the query expression is evaluated, see
arguments join_type
and join_by
below.
query_list( x, cond, element, cols_base, join_type = "left", join_by, pivot_long = TRUE, pivot_var = "variable", pivot_val = "value", as_chr = TRUE )
query_list( x, cond, element, cols_base, join_type = "left", join_by, pivot_long = TRUE, pivot_var = "variable", pivot_val = "value", as_chr = TRUE )
x |
A list of data frames |
cond |
Expression to evaluate with respect to one or more variables in
one or more of the data frames within |
element |
Name or integer index of the focal list element of |
cols_base |
(Optional) Tidy-selection of other columns within |
join_type |
If |
join_by |
A character vector of variables to join by. If the join key
columns have different names in |
pivot_long |
Logical indicating whether to pivot the variables
referenced within |
pivot_var |
Prefix for pivoted variable column(s). Defaults to
"variable". Only used if |
pivot_val |
Prefix for pivoted value column(s). Defaults to "value".
Only used if |
as_chr |
Logical indicating whether to coerce the columns referenced in
the query expression |
A data frame reflecting the rows of x[[element]]
that match the given
query. Returned columns include:
(optional) columns matched by argument cols_base
columns referenced within the query expression (pivoted to long form by default)
Data validation queries with query
or query_list
, but vectorized over
a set of query expressions in string format (and optionally a corresponding
vector of query names/IDs). Results of the multiple queries are stacked and
returned in a single tidy data frame, with columns referenced in the query
expressions pivoted to long-form (e.g. "variable1", "value1", "variable2",
"value2", ...).
query_vec( x, cond, element, name, cols_base, name_col = "query_id", join_type = "left", join_by = NULL, pivot_var = "variable", pivot_val = "value", as_chr = TRUE )
query_vec( x, cond, element, name, cols_base, name_col = "query_id", join_type = "left", join_by = NULL, pivot_var = "variable", pivot_val = "value", as_chr = TRUE )
x |
A data frame or a list of data frames to query. If a single data
frame will vectorize with |
cond |
Character vector of expressions to evaluate with respect to
variables within |
element |
If |
name |
(Optional) Character vector giving query names/IDs for each of
the expressions within |
cols_base |
(Optional) Tidy-selection of other columns within |
name_col |
Column name for the query names/IDs. Defaults to "query_id". |
join_type |
If |
join_by |
A character vector of variables to join by, or list of vectors
the same length as |
pivot_var |
Prefix for pivoted variable column(s). Defaults to "variable". |
pivot_val |
Prefix for pivoted value column(s). Defaults to "value". |
as_chr |
Logical indicating whether to coerce the columns referenced in
the query expression(s) to character prior to returning. This enables
row-binding multiple queries with variables of different classes. Defaults
to |
A data frame reflecting the rows of data
that match the given queries.
Returned columns include:
query name/ID column (name taken from argument name_col
)
(optional) columns matched by argument cols_base
columns referenced within the query expressions, pivoted to long form
data(ll) # example dataset, an epidemiological linelist data(ll_queries) # example data frame defining queries to run on ll # run all queries defined in ll_queries query_vec( ll, cond = ll_queries$query, name = ll_queries$query_id, cols_base = c(id, site) )
data(ll) # example dataset, an epidemiological linelist data(ll_queries) # example data frame defining queries to run on ll # run all queries defined in ll_queries query_vec( ll, cond = ll_queries$query, name = ll_queries$query_id, cols_base = c(id, site) )
Find observations matching a query that concerns two data frames, and return tidy, stackable output. Entails three steps:
separately query each of the two data frames using
query
combine the resulting query outputs based on a given join type (semi, anti, left, or inner)
execute a third query on the joined output
Each of the query steps is optional — unspecified query expressions are
replaced with TRUE
such that all rows of the relevant input are returned.
query2( data1, data2, cond1, cond2, cols_base1, cols_base2, join_type, join_by, cond3, pivot_long = TRUE, pivot_var = "variable", pivot_val = "value", as_chr = TRUE )
query2( data1, data2, cond1, cond2, cols_base1, cols_base2, join_type, join_by, cond3, pivot_long = TRUE, pivot_var = "variable", pivot_val = "value", as_chr = TRUE )
data1 |
Data frame to query (#1) |
data2 |
Data frame to query (#2) |
cond1 |
(Optional) Expression to evaluate with respect to |
cond2 |
(Optional) Expression to evaluate with respect to |
cols_base1 |
(Optional) Tidy-selection of other columns within |
cols_base2 |
(Optional) Tidy-selection of other columns within |
join_type |
How to join the output from the two initial queries ("semi", "anti",
"left", or "inner"). Based on dplyr |
join_by |
A character vector of variables to join by. If the join key
columns have different names in |
cond3 |
(Optional) Expression to evaluate with respect to the joined
output of the two initial queries. If missing will be set to Note that if If |
pivot_long |
Logical indicating whether to pivot the variables
referenced within the query expression(s) to a long (i.e. stackable)
format, with default column names "variable1", "value1", "variable2",
"value2", ... Defaults to |
pivot_var |
Prefix for pivoted variable column(s). Defaults to
"variable". Only used if |
pivot_val |
Prefix for pivoted value column(s). Defaults to "value".
Only used if |
as_chr |
Logical indicating whether to coerce the columns referenced in
the query expression(s) to character prior to returning. This enables
row-binding multiple queries with variables of different classes, but is
only important if |
A data frame reflecting the rows of data1
that match the given
query. Returned columns include:
Columns matched by argument cols_base1
Columns matched by argument cols_base2
(only if join type is "left" or
"inner")
Columns referenced within the relevant condition statements (pivoted to
long form by default).
If the join type is a mutating join ("left" or "inner"), variables from
data1
or data2
referenced in any of the condition statements
(cond1
, cond2
, or cond3
) will appear in the output. However, with a
filtering join ("anti" or "semi") only variables from data1
will appear
in the output.
# example datasets: two related epidemiological linelists data(ll) # ll from treatment center (all cases, confirmed and non-confirmed) data(sll) # summary linelist (only confirmed/probable cases) # find patients in ll that don't appear in sll query2( ll, sll, cols_base1 = c(id, site, status), join_type = "anti", join_by = c("id" = "tc_id") ) # find patients with different outcome status in ll vs sll query2( ll, sll, cols_base1 = id:site, join_type = "inner", join_by = c("id" = "tc_id"), cond3 = status != sll_status )
# example datasets: two related epidemiological linelists data(ll) # ll from treatment center (all cases, confirmed and non-confirmed) data(sll) # summary linelist (only confirmed/probable cases) # find patients in ll that don't appear in sll query2( ll, sll, cols_base1 = c(id, site, status), join_type = "anti", join_by = c("id" = "tc_id") ) # find patients with different outcome status in ll vs sll query2( ll, sll, cols_base1 = id:site, join_type = "inner", join_by = c("id" = "tc_id"), cond3 = status != sll_status )
Example dataset, a summary epidemiological linelist containing only confirmed/probable cases
sll
sll
A data.frame with 10 rows and 8 variables:
Patient identifier in the summary linelist
Was patient admitted to a treatment centre?
Patient ID at treatment centre
Site of treatment centre
Date of symptom onset
Date of admission to hospital
Date of laboratory test
Patient outcome