Using pointblank for panel / repeated measures data #297

emilyriederer · 2021-03-12T01:08:42Z

emilyriederer
Mar 12, 2021

I'm curious what strategies people use when using pointblank with panel or repeated measures data? For example, if one is analyzing data for a set of customers, it could be useful to specify checks within observations, such as:

Uniqueness of a column within but not between groups (e.g. YEAR is unique given COUNTRY but different COUNTRYs can have the same YEAR)
Values increasing within each group (but can decrease between)
The values of some column within each group contain all values of a prespecified set, so check can fail in all values exist in column but do not exist for each group

pointblank has great strategies for doing these checks for a full dataset (or, equivalently, a single group), but I cannot think of a good approach for running them for a large number of groups besides using a nest-map paradigm and creating a separate agent by group. However, this feels inelegant, challenging to combine into a single final report, and harder to implement on remote data sources (e.g. running against the database)

Thanks for any thoughts!

rich-iannone · 2021-03-14T20:46:05Z

rich-iannone
Mar 14, 2021
Maintainer

Hi Emily, thanks for bringing this up. I sketched out some code/ideas on how to solve but ran up against some dead ends. The problems are fixable but I think it would be great to share the code and see which direction might be best. Here's the exploratory R script:

# devtools::install_github("rich-iannone/intendo")

library(intendo)
library(pointblank)
library(tidyverse)

# Let's take `intendo::sj_all_revenue` and modify it so that Germany didn't
# come online until 2015-02-01
all_revenue_modified <- 
  intendo::sj_all_revenue %>%
  filter(!(country == "Germany" & 
             session_start < lubridate::ymd_hms("2015-02-01 00:00:00"))) %>%
  filter(country %in% c("United States", "Canada", "Germany", "Spain"))

# New function in pointblank stores table-prep formulas; it's low on features
# right now but you can at least materialize the table by stored name (LHS of
# formula) or get the table-prep formulas (used in the `read_fn` arg)
tbls <-
  tbl_store(
    all_revenue ~ all_revenue_modified,
    all_revenue_us ~ all_revenue_modified %>% dplyr::filter(country == "United States"),
    all_revenue_ca ~ all_revenue_modified %>% dplyr::filter(country == "Canada"),
    all_revenue_de ~ all_revenue_modified %>% dplyr::filter(country == "Germany"),
    all_revenue_es ~ all_revenue_modified %>% dplyr::filter(country == "Spain")
  )

# Note on the above: it would be nice to refer to the name of a
# table that is being mutated, like this
#' tbls <-
#'   tbl_store(
#'     all_revenue ~ all_revenue_modified,
#'     all_revenue_us ~ ..all_revenue.. %>% dplyr::filter(country == "United States"),
#'     ...
#'   )

# Then the `tbl_store` function could just dynamically complete the sequence at
# request time (i.e., through `tbl_get()` or `tbl_source`)

# Materializing tables from `tbl_get()` calls 
tbl_get("all_revenue", tbls)
tbl_get("all_revenue_us", tbls)
tbl_get("all_revenue_ca", tbls)
tbl_get("all_revenue_de", tbls)
tbl_get("all_revenue_es", tbls)

# Extracting the table-prep formulas
tbl_source("all_revenue", tbls)
tbl_source("all_revenue_us", tbls)
tbl_source("all_revenue_ca", tbls)
tbl_source("all_revenue_de", tbls)
tbl_source("all_revenue_es", tbls)

# Underneath, these are two-sided formulas
unclass(tbl_source("all_revenue", tbls)) # `all_revenue ~ all_revenue_modified`
rlang::is_formula(unclass(tbl_source("all_revenue", tbls))) # TRUE

# Here's a plot that shows us that sessions in Germany begin in February 
ggplot(all_revenue_modified) +
  geom_point(aes(x = session_start, y = item_revenue), alpha = 0.25) +
  facet_wrap(~country)
  
# Validate by group; I want this to work, but, it doesn't because
# `preconditions` only accepts a functional sequence (`. %>% filter(...)`)
# It would be great if this did work though
create_agent(
  read_fn = ~ intendo::sj_all_revenue,
  tbl_name = "all_revenue",
  label = "All Revenue for 2015",
  actions = action_levels(warn_at = 0.01, stop_at = 0.05)
) %>%
  col_vals_gte(
    "session_start", lubridate::ymd_hms("2015-01-01 00:00:00"),
    preconditions = tbl_source("all_revenue_us", tbls)
  ) %>%
  col_vals_gte(
    "session_start", lubridate::ymd_hms("2015-01-01 00:00:00"),
    preconditions = tbl_source("all_revenue_ca", tbls)
  ) %>%
  col_vals_gte(
    "session_start", lubridate::ymd_hms("2015-02-01 00:00:00"),
    preconditions = tbl_source("all_revenue_de", tbls)
  ) %>%
  col_vals_gte(
    "session_start", lubridate::ymd_hms("2015-02-01 00:00:00"),
    preconditions = tbl_source("all_revenue_es", tbls)
  ) %>%
  interrogate()

Some of my ideas (all requiring some changes, but nothing too substantial) are to:

(1) improve tbl_store() so that you can easily create variants of a data table, calling them in by name in preconditions using tbl_source()
(2) include a group arg in many validation functions
(3) include a utility function to help create a named list of functional sequences

I'll continue to explore these. Maybe the third idea might be an acceptable workaround (or the best solution? have to try it!).

I think this is an important use case to solve for, so, if any development is required I'd be happy to take that on.

2 replies

rich-iannone Mar 14, 2021
Maintainer

Looking again at this, I missed the main point that validation rules are exactly the same but applied across groups. Maybe a group argument is necessary and then pointblank would split interrogations across groups. I don’t think there’s a good workaround to this. Could you please make a new issue?

rich-iannone Mar 14, 2021
Maintainer

Another idea: see if the incoming data has groups with dplyr:group_vars(). This is done in {gt}. Then the API doesn’t change much; during interrogate() the grouping vars are inspected and several sub-interrogations would be run. A drawback is that this makes the feature pretty {dplyr}-centric and I wonder how to express the same with {data.table} tables (not that we fully support that yet).

emilyriederer · 2021-03-14T23:30:07Z

emilyriederer
Mar 14, 2021
Author

Thanks for all of the thoughts on this @rich-iannone ! I think your example makes a really good point that there are a lot of different potential uses for groups:

running checks that should be evaluated only within a group (my premise)
running checks that need to differ by group (your example)
running the same checks across groups but perhaps looking at output at the group level to identify errors (random concept that just occurs to me)

Even if all of those are implemented, it's good to realize how many different ways people could interpret a "group" function!

I do like the idea of using group_vars() to keep the API simple. Since dplyr::group_vars() works by adding a groups attribute, I wonder if for non-dplyr data structures you could simply add a helper function that adds a similar attribute to other types of objects in a non-disruptive way? 🤔

I'll open a general issue to keep discussing more!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pointblank for panel / repeated measures data #297

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Using pointblank for panel / repeated measures data #297

emilyriederer Mar 12, 2021

Replies: 2 comments · 2 replies

rich-iannone Mar 14, 2021 Maintainer

rich-iannone Mar 14, 2021 Maintainer

rich-iannone Mar 14, 2021 Maintainer

emilyriederer Mar 14, 2021 Author

emilyriederer
Mar 12, 2021

Replies: 2 comments 2 replies

rich-iannone
Mar 14, 2021
Maintainer

rich-iannone Mar 14, 2021
Maintainer

rich-iannone Mar 14, 2021
Maintainer

emilyriederer
Mar 14, 2021
Author