Best way to check for duplicates of one column grouped by another? #394

Aariq · 2022-02-09T19:01:21Z

Aariq
Feb 9, 2022

What's the best way to check, for example, that ID numbers are not duplicated within years. I suspect there is a better way than what I'm trying to do:

library(tidyverse)

#> Warning: package 'tidyr' was built under R version 4.0.5

df <- tibble(year = rep(2001:2003, each =3),
       ID = c("A1", "A2", "A3",
              "A1", "A1", "A3",
              "A1", "A2", "A3"))
df

#> # A tibble: 9 × 2
#>    year ID   
#>   <int> <chr>
#> 1  2001 A1   
#> 2  2001 A2   
#> 3  2001 A3   
#> 4  2002 A1   
#> 5  2002 A1   
#> 6  2002 A3   
#> 7  2003 A1   
#> 8  2003 A2   
#> 9  2003 A3

#row 5 should fail test

library(pointblank)

#I think this worked in a previous version, but doesn't anymore?
create_agent(df) %>% 
  col_vals_lte(vars(n), 1,
               preconditions = ~ . %>%
                 group_by(ID, year) %>%
                 count()) %>% 
  interrogate()

#> Warning in agent$validation_set$n[idx] <- row_count: number of items to replace
#> is not a multiple of replacement length

#> Warning in agent$validation_set$n_passed[idx] <- n_passed: number of items to
#> replace is not a multiple of replacement length

#> Warning in n_passed/row_count: longer object length is not a multiple of shorter
#> object length

#> Warning in agent$validation_set$f_passed[idx] <- round((n_passed/row_count), :
#> number of items to replace is not a multiple of replacement length

#> Warning in agent$validation_set$f_failed[idx] <- round((n_failed/row_count), :
#> number of items to replace is not a multiple of replacement length

Answered by rich-iannone

Feb 9, 2022

Eric, I gotcha there too! With rows_distinct() you can focus on a subset of columns:

library(pointblank)
library(tidyverse)

df <- 
  tibble(
    year = rep(2001:2003, each = 3),
    ID = c("A1", "A2", "A3",
           "A1", "A1", "A3",
           "A1", "A2", "A3"
    )
  )

agent <-
  create_agent(
    tbl = df,
    actions = action_levels(warn_at = 1)
  ) %>%
  rows_distinct(columns = "ID", segments = vars(year)) %>%
  interrogate()

agent

This yields this report:

View full answer

rich-iannone · 2022-02-09T19:09:02Z

rich-iannone
Feb 9, 2022
Maintainer

Hey Eric, try using the relatively new segmentation feature. Here's an example:

library(pointblank)
library(tidyverse)

df <- 
  tibble(
    year = rep(2001:2003, each = 3),
    ID = c("A1", "A2", "A3",
           "A1", "A1", "A3",
           "A1", "A2", "A3"
    )
  )

agent <-
  create_agent(
    tbl = df,
    actions = action_levels(warn_at = 1)
  ) %>%
  rows_distinct(segments = vars(year)) %>%
  interrogate()

agent

Here's a screen capture of the reporting:

2 replies

Aariq Feb 9, 2022
Author

Ah, ok, this works except I'm not interested in entirely duplicated rows, just duplicates of one column. For example, each ID number should only have one measurement of height per year. If there are multiples, then someone entered the ID number wrong. Here's a slightly different example where row 5 should still fail.

library(tidyverse)
#> Warning: package 'tidyr' was built under R version 4.0.5
df <- tibble(year = rep(2001:2003, each =3),
       ID = c("A1", "A2", "A3",
              "A1", "A1", "A3",
              "A1", "A2", "A3"),
       height = rnorm(9, 100)
       )
df
#> # A tibble: 9 × 3
#>    year ID    height
#>   <int> <chr>  <dbl>
#> 1  2001 A1     101. 
#> 2  2001 A2     102. 
#> 3  2001 A3     100. 
#> 4  2002 A1      99.3
#> 5  2002 A1     101. 
#> 6  2002 A3     100. 
#> 7  2003 A1     101. 
#> 8  2003 A2      99.2
#> 9  2003 A3      99.1
#row 5 should fail test

library(pointblank)
#> Warning: package 'pointblank' was built under R version 4.0.5

agent <-
  create_agent(
    tbl = df,
    actions = action_levels(warn_at = 1)
  ) %>%
  rows_distinct(segments = vars(year)) %>%
  interrogate()

^{Created on 2022-02-09 by the reprex package (v2.0.1)}

Aariq Feb 9, 2022
Author

something like a col_vals_distinct()

rich-iannone · 2022-02-09T19:23:02Z

rich-iannone
Feb 9, 2022
Maintainer

Eric, I gotcha there too! With rows_distinct() you can focus on a subset of columns:

library(pointblank)
library(tidyverse)

df <- 
  tibble(
    year = rep(2001:2003, each = 3),
    ID = c("A1", "A2", "A3",
           "A1", "A1", "A3",
           "A1", "A2", "A3"
    )
  )

agent <-
  create_agent(
    tbl = df,
    actions = action_levels(warn_at = 1)
  ) %>%
  rows_distinct(columns = "ID", segments = vars(year)) %>%
  interrogate()

agent

This yields this report:

2 replies

Aariq Feb 9, 2022
Author

oh, thanks! I missed that in the lengthy list of arguments. Perfect!

rich-iannone Feb 9, 2022
Maintainer

Awesome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to check for duplicates of one column grouped by another? #394

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best way to check for duplicates of one column grouped by another? #394

Aariq Feb 9, 2022

Replies: 2 comments · 4 replies

rich-iannone Feb 9, 2022 Maintainer

Aariq Feb 9, 2022 Author

Aariq Feb 9, 2022 Author

rich-iannone Feb 9, 2022 Maintainer

Aariq Feb 9, 2022 Author

rich-iannone Feb 9, 2022 Maintainer

Aariq
Feb 9, 2022

Replies: 2 comments 4 replies

rich-iannone
Feb 9, 2022
Maintainer

Aariq Feb 9, 2022
Author

Aariq Feb 9, 2022
Author

rich-iannone
Feb 9, 2022
Maintainer

Aariq Feb 9, 2022
Author

rich-iannone Feb 9, 2022
Maintainer