-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #105 from UrbanInstitute/version0.0.4
Version0.0.4
- Loading branch information
Showing
88 changed files
with
4,342 additions
and
351 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,11 @@ | ||
^renv$ | ||
^renv\.lock$ | ||
^syntheval\.Rproj$ | ||
^\.Rproj\.user$ | ||
^LICENSE\.md$ | ||
^README\.Rmd$ | ||
^README\.qmd$ | ||
^README_files$ | ||
^data-raw$ | ||
^test.R | ||
disriminators.qmd | ||
^project-standards.md$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,52 @@ | ||
.Rproj.user | ||
# History files | ||
.Rhistory | ||
.DS_Store | ||
.Rapp.history | ||
|
||
# Mac system file | ||
.DS_Store | ||
|
||
# Session Data files | ||
.RData | ||
|
||
# Example code in package build process | ||
*-Ex.R | ||
|
||
# Output files from R CMD build | ||
/*.tar.gz | ||
|
||
# Output files from R CMD check | ||
/*.Rcheck/ | ||
|
||
# RStudio files | ||
.Rproj.user/ | ||
|
||
# produced vignettes | ||
vignettes/*.html | ||
vignettes/*.pdf | ||
|
||
# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 | ||
.httr-oauth | ||
|
||
# knitr and R markdown default cache directories | ||
/*_cache/ | ||
/cache/ | ||
|
||
# Temporary files created by R markdown | ||
*.utf8.md | ||
*.knit.md | ||
|
||
# Shiny token, see https://shiny.rstudio.com/articles/shinyapps.html | ||
rsconnect/ | ||
|
||
# log files | ||
*.log | ||
*\.html | ||
|
||
# renv environment files | ||
renv/ | ||
|
||
# README build files | ||
README_files/ | ||
|
||
# Plot outputs from unit tests | ||
tests/testthat/Rplots.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,16 @@ | ||
Package: syntheval | ||
Title: A set of tools for evaluating synthetic data utility and disclosure risk | ||
Version: 0.0.3 | ||
Version: 0.0.4 | ||
Authors@R: c( | ||
person(given = "Aaron R.", family = "Williams", | ||
person(given = "Aaron R.", | ||
family = "Williams", | ||
email = "[email protected]", role = c("aut", "cre"), | ||
comment = c(ORCID = "0000-0001-5564-1938")), | ||
person(given = "Jeremy", | ||
family = "Seeman", | ||
email = "[email protected]", | ||
role = "aut", | ||
comment = c(ORCID = "0000-0003-3526-3209")), | ||
person("Gabe", "Morrison", , "[email protected]", role = "ctb"), | ||
person("Elyse", "McFalls", , "[email protected]", role = "ctb") | ||
) | ||
|
@@ -16,24 +22,31 @@ License: AGPL (>= 3) | |
BugReports: https://github.com/UI-Research/syntheval/issues | ||
Encoding: UTF-8 | ||
Roxygen: list(markdown = TRUE) | ||
RoxygenNote: 7.2.3 | ||
RoxygenNote: 7.3.2 | ||
Suggests: | ||
forcats, | ||
stringr, | ||
testthat (>= 3.0.0) | ||
Config/testthat/edition: 3 | ||
Imports: | ||
broom, | ||
dplyr, | ||
ggplot2, | ||
gower, | ||
gridExtra, | ||
Hmisc, | ||
magrittr, | ||
parsnip, | ||
pillar, | ||
purrr, | ||
recipes, | ||
rlang, | ||
rsample, | ||
tibble, | ||
tidyr, | ||
tidyselect, | ||
tune, | ||
twosamples, | ||
workflows, | ||
yardstick | ||
Suggestions: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
#' American Community Survey confidential microdata | ||
#' | ||
#' An extract constructed from the 2019 American Community Survey containing a | ||
#' random sample of n = 1000 Nebraska respondents. | ||
#' | ||
#' Original data source: | ||
#' Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, | ||
#' Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. | ||
#' IPUMS USA: Version 15.0 \[dataset\]. Minneapolis, MN: IPUMS, 2024. | ||
#' https://doi.org/10.18128/D010.V15.0 | ||
#' | ||
#' @format ## `acs_conf` | ||
#' A data frame with 1,000 rows and 11 columns: | ||
#' \describe{ | ||
#' \item{county}{fct, county} | ||
#' \item{gq}{fct, group quarter kind} | ||
#' \item{sex}{fct, sex} | ||
#' \item{marst}{fct, marital status} | ||
#' \item{hcovany}{fct, health insurance status} | ||
#' \item{empstat}{fct, employment status} | ||
#' \item{classwkr}{fct, employment kind (ex: self-employed, etc.)} | ||
#' \item{age}{dbl, age (in years)} | ||
#' \item{famsize}{dbl, household/family size} | ||
#' \item{transit_time}{dbl, transit time to work (in minutes)} | ||
#' \item{inctot}{dbl, annual income} | ||
#' } | ||
#' @source <https://usa.ipums.org/usa/> | ||
"acs_conf" | ||
|
||
#' American Community Survey holdout microdata | ||
#' | ||
#' An extract constructed from the 2019 American Community Survey containing a | ||
#' random sample of n = 1000 Nebraska respondents. This sample is distinct from | ||
#' `acs_conf` and is not used in producing the synthetic data available in this | ||
#' package. | ||
#' | ||
#' Original data source: | ||
#' Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, | ||
#' Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. | ||
#' IPUMS USA: Version 15.0 \[dataset\]. Minneapolis, MN: IPUMS, 2024. | ||
#' https://doi.org/10.18128/D010.V15.0 | ||
#' | ||
#' @format ## `acs_holdout` | ||
#' A data frame with 1,000 rows and 11 columns: | ||
#' \describe{ | ||
#' \item{county}{fct, county} | ||
#' \item{gq}{fct, group quarter kind} | ||
#' \item{sex}{fct, sex} | ||
#' \item{marst}{fct, marital status} | ||
#' \item{hcovany}{fct, health insurance status} | ||
#' \item{empstat}{fct, employment status} | ||
#' \item{classwkr}{fct, employment kind (ex: self-employed, etc.)} | ||
#' \item{age}{dbl, age (in years)} | ||
#' \item{famsize}{dbl, household/family size} | ||
#' \item{transit_time}{dbl, transit time to work (in minutes)} | ||
#' \item{inctot}{dbl, annual income} | ||
#' } | ||
#' @source <https://usa.ipums.org/usa/> | ||
"acs_holdout" | ||
|
||
#' American Community Survey lower-risk synthetic data | ||
#' | ||
#' A list of 30 samples of synthetic data derived from `acs_conf`, | ||
#' generated using noise infusion for both categorical and numeric random variables. | ||
#' These are referred to as "lower-risk" relative to the "higher-risk" synthetic data | ||
#' also available in this package; the synthetic data is purely for testing purposes. | ||
#' | ||
#' Categorical random variables are selected by resampling from a mixture of the | ||
#' original multivariate cell proportions and a uniform mixture. Numeric random | ||
#' variables are first modelled using regression trees, and new sampled values | ||
#' each have additional discrete two-sided geometric noise added to them. | ||
#' | ||
#' Original data source: | ||
#' Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, | ||
#' Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. | ||
#' IPUMS USA: Version 15.0 \[dataset\]. Minneapolis, MN: IPUMS, 2024. | ||
#' https://doi.org/10.18128/D010.V15.0 | ||
#' | ||
#' @format ## `acs_lr_synths` | ||
#' A list of 30 data frames with 1,000 rows and 11 columns: | ||
#' \describe{ | ||
#' \item{county}{fct, county} | ||
#' \item{gq}{fct, group quarter kind} | ||
#' \item{sex}{fct, sex} | ||
#' \item{marst}{fct, marital status} | ||
#' \item{hcovany}{fct, health insurance status} | ||
#' \item{empstat}{fct, employment status} | ||
#' \item{classwkr}{fct, employment kind (ex: self-employed, etc.)} | ||
#' \item{age}{dbl, age (in years)} | ||
#' \item{famsize}{dbl, household/family size} | ||
#' \item{transit_time}{dbl, transit time to work (in minutes)} | ||
#' \item{inctot}{dbl, annual income} | ||
#' } | ||
#' @source <https://usa.ipums.org/usa/> | ||
"acs_lr_synths" | ||
|
||
|
||
#' American Community Survey higher-risk synthetic data | ||
#' | ||
#' A list of 30 samples of partial synthetic data derived from `acs_conf`, | ||
#' generated using models that intentionally overfit to the confidential data. | ||
#' These are referred to as "higher-risk" relative to the "lower-risk" synthetic | ||
#' data also available in this package; the synthetic data is purely for testing purposes. | ||
#' | ||
#' Categorical variables are primarily kept "as-is" in this partially synthetic data, | ||
#' with a small proportion of categorical records resampled from the data. Numeric | ||
#' variables are resampled from decision tree models that are intentionally designed | ||
#' to overfit to the confidential data. | ||
#' | ||
#' Original data source: | ||
#' Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, | ||
#' Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. | ||
#' IPUMS USA: Version 15.0 \[dataset\]. Minneapolis, MN: IPUMS, 2024. | ||
#' https://doi.org/10.18128/D010.V15.0 | ||
#' | ||
#' @format ## `acs_hr_synths` | ||
#' A list of 30 data frames with 1,000 rows and 11 columns: | ||
#' \describe{ | ||
#' \item{county}{fct, county} | ||
#' \item{gq}{fct, group quarter kind} | ||
#' \item{sex}{fct, sex} | ||
#' \item{marst}{fct, marital status} | ||
#' \item{hcovany}{fct, health insurance status} | ||
#' \item{empstat}{fct, employment status} | ||
#' \item{classwkr}{fct, employment kind (ex: self-employed, etc.)} | ||
#' \item{age}{dbl, age (in years)} | ||
#' \item{famsize}{dbl, household/family size} | ||
#' \item{transit_time}{dbl, transit time to work (in minutes)} | ||
#' \item{inctot}{dbl, annual income} | ||
#' } | ||
#' @source <https://usa.ipums.org/usa/> | ||
"acs_hr_synths" |
Oops, something went wrong.