Skip to content

Commit

Permalink
Merge branch 'main' of github.com:UBC-DSCI/reproducible-and-trustwort…
Browse files Browse the repository at this point in the history
…hy-workflows-for-data-science
  • Loading branch information
ttimbers committed Nov 28, 2024
2 parents 13459ad + 45379af commit c0cb35d
Show file tree
Hide file tree
Showing 3 changed files with 84 additions and 7 deletions.
2 changes: 1 addition & 1 deletion book/lectures/100-containerization-1.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ Next, try to use Docker to run a container
that contains the RStudio server web-application installed:

```bash
docker run --rm -p 8787:8787 -e PASSWORD="apassword" rocker/rstudio:4.3.2
docker run --rm -p 8787:8787 -e PASSWORD="apassword" rocker/rstudio:4.4.2
```

Then visit a web browser on your computer and type: <http://localhost:8787>
Expand Down
8 changes: 4 additions & 4 deletions book/lectures/110-containerization-2.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ docker run \
--rm \
-p 8787:8787 \
-e PASSWORD="apassword" \
rocker/rstudio:4.3.2
rocker/rstudio:4.4.2
```
Then to access the web app, we need to navigate a browser url to `http://localhost:<CONTAINER_PORT>`. In this case we would navigate to <http://localhost:8787> to use the RStudio server web app from the container.
Expand All @@ -154,7 +154,7 @@ docker run \
--rm \
-p 8788:8787 \
-e PASSWORD="apassword" \
rocker/rstudio:4.3.2
rocker/rstudio:4.4.2
```
When we do this, to run the app in a browser on our computer, we need to go to <http://localhost:8788> (instead of <http://localhost:8787>) to access this container as we mapped it to the `8788` port on our computer (and not `8787`).
Expand Down Expand Up @@ -216,7 +216,7 @@ we would run:
docker run \
--rm \
-it \
rocker/rstudio:4.3.2 \
rocker/rstudio:4.4.2 \
bash
```
Expand Down Expand Up @@ -351,7 +351,7 @@ for use with the `rocker/rstudio` container image:
```yaml
services:
analysis-env:
image: rocker/rstudio:4.3.2
image: rocker/rstudio:4.4.2
ports:
- "8787:8787"
volumes:
Expand Down
81 changes: 79 additions & 2 deletions book/lectures/130-data-validation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,25 @@ into functions also allows this code to be tested to ensure it is correct,
and that invalid data is handled as intended
(more on this in the testing chapter later in this book).

One note of caution for where to perform data validation checks in data analysis
where data splitting is needed
(e.g., splitting data into a training and test set for answering predictive questions)
is that you want to be sure that the data validation checks
do not cause any data leakage between the split data sets.
For example,
when checking for anomalous correlations between the target/response variable
and features/explanatory variables,
when attempting to answer a predictive question,
it would be important to not use the entire data set.
This is because using the entire dataset for such checks
could inadvertently reveal patterns, distributions,
or relationships from the test set --
which may impact the analyst's decisions/choices
when performing feature and model selection.
Given that, data validation checks like this should initially only be done on the training set.
It may make sense to apply this data validation check also to the test set,
but only after finalizing the feature and model selection.

## Data validation checks

What kind of data validation, or checks,
Expand Down Expand Up @@ -527,10 +546,68 @@ We list the others below:

**Python:**

- Pandera: <https://pandera.readthedocs.io>
- Great Expectation: <https://docs.greatexpectations.io>
- Deep Checks: <https://docs.deepchecks.com>
- Great Expectation: <https://docs.greatexpectations.io>
- Pandera: <https://pandera.readthedocs.io>
- Pydantic: <https://docs.pydantic.dev/latest/>


**R**

- pointblank: <https://rstudio.github.io/pointblank>

### Deep Checks

In particular, the [Deep Checks](https://docs.deepchecks.com) package is quite useful
due to it's high-level abstraction of several machine learning data validation checks
that you would have to code manually if you chose to use something like `Pandera` for these.
Examples from the checklist above include:

- [ ] No anomalous correlations between target/response variable and features/explanatory variables^1^
- [ ] No anomalous correlations between features/explanatory variables^1^

To use this, we first have to create a Deep Checks Dataset object
(specifying the data set, the target/response variable, and any categorical features):

```python
from deepchecks.tabular import Dataset


cancer_train_ds = Dataset(cancer_train, label="class", cat_features=[])
```

Once we have that, we can use the `FeatureLabelCorrelation()` check
set the maximum threshold we'll allow (here 0.9),
and run the check:

```python
from deepchecks.tabular.checks import FeatureLabelCorrelation


check_feat_lab_corr = FeatureLabelCorrelation().add_condition_feature_pps_less_than(0.9)
check_feat_lab_corr_result = check_feat_lab_corr.run(dataset=cancer_train_ds)
```

Finally, we can check if the result of the `FeatureLabelCorrelation()` validation has failed.
If it has (i.e., correlation is above the acceptable threshold),
we can do something, like raise a ValueError with an appropriate error message:

```python
if not check_feat_lab_corr_result.passed_conditions():
raise ValueError("Feature-Label correlation exceeds the maximum acceptable threshold.")
```

:::{.callout-note}
Notice above the name of the data frame and Deep checks data set?
It has the word "train" in it.
This is important!
Some data validation checks can cause data leakage
if we perform them on the entire data set
before finalizing feature and model selection.
Be conscientious about your data validation checks
to ensure they do not data introduce leakage.
:::

Deep Checks has a nice gallery of different data validation checks
for which it has high-level functions:
<https://docs.deepchecks.com/stable/tabular/auto_checks/data_integrity/index.html>

0 comments on commit c0cb35d

Please sign in to comment.