Skip to content

Commit

Permalink
Merge pull request #15 from pvcy/jc/update-readme
Browse files Browse the repository at this point in the history
Update README
  • Loading branch information
john-craft authored Oct 27, 2023
2 parents 88c85d9 + eea10fc commit 7a1a21a
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 78 deletions.
18 changes: 0 additions & 18 deletions .github/workflows/anonymize.yml

This file was deleted.

38 changes: 0 additions & 38 deletions .github/workflows/snapshot-data.yml

This file was deleted.

36 changes: 14 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,22 @@

This application is a demonstration of how to validate database migrations will succeed before being run in production environments. The application uses real, anonymized production data to verify migrations work and don't fail against outlier data.

## A reference architecture
The reference architecture we propose is designed specifically for ephemeral development environments. These environments often contain subsets of production data, making it essential to safeguard Personally Identifiable Information (PII) and other sensitive details.
This application is based off of the reference app at [pvcy/anonymize-demo](https://github.com/pvcy/anonymize-demo).

### Key components:
## Overview
This sample app runs a GitHub workflow to verify success when pull requests contain database migration code.
The app uses two GitHub workflows, `build-and-push.yml` and `test-migration.yml`, to coordinate everything. The first workflow, `build-and-push.yml`, builds and stores a container image from the `/db` directory which will load data from a GCS bucket at startup. The second worklow, `test-migration.yml`, detects when migration changes in pull requests exist and runs the database container with a full copy of production data against which migrations can be tested. The success or failur of the migration is added as a comment to the pull request.

1. Source data. This component represents the data that needs to be anonymized. It could be a database, file storage, or any other data repository. In this example, the source data is a PostgreSQL container seeded from a synthetically generated JSON document.
2. Anonymization engine. The anonymization engine is the core component responsible for processing and transforming data to remove sensitive information. Privacy Dynamics serves as the anonymization engine in this example.
3. Anonymized data. An anonymized copy of the source data that is used for development and testing purposes while protecting individual privacy.
## Assumptions
* This app assumes there is a `pg_dump` from PostgreSQL stored in a Google Cloud Storage (GCS) bucket.
* This is a Node app and uses [Sequelize](https://sequelize.org/) to run the database migration.
* The workflow variable `DB_BACKUP_URL` needs to be pointed at the bucket containing the backup.
* You must have [Docker](https://www.docker.com/) installed to run the app locally.
* This app loads the database backup into a PostgreSQL container when the workflow `test-migration` runs. If the data is exceptionally large, it may not load in time.

The application contains two services, a front-end and back-end, both written in Python. The back-end contains a fake dataset of users to represent a production database.

For the demonstration, two instances of the app should be started. One instance to represent a production application and the second instance to represent a development or preview environment. Privacy Dynamics connects to the first instance, anonymizes data in memory, and writes the anonymized copy to the second instance.

![](docs/Basic%20Anonymizing%20Data%20for%20Dev%20and%20Test%20Evironments.jpg)

## Leveraging Privacy Dynamics
The demonstration relies on the [Privacy Dynamics Software-as-a-Service (SaaS)](https://www.privacydynamics.io) platform to anonymize the dataset. In order for Privacy Dynamics to work, it must be able to connect to both the source and destination databases. In most cases, networking adjustments must be made to make the databases accessible by Privacy Dynamics.

![](docs/Anonymizing%20Data%20for%20Dev%20and%20Test%20Evironments.jpg)

Once connectivity to the databases is established, a base environment can be used to store an anonymized copy of production data, refreshed on an hourly, daily, or weekly schedule. Database snapshots are used to copy and replicate the anonymized data to _n_ number of remote environments.

## Getting started
You can run a single instance of the application locally. All you need to get started is [Docker](https://www.docker.com/) installed on your local machine.

1. Start the application with `docker compose up --build`. This will seed the PostgreSQL database with data from the JSON file.
2. Navigate to `http://localhost:5000` in your browser.
1. Create a backup of your production database with `pg_dump` and store it in a GCS bucket.
2. Run the PostgreSQL database locally with the command `docker compose up --build`. This will launch the three containers defined in `/docker-compose.yml` and load the sample data (`/users-api/data/users.json`) into the database container.
3. From the command line, run the migration with `$\users-api\src npx sequelize-cli db:migrate`. The migration should pass locally.
4. Create a migration
6 changes: 6 additions & 0 deletions users-api/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 7a1a21a

Please sign in to comment.