Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rake task for dumping, restoring and anonymizing user data #1013

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dkuku
Copy link
Collaborator

@dkuku dkuku commented Jan 5, 2019

Created rake task for anonymizing user data. fixes #987

Description

Also for dumping and restoring development database
Currently to use it you need to dump your data rake db:dump and backup the file db/backend.dump
Then you can run rake db:anonymize_user which changes current user table
Now you can run rake db:dump again to dump anonymized data
to restore the original file move it to db/ and run rake db/restore
Waiting for suggestions to extend this.

Motivation and Context

To be in compliace with current European law we can't get developers access to user data - this rake task annonymizes the user table

How Has This Been Tested?

Tested locally on seeded data - might need some adjustments, I annonymized only the data I currently have access to

alternative solution using shell and temporary database

http://www.michaelkrenz.de/2012/08/05/how-to-anonymize-data-in-a-postgresql-database/

dump data

pg_dump database > original_datadump

create temp database

createdb tempDB

import data to temp

psql tempDB < ./original_datadump

run anonymize script - on the botttom

psql tempDB < ./anonymize_db.sql

dump anonymized data

pg_dump tempDB > anon_dump.sql

delete temporary table

dropdb tempDB

/*delete field content
    update users set encrypted_password = NULL;
    */
    update users set encrypted_password = 'asdf';

/*anonymize emails, lat and lng*/
    update users set email = 'user' || id || '@rundfunk.com';
    update users set latitude = 51 + 0.001 * id;
    update users set longitude = 22 - 0.001 * id;

/*anonymize other data*/
    update city set
    city = translate(message, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'aaaaaaaaaaaaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAA');

Copy link
Owner

@roschaefer roschaefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkuku looks good! Just the usual nitpickings. I'm gonna try it out on my machine now and send you a an anonymized dump.

One question: Don't you think that keeping the ids will allow for de-anonymization? Maybe we should shuffle the user ids?

backend/lib/tasks/db.rake Outdated Show resolved Hide resolved
backend/lib/tasks/db.rake Outdated Show resolved Hide resolved

user.email = "user#{user.id}@rundfunk.com"
user.save!
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkuku you can probably update all the users in one sql statement: https://apidock.com/rails/ActiveRecord/Base/update_all/class

User.update_all(:password, 'xxxxxx')
User.update_all(:latitute, '50 + 0.001 * users.id')
# ... etc

@roschaefer
Copy link
Owner

@dkuku there are very similar rake tasks for dumping and restoring already 😲

2019-01-09-174651_1920x1080_scrot

@roschaefer
Copy link
Owner

@dkuku okay 👍 I ran the task on my machine. The dump task seems to be a duplicate of the dump task from https://github.com/sgruhier/capistrano-db-tasks. Unfortunately their dump task fails:

robert@e480 ~/D/r/backend> bin/rails db:data:dump
rails aborted!
TypeError: no implicit conversion of Pathname into String
/home/robert/.gem/ruby/2.5.1/gems/activerecord-5.1.6/lib/active_record/tasks/postgresql_database_tasks.rb:108:in `system'
/home/robert/.gem/ruby/2.5.1/gems/activerecord-5.1.6/lib/active_record/tasks/postgresql_database_tasks.rb:108:in `run_cmd'
/home/robert/.gem/ruby/2.5.1/gems/chrono_model-0.12.1/lib/active_record/tasks/chronomodel_database_tasks.rb:30:in `data_dump'
/home/robert/.gem/ruby/2.5.1/gems/chrono_model-0.12.1/lib/chrono_model/railtie.rb:31:in `block (2 levels) in <class:Railtie>'
/home/robert/.gem/ruby/2.5.1/gems/railties-5.1.6/lib/rails/commands/rake/rake_command.rb:21:in `block in perform'
/home/robert/.gem/ruby/2.5.1/gems/railties-5.1.6/lib/rails/commands/rake/rake_command.rb:18:in `perform'
/home/robert/.gem/ruby/2.5.1/gems/railties-5.1.6/lib/rails/command.rb:46:in `invoke'
/home/robert/.gem/ruby/2.5.1/gems/railties-5.1.6/lib/rails/commands.rb:16:in `<top (required)>'
bin/rails:4:in `require'
bin/rails:4:in `<main>'
Tasks: TOP => db:data:dump
(See full trace by running task with --trace)
robert@e480 ~/D/r/backend> ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]

Apparently IFAD's chronomodel does not like the db tasks? 😢

@dkuku shall I send you the dump via Slack?

@dkuku
Copy link
Collaborator Author

dkuku commented Jan 9, 2019 via email

@roschaefer
Copy link
Owner

@dkuku yes, but all foreign keys would need to be shuffled, too. The correct way to say is "It breaks referential integrity", from here.

I think it does not need to be done now.

@dkuku
Copy link
Collaborator Author

dkuku commented Jan 9, 2019

@roschaefer you can send it to me via slack.
I'll look at the data shuffling thingy too. I found a gem that might solve this.
https://github.com/sunitparekh/data-anonymization


@user_ids = User.ids.shuffle
Broadcast.find_each do |broadcast|
broadcast.creator_id = @user_ids.sample
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I would not do that because it gets confusing. E.g. @ciremoussadia is working on a PR where we update the user role if you create a broadcast.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do proper anonymization 😆 maybe you can do some research if the user id itself is a means of de-anonymization and if yes, what are the counter-measures?

Maybe it's not necessary after all? 🤷‍♂️

@roschaefer
Copy link
Owner

@dkuku that gem looks OK to me on the first glimpse. Last commit not too old, a couple of stars and focused use case. Does it do sth. to the id? It does not seem so, does it?

@dkuku
Copy link
Collaborator Author

dkuku commented Jan 10, 2019 via email

@roschaefer
Copy link
Owner

roschaefer commented Jan 10, 2019

@dkuku the foreign key is constraint comes from the database migrations. Ugh, good luck in getting rid of them. Maybe postgres allows to dump the data without foreign key constraints? that will not help you because you need to get rid of the constraints earlier

@roschaefer
Copy link
Owner

@dkuku could you please: git fetch and git merge origin/master? Given that origin is the name of this remote.

@roschaefer
Copy link
Owner

We could namespace the rake tasks db:anonymized:dump no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dump anonymized data
2 participants