-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DYN-6146 Removing PII data from a JSON workspace #14471
DYN-6146 Removing PII data from a JSON workspace #14471
Conversation
src/DynamoUtilities/PIIDetector.cs
Outdated
static string creditCardPattern = @"(\d{4}[-, ]\d{4})"; | ||
static string ssnPattern = @"\d{3}[- ]\d{2}[- ]\d{4}"; | ||
static string ipPattern = @"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"; | ||
static string datePattern = @"(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this account for different data formats like DDMMYY, DDMMYYYY, MMDDYY, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this account for different data formats like DDMMYY, DDMMYYYY, MMDDYY, etc.
Done, I have added a Unit Test too for validate more cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this amounting to modifying the original graph authored by a user? What if users deliberately want to insert such data in their graphs for certain specific workflows, without which their graphs might not be useful?
In other words, do we need to care about removing PII data that users are deliberately of their own volition introducing in Dynamo?
Hi @aparajit-pratap The change to |
return new Tuple<string, JObject>(uuid, jObjectResult); | ||
} | ||
|
||
static string emailPattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this best practice? where are these regexes pulled from? I don't know how others feel, but this seems very hard to understand, reason about, or maintain - especially because there are no tests in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of note https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html https://microsoft.github.io/presidio/analyzer/
I have added an Unit Test for it so We can discuss in detail the used regexes and their scope / granularity, I have already used them in the past for others projects (As we know this feature is cross project/company).
About your references I have already considered similar options in the Spike from this task : https://jira.autodesk.com/browse/DYN-5964 . I considered the PII Helper class as the main option but its opened to be extended or changed as the Team decide.
I have removed the functionality from the Save method and create a Unit Test instead of. |
src/DynamoUtilities/PIIDetector.cs
Outdated
/// <summary> | ||
/// Helper Class for removing PII Data from a JSON workspace | ||
/// </summary> | ||
public static class PIIDetector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep this stuff internal, this should not be an API IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/DynamoUtilities/PIIDetector.cs
Outdated
return property.Value; | ||
} | ||
|
||
public static bool ContainsEmail(string value) { return new Regex(emailPattern).Match(value).Success; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be conservative as possible with access modifiers, none of this should be something we are providing as APIs - right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/DynamoUtilities/PIIDetector.cs
Outdated
/// </summary> | ||
/// <param name="jsonObject"></param> | ||
/// <returns></returns> | ||
public static JObject RemovePIIData(JObject jsonObject) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code in here seems kind of fragile to changes - it uses a bunch of hardcoded string names which could easily break at runtime.
I think for the null checking if possible you should move to using the null coalescing operator for ease of readability:
https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/null-coalescing-operator
or for properties
In addition I don't see this function used, who will use it? Will it be called often?
I would add a try catch around this entire call internally with a catch that reports the error, or return a bool if the removal fails so downstream consumers can decide if they still wish to use the possibly tainted file.
I would probably also take all the strings out and make them consts - ideally we would use NameOf
but I understand you want this in a project that should probably not reference the types you would need to do that. I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjkkirschner I have applied your suggestions, the function will be used to get a serialized workspace without PII data (https://jira.autodesk.com/browse/DYN-5964), based on a well formatted and confident json object, it will be called when an user close a workspace / graph (potentially) after checking if it's valid for send to Forge. Yes, since the project doesn't have reference to the types, the comparisons are performed to explicit values.
@jesusalvino it looks good, but the build checks are failing, any idea if thats a fluke or something real? |
@jesusalvino Please double check this branch for building, I always trust the checks. It could be that updating this branch with master branch caused side effects or errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @mjkkirschner and @aparajit-pratap for the review. One thing I was not thrilled about was the hardcoded Workspace related property names, e.g. const string Nodes = "Nodes";
. I later realized that these need to be hard coded because the code is in Utility class with a recursive implementation, I do not see any work around yet other than us moving these code back so we can use nameof
. But I think for now, since these properties are always going to be valid and we have been stable about our file format, I am OK with that.
I missed commit the unit test / merged with the latest master, all should be clean now |
sure, waiting for PR checks to pass |
The regression should be fixed by #14566 already. Merging |
Purpose
Removing the PII data from a JSON workspace object to implement the task https://jira.autodesk.com/browse/DYN-6146
Declarations
Check these if you believe they are true
*.resx
filesReviewers
@QilongTang
@reddyashish
FYIs
@RobertGlobant20
@Enzo707