Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DYN-6146 Removing PII data from a JSON workspace #14471

Merged
merged 6 commits into from
Nov 6, 2023
Merged

DYN-6146 Removing PII data from a JSON workspace #14471

merged 6 commits into from
Nov 6, 2023

Conversation

jesusalvino
Copy link
Contributor

Purpose

Removing the PII data from a JSON workspace object to implement the task https://jira.autodesk.com/browse/DYN-6146

Declarations

Check these if you believe they are true

  • The codebase is in a better state after this PR
  • Is documented according to the standards
  • The level of testing this PR includes is appropriate
  • User facing strings, if any, are extracted into *.resx files
  • All tests pass using the self-service CI.
  • Snapshot of UI changes, if any.
  • Changes to the API follow Semantic Versioning and are documented in the API Changes document.
  • This PR modifies some build requirements and the readme is updated
  • This PR contains no files larger than 50 MB

Reviewers

@QilongTang
@reddyashish

FYIs

@RobertGlobant20
@Enzo707

@QilongTang QilongTang requested a review from varvaratou October 11, 2023 03:53
@QilongTang QilongTang added this to the 3.0 milestone Oct 11, 2023
static string creditCardPattern = @"(\d{4}[-, ]\d{4})";
static string ssnPattern = @"\d{3}[- ]\d{2}[- ]\d{4}";
static string ipPattern = @"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";
static string datePattern = @"(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this account for different data formats like DDMMYY, DDMMYYYY, MMDDYY, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this account for different data formats like DDMMYY, DDMMYYYY, MMDDYY, etc.

Done, I have added a Unit Test too for validate more cases.

Copy link
Contributor

@aparajit-pratap aparajit-pratap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this amounting to modifying the original graph authored by a user? What if users deliberately want to insert such data in their graphs for certain specific workflows, without which their graphs might not be useful?
In other words, do we need to care about removing PII data that users are deliberately of their own volition introducing in Dynamo?

@QilongTang
Copy link
Contributor

QilongTang commented Oct 13, 2023

Isn't this amounting to modifying the original graph authored by a user? What if users deliberately want to insert such data in their graphs for certain specific workflows, without which their graphs might not be useful? In other words, do we need to care about removing PII data that users are deliberately of their own volition introducing in Dynamo?

Hi @aparajit-pratap The change to save function itself is unintended, we intend to instrument graphs without PII as the purpose. @jesusalvino I would suggest not injecting the function as part of the save process for clarity.

return new Tuple<string, JObject>(uuid, jObjectResult);
}

static string emailPattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
Copy link
Member

@mjkkirschner mjkkirschner Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this best practice? where are these regexes pulled from? I don't know how others feel, but this seems very hard to understand, reason about, or maintain - especially because there are no tests in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of note https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html https://microsoft.github.io/presidio/analyzer/

I have added an Unit Test for it so We can discuss in detail the used regexes and their scope / granularity, I have already used them in the past for others projects (As we know this feature is cross project/company).

About your references I have already considered similar options in the Spike from this task : https://jira.autodesk.com/browse/DYN-5964 . I considered the PII Helper class as the main option but its opened to be extended or changed as the Team decide.

@avidit avidit changed the title Removing PII data from a JSON workspace DYN-6146 Removing PII data from a JSON workspace Oct 17, 2023
@jesusalvino
Copy link
Contributor Author

Isn't this amounting to modifying the original graph authored by a user? What if users deliberately want to insert such data in their graphs for certain specific workflows, without which their graphs might not be useful? In other words, do we need to care about removing PII data that users are deliberately of their own volition introducing in Dynamo?

Hi @aparajit-pratap The change to save function itself is unintended, we intend to instrument graphs without PII as the purpose. @jesusalvino I would suggest not injecting the function as part of the save process for clarity.

I have removed the functionality from the Save method and create a Unit Test instead of.

/// <summary>
/// Helper Class for removing PII Data from a JSON workspace
/// </summary>
public static class PIIDetector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this stuff internal, this should not be an API IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return property.Value;
}

public static bool ContainsEmail(string value) { return new Regex(emailPattern).Match(value).Success; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be conservative as possible with access modifiers, none of this should be something we are providing as APIs - right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/// </summary>
/// <param name="jsonObject"></param>
/// <returns></returns>
public static JObject RemovePIIData(JObject jsonObject)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code in here seems kind of fragile to changes - it uses a bunch of hardcoded string names which could easily break at runtime.

I think for the null checking if possible you should move to using the null coalescing operator for ease of readability:
https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/null-coalescing-operator

or for properties

https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/member-access-operators#null-conditional-operators--and-

In addition I don't see this function used, who will use it? Will it be called often?

I would add a try catch around this entire call internally with a catch that reports the error, or return a bool if the removal fails so downstream consumers can decide if they still wish to use the possibly tainted file.

I would probably also take all the strings out and make them consts - ideally we would use NameOf but I understand you want this in a project that should probably not reference the types you would need to do that. I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjkkirschner I have applied your suggestions, the function will be used to get a serialized workspace without PII data (https://jira.autodesk.com/browse/DYN-5964), based on a well formatted and confident json object, it will be called when an user close a workspace / graph (potentially) after checking if it's valid for send to Forge. Yes, since the project doesn't have reference to the types, the comparisons are performed to explicit values.

@mjkkirschner
Copy link
Member

@jesusalvino it looks good, but the build checks are failing, any idea if thats a fluke or something real?

@QilongTang
Copy link
Contributor

@jesusalvino Please double check this branch for building, I always trust the checks. It could be that updating this branch with master branch caused side effects or errors.

Copy link
Contributor

@QilongTang QilongTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mjkkirschner and @aparajit-pratap for the review. One thing I was not thrilled about was the hardcoded Workspace related property names, e.g. const string Nodes = "Nodes";. I later realized that these need to be hard coded because the code is in Utility class with a recursive implementation, I do not see any work around yet other than us moving these code back so we can use nameof. But I think for now, since these properties are always going to be valid and we have been stable about our file format, I am OK with that.

@jesusalvino
Copy link
Contributor Author

@jesusalvino it looks good, but the build checks are failing, any idea if thats a fluke or something real?

I missed commit the unit test / merged with the latest master, all should be clean now

@QilongTang
Copy link
Contributor

@jesusalvino it looks good, but the build checks are failing, any idea if thats a fluke or something real?

I missed commit the unit test / merged with the latest master, all should be clean now

sure, waiting for PR checks to pass

@QilongTang
Copy link
Contributor

The regression should be fixed by #14566 already. Merging

@QilongTang QilongTang merged commit 4e0cf2e into DynamoDS:master Nov 6, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants