-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PII data file #828
PII data file #828
Conversation
1243feb
to
7282f1a
Compare
@touma-I This is a very different notebook than the templates we provide with the other transforms. However, it does the job of showing the effectiveness of PII in an example. It pip installs all the packages that it needs. Since it is in the "examples" folder, I wonder if we should approve it as is. |
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
transform => transforms Signed-off-by: Pooja Holkar <[email protected]>
syntax issues Signed-off-by: Pooja Holkar <[email protected]>
A few syntax changes Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
The notebook is a Kickstarter for using PII redaction transform Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Pooja Holkar <[email protected]>
@shahrokhDaijavad it is running on my local as well. am using python version 3.11.10 |
@PoojaHolkar Did you close the PR by accident? I reopened it and tried the latest version. It works perfectly on Google Colab. I am sending you a picture of the error I get on my local machine. It seems that the "wget" command is not found and as a result, invoice.pdf is not found. The importing of the transform (PIIRedactorTransform) is working correctly. I assume it is because you have wget installed on your machine, and I don't. Update: I installed wget on my laptop and now the notebook works perfectly on my laptop! |
@touma-I I will approve this PR. I think all the pip installs at the top of the notebook are needed for the Google Colab and the Google Colab button will work, when this fork is merged and wget is a non-issue in the Colab environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments.
@touma-I Checking the requirements.txt file for the transform and what @PoojaHolkar has in the notebook: data-prep-toolkit>=0.2.3.dev0 In the notebook: !pip install data-prep-toolkit==0.2.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pip install block should not have [all] and should not have the transitive dependencies presidio-analyzer, presidio-anonymizer nor flair. It should look like this:
%%capture logpip --no-stderr
!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'
!pip install pdfplumber
!pip install spacy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider using pdf2Parquet transform (instead pdfplumber) in order to ingest the pdf document. It might be a bit more cumbersome to use in its current release but we are actually making improvements to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not be using the private method _redact_pii(). Instead you should use the pdf2Parquet transform to create a parquet file and then use the transform() method to redact the content.
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
"source": [ | ||
"if RUNNING_IN_COLAB:\n", | ||
" !mkdir -p 'input-data'\n", | ||
" !wget -O 'input-data/Invoice.pdf' 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change .../examples/notebooks/PII/invoicedata/Invoice.pdf to .../examples/notebooks/PII/input-data/Invoice.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PoojaHolkar I tested the Google colab notebook, after making the change manually and it worked correctly.
Update: I pushed the change to the repo.
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve this from functionality point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PoojaHolkar why do you need if __name__ == "__main__"
: in step 4 ?__name__
is always true in a notebook. When doing copy/paste from the scripts, please try to adapt the code you are copying to your use case and runtime environment. cc: @shahrokhDaijavad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PoojaHolkar In step 5, I don't see the need for redactor = PIIRedactorTransform(config)
. It do not see where it is being used. If additional configurations were needed, those should have been done before launcher.launch() in step 4. cc: @shahrokhDaijavad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am using redactor here :
'for text in data["contents"]:
redacted_text, detected_entities = redactor._redact_pii(text)
redacted_texts.append(redacted_text)
detected_entities_list.append(detected_entities)
for displaying the redacted information in the output .
Let me change calling it before launcher.launch() as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better than before but still needs additional work to be correct. Please see associated comments with the notebook.
@touma-I please verify the changes, I have pushed the needed. |
@PoojaHolkar and @touma-I From a functionality point of view, I just tested the latest notebook both in my local venv and in Google Colab and verified that the notebook runs successfully to the end in both environments. |
@touma-I have made the changes for pulling PII transform and not its internal functions, the developer @SowmyaLR explained the variations needed to specify in appending the column names else the transform will fail due to the duplicity in column names. Please have a look, I reckon this will now adhere with the transform guidelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do one more pass and clean up unnecessary pip install. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PoojaHolkar Do you still need ?
!pip install pdfplumber
!pip install spacy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected and pushed.
Why are these changes needed?
This is PII data file needed to run the PII reactor notebook
Related issue number (if any).
Fixes issue #842