Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PII data file #828

Merged
merged 54 commits into from
Dec 27, 2024
Merged

PII data file #828

merged 54 commits into from
Dec 27, 2024

Conversation

PoojaHolkar
Copy link
Contributor

@PoojaHolkar PoojaHolkar commented Nov 25, 2024

Why are these changes needed?

This is PII data file needed to run the PII reactor notebook

Related issue number (if any).

Fixes issue #842

@shahrokhDaijavad
Copy link
Member

@touma-I This is a very different notebook than the templates we provide with the other transforms. However, it does the job of showing the effectiveness of PII in an example. It pip installs all the packages that it needs. Since it is in the "examples" folder, I wonder if we should approve it as is.

shahrokhDaijavad and others added 21 commits December 4, 2024 17:42
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
transform => transforms

Signed-off-by: Pooja Holkar <[email protected]>
syntax issues

Signed-off-by: Pooja Holkar <[email protected]>
A few syntax changes

Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
The notebook is a Kickstarter for using PII redaction transform

Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Pooja Holkar <[email protected]>
@PoojaHolkar
Copy link
Contributor Author

@shahrokhDaijavad it is running on my local as well. am using python version 3.11.10

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Dec 8, 2024

@PoojaHolkar Did you close the PR by accident? I reopened it and tried the latest version. It works perfectly on Google Colab. I am sending you a picture of the error I get on my local machine. It seems that the "wget" command is not found and as a result, invoice.pdf is not found. The importing of the transform (PIIRedactorTransform) is working correctly.

I assume it is because you have wget installed on your machine, and I don't.
image

Update: I installed wget on my laptop and now the notebook works perfectly on my laptop!

@shahrokhDaijavad
Copy link
Member

@touma-I I will approve this PR. I think all the pip installs at the top of the notebook are needed for the Google Colab and the Google Colab button will work, when this fork is merged and wget is a non-issue in the Colab environment.

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments.

@shahrokhDaijavad
Copy link
Member

@touma-I Checking the requirements.txt file for the transform and what @PoojaHolkar has in the notebook:
requirements.txt:

data-prep-toolkit>=0.2.3.dev0
presidio-analyzer>=2.2.355
presidio-anonymizer>=2.2.355
flair>=0.14.0
pandas>=2.2.2

In the notebook:

!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[all]==0.2.2'
!pip install pdfplumber
!pip install flair
!pip install spacy
!pip install presidio_analyzer
!pip install presidio_anonymizer==2.2.355

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pip install block should not have [all] and should not have the transitive dependencies presidio-analyzer, presidio-anonymizer nor flair. It should look like this:

%%capture logpip --no-stderr
!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'
!pip install pdfplumber
!pip install spacy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using pdf2Parquet transform (instead pdfplumber) in order to ingest the pdf document. It might be a bit more cumbersome to use in its current release but we are actually making improvements to this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not be using the private method _redact_pii(). Instead you should use the pdf2Parquet transform to create a parquet file and then use the transform() method to redact the content.

"source": [
"if RUNNING_IN_COLAB:\n",
" !mkdir -p 'input-data'\n",
" !wget -O 'input-data/Invoice.pdf' 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change .../examples/notebooks/PII/invoicedata/Invoice.pdf to .../examples/notebooks/PII/input-data/Invoice.pdf

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoojaHolkar I tested the Google colab notebook, after making the change manually and it worked correctly.

Update: I pushed the change to the repo.

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this from functionality point of view.

Copy link
Collaborator

@touma-I touma-I Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoojaHolkar why do you need if __name__ == "__main__": in step 4 ?__name__ is always true in a notebook. When doing copy/paste from the scripts, please try to adapt the code you are copying to your use case and runtime environment. cc: @shahrokhDaijavad

Copy link
Collaborator

@touma-I touma-I Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoojaHolkar In step 5, I don't see the need for redactor = PIIRedactorTransform(config). It do not see where it is being used. If additional configurations were needed, those should have been done before launcher.launch() in step 4. cc: @shahrokhDaijavad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am using redactor here :

'for text in data["contents"]:
redacted_text, detected_entities = redactor._redact_pii(text)
redacted_texts.append(redacted_text)
detected_entities_list.append(detected_entities)

for displaying the redacted information in the output .

Let me change calling it before launcher.launch() as you suggested.

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better than before but still needs additional work to be correct. Please see associated comments with the notebook.

@PoojaHolkar
Copy link
Contributor Author

@touma-I please verify the changes, I have pushed the needed.

@shahrokhDaijavad
Copy link
Member

@PoojaHolkar and @touma-I From a functionality point of view, I just tested the latest notebook both in my local venv and in Google Colab and verified that the notebook runs successfully to the end in both environments.

@PoojaHolkar
Copy link
Contributor Author

@touma-I have made the changes for pulling PII transform and not its internal functions, the developer @SowmyaLR explained the variations needed to specify in appending the column names else the transform will fail due to the duplicity in column names. Please have a look, I reckon this will now adhere with the transform guidelines.

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do one more pass and clean up unnecessary pip install. Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoojaHolkar Do you still need ?

!pip install pdfplumber
!pip install spacy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected and pushed.

@touma-I touma-I self-requested a review December 27, 2024 16:48
@touma-I touma-I merged commit 6a06d87 into IBM:dev Dec 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants