PII data file #828

PoojaHolkar · 2024-11-25T14:18:13Z

Why are these changes needed?

This is PII data file needed to run the PII reactor notebook

Related issue number (if any).

Fixes issue #842

shahrokhDaijavad · 2024-12-03T00:21:06Z

@touma-I This is a very different notebook than the templates we provide with the other transforms. However, it does the job of showing the effectiveness of PII in an example. It pip installs all the packages that it needs. Since it is in the "examples" folder, I wonder if we should approve it as is.

Signed-off-by: Pooja Holkar <[email protected]>

transform => transforms Signed-off-by: Pooja Holkar <[email protected]>

syntax issues Signed-off-by: Pooja Holkar <[email protected]>

A few syntax changes Signed-off-by: Pooja Holkar <[email protected]>

Signed-off-by: Pooja Holkar <[email protected]>

The notebook is a Kickstarter for using PII redaction transform Signed-off-by: Pooja Holkar <[email protected]>

Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Pooja Holkar <[email protected]>

PoojaHolkar · 2024-12-08T12:31:15Z

@shahrokhDaijavad it is running on my local as well. am using python version 3.11.10

shahrokhDaijavad · 2024-12-08T20:01:26Z

@PoojaHolkar Did you close the PR by accident? I reopened it and tried the latest version. It works perfectly on Google Colab. I am sending you a picture of the error I get on my local machine. It seems that the "wget" command is not found and as a result, invoice.pdf is not found. The importing of the transform (PIIRedactorTransform) is working correctly.

I assume it is because you have wget installed on your machine, and I don't.

Update: I installed wget on my laptop and now the notebook works perfectly on my laptop!

shahrokhDaijavad · 2024-12-08T20:28:11Z

@touma-I I will approve this PR. I think all the pip installs at the top of the notebook are needed for the Google Colab and the Google Colab button will work, when this fork is merged and wget is a non-issue in the Colab environment.

shahrokhDaijavad

Please see my comments.

shahrokhDaijavad · 2024-12-09T16:59:30Z

@touma-I Checking the requirements.txt file for the transform and what @PoojaHolkar has in the notebook:
requirements.txt:

data-prep-toolkit>=0.2.3.dev0
presidio-analyzer>=2.2.355
presidio-anonymizer>=2.2.355
flair>=0.14.0
pandas>=2.2.2

In the notebook:

!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[all]==0.2.2'
!pip install pdfplumber
!pip install flair
!pip install spacy
!pip install presidio_analyzer
!pip install presidio_anonymizer==2.2.355

touma-I

The pip install block should not have [all] and should not have the transitive dependencies presidio-analyzer, presidio-anonymizer nor flair. It should look like this:

%%capture logpip --no-stderr
!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'
!pip install pdfplumber
!pip install spacy

touma-I · 2024-12-09T19:28:54Z

examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb

Please consider using pdf2Parquet transform (instead pdfplumber) in order to ingest the pdf document. It might be a bit more cumbersome to use in its current release but we are actually making improvements to this.

touma-I · 2024-12-09T19:31:53Z

examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb

You should not be using the private method _redact_pii(). Instead you should use the pdf2Parquet transform to create a parquet file and then use the transform() method to redact the content.

Signed-off-by: Pooja Holkar <[email protected]>

update repo

Signed-off-by: Pooja Holkar <[email protected]>

shahrokhDaijavad · 2024-12-12T20:39:12Z

examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb

+   "source": [
+    "if RUNNING_IN_COLAB:\n",
+    "  !mkdir -p 'input-data'\n",
+    "  !wget -O 'input-data/Invoice.pdf' 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'\n",


Please change .../examples/notebooks/PII/invoicedata/Invoice.pdf to .../examples/notebooks/PII/input-data/Invoice.pdf

shahrokhDaijavad

@PoojaHolkar I tested the Google colab notebook, after making the change manually and it worked correctly.

Update: I pushed the change to the repo.

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad

I approve this from functionality point of view.

touma-I · 2024-12-19T14:23:57Z

examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb

@PoojaHolkar why do you need if __name__ == "__main__": in step 4 ?__name__ is always true in a notebook. When doing copy/paste from the scripts, please try to adapt the code you are copying to your use case and runtime environment. cc: @shahrokhDaijavad

touma-I · 2024-12-19T14:26:38Z

examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb

@PoojaHolkar In step 5, I don't see the need for redactor = PIIRedactorTransform(config). It do not see where it is being used. If additional configurations were needed, those should have been done before launcher.launch() in step 4. cc: @shahrokhDaijavad

am using redactor here :

'for text in data["contents"]:
redacted_text, detected_entities = redactor._redact_pii(text)
redacted_texts.append(redacted_text)
detected_entities_list.append(detected_entities)

for displaying the redacted information in the output .

Let me change calling it before launcher.launch() as you suggested.

touma-I

This is better than before but still needs additional work to be correct. Please see associated comments with the notebook.

PoojaHolkar · 2024-12-19T18:37:18Z

@touma-I please verify the changes, I have pushed the needed.

shahrokhDaijavad · 2024-12-19T19:12:44Z

@PoojaHolkar and @touma-I From a functionality point of view, I just tested the latest notebook both in my local venv and in Google Colab and verified that the notebook runs successfully to the end in both environments.

PoojaHolkar · 2024-12-22T13:31:29Z

@touma-I have made the changes for pulling PII transform and not its internal functions, the developer @SowmyaLR explained the variations needed to specify in appending the column names else the transform will fail due to the duplicity in column names. Please have a look, I reckon this will now adhere with the transform guidelines.

touma-I

Please do one more pass and clean up unnecessary pip install. Thanks

touma-I · 2024-12-26T16:29:50Z

examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb

@PoojaHolkar Do you still need ?

!pip install pdfplumber
!pip install spacy

corrected and pushed.

PoojaHolkar closed this Nov 25, 2024

PoojaHolkar reopened this Nov 25, 2024

PoojaHolkar force-pushed the dev branch 4 times, most recently from 1243feb to 7282f1a Compare November 27, 2024 07:09

agoyal26 requested a review from touma-I November 27, 2024 08:25

shahrokhDaijavad mentioned this pull request Dec 2, 2024

[Feature] PII example for kickstart #842

Open

2 tasks

shahrokhDaijavad and others added 21 commits December 4, 2024 17:42

Update README.md

c59bc6c

Signed-off-by: Pooja Holkar <[email protected]>

Update README.md

9c58af3

Signed-off-by: Pooja Holkar <[email protected]>

Update README.md

501570c

Signed-off-by: Pooja Holkar <[email protected]>

Update README.md

30c8a19

transform => transforms Signed-off-by: Pooja Holkar <[email protected]>

Update README inweb2parquet

ebfe95e

syntax issues Signed-off-by: Pooja Holkar <[email protected]>

Update README.md for the web2parquet

eb5d0ad

A few syntax changes Signed-off-by: Pooja Holkar <[email protected]>

Update README-list.md

af8cdd8

Signed-off-by: Pooja Holkar <[email protected]>

Update README-list.md

3738e51

Signed-off-by: Pooja Holkar <[email protected]>

Update README.md

d87c992

Signed-off-by: Pooja Holkar <[email protected]>

Update README.md

b0beaf5

Signed-off-by: Pooja Holkar <[email protected]>

Create test

a00380b

Signed-off-by: Pooja Holkar <[email protected]>

PII input file

7be8cb1

Signed-off-by: Pooja Holkar <[email protected]>

PII_redactor code example

2aceec1

Signed-off-by: Pooja Holkar <[email protected]>

invoice data

d16f0f7

Signed-off-by: Pooja Holkar <[email protected]>

upload data

ee735cd

Signed-off-by: Pooja Holkar <[email protected]>

upload data

d825e8b

Signed-off-by: Pooja Holkar <[email protected]>

Delete examples/notebooks/PII/Invoice.pdf

22ec3fd

Signed-off-by: Pooja Holkar <[email protected]>

Delete examples/notebooks/PII/invoicedata/test.py

a1965c2

Signed-off-by: Pooja Holkar <[email protected]>

notebook recipe for PII redaction code

ea9d692

The notebook is a Kickstarter for using PII redaction transform Signed-off-by: Pooja Holkar <[email protected]>

update pdf2parquet README

f47b45f

Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Pooja Holkar <[email protected]>

add data_files_to_use

1f1764e

Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Pooja Holkar <[email protected]>

shahrokhDaijavad reopened this Dec 8, 2024

shahrokhDaijavad approved these changes Dec 8, 2024

View reviewed changes

touma-I requested changes Dec 9, 2024

View reviewed changes

pholkar1 and others added 5 commits December 12, 2024 23:04

parquet inout uodate

d13bb5c

colab version notebook

fcf9aa8

Signed-off-by: Pooja Holkar <[email protected]>

Merge branch 'IBM:dev' into dev

e520a95

Delete examples/notebooks/PII/invoicedata directory

a8b226c

update repo

colab version updated notebook

c6d2e29

Signed-off-by: Pooja Holkar <[email protected]>

shahrokhDaijavad reviewed Dec 12, 2024

View reviewed changes

shahrokhDaijavad requested changes Dec 12, 2024

View reviewed changes

fixed the path to the Invoice.odf

2f2f0ca

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad approved these changes Dec 17, 2024

View reviewed changes

touma-I reviewed Dec 19, 2024

View reviewed changes

touma-I requested changes Dec 19, 2024

View reviewed changes

required code changes

896713b

edited code to pull full transform

e8318b0

touma-I reviewed Dec 26, 2024

View reviewed changes

removed unneccessary installs

8bdcf7c

touma-I self-requested a review December 27, 2024 16:48

touma-I approved these changes Dec 27, 2024

View reviewed changes

touma-I merged commit 6a06d87 into IBM:dev Dec 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PII data file #828

PII data file #828

PoojaHolkar commented Nov 25, 2024 •

edited

Loading

shahrokhDaijavad commented Dec 3, 2024

PoojaHolkar commented Dec 8, 2024

shahrokhDaijavad commented Dec 8, 2024 •

edited

Loading

shahrokhDaijavad commented Dec 8, 2024

shahrokhDaijavad left a comment

shahrokhDaijavad commented Dec 9, 2024

touma-I left a comment •

edited

Loading

touma-I Dec 9, 2024

touma-I Dec 9, 2024

shahrokhDaijavad Dec 12, 2024

shahrokhDaijavad left a comment •

edited

Loading

shahrokhDaijavad left a comment

touma-I Dec 19, 2024 •

edited

Loading

touma-I Dec 19, 2024 •

edited

Loading

PoojaHolkar Dec 19, 2024

touma-I left a comment

PoojaHolkar commented Dec 19, 2024

shahrokhDaijavad commented Dec 19, 2024

PoojaHolkar commented Dec 22, 2024

touma-I left a comment •

edited

Loading

touma-I Dec 26, 2024

PoojaHolkar Dec 27, 2024

PII data file #828

PII data file #828

Conversation

PoojaHolkar commented Nov 25, 2024 • edited Loading

Why are these changes needed?

Related issue number (if any).

shahrokhDaijavad commented Dec 3, 2024

PoojaHolkar commented Dec 8, 2024

shahrokhDaijavad commented Dec 8, 2024 • edited Loading

shahrokhDaijavad commented Dec 8, 2024

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

shahrokhDaijavad commented Dec 9, 2024

touma-I left a comment • edited Loading

Choose a reason for hiding this comment

touma-I Dec 9, 2024

Choose a reason for hiding this comment

touma-I Dec 9, 2024

Choose a reason for hiding this comment

shahrokhDaijavad Dec 12, 2024

Choose a reason for hiding this comment

shahrokhDaijavad left a comment • edited Loading

Choose a reason for hiding this comment

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

touma-I Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

touma-I Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

PoojaHolkar Dec 19, 2024

Choose a reason for hiding this comment

touma-I left a comment

Choose a reason for hiding this comment

PoojaHolkar commented Dec 19, 2024

shahrokhDaijavad commented Dec 19, 2024

PoojaHolkar commented Dec 22, 2024

touma-I left a comment • edited Loading

Choose a reason for hiding this comment

touma-I Dec 26, 2024

Choose a reason for hiding this comment

PoojaHolkar Dec 27, 2024

Choose a reason for hiding this comment

PoojaHolkar commented Nov 25, 2024 •

edited

Loading

shahrokhDaijavad commented Dec 8, 2024 •

edited

Loading

touma-I left a comment •

edited

Loading

shahrokhDaijavad left a comment •

edited

Loading

touma-I Dec 19, 2024 •

edited

Loading

touma-I Dec 19, 2024 •

edited

Loading

touma-I left a comment •

edited

Loading