-
Notifications
You must be signed in to change notification settings - Fork 584
Replies: 1 comment · 9 replies
-
Hi, checksum does raise the score to 1.0. Could you please provide some examples of false positives that pass a checksum? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Then this would do the trick: |
Beta Was this translation helpful? Give feedback.
All reactions
-
Wouldn't this just remove TFN detection? We still need it. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Oh I misread your previous comment. I am not an expert in Australian codes, but I would consider removing the TFN recognizer, and creating a new recognizer class that inherits from it while having a slightly different logic. Alternatively as Presidio does not remove multiple results for the same substring, I would consider having a post process which checks if a string is both ABN and TFN (as an example) . That however would only help in specific cases like you've given. |
Beta Was this translation helpful? Give feedback.
All reactions
-
This will be a long reply, so apologies for the size! I know you have already replied about the custom recognizer, thank you for that. I thought I would add some info in case you were interested. This gets picked up as it passes the validation of pattern and checksum, even though it does not have any matching context words. TEXT: Here is the log output. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Checksum validation automatically boosts the confidence to 1.0 regardless of context words, assuming that checksum should have a very specific logic for which very few FPs are expected. If you would like to have a lower score for a string that passed the checksum validation, you could configure it this way: from presidio_analyzer import EntityRecognizer
EntityRecognizer.MAX_SCORE = 0.5 Then, all checksum validations would result in a boost to 0.5, and context words could potentially boost this even further. Hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Hi All
We are getting a lot of false positives with our recognizers, this is mainly with TFN and PCI, and other recognizers with checksums.
We get many false matches in spreadsheets and even logs due to the probability of a matching number.
It seems like the code marks the confidence as 1 if it passes the checksum, and the context words are checked after this step.
So the code does not seem to take into account any context words?
Is there a way we can make these recognizers more accurate? or example by taking into account the context words.
Cheers
Chris
Beta Was this translation helpful? Give feedback.
All reactions