fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

dhdaines · 2024-12-12T02:43:03Z

Verified on my very large documents that it doesn't unnecessarily and unsuccessfully "repair" them.

You may or may not wish to keep the version check in patch_psparser. Since you're pinning the version of pdfminer.six and since it isn't guaranteed that the bug in question will be fixed in the next pdfminer.six release (but it is rather serious, so I should hope so), then perhaps you just want to unconditionally patch it. it seems like pinning of versions is only operative when running from Docker (good!) so never mind! Keep that version check!

Also corrected an import so that if you do feel like using a newer version of pdfminer.six, it won't break on you.

…#3815)

PhorstenkampFuzzy · 2025-01-18T23:19:52Z

Any update on this?

qued

LGTM!

Here's what I checked:

Ran the tests with this branch to verify that invalid-pdf-structure-pdfminer-entire-doc.pdf is being repaired while invalid-pdf-structure-pdfminer-one-page.pdf is not.
Processed invalid-pdf-structure-pdfminer-one-page.pdf (unrepaired) using fast strategy with the new code and looked over the results to ensure the doc is being processed properly.
Processed invalid-pdf-structure-pdfminer-entire-doc.pdf (repaired) using fast strategy with the new code and looked over the results to ensure the doc is being processed properly.
Processed linked doc here with this branch and main branch to verify that the doc fails before the fix and succeeds afterwards.

qued · 2025-01-24T19:52:57Z

@dhdaines Thanks for fixing this!

dhdaines and others added 6 commits December 11, 2024 20:15

fix: correctly patch EOF handling in pdfminer (fixes: Unstructured-IO…

e0f464a

…#3815)

chore: add missing newline

1637377

docs: clarify exactly what we are patching here

7d87840

fix: correct the import of PSSyntaxError

39b2472

docs: document what patch_psparser does

99b1c61

chore: ruff

6cba88a

dhdaines changed the title ~~Fix the fix to pdfminer~~ Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR Dec 12, 2024

chore: changelog

1b0c7f7

Merge branch 'main' into fix_the_fix_to_pdfminer

b974720

qued self-requested a review January 24, 2025 16:23

qued approved these changes Jan 24, 2025

View reviewed changes

qued and others added 2 commits January 24, 2025 13:34

Merge branch 'main' into fix_the_fix_to_pdfminer

8264d73

Update version

897ffad

qued merged commit 9e5ff22 into Unstructured-IO:main Jan 24, 2025
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

dhdaines commented Dec 12, 2024 •

edited

Loading

PhorstenkampFuzzy commented Jan 18, 2025

qued left a comment

qued commented Jan 24, 2025

fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

Conversation

dhdaines commented Dec 12, 2024 • edited Loading

PhorstenkampFuzzy commented Jan 18, 2025

qued left a comment

Choose a reason for hiding this comment

qued commented Jan 24, 2025

dhdaines commented Dec 12, 2024 •

edited

Loading