PDF misidentified as scanned (tesseract used) #16

jwdevantier · 2025-02-18T11:48:51Z

I have a technical PDF which I bought a while ago. I tried submitting it to kreuzberg and my computer slowed to a crawl while tesseract was taking 10 minutes plus trying to parse the document.

The thing is, this document is very much not scanned. The text is selectable and libraries like PyMuPDF have no issue extracting the text.

Since i cannot provide you the PDF (copyright reasons), can you tell me how to perhaps investigate what went wrong ? Also, I do believe you should allow tesseract to be an optional choice.

Goldziher · 2025-02-18T11:51:31Z

I have a technical PDF which I bought a while ago. I tried submitting it to kreuzberg and my computer slowed to a crawl while tesseract was taking 10 minutes plus trying to parse the document.

The thing is, this document is very much not scanned. The text is selectable and libraries like PyMuPDF have no issue extracting the text.

Since i cannot provide you the PDF (copyright reasons), can you tell me how to perhaps investigate what went wrong ? Also, I do believe you should allow tesseract to be an optional choice.

Hi, certainly - you can run the code with a debugger and see what is the result coming from. Pypdfium2.

Also, what's your OS a tesseract version?

As for making tesseract optional, this should go into a discussion or another issue, let's not mix potatoes with tomatoes

Goldziher · 2025-03-01T08:08:44Z

I have a technical PDF which I bought a while ago. I tried submitting it to kreuzberg and my computer slowed to a crawl while tesseract was taking 10 minutes plus trying to parse the document.

The thing is, this document is very much not scanned. The text is selectable and libraries like PyMuPDF have no issue extracting the text.

Since i cannot provide you the PDF (copyright reasons), can you tell me how to perhaps investigate what went wrong ? Also, I do believe you should allow tesseract to be an optional choice.

Hi, certainly - you can run the code with a debugger and see what is the result coming from. Pypdfium2.

Also, what's your OS a tesseract version?

As for making tesseract optional, this should go into a discussion or another issue, let's not mix potatoes with tomatoes

I had another report of a similar issue. I'll have to broaden the testing on my end to catch this somehow.

Do you have any other examples you can share with me?

Goldziher · 2025-03-01T12:46:13Z

@jwdevantier Issue should be fixed with v2.1.1, could you check on your end?

P.S. I will keep this issue open for another week if you do not reply, and then close it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF misidentified as scanned (tesseract used) #16

PDF misidentified as scanned (tesseract used) #16

jwdevantier commented Feb 18, 2025

Goldziher commented Feb 18, 2025 •

edited

Loading

Goldziher commented Mar 1, 2025

Goldziher commented Mar 1, 2025

PDF misidentified as scanned (tesseract used) #16

PDF misidentified as scanned (tesseract used) #16

Comments

jwdevantier commented Feb 18, 2025

Goldziher commented Feb 18, 2025 • edited Loading

Goldziher commented Mar 1, 2025

Goldziher commented Mar 1, 2025

Goldziher commented Feb 18, 2025 •

edited

Loading