Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF misidentified as scanned (tesseract used) #16

Open
jwdevantier opened this issue Feb 18, 2025 · 3 comments
Open

PDF misidentified as scanned (tesseract used) #16

jwdevantier opened this issue Feb 18, 2025 · 3 comments

Comments

@jwdevantier
Copy link

I have a technical PDF which I bought a while ago. I tried submitting it to kreuzberg and my computer slowed to a crawl while tesseract was taking 10 minutes plus trying to parse the document.

The thing is, this document is very much not scanned. The text is selectable and libraries like PyMuPDF have no issue extracting the text.

Since i cannot provide you the PDF (copyright reasons), can you tell me how to perhaps investigate what went wrong ? Also, I do believe you should allow tesseract to be an optional choice.

@Goldziher
Copy link
Owner

Goldziher commented Feb 18, 2025

I have a technical PDF which I bought a while ago. I tried submitting it to kreuzberg and my computer slowed to a crawl while tesseract was taking 10 minutes plus trying to parse the document.

The thing is, this document is very much not scanned. The text is selectable and libraries like PyMuPDF have no issue extracting the text.

Since i cannot provide you the PDF (copyright reasons), can you tell me how to perhaps investigate what went wrong ? Also, I do believe you should allow tesseract to be an optional choice.

Hi, certainly - you can run the code with a debugger and see what is the result coming from. Pypdfium2.

Also, what's your OS a tesseract version?

As for making tesseract optional, this should go into a discussion or another issue, let's not mix potatoes with tomatoes

@Goldziher
Copy link
Owner

I have a technical PDF which I bought a while ago. I tried submitting it to kreuzberg and my computer slowed to a crawl while tesseract was taking 10 minutes plus trying to parse the document.

The thing is, this document is very much not scanned. The text is selectable and libraries like PyMuPDF have no issue extracting the text.

Since i cannot provide you the PDF (copyright reasons), can you tell me how to perhaps investigate what went wrong ? Also, I do believe you should allow tesseract to be an optional choice.

Hi, certainly - you can run the code with a debugger and see what is the result coming from. Pypdfium2.

Also, what's your OS a tesseract version?

As for making tesseract optional, this should go into a discussion or another issue, let's not mix potatoes with tomatoes

I had another report of a similar issue. I'll have to broaden the testing on my end to catch this somehow.

Do you have any other examples you can share with me?

@Goldziher
Copy link
Owner

@jwdevantier Issue should be fixed with v2.1.1, could you check on your end?

P.S. I will keep this issue open for another week if you do not reply, and then close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants