-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF misidentified as scanned (tesseract used) #16
Comments
Hi, certainly - you can run the code with a debugger and see what is the result coming from. Pypdfium2. Also, what's your OS a tesseract version? As for making tesseract optional, this should go into a discussion or another issue, let's not mix potatoes with tomatoes |
I had another report of a similar issue. I'll have to broaden the testing on my end to catch this somehow. Do you have any other examples you can share with me? |
@jwdevantier Issue should be fixed with v2.1.1, could you check on your end? P.S. I will keep this issue open for another week if you do not reply, and then close it. |
I have a technical PDF which I bought a while ago. I tried submitting it to kreuzberg and my computer slowed to a crawl while tesseract was taking 10 minutes plus trying to parse the document.
The thing is, this document is very much not scanned. The text is selectable and libraries like PyMuPDF have no issue extracting the text.
Since i cannot provide you the PDF (copyright reasons), can you tell me how to perhaps investigate what went wrong ? Also, I do believe you should allow tesseract to be an optional choice.
The text was updated successfully, but these errors were encountered: