Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature upgrade apache pdfbox #4450

Merged
merged 5 commits into from
Jan 17, 2025
Merged

Feature upgrade apache pdfbox #4450

merged 5 commits into from
Jan 17, 2025

Conversation

buchen
Copy link
Member

@buchen buchen commented Jan 3, 2025

First draft for #4449

It switches to PDFBox 3.0.3 and falls back - in case of errors - to PDFBox 1.8.

What I am still thinking about: How to learn if and where there are differences between the PDFBox versions.

Right now, I am printing a log message. But users will most likely not notice it let alone inform us.

I wonder if we should create both versions of the extracted text, and, if there are differences, put both files into the debug text. At least we see documents that are different.

@buchen buchen requested a review from Nirus2000 January 3, 2025 11:06
@ZfT2
Copy link
Contributor

ZfT2 commented Jan 4, 2025

Maybe just provide a separate "beta" Version of PP with (only) pdfbox 3, post it in the forums and let the community test it?
If after a time no (bigger) failures are reported, upgrade it also in the main release...

Advantage: we will (hopefully) only deal with one pdfbox Version in future.

@buchen buchen force-pushed the feature_upgrade_apache_pdfbox branch from 05ea668 to aeea8dd Compare January 6, 2025 10:27
@buchen
Copy link
Member Author

buchen commented Jan 6, 2025

Advantage: we will (hopefully) only deal with one pdfbox Version in future.

Yes. That is the goal.

I have now made two more changes to help us learn about differences.

The extracted debug text indicates if the version is different. At least we will learn if there are differences.

Bildschirmfoto 2025-01-06 um 11 21 28

And I print the diff to the log file. In case we want to better understand what the differences are. The changes so far do not make a material difference (better extraction of special characters, additional line breaks in address, etc.)

Bildschirmfoto 2025-01-06 um 11 10 08

@buchen buchen added the pdf label Jan 6, 2025
@stoeggich
Copy link

Can you perhaps make it so that you can check many files at once? Then I would run everything through myself

@buchen
Copy link
Member Author

buchen commented Jan 10, 2025

I added a menu option to create diffs for multiple files.
It is only visible if you activated the "experimental features" in the settings.

Bildschirmfoto 2025-01-10 um 10 07 59

What it does not support: anonymizing the data via mouse click. You might have to do that manually.

Please post meaningful deltas in this issue #4449

So far, I think the diffs are manageable. They do not look like material differences that have impact on the relevant regular expressions.

@Nirus2000
Copy link
Member

I added a menu option to create diffs for multiple files. It is only visible if you activated the "experimental features" in the settings.
Bildschirmfoto 2025-01-10 um 10 07 59

What it does not support: anonymizing the data via mouse click. You might have to do that manually.

Please post meaningful deltas in this issue #4449

So far, I think the diffs are manageable. They do not look like material differences that have impact on the relevant regular expressions.

Perhaps you should also add “Experimental”, as in the XML document menu item.
BTW. why is “XML documents” grayed out, although the “marked as experimental” is activated?

@buchen
Copy link
Member Author

buchen commented Jan 14, 2025

BTW. why is “XML documents” grayed out, although the “marked as experimental” is activated?

Because it is a "headline" - all supported XML documents should come afterwards. There is only 1 at the moment. Menus do not support a headline. They only support deactivated items.

the item is only visible when the experimental stuff is activated. I hope that is enough labelling. We'll remove it in a couple weeks anyway. (or maybe in years... who knows ;-))

@buchen buchen merged commit 6abc8d0 into master Jan 17, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants