-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of FileSystemFontProvider.scanFonts() #189
base: 3.0
Are you sure you want to change the base?
Conversation
Thanks for the contribution, looks promising at least at a first sight. And yes, please split up the patch im smaller parts so that it is easier to check the changes. We won't add breaking changes to 3.x, those are limited to the trunk. But maybe we find a way to backport at least some of them without breaking the api. |
Hi, any update on this? Looks like a worthwhile enhancement (as far as I can tell). |
I'm sorry, I didn't have much free time, but refactoring is in progress. I am shrinking changes to only apply to LoadOnlyHeaders case. As for those changes that affect both LoadOnlyHeaders and non-LoadOnlyHeaders parsers - I will push them separately, for easier testing. When done, I will post performance numbers for this subset of changes. |
5ec36b8
to
10d0913
Compare
Force-pushed the new implementation, based on 3.0 branch (rev 5c4a09a). Also I edited the 1st post to reflect changes Performance numbers: first number is a re-run (font files are cached by the OS), second is a cold run Windows (296 fonts, Java 8 x32):
macOS (792 fonts, Java 22 arm64):
|
Reasoning behind using a separate class
|
10d0913
to
836a39e
Compare
Could you please rename "LoadOnlyHeaders" to something that isn't a verb? E.g. "FontHeaders" or whatever. Logging should be done like it's done in the code, not with "Logger.getLogger". I'd also like to separate refactorings and bug fixes from the main PR, I'll do some small fixes. Please rebase your PR when you notice it. |
Re "fixed a memory leak", this is tricky: your change makes sense for the |
I've done your refactorings of NamingTable. I've kept the same sequence so I hope you don't have too much work. Ideally you could just keep your own code and the diff to the existing would now be much smaller. |
Why is readExact() needed under the hood? Assuming that your changes in this PR read a selection of the data that is read in the production code, the error handling on truncated / corrupted files can stay the same. Unless you think that the error handling is poor, then it should be changed everywhere. The reason I ask this is because your read() has a different behavior than other read() methods, and because I'd like to keep the code as small as possible. |
Thank you, I will fix the issues today. Re I initially saw that For the future refactoring, here are the places where
No-loop calls (may be wrong if implementations are not required to behave like
|
836a39e
to
e390fe1
Compare
Thanks; The memory leak part has been fixed in https://issues.apache.org/jira/browse/PDFBOX-5845 . I've removed one of the constructors because it's no longer needed, please update your PR. |
e390fe1
to
a29a003
Compare
I applied the patch for the first time, I do get differences in the Here's the comicz file: However it might also be my fault due to my own changes, so I'll investigate. |
Yes, I have the same |
Found it: |
25f858d
to
0a6ae75
Compare
Thanks that works now. |
Why the setException() / getException() thing? It feels weird to me because parse() might still throw an exception so why swallow and store it if it happens later? Same weird feeling that an exception for "font.getNaming()" is completely ignored, I don't understand the comment. |
0a6ae75
to
4be17cf
Compare
Re exception: 2 calls should've been Re "font.getNaming()" - don't remember, looks weird, removed. Found a potential bug in Also, simplified |
Thanks; |
While I'm at it, I think it's better to replace "calling getter just for side-effects" font.getNaming(); // calls NamingTable.readTable();
font.getHeader(); // calls HeaderTable.readTable();
((OpenTypeFont) font).getCFF(); // calls CFFTable.readTable(); with explicit calls font.readTableHeaders(*.TAG, outHeaders); // -> TTFTable.readHeaders(..., FontHeaders) Then The changeset will become a bit larger (like, duplicated |
…ng arrays [ADD] RandomAccessReadUncachedDataStream that doesn't read input stream to byte[]
…by adding 'only headers' mode to TTF parser: * only read tables needed for FSFontInfo ('name', 'head', 'OS/2', 'CFF ', 'gcid') * 'CFF ' and 'head' table parsers finish as soon as it has all needed headers
4be17cf
to
54c4eb0
Compare
Or did I over-do it with |
I'm wondering if the |
When So it's slightly more computing in the 1st loop ( Of course, to please branch predictor, we can extract
Which 3 lines do you mean? |
As for |
Please ignore all I wrote today for now, I overlooked the readstring part. Don't change anything. |
Thank you for your work and your patience. I'm planning to commit all next week, as soon as time permits. Ideally I'll commit to the trunk and all branches. From what I see you didn't break the API. However there's one more thing I have to investigate, I'm getting sporadic parsing errors with a specific file, when the cache is getting rebuilt. Most likely this isn't because of the changes but I'll check that first. Update: not because of the changes, problem happens in 3.0.2 release too. |
trunk and 3.0 have been committed. 2.0 is still todo. Builds will freeze, this is a problem with the OWASP plugin, search for "owasp" in the top pom.xml, then below "" add "true". The people maintaining this plugin are aware of the problem and I'll update the version as soon as their release is done. https://issues.apache.org/jira/browse/PDFBOX-5847 update: OWASP update has been released, change has been committed to the pom.xml |
I worked on 2.0 for a few hours and I was only able to do 4 files in that time, and not the big ones. In 2.0 we read the fonts into memory using the Opinions? |
Thank you. I will look at 2.0 on the weekend or maybe next week |
On one machine with 792 fonts (~800MB) scanning takes around 5.5s, after these changes time is down to 0.7s (without checksumming), or 2.2s (with checksums).
In this PR, TTF parsers have an "only headers" mode where each table reads as little information as possible:
byte[]
, parser usesRandomAccessRead
directly, because most of the file is skippedFSFontInfo
(name
,head
,OS/2
,CFF
,gcid
)pdfbox.fontcache.skipchecksums
system property, for backward compatibility): checksumming 800MB takes 1.5s and parsing headers takes only 0.7sAdditional fixes:
RandomAccessRead
passed toTrueTypeCollection
constructor was never freed.NamingTable
: use sorted list instead of multilevelHashMap
, delay-load Strings (for non-"only headers" mode)TTFSubsetter
: avoid bytes->string->bytes conversionstreamline I/O: replace readByte() with read(array)consolidateread(buf, offset, len)
loops intoreadNBytes()
(allows underflow) andreadExact()
(throwing)Breaking changes:
pdfbox.fontcache.skipchecksums
property is set,.pdfbox.cache
file has dashes instead of checksums.pdfbox.cache
file differs for corrupted fonts (likeLastResort.otf
that throwsIOException: Invalid character code 0xD800
):*skipexception*|OTF||0|0|0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000
LastResort|OTF||||0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000
because we simply do not parse it as extensively to recognize that it's corruptedNameRecord.getString()
is now package-private and lazy, renamed togetStringLazy()
new abstract method(used a default implementation instead)TTFDataStream.createSubView()
Only tested with 3.0 branch, all tests pass, resulting file is the same (except for checksums field, of course, and some
*skipexception*
s).Should I break it into smaller commits?