Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce comparisons when parsing numbers #698

Merged
merged 2 commits into from
Nov 7, 2024
Merged

Reduce comparisons when parsing numbers #698

merged 2 commits into from
Nov 7, 2024

Conversation

tenderlove
Copy link
Member

Before this commit, we would try to scan for a float, then if that failed, scan for an integer. But floats and integers have many bytes in common, so we would end up scanning the same bytes multiple times.

This patch combines integer and float scanning machines so that we only have to scan bytes once. If the machine finds "float parts", then it executes the "isFloat" transition in the machine, which sets a boolean letting us know that the parser found a float.

If we didn't find a float, but we did match, then we know it's an int.

This seems to speed up various benchmarks. Just pulling citm_catalog.json,

Before this change:

== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T15:34:20Z master 56ecc243e2) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    45.000 i/100ms
                  oj    34.000 i/100ms
          Oj::Parser    45.000 i/100ms
           rapidjson    39.000 i/100ms
Calculating -------------------------------------
                json    478.631 (± 0.4%) i/s    (2.09 ms/i) -      2.430k in   5.077033s
                  oj    359.229 (± 0.6%) i/s    (2.78 ms/i) -      1.802k in   5.016486s
          Oj::Parser    478.790 (± 0.6%) i/s    (2.09 ms/i) -      2.430k in   5.075481s
           rapidjson    398.902 (± 1.0%) i/s    (2.51 ms/i) -      2.028k in   5.084372s

Comparison:
                json:      478.6 i/s
          Oj::Parser:      478.8 i/s - same-ish: difference falls within error
           rapidjson:      398.9 i/s - 1.20x  slower
                  oj:      359.2 i/s - 1.33x  slower

After this change:

== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T15:34:20Z master 56ecc243e2) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    44.000 i/100ms
                  oj    34.000 i/100ms
          Oj::Parser    43.000 i/100ms
           rapidjson    37.000 i/100ms
Calculating -------------------------------------
                json    458.547 (± 0.2%) i/s    (2.18 ms/i) -      2.332k in   5.085664s
                  oj    344.691 (± 0.3%) i/s    (2.90 ms/i) -      1.734k in   5.030657s
          Oj::Parser    450.732 (± 0.7%) i/s    (2.22 ms/i) -      2.279k in   5.056488s
           rapidjson    384.103 (± 0.8%) i/s    (2.60 ms/i) -      1.924k in   5.009417s

Comparison:
                json:      458.5 i/s
          Oj::Parser:      450.7 i/s - 1.02x  slower
           rapidjson:      384.1 i/s - 1.19x  slower
                  oj:      344.7 i/s - 1.33x  slower

With this patch JSON beats OJ::Parser for this benchmark now.

@headius
Copy link
Contributor

headius commented Nov 7, 2024

Seems like it would be an easy port to the Java parser, if anyone gets to it before I do.

Before this commit, we would try to scan for a float, then if that
failed, scan for an integer.  But floats and integers have many bytes in
common, so we would end up scanning the same bytes multiple times.

This patch combines integer and float scanning machines so that we only
have to scan bytes once.  If the machine finds "float parts", then it
executes the "isFloat" transition in the machine, which sets a boolean
letting us know that the parser found a float.

If we didn't find a float, but we did match, then we know it's an int.
@byroot
Copy link
Member

byroot commented Nov 7, 2024

Ah thank you Aaron, I had identified that issue yesterday, but was struggling with the intricacies of Ragel.

I'm pretty sure there's multiple similar issues in the parser, I think Ragel is largely being mis-used here :/ The more it goes the more I'm tempted to go with an handrolled parser.

@byroot byroot merged commit 4be55b0 into master Nov 7, 2024
72 checks passed
@byroot byroot deleted the float-parsing branch November 7, 2024 07:56
@headius
Copy link
Contributor

headius commented Nov 7, 2024

tempted to go with an handrolled parser.

Worth mentioning that the code ragel generates for Java is pretty awful. It ends up as one huge method with a switch-based state machine. I once looked into what it would take to generate JVM bytecode directly (JVM supports goto but only in bytecode) but ragel is a difficult codebase to work with.

Because of this the performance of the JRuby ext is pretty poor and there's little that can be done about it. We'd probably be better served by binding a proper high performance JVM JSON library and ditching ragel.

@tenderlove
Copy link
Member Author

I think Ragel is largely being mis-used here

I think that's true. It feels like ideally we would have the entire parser in one state machine, otherwise we're doomed to be re-examining bytes. We're definitely re-examining bytes, but I think the integer parsing was the most egregious case.

I'm tempted to go with an handrolled parser.

I tend to agree. The nice thing about Ragel though is that we can easily read the regular expressions compared to the stuff we'd have to write in C.

Worth mentioning that the code ragel generates for Java is pretty awful.

Ya, I think Ragel generates equally bad code for all languages. Just happens that the C code is bad in a way the C compiler is okay with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants