Reduce comparisons when parsing numbers #698

tenderlove · 2024-11-07T01:22:50Z

Before this commit, we would try to scan for a float, then if that failed, scan for an integer. But floats and integers have many bytes in common, so we would end up scanning the same bytes multiple times.

This patch combines integer and float scanning machines so that we only have to scan bytes once. If the machine finds "float parts", then it executes the "isFloat" transition in the machine, which sets a boolean letting us know that the parser found a float.

If we didn't find a float, but we did match, then we know it's an int.

This seems to speed up various benchmarks. Just pulling citm_catalog.json,

Before this change:

== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T15:34:20Z master 56ecc243e2) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    45.000 i/100ms
                  oj    34.000 i/100ms
          Oj::Parser    45.000 i/100ms
           rapidjson    39.000 i/100ms
Calculating -------------------------------------
                json    478.631 (± 0.4%) i/s    (2.09 ms/i) -      2.430k in   5.077033s
                  oj    359.229 (± 0.6%) i/s    (2.78 ms/i) -      1.802k in   5.016486s
          Oj::Parser    478.790 (± 0.6%) i/s    (2.09 ms/i) -      2.430k in   5.075481s
           rapidjson    398.902 (± 1.0%) i/s    (2.51 ms/i) -      2.028k in   5.084372s

Comparison:
                json:      478.6 i/s
          Oj::Parser:      478.8 i/s - same-ish: difference falls within error
           rapidjson:      398.9 i/s - 1.20x  slower
                  oj:      359.2 i/s - 1.33x  slower

After this change:

== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T15:34:20Z master 56ecc243e2) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    44.000 i/100ms
                  oj    34.000 i/100ms
          Oj::Parser    43.000 i/100ms
           rapidjson    37.000 i/100ms
Calculating -------------------------------------
                json    458.547 (± 0.2%) i/s    (2.18 ms/i) -      2.332k in   5.085664s
                  oj    344.691 (± 0.3%) i/s    (2.90 ms/i) -      1.734k in   5.030657s
          Oj::Parser    450.732 (± 0.7%) i/s    (2.22 ms/i) -      2.279k in   5.056488s
           rapidjson    384.103 (± 0.8%) i/s    (2.60 ms/i) -      1.924k in   5.009417s

Comparison:
                json:      458.5 i/s
          Oj::Parser:      450.7 i/s - 1.02x  slower
           rapidjson:      384.1 i/s - 1.19x  slower
                  oj:      344.7 i/s - 1.33x  slower

With this patch JSON beats OJ::Parser for this benchmark now.

headius · 2024-11-07T01:56:49Z

Seems like it would be an easy port to the Java parser, if anyone gets to it before I do.

Before this commit, we would try to scan for a float, then if that failed, scan for an integer. But floats and integers have many bytes in common, so we would end up scanning the same bytes multiple times. This patch combines integer and float scanning machines so that we only have to scan bytes once. If the machine finds "float parts", then it executes the "isFloat" transition in the machine, which sets a boolean letting us know that the parser found a float. If we didn't find a float, but we did match, then we know it's an int.

byroot · 2024-11-07T07:56:07Z

Ah thank you Aaron, I had identified that issue yesterday, but was struggling with the intricacies of Ragel.

I'm pretty sure there's multiple similar issues in the parser, I think Ragel is largely being mis-used here :/ The more it goes the more I'm tempted to go with an handrolled parser.

headius · 2024-11-07T12:32:33Z

tempted to go with an handrolled parser.

Worth mentioning that the code ragel generates for Java is pretty awful. It ends up as one huge method with a switch-based state machine. I once looked into what it would take to generate JVM bytecode directly (JVM supports goto but only in bytecode) but ragel is a difficult codebase to work with.

Because of this the performance of the JRuby ext is pretty poor and there's little that can be done about it. We'd probably be better served by binding a proper high performance JVM JSON library and ditching ragel.

tenderlove · 2024-11-07T16:43:24Z

I think Ragel is largely being mis-used here

I think that's true. It feels like ideally we would have the entire parser in one state machine, otherwise we're doomed to be re-examining bytes. We're definitely re-examining bytes, but I think the integer parsing was the most egregious case.

I'm tempted to go with an handrolled parser.

I tend to agree. The nice thing about Ragel though is that we can easily read the regular expressions compared to the stuff we'd have to write in C.

Worth mentioning that the code ragel generates for Java is pretty awful.

Ya, I think Ragel generates equally bad code for all languages. Just happens that the C code is bad in a way the C compiler is okay with.

byroot force-pushed the float-parsing branch from 42d9d45 to 0c0e093 Compare November 7, 2024 07:48

Rename parse_float into parse_number

e51e796

byroot merged commit 4be55b0 into master Nov 7, 2024
72 checks passed

byroot deleted the float-parsing branch November 7, 2024 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce comparisons when parsing numbers #698

Reduce comparisons when parsing numbers #698

tenderlove commented Nov 7, 2024

headius commented Nov 7, 2024

byroot commented Nov 7, 2024

headius commented Nov 7, 2024

tenderlove commented Nov 7, 2024

Reduce comparisons when parsing numbers #698

Reduce comparisons when parsing numbers #698

Conversation

tenderlove commented Nov 7, 2024

headius commented Nov 7, 2024

byroot commented Nov 7, 2024

headius commented Nov 7, 2024

tenderlove commented Nov 7, 2024