Handle "large integers" #167

crazyxman · 2019-05-14T09:40:12Z

The following json parsing failed because of the number "38793000000000000001". Can I give an option when parsing? Convert to a string when parsing an integer overflow.
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793000000000000001]
}
}

lemire · 2019-05-14T12:26:11Z

Treating it as a string would violate the JSON specification as far as I can tell and once you reserialize, you'd end up with "3879..." (a string value) instead of 3879 (an integer value). JavaScript and some C++ parsers might convert it to a floating-point number but this is not ideal either since it is a lossy process (you won't be able to retrieve the original integer). We could map it to a "big integer" class of some kind... but that's inconvenient for the users.

What is your purpose with such a document?

You can't, portably, represent such larger integers in C/C++ on any system I know of.

crazyxman · 2019-05-14T12:58:33Z

Thank you. Maybe it's wrong to do so.
This idea comes from the fourth parameter of the PHP function json_decode() when options=JSON_BIGINT_AS_STRING
Https://www.php.net/manual/en/function.json-decode.php

lemire · 2019-05-14T13:01:31Z

I am reopening because I don't think this is "wrong".

lemire · 2019-05-14T13:02:43Z

Of course, it would help to have a good motivation for this problem.

plokhotnyuk · 2019-05-14T13:18:58Z

It can be decimal128 numbers with up to 34-digit mantissas, bank account numbers or IDs in decimal representation, etc.

Packing them to a string is one of safest options. Some implementations that just parse the input to a big number can be vulnerable on a malicious input.

lemire · 2019-05-14T13:56:27Z

Packing them to a string is one of safest options.

Yes. I would still like to have them flagged as "bigint" or something and not stored as strings per se.

So we could have a type "bigint" that actually returns a string when you query the value. It would be up to the user to figure out what do with the string.

lemire · 2019-09-02T16:00:07Z

Note that we now a support for 64-bit unsigned integers.

Of course, we still cannot represent big integers like 38793000000000000001.

hedgefund · 2020-04-25T13:31:25Z

Yes. I would still like to have them flagged as "bigint" or something and not stored as strings per se.

So we could have a type "bigint" that actually returns a string when you query the value. It would be up to the user to figure out what do with the string.

It would be useful to be able to direct simdjson to treat all numbers like that (double and integer) regardless of their actual precision in the input with some compile-time flag.

A common use case that comes up a lot when dealing with JSON libraries is a JSON market data stream of cryptocurrency exchanges; some of them send the price/quantity as doubles instead of strings and you want to be able to parse that directly into a decimal class without losing precision (since getting the decimal precision correct with currency data is critical). RapidJSON for example has an optional parser flag kParseNumbersAsStringsFlag which does exactly this.

lemire · 2020-04-25T19:12:20Z

Yes. Offering the option of parsing numbers as strings is entirely doable.

hedgefund · 2020-04-26T18:34:20Z

Yes. Offering the option of parsing numbers as strings is entirely doable.

Is this something that one could temporarily hack locally with small modifications to the master version of simdjson until official support is available, or does it require some fundamental internal changes? It's the thing we miss most about RapidJSON when switching to simdjson.

lemire · 2020-04-26T18:39:53Z

@hedgefund

Our code base is "advanced C++" but it is super clean. It is designed to be maintainable.

See these lines...

https://github.com/simdjson/simdjson/blob/master/src/generic/stage2_build_tape.h#L165-L179

Basically, rewiring "parse_number" so that it handles the number as a string... is what you want to do... Not hard.

It won't take 5 minutes. But it won't take 5 months. Might take a few days... depending on your C++ skills.

If you are pretty good at C++, you might even be able to do a PR that we could integrate soon.

Look for the HACKING.md file first.

hedgefund · 2020-08-17T12:02:04Z

@lemire

Sorry for revisiting this issue after so much time. I have a working version of this change on my local repo that took me about 30 minutes to conjure up; it was much simpler than I imagined it would be, especially after the refactoring to how stage1/2 and the tape work in the past months.

While writing this I ran into another issue; we need a way to pass compile-time flags to the dom parser on a per-parse basis, so that you can do something like parser.parse<dom::flags::numbers_as_strings>(json); while in other parts of your app you might just want the default behavior so you won't pass anything for the flags. Currently I have implemented this as just an #ifdef for my own use but I want to submit a PR for this so I want to integrate the compile-time flag passing as well.

lemire · 2020-08-17T22:23:36Z

Yes. It sounds reasonable. We probably want to pass a template that defines how numbers are handled.

jkeiser · 2021-01-25T07:28:30Z

On demand could add support for this via a get_decimal and get_bigint methods (in a way that would be hard to do with the dom). I don't think it blocks 1.0, however.

hedgefund · 2021-02-06T14:53:02Z

On demand could add support for this via a get_decimal and get_bigint methods (in a way that would be hard to do with the dom). I don't think it blocks 1.0, however.

I think those are too specific; the ideal solution would be a way to get a [pointer, size] tuple to the raw string for any non-string value and let the user parse it as they see fit. Something like to get_raw_string() or get_as_string().

Slightly related to this; I've been running with my own patch for months now, albeit not with the ondemand API, which basically changes number parsing to just store raw strings on the tape (controlled by preprocessor macro) and then afterwards I parse them myself using my own bignumber/decimal parsers as needed during the DOM traversal.

jkeiser · 2021-02-07T19:22:59Z

get_raw_json() would absolutely be worth doing. You should be able to do your own kind of parsing if you want. For efficiency's sake, it would probably give you a string_view that included any spaces after the value as well. And it would give you a guarantee (unless you were getting the raw json from a document) that there will be spaces or structurals ([]{},:) after the value, and you therefore don't need to worry about the length, just stop when the number stops.

I still think having a decimal/bigint in the library would be a good idea, though! It's a reasonable thing to want, I think.

hedgefund · 2021-02-07T19:43:45Z

get_raw_json() would absolutely be worth doing

I'm eagerly looking forward to that. Should be easy enough to add and opens a lot of new possibilities (and I'll finally be able to get rid of my fork).

I still think having a decimal/bigint in the library would be a good idea, though! It's a reasonable thing to want, I think.

The thing about that is that unlike binary floating point where it is standardized and ubiquitous, different applications/projects call for different types of bigints and decimals, and people might want to use various parsers that have different tradeoffs. E.g. For some applications you might want to use a decimal floating point type and for others you might want a fixed decimal type constrained to 64 or 128 bits and the same goes for bigints.

hedgefund · 2021-05-10T08:40:56Z

@jkeiser was any progress made with this feature / get_raw_json()?

lemire · 2021-05-11T14:05:37Z

@hedgefund Would you be willing to work on a pull request for this feature?

jkeiser · 2021-05-11T16:20:13Z

There is now a raw_json_token method on all JSON values, yep! Let us know if it meets your needs. We haven't done anything around specific bigint or decimal representation, and it's not in the high priority list for 1.0 given raw JSON support.

lemire · 2021-05-11T19:56:30Z

@hedgefund Here is a code sample of what is possible right now...

    simdjson::ondemand::parser parser;
    simdjson::padded_string docdata =  R"({"value":12321323213213213213213213213211223})"_padded;
    simdjson::ondemand::document doc = parser.iterate(docdata);
    simdjson::ondemand::object obj = doc.get_object();
    string_view token = obj["value"].raw_json_token();
    std::cout << token << std::endl;
    // token == "12321323213213213213213213213211223"

Users who want support for big integers 'now' or 'soon', should be willing to contribute pull requests.

AlexeyDmitriev · 2021-07-01T22:48:02Z

Yes. It sounds reasonable. We probably want to pass a template that defines how numbers are handled.

I think that it doesn't make much sense to do it a global setting for a parser, often you know that most of the numbers are small numbers and some specific fields of json you need to parse in your custom class (for example decimal).

AlexeyDmitriev · 2021-07-01T23:01:24Z

Actually, I think raw_json_token() provides most of the value for me. I can make an informed decision, how to parse the number if we get it from specific API.

What your library could provide is faster parsing because it's known to be fast to parse other things. The question of what you can provide to parse the class that I will implement tomorrow is non-trivial.
One thing that comes to mind is you can pre-parse things like sign, integer and fractional part, exponent part(because you can probably do that for doubles anyway) and provide them as strings and as integers.

Also, you can make an API a bit nicer for these cases by allowing to get arbitrary types if they provide some API (or registered sonehow). So that I can write something along the lines of static_cast<MyDecimal>(json) or json.get_number<MyDecimal>()

lemire · 2021-07-02T17:27:05Z

Also, you can make an API a bit nicer for these cases by allowing to get arbitrary types if they provide some API (or registered sonehow). So that I can write something along the lines of static_cast(json) or json.get_number()

Pull request invited!

mbasmanova · 2023-12-15T10:39:11Z

Seeing similar issue in Velox with 233897314173811950000: facebookincubator/velox#8066

lemire · 2023-12-15T13:33:23Z

@mbasmanova Reproducing my answer here...

In standard C++, we have 64-bit, 32-bit, 16-bit and 8-bit integers and well as 64-bit, 32-bit, and 16-bit floating-point integers. None of them can represent the value provided.

We can try with a 64-bit integer (uint64_t). The number233897314173811950000 which is 0xcadfa3000000001b0, is much larger than what an unsigned 64-bit integer can store, which is 0xffffffffffffffff, so no luck there. There are 128-bit integers as extension of some systems (e.g., GCC), but they are not standard (e.g., won't work with Visual Studio).

You can also parse it as a double (64-bit floating point integer):

#include "simdjson.h"
#include <iostream>

using namespace simdjson;

int main(void) {
  padded_string json = R"( [ 233897314173811950000 ])"_padded;
  ondemand::parser parser;
  ondemand::document doc;
  if(auto error = parser.iterate(json).get(doc); error) { std::cout << error << std::endl; return EXIT_FAILURE; }
  double val;
  if(auto error = doc.at(0).get_double().get(val)) { std::cout << error << std::endl; return EXIT_FAILURE; }
  printf("%.0f\n", val);
  return EXIT_SUCCESS;
}

This will print 233897314173811949568 or 26591046*2**43.

From the string "233897314173811950000", you parse the integer 233897314173811949568. Is that what you wanted? The simdjson library will do it for you if that's what you want, as demonstrated above...

You have to decide what you want to do with it. If you want to store it using your own type, then you can get access to the raw type and then provide your own parser (e.g., to MyDecimalType or to MyBigIntegerType). The simdjson library provides this functionality as well.

What the simdjson library does not provide is a custom big-integer or decimal type, because that's not standard. We do, however, make it easy for you to do your own parsing if you so desire. Please see the relevant section of our documentation.

Capture d’écran, le 2023-12-15 à 08 29 31

richarddd · 2024-01-10T10:02:07Z

I'd like to suggest that handling integers exceeding 64 bits by default as floats, unless explicitly opted out, could be a more accommodating approach. It's worth considering the essence of JSON, which stands for JavaScript Object Notation.

It's notable that every JavaScript engine currently treats very large numbers as floats unless a dedicated BigInt type is specified. For instance, when executing 2341341341234134112341343 in V8, SpiderMonkey, or JavaScriptCore, the result is 2.3413413412341342e+24. To prevent data loss, one would need to use 2341341341234134112341343n, which is not part of the JSON spec and cannot be serialized or deserialized. Conversely, attempting JSON.parse("2341341341234134112341343") would yield the same float value.

lemire · 2024-01-10T15:12:29Z

@richarddd I think that's what @Yuhta proposed in facebookincubator/velox#8066 (comment)

PR invited.

richarddd · 2024-01-10T21:48:18Z

@richarddd I think that's what @Yuhta proposed in facebookincubator/velox#8066 (comment)

PR invited.

I would if I could. My c++ skills are virtually non-existent. I've created a PR in the rust port that enable this behaviour via a feature flag: simd-lite/simd-json#363

lemire · 2024-01-31T16:52:14Z

I really like PR #2115 by @saleyn

lemire · 2024-02-26T19:01:26Z

Closed via #2139

This will be part of the next release.

crazyxman closed this as completed May 14, 2019

lemire reopened this May 14, 2019

lemire changed the title ~~Analyze integer overflow recommendations~~ Handle "large integers" May 14, 2019

lemire mentioned this issue Oct 18, 2019

simdjson jqlang/jq#1892

Open

This was referenced Jan 9, 2020

Option to get number in JSON as string in the DOM API #425

Closed

Discussion regarding major refactoring to decouple the tape production and index consumption from parsing (comments eagerly invited) #439

Closed

lemire added this to the 0.4 milestone Mar 29, 2020

jkeiser modified the milestones: 0.4, 1.0 Jun 10, 2020

jkeiser added the enhancement New feature or request label Sep 22, 2020

jkeiser modified the milestones: 1.0, 2.0 Jan 25, 2021

jkeiser mentioned this issue Jan 25, 2021

parser.on_invalid_number #560

Open

damilgra mentioned this issue Mar 6, 2021

Replace ujson Kinto/kinto#2737

Merged

michaeleisel mentioned this issue Aug 27, 2022

Add support for parsing arbitrary-length numbers in the DOM API #1886

Closed

MegaRedHand mentioned this issue Jul 21, 2023

Use simdjson to deserialize struct lambdaclass/cairo-vm#1295

Closed

mbasmanova mentioned this issue Dec 15, 2023

json_parse: NUMBER_ERROR: Problem while parsing a number facebookincubator/velox#8066

Closed

lemire mentioned this issue Jan 8, 2024

Implement an ability to parse integers that exceed 64 bits #2108

Closed

lemire closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle "large integers" #167

Handle "large integers" #167

crazyxman commented May 14, 2019

lemire commented May 14, 2019

crazyxman commented May 14, 2019

lemire commented May 14, 2019

lemire commented May 14, 2019

plokhotnyuk commented May 14, 2019 •

edited

Loading

lemire commented May 14, 2019

lemire commented Sep 2, 2019

hedgefund commented Apr 25, 2020

lemire commented Apr 25, 2020

hedgefund commented Apr 26, 2020

lemire commented Apr 26, 2020

hedgefund commented Aug 17, 2020 •

edited

Loading

lemire commented Aug 17, 2020

jkeiser commented Jan 25, 2021

hedgefund commented Feb 6, 2021 •

edited

Loading

jkeiser commented Feb 7, 2021

hedgefund commented Feb 7, 2021 •

edited

Loading

hedgefund commented May 10, 2021

lemire commented May 11, 2021

jkeiser commented May 11, 2021

lemire commented May 11, 2021 •

edited

Loading

AlexeyDmitriev commented Jul 1, 2021 •

edited

Loading

AlexeyDmitriev commented Jul 1, 2021

lemire commented Jul 2, 2021

mbasmanova commented Dec 15, 2023

lemire commented Dec 15, 2023

richarddd commented Jan 10, 2024

lemire commented Jan 10, 2024

richarddd commented Jan 10, 2024

lemire commented Jan 31, 2024

lemire commented Feb 26, 2024

Handle "large integers" #167

Handle "large integers" #167

Comments

crazyxman commented May 14, 2019

lemire commented May 14, 2019

crazyxman commented May 14, 2019

lemire commented May 14, 2019

lemire commented May 14, 2019

plokhotnyuk commented May 14, 2019 • edited Loading

lemire commented May 14, 2019

lemire commented Sep 2, 2019

hedgefund commented Apr 25, 2020

lemire commented Apr 25, 2020

hedgefund commented Apr 26, 2020

lemire commented Apr 26, 2020

hedgefund commented Aug 17, 2020 • edited Loading

lemire commented Aug 17, 2020

jkeiser commented Jan 25, 2021

hedgefund commented Feb 6, 2021 • edited Loading

jkeiser commented Feb 7, 2021

hedgefund commented Feb 7, 2021 • edited Loading

hedgefund commented May 10, 2021

lemire commented May 11, 2021

jkeiser commented May 11, 2021

lemire commented May 11, 2021 • edited Loading

AlexeyDmitriev commented Jul 1, 2021 • edited Loading

AlexeyDmitriev commented Jul 1, 2021

lemire commented Jul 2, 2021

mbasmanova commented Dec 15, 2023

lemire commented Dec 15, 2023

richarddd commented Jan 10, 2024

lemire commented Jan 10, 2024

richarddd commented Jan 10, 2024

lemire commented Jan 31, 2024

lemire commented Feb 26, 2024

plokhotnyuk commented May 14, 2019 •

edited

Loading

hedgefund commented Aug 17, 2020 •

edited

Loading

hedgefund commented Feb 6, 2021 •

edited

Loading

hedgefund commented Feb 7, 2021 •

edited

Loading

lemire commented May 11, 2021 •

edited

Loading

AlexeyDmitriev commented Jul 1, 2021 •

edited

Loading