Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle "large integers" #167

Closed
crazyxman opened this issue May 14, 2019 · 31 comments
Closed

Handle "large integers" #167

crazyxman opened this issue May 14, 2019 · 31 comments
Labels
enhancement New feature or request

Comments

@crazyxman
Copy link
Contributor

The following json parsing failed because of the number "38793000000000000001". Can I give an option when parsing? Convert to a string when parsing an integer overflow.
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793000000000000001]
}
}

@lemire
Copy link
Member

lemire commented May 14, 2019

Treating it as a string would violate the JSON specification as far as I can tell and once you reserialize, you'd end up with "3879..." (a string value) instead of 3879 (an integer value). JavaScript and some C++ parsers might convert it to a floating-point number but this is not ideal either since it is a lossy process (you won't be able to retrieve the original integer). We could map it to a "big integer" class of some kind... but that's inconvenient for the users.

What is your purpose with such a document?

You can't, portably, represent such larger integers in C/C++ on any system I know of.

@crazyxman
Copy link
Contributor Author

Thank you. Maybe it's wrong to do so.
This idea comes from the fourth parameter of the PHP function json_decode() when options=JSON_BIGINT_AS_STRING
Https://www.php.net/manual/en/function.json-decode.php

@lemire
Copy link
Member

lemire commented May 14, 2019

I am reopening because I don't think this is "wrong".

@lemire lemire reopened this May 14, 2019
@lemire lemire changed the title Analyze integer overflow recommendations Handle "large integers" May 14, 2019
@lemire
Copy link
Member

lemire commented May 14, 2019

Of course, it would help to have a good motivation for this problem.

@plokhotnyuk
Copy link

plokhotnyuk commented May 14, 2019

It can be decimal128 numbers with up to 34-digit mantissas, bank account numbers or IDs in decimal representation, etc.

Packing them to a string is one of safest options. Some implementations that just parse the input to a big number can be vulnerable on a malicious input.

@lemire
Copy link
Member

lemire commented May 14, 2019

Packing them to a string is one of safest options.

Yes. I would still like to have them flagged as "bigint" or something and not stored as strings per se.

So we could have a type "bigint" that actually returns a string when you query the value. It would be up to the user to figure out what do with the string.

@lemire
Copy link
Member

lemire commented Sep 2, 2019

Note that we now a support for 64-bit unsigned integers.

Of course, we still cannot represent big integers like 38793000000000000001.

@hedgefund
Copy link

Yes. I would still like to have them flagged as "bigint" or something and not stored as strings per se.

So we could have a type "bigint" that actually returns a string when you query the value. It would be up to the user to figure out what do with the string.

It would be useful to be able to direct simdjson to treat all numbers like that (double and integer) regardless of their actual precision in the input with some compile-time flag.

A common use case that comes up a lot when dealing with JSON libraries is a JSON market data stream of cryptocurrency exchanges; some of them send the price/quantity as doubles instead of strings and you want to be able to parse that directly into a decimal class without losing precision (since getting the decimal precision correct with currency data is critical). RapidJSON for example has an optional parser flag kParseNumbersAsStringsFlag which does exactly this.

@lemire
Copy link
Member

lemire commented Apr 25, 2020

Yes. Offering the option of parsing numbers as strings is entirely doable.

@hedgefund
Copy link

Yes. Offering the option of parsing numbers as strings is entirely doable.

Is this something that one could temporarily hack locally with small modifications to the master version of simdjson until official support is available, or does it require some fundamental internal changes? It's the thing we miss most about RapidJSON when switching to simdjson.

@lemire
Copy link
Member

lemire commented Apr 26, 2020

@hedgefund

Our code base is "advanced C++" but it is super clean. It is designed to be maintainable.

See these lines...

https://github.com/simdjson/simdjson/blob/master/src/generic/stage2_build_tape.h#L165-L179

Basically, rewiring "parse_number" so that it handles the number as a string... is what you want to do... Not hard.

It won't take 5 minutes. But it won't take 5 months. Might take a few days... depending on your C++ skills.

If you are pretty good at C++, you might even be able to do a PR that we could integrate soon.

Look for the HACKING.md file first.

@jkeiser jkeiser modified the milestones: 0.4, 1.0 Jun 10, 2020
@hedgefund
Copy link

hedgefund commented Aug 17, 2020

@lemire

Sorry for revisiting this issue after so much time. I have a working version of this change on my local repo that took me about 30 minutes to conjure up; it was much simpler than I imagined it would be, especially after the refactoring to how stage1/2 and the tape work in the past months.

While writing this I ran into another issue; we need a way to pass compile-time flags to the dom parser on a per-parse basis, so that you can do something like parser.parse<dom::flags::numbers_as_strings>(json); while in other parts of your app you might just want the default behavior so you won't pass anything for the flags. Currently I have implemented this as just an #ifdef for my own use but I want to submit a PR for this so I want to integrate the compile-time flag passing as well.

@lemire
Copy link
Member

lemire commented Aug 17, 2020

Yes. It sounds reasonable. We probably want to pass a template that defines how numbers are handled.

@jkeiser jkeiser added the enhancement New feature or request label Sep 22, 2020
@jkeiser
Copy link
Member

jkeiser commented Jan 25, 2021

On demand could add support for this via a get_decimal and get_bigint methods (in a way that would be hard to do with the dom). I don't think it blocks 1.0, however.

@jkeiser jkeiser modified the milestones: 1.0, 2.0 Jan 25, 2021
@hedgefund
Copy link

hedgefund commented Feb 6, 2021

On demand could add support for this via a get_decimal and get_bigint methods (in a way that would be hard to do with the dom). I don't think it blocks 1.0, however.

I think those are too specific; the ideal solution would be a way to get a [pointer, size] tuple to the raw string for any non-string value and let the user parse it as they see fit. Something like to get_raw_string() or get_as_string().

Slightly related to this; I've been running with my own patch for months now, albeit not with the ondemand API, which basically changes number parsing to just store raw strings on the tape (controlled by preprocessor macro) and then afterwards I parse them myself using my own bignumber/decimal parsers as needed during the DOM traversal.

@jkeiser
Copy link
Member

jkeiser commented Feb 7, 2021

get_raw_json() would absolutely be worth doing. You should be able to do your own kind of parsing if you want. For efficiency's sake, it would probably give you a string_view that included any spaces after the value as well. And it would give you a guarantee (unless you were getting the raw json from a document) that there will be spaces or structurals ([]{},:) after the value, and you therefore don't need to worry about the length, just stop when the number stops.

I still think having a decimal/bigint in the library would be a good idea, though! It's a reasonable thing to want, I think.

@hedgefund
Copy link

hedgefund commented Feb 7, 2021

get_raw_json() would absolutely be worth doing

I'm eagerly looking forward to that. Should be easy enough to add and opens a lot of new possibilities (and I'll finally be able to get rid of my fork).

I still think having a decimal/bigint in the library would be a good idea, though! It's a reasonable thing to want, I think.

The thing about that is that unlike binary floating point where it is standardized and ubiquitous, different applications/projects call for different types of bigints and decimals, and people might want to use various parsers that have different tradeoffs. E.g. For some applications you might want to use a decimal floating point type and for others you might want a fixed decimal type constrained to 64 or 128 bits and the same goes for bigints.

@hedgefund
Copy link

@jkeiser was any progress made with this feature / get_raw_json()?

@lemire
Copy link
Member

lemire commented May 11, 2021

@hedgefund Would you be willing to work on a pull request for this feature?

@jkeiser
Copy link
Member

jkeiser commented May 11, 2021

There is now a raw_json_token method on all JSON values, yep! Let us know if it meets your needs. We haven't done anything around specific bigint or decimal representation, and it's not in the high priority list for 1.0 given raw JSON support.

@lemire
Copy link
Member

lemire commented May 11, 2021

@hedgefund Here is a code sample of what is possible right now...

    simdjson::ondemand::parser parser;
    simdjson::padded_string docdata =  R"({"value":12321323213213213213213213213211223})"_padded;
    simdjson::ondemand::document doc = parser.iterate(docdata);
    simdjson::ondemand::object obj = doc.get_object();
    string_view token = obj["value"].raw_json_token();
    std::cout << token << std::endl;
    // token == "12321323213213213213213213213211223"

Users who want support for big integers 'now' or 'soon', should be willing to contribute pull requests.

@AlexeyDmitriev
Copy link

AlexeyDmitriev commented Jul 1, 2021

Yes. It sounds reasonable. We probably want to pass a template that defines how numbers are handled.

I think that it doesn't make much sense to do it a global setting for a parser, often you know that most of the numbers are small numbers and some specific fields of json you need to parse in your custom class (for example decimal).

@AlexeyDmitriev
Copy link

Actually, I think raw_json_token() provides most of the value for me. I can make an informed decision, how to parse the number if we get it from specific API.

What your library could provide is faster parsing because it's known to be fast to parse other things. The question of what you can provide to parse the class that I will implement tomorrow is non-trivial.
One thing that comes to mind is you can pre-parse things like sign, integer and fractional part, exponent part(because you can probably do that for doubles anyway) and provide them as strings and as integers.

Also, you can make an API a bit nicer for these cases by allowing to get arbitrary types if they provide some API (or registered sonehow). So that I can write something along the lines of static_cast<MyDecimal>(json) or json.get_number<MyDecimal>()

@lemire
Copy link
Member

lemire commented Jul 2, 2021

Also, you can make an API a bit nicer for these cases by allowing to get arbitrary types if they provide some API (or registered sonehow). So that I can write something along the lines of static_cast(json) or json.get_number()

Pull request invited!

@mbasmanova
Copy link

Seeing similar issue in Velox with 233897314173811950000: facebookincubator/velox#8066

@lemire
Copy link
Member

lemire commented Dec 15, 2023

@mbasmanova Reproducing my answer here...

In standard C++, we have 64-bit, 32-bit, 16-bit and 8-bit integers and well as 64-bit, 32-bit, and 16-bit floating-point integers. None of them can represent the value provided.

We can try with a 64-bit integer (uint64_t). The number233897314173811950000 which is 0xcadfa3000000001b0, is much larger than what an unsigned 64-bit integer can store, which is 0xffffffffffffffff, so no luck there. There are 128-bit integers as extension of some systems (e.g., GCC), but they are not standard (e.g., won't work with Visual Studio).

You can also parse it as a double (64-bit floating point integer):

#include "simdjson.h"
#include <iostream>

using namespace simdjson;

int main(void) {
  padded_string json = R"( [ 233897314173811950000 ])"_padded;
  ondemand::parser parser;
  ondemand::document doc;
  if(auto error = parser.iterate(json).get(doc); error) { std::cout << error << std::endl; return EXIT_FAILURE; }
  double val;
  if(auto error = doc.at(0).get_double().get(val)) { std::cout << error << std::endl; return EXIT_FAILURE; }
  printf("%.0f\n", val);
  return EXIT_SUCCESS;
}

This will print 233897314173811949568 or 26591046*2**43.

From the string "233897314173811950000", you parse the integer 233897314173811949568. Is that what you wanted? The simdjson library will do it for you if that's what you want, as demonstrated above...

You have to decide what you want to do with it. If you want to store it using your own type, then you can get access to the raw type and then provide your own parser (e.g., to MyDecimalType or to MyBigIntegerType). The simdjson library provides this functionality as well.

What the simdjson library does not provide is a custom big-integer or decimal type, because that's not standard. We do, however, make it easy for you to do your own parsing if you so desire. Please see the relevant section of our documentation.

Capture d’écran, le 2023-12-15 à 08 29 31

@richarddd
Copy link

I'd like to suggest that handling integers exceeding 64 bits by default as floats, unless explicitly opted out, could be a more accommodating approach. It's worth considering the essence of JSON, which stands for JavaScript Object Notation.

It's notable that every JavaScript engine currently treats very large numbers as floats unless a dedicated BigInt type is specified. For instance, when executing 2341341341234134112341343 in V8, SpiderMonkey, or JavaScriptCore, the result is 2.3413413412341342e+24. To prevent data loss, one would need to use 2341341341234134112341343n, which is not part of the JSON spec and cannot be serialized or deserialized. Conversely, attempting JSON.parse("2341341341234134112341343") would yield the same float value.

@lemire
Copy link
Member

lemire commented Jan 10, 2024

@richarddd I think that's what @Yuhta proposed in facebookincubator/velox#8066 (comment)

PR invited.

@richarddd
Copy link

@richarddd I think that's what @Yuhta proposed in facebookincubator/velox#8066 (comment)

PR invited.

I would if I could. My c++ skills are virtually non-existent. I've created a PR in the rust port that enable this behaviour via a feature flag: simd-lite/simd-json#363

@lemire
Copy link
Member

lemire commented Jan 31, 2024

I really like PR #2115 by @saleyn

@lemire
Copy link
Member

lemire commented Feb 26, 2024

Closed via #2139

This will be part of the next release.

@lemire lemire closed this as completed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants