-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle "large integers" #167
Comments
Treating it as a string would violate the JSON specification as far as I can tell and once you reserialize, you'd end up with "3879..." (a string value) instead of 3879 (an integer value). JavaScript and some C++ parsers might convert it to a floating-point number but this is not ideal either since it is a lossy process (you won't be able to retrieve the original integer). We could map it to a "big integer" class of some kind... but that's inconvenient for the users. What is your purpose with such a document? You can't, portably, represent such larger integers in C/C++ on any system I know of. |
Thank you. Maybe it's wrong to do so. |
I am reopening because I don't think this is "wrong". |
Of course, it would help to have a good motivation for this problem. |
It can be decimal128 numbers with up to 34-digit mantissas, bank account numbers or IDs in decimal representation, etc. Packing them to a string is one of safest options. Some implementations that just parse the input to a big number can be vulnerable on a malicious input. |
Yes. I would still like to have them flagged as "bigint" or something and not stored as strings per se. So we could have a type "bigint" that actually returns a string when you query the value. It would be up to the user to figure out what do with the string. |
Note that we now a support for 64-bit unsigned integers. Of course, we still cannot represent big integers like 38793000000000000001. |
It would be useful to be able to direct simdjson to treat all numbers like that (double and integer) regardless of their actual precision in the input with some compile-time flag. A common use case that comes up a lot when dealing with JSON libraries is a JSON market data stream of cryptocurrency exchanges; some of them send the price/quantity as doubles instead of strings and you want to be able to parse that directly into a decimal class without losing precision (since getting the decimal precision correct with currency data is critical). RapidJSON for example has an optional parser flag |
Yes. Offering the option of parsing numbers as strings is entirely doable. |
Is this something that one could temporarily hack locally with small modifications to the master version of simdjson until official support is available, or does it require some fundamental internal changes? It's the thing we miss most about RapidJSON when switching to simdjson. |
Our code base is "advanced C++" but it is super clean. It is designed to be maintainable. See these lines... https://github.com/simdjson/simdjson/blob/master/src/generic/stage2_build_tape.h#L165-L179 Basically, rewiring "parse_number" so that it handles the number as a string... is what you want to do... Not hard. It won't take 5 minutes. But it won't take 5 months. Might take a few days... depending on your C++ skills. If you are pretty good at C++, you might even be able to do a PR that we could integrate soon. Look for the HACKING.md file first. |
Sorry for revisiting this issue after so much time. I have a working version of this change on my local repo that took me about 30 minutes to conjure up; it was much simpler than I imagined it would be, especially after the refactoring to how stage1/2 and the tape work in the past months. While writing this I ran into another issue; we need a way to pass compile-time flags to the dom parser on a per-parse basis, so that you can do something like |
Yes. It sounds reasonable. We probably want to pass a template that defines how numbers are handled. |
On demand could add support for this via a get_decimal and get_bigint methods (in a way that would be hard to do with the dom). I don't think it blocks 1.0, however. |
I think those are too specific; the ideal solution would be a way to get a [pointer, size] tuple to the raw string for any non-string value and let the user parse it as they see fit. Something like to Slightly related to this; I've been running with my own patch for months now, albeit not with the ondemand API, which basically changes number parsing to just store raw strings on the tape (controlled by preprocessor macro) and then afterwards I parse them myself using my own bignumber/decimal parsers as needed during the DOM traversal. |
I still think having a decimal/bigint in the library would be a good idea, though! It's a reasonable thing to want, I think. |
I'm eagerly looking forward to that. Should be easy enough to add and opens a lot of new possibilities (and I'll finally be able to get rid of my fork).
The thing about that is that unlike binary floating point where it is standardized and ubiquitous, different applications/projects call for different types of bigints and decimals, and people might want to use various parsers that have different tradeoffs. E.g. For some applications you might want to use a decimal floating point type and for others you might want a fixed decimal type constrained to 64 or 128 bits and the same goes for bigints. |
@jkeiser was any progress made with this feature / |
@hedgefund Would you be willing to work on a pull request for this feature? |
There is now a |
@hedgefund Here is a code sample of what is possible right now... simdjson::ondemand::parser parser;
simdjson::padded_string docdata = R"({"value":12321323213213213213213213213211223})"_padded;
simdjson::ondemand::document doc = parser.iterate(docdata);
simdjson::ondemand::object obj = doc.get_object();
string_view token = obj["value"].raw_json_token();
std::cout << token << std::endl;
// token == "12321323213213213213213213213211223" Users who want support for big integers 'now' or 'soon', should be willing to contribute pull requests. |
I think that it doesn't make much sense to do it a global setting for a parser, often you know that most of the numbers are small numbers and some specific fields of json you need to parse in your custom class (for example decimal). |
Actually, I think raw_json_token() provides most of the value for me. I can make an informed decision, how to parse the number if we get it from specific API. What your library could provide is faster parsing because it's known to be fast to parse other things. The question of what you can provide to parse the class that I will implement tomorrow is non-trivial. Also, you can make an API a bit nicer for these cases by allowing to get arbitrary types if they provide some API (or registered sonehow). So that I can write something along the lines of |
Pull request invited! |
Seeing similar issue in Velox with 233897314173811950000: facebookincubator/velox#8066 |
@mbasmanova Reproducing my answer here... In standard C++, we have 64-bit, 32-bit, 16-bit and 8-bit integers and well as 64-bit, 32-bit, and 16-bit floating-point integers. None of them can represent the value provided. We can try with a 64-bit integer ( You can also parse it as a double (64-bit floating point integer): #include "simdjson.h"
#include <iostream>
using namespace simdjson;
int main(void) {
padded_string json = R"( [ 233897314173811950000 ])"_padded;
ondemand::parser parser;
ondemand::document doc;
if(auto error = parser.iterate(json).get(doc); error) { std::cout << error << std::endl; return EXIT_FAILURE; }
double val;
if(auto error = doc.at(0).get_double().get(val)) { std::cout << error << std::endl; return EXIT_FAILURE; }
printf("%.0f\n", val);
return EXIT_SUCCESS;
} This will print 233897314173811949568 or From the string You have to decide what you want to do with it. If you want to store it using your own type, then you can get access to the raw type and then provide your own parser (e.g., to MyDecimalType or to MyBigIntegerType). The simdjson library provides this functionality as well. What the simdjson library does not provide is a custom big-integer or decimal type, because that's not standard. We do, however, make it easy for you to do your own parsing if you so desire. Please see the relevant section of our documentation. |
I'd like to suggest that handling integers exceeding 64 bits by default as floats, unless explicitly opted out, could be a more accommodating approach. It's worth considering the essence of JSON, which stands for JavaScript Object Notation. It's notable that every JavaScript engine currently treats very large numbers as floats unless a dedicated BigInt type is specified. For instance, when executing |
@richarddd I think that's what @Yuhta proposed in facebookincubator/velox#8066 (comment) PR invited. |
I would if I could. My c++ skills are virtually non-existent. I've created a PR in the rust port that enable this behaviour via a feature flag: simd-lite/simd-json#363 |
Closed via #2139 This will be part of the next release. |
The following json parsing failed because of the number "38793000000000000001". Can I give an option when parsing? Convert to a string when parsing an integer overflow.
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793000000000000001]
}
}
The text was updated successfully, but these errors were encountered: