Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ujson #2737

Merged
merged 15 commits into from
Jul 2, 2021
Merged

Replace ujson #2737

merged 15 commits into from
Jul 2, 2021

Conversation

damilgra
Copy link
Contributor

@damilgra damilgra commented Mar 4, 2021

Fixes #2736

damilgra added 6 commits March 2, 2021 17:58
Added a unit test for issue Kinto#2736 .

Modified the memory storage backend to use the native json library as the added unit test fails with the memory storage if simdjson is employed. simdjson cannot *deserialize* strings representing numbers > 64 bits. On the other hand, it can serialize a > 64-bit number.

So, currently we have:
Memory Storage - uses native json
PostgreSQL Storage - uses simdjson
…Memory storage backend

Resolve Kinto#2736 by replacing ujson which cannot serialize or deserialize numbers > 64 bits.

simdjson, a performant json library is used for the PostgreSQL backend while native json is used for the memory storage backend in order to ensure that the unit test demonstrating the issue passes.

Note 1: there is no unit test which verifies that this works with PostgreSQL - I manually modified the test configuration setup however to confirm that it does in fact pass.
@@ -13,7 +13,8 @@
StorageBase,
exceptions,
)
from kinto.core.utils import COMPARISON, find_nested_value, json
from kinto.core.utils import COMPARISON, find_nested_value
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you import json from kinto.core.utils here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, kinto.core.utils redefines json as simdjson, so no. Currently I have:
simdjson for PostgreSQL, and native json for Memory

Not sure whether or not a more performant (than native) json library is needed but that's the rationale. Alternatively it's possible simdjson could be used for both storages but the way records are added via memory storage would need to be modified which is a far more invasive change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of json in the memory backend is rather straightforward. What issues would we have if we would use simdjson there too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the unit test I introduced will raise an exception here (found in core/storage/memory.py):

obj = json.loads(json.dumps(obj))

The reason is that simdjson cannot deserialize strings representing numbers > 64 bits, as per simdjson/simdjson#167. On the other hand, it can serialize a > 64-bit number.

Copy link
Contributor

@leplatrem leplatrem Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, gotcha.

And then I would suggest that once we merge this, you create an issue to keep track of this so that we can use simdjson everywhere once this is fixed upstream.

(we could also do something like obj = simdjson.loads(json.dumps(obj)) but I don't think it brings a lot of value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I'll create an issue once it's merged and I've subscribed to the simdjson issue to track progress there.

@TkTech
Copy link

TkTech commented Mar 9, 2021

The reason is that simdjson cannot deserialize strings representing numbers > 64 bits, as per simdjson/simdjson#167. On the other hand, it can serialize a > 64-bit number.

Funny enough, this is because pysimdjson's dump() and dumps() are just aliases to Python's built in json module. The underlying simdjson library doesn't do serialization at all (yet).

@damilgra
Copy link
Contributor Author

Funny enough, this is because pysimdjson's dump() and dumps() are just aliases to Python's built in json module. The underlying simdjson library doesn't do serialization at all (yet).

Thanks very much, good to know. I'm going to investigate other options.

Replaced pysimdjson with python-rapidjson.

As pysimdjson uses stdlib json to serialize json, serialization performance is subpar compared to ujson. With rapidjson, and the selected options used, performance for serialization should be comparable to that of ujson and simdjson, whereas deserialization is slower than that of ujson but better than that of ujson, according to benchmarks found here:
https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html

For serialization, we do not employ the "number_mode=NM_NATIVE" setting which improves performance, in favour of correctness and in order to resolve Kinto#2677.
return json.dumps(v, escape_forward_slashes=False)
class json:
def dumps(v, **kw):
return rapidjson.dumps(v, bytes_mode=rapidjson.BM_NONE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the kw will introduce breaking changes... See

canonical_json = json.dumps(record, sort_keys=True, separators=(",", ":"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm misunderstanding but the file you reference imports stdlib json so it shouldn't be affected. I will make a change however to respect passed in kw, if any; grepping the source I don't see any instances where this happens right now.

import json

def record_size(record):
# We cannot use ultrajson here, since the separator option is not available.
canonical_json = json.dumps(record, sort_keys=True, separators=(",", ":"))
return len(canonical_json)

kinto/core/utils.py Outdated Show resolved Hide resolved
@leplatrem
Copy link
Contributor

@damilgra Are you still motivated to push this to the finish line?

@damilgra
Copy link
Contributor Author

damilgra commented Jun 7, 2021

@damilgra Are you still motivated to push this to the finish line?

Sure, anything else I need to do? I'm trying to get a tox environment set up, but if memory serves there were some failing tests unrelated to changes I made. Please let me know if you have any other changes you'd like me to make.

@leplatrem leplatrem merged commit 951dd25 into Kinto:master Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decoding record with > 64-bit integer fails
3 participants