-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize string parsing #1161
Vectorize string parsing #1161
Conversation
This is not backed by benchmarks, but it seems reasonable that we'd be more starved for cache than CPU in IO-bound tasks. It also simplifies code a bit and frees up some memory, which is probably a good thing.
Also, just to make sure we're on the same page: as simd-json seems to be serde-compatible, what is the main focus of this crate? Is it supposed to be a simple canonical implementation or are more complicated performance optimizations welcome here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is terrific work.
I was able to come up with 2 small variations that outperform this PR on my machine even without inline(always)
.
Variation 1: in the Mycroft's algorithm, move & (ONE_BYTES << 7)
and != 0
out of the loop. #1167
Variation 2: skip the memchr. Mycroft's algorithm works for finding any specific byte by xor-ing the original chunk and then testing for a zero byte. #1168
I was measuring using the following program, whose performance is dominated by SliceRead::ignore_str
:
const ALIGNMENT_OFFSET: usize = 0;
#[repr(C, align(8))]
struct Static<T: ?Sized = [u8]> {
offset: [u8; ALIGNMENT_OFFSET],
json: T,
}
static STATIC: &Static = &Static {
offset: [0u8; ALIGNMENT_OFFSET],
json: *include_bytes!("twitter.json"), // https://github.com/serde-rs/json-benchmark/tree/master/data
};
fn main() {
let iterations = 10000;
let json = std::str::from_utf8(&STATIC.json).unwrap();
let begin = std::time::Instant::now();
for _ in 0..iterations {
serde_json::from_str::<serde::de::IgnoredAny>(json).unwrap();
}
let bytes = iterations * json.len();
let throughput = bytes as f64 / 1024.0 / 1024.0 / begin.elapsed().as_secs_f64();
println!("{:.02} MB/sec", throughput);
}
I am interested whether you find either of the variations promising to incorporate into this implementation, whether you can reproduce them being any faster, and whether they mitigate the small string slowdown at all (which I didn't test).
Yes, this is fine.
I think this is fine. The set of three benchmark files you already cited above are chosen to be representative enough of real-world performance-sensitive workflows, including with regard to the distribution of string lengths they contain. Before merging, I would want to try benchmarking another file that disproportionately contains small string to try to get an upper bound on how much such data would be pessimized, but I expect it won't be much.
Performance optimizations are welcome. The relevant constraint is that the public API needs to remain good enough and approachable for 99% of JSON use cases. Simd-json's API is not this because so many optimizations have leaked into the public API. |
error: adding items after statements is confusing, since items exist from the start of the scope
--> src/read.rs:446:9
|
446 | const STEP: usize = mem::size_of::<usize>();
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#items_after_statements
= note: `-D clippy::items-after-statements` implied by `-D clippy::pedantic`
= help: to override `-D clippy::pedantic` add `#[allow(clippy::items_after_statements)]` Feel free to add |
I'm incredibly happy with the direction this is going. I'll see if I can incorporate your patches and maybe optimize stuff some more soon. |
fce646f
to
a60f1e1
Compare
I've pushed a rewrite of #1168 that seems cleaner to me and also seems to produce better x86 code. Compared to the previous version of the PR, the benchmark result is:
...so there are no regressions. This was confusing to me at first, because memchr2 is supposed to be faster than the SWAR code. But memory reads are slow, so perhaps doing two runs over the data is slower than just one run. I think this change will also make performance more level by avoiding a performance cliff when the chunk doesn't fit in L1. I also ran a more precise comparison to the original naive code:
I'm a bit surprised that it's fast on short strings too. That's probably because we no longer perform an indirect call in Some commentary on the code:
|
On the flip side, here's the comparison with the "factor out the check" variation:
canada DOM is a bona fide regression, but still better than naive code. citm DOM is noise. I think we should just bite the bullet and drop #1167. |
For completeness, here's the comparison for the benchmark in #1161 (review) (5 runs performed):
|
a60f1e1
to
a95d6df
Compare
I think this PR is ready for merging (bar squashing, I'll be happy to squash it if you want). Unrolling the loop manually twice (this is notably distinct from We'd probably have more luck with wide for further improvements. I can work on that if you'd like me to, but I think it's better to land this PR in the meantime anyway.
Data:
This is a 20% regression. Data:
This is a 10% improvement. Data (strings
Overall, this PR regresses strings of perfectly predictable very short lengths (the breakeven point seems to be around 6 characters). So the main troublemakers are objects with: a) few, b) short, c) fixed keys and short values. I'm not sure if this is practical. I cherry-picked some datasets that I expected would be regressed, but none did:
I'm not happy about this, but if this regresses no realistic datasets, we might as well merge it and see who hollers. |
|
I just found a regression. We shouldn't really be looking for short strings as counterexamples, we should be looking for empty strings! We also effectively treat the spaces between consecutive Unicode escapes as empty strings, and this is very common in Unicode text. For JSON-encoded War and Peace in Russian, I get
|
175cde9
to
e43da5e
Compare
I've committed a fix that should offset the performance loss from -17% to -1.7%:
I have not noticed any significant regressions on json-benchmark (there are some improvements too, but I believe it's all just noise). I don't believe this deoptimization is a concern anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
let contains_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20) & !chars; | ||
let chars_quote = chars ^ (ONE_BYTES * Chunk::from(b'"')); | ||
let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars_quote; | ||
let chars_backslash = chars ^ (ONE_BYTES * Chunk::from(b'\\')); | ||
let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars_backslash; | ||
let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fun followup would be to throw this arithmetic expression into a superoptimizer and see if there is some shorter expression that it is equivalent to. It seems very likely to me that the same quantity can be computed in fewer operations than written here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some observations from https://rust.godbolt.org/z/66fGEqY6c:
LLVM already rewrites this as:
- let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars_quote;
- let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars_backslash;
+ let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars;
+ let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars;
let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7);
because it has proven that A & !(chars ^ B) & C
is equivalent to A & !chars & C
whenever B & C == 0
, i.e. doing the ^
only affects bits that are later erased by the second &
.
Then it factored out the three & !chars
into one.
- let contains_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20) & !chars;
- let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars;
- let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars;
- let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7);
+ let tmp_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20);
+ let tmp_quote = chars_quote.wrapping_sub(ONE_BYTES);
+ let tmp_backslash = chars_backslash.wrapping_sub(ONE_BYTES);
+ let masked = (tmp_ctrl | tmp_quote | tmp_backslash) & !chars & (ONE_BYTES << 7);
A superoptimizer would do this kind of thing, but better.
I've implemented your suggestion here, and I need a bit of your input regarding my approach.
What we really want is to find the first character s.t.
ch == b'"' || ch == b'\\' || ch < 0x20
. memchr3 supports searches of kindch == a || ch == b || ch == c
, which is almost what we need. SIMD comparisons for equality and non-equality are not significantly different performance-wise or implementation-wise, so memchr could theoretically help us out. I've filed BurntSushi/memchr#157, but chances are this isn't going to be implemented anytime soon.The second best thing -- and I want to articulate that this is not the more performant solution -- is to first use SIMD to scan for
"
or\
and then find escapes in an ad-hoc way. However, it turns out thatch < 0x20
in a hot loop is not any faster thanESCAPES[ch]
, on benchmarks at least. The reason is likely that the table is brought into L1 cache and the string isn't, so the speed of string iteration shadows the concerns about table access.This forces us to use SIMD in serde-json directly. Luckily, there's a way to implement what we need in an architecture-agnostic way, and it's faster than using intrinsics on my benchmarks. This unfortunately increases code complexity a bit and requires an
#[inline(always)]
annotation, but this is what we get in return:I'd say this is a good result overall, but the citm pessimization bothers me a bit. It seems like the new implementation is slower when most of the strings are very short (less than 16 characters, perhaps?). In practice, this happens when the data contains lots of objects with many keys but small values, e.g.
{"areaId": 205706007, "blockIds": []}
x 10k in citm.So the questions are: