Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression hamming #188

Open
maxbachmann opened this issue Apr 20, 2023 · 13 comments
Open

Performance regression hamming #188

maxbachmann opened this issue Apr 20, 2023 · 13 comments

Comments

@maxbachmann
Copy link
Contributor

The new rust backend appears to lead to a pretty steep performance regression in the hamming implementation:

Old

hamming_old

New

hamming

I am sure this should be fairly simple to fix given the low complexity of the hamming similarity

@jamesturk
Copy link
Owner

jamesturk commented Apr 20, 2023 via email

@jamesturk
Copy link
Owner

jamesturk commented Apr 20, 2023 via email

@maxbachmann
Copy link
Contributor Author

can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime

I do not see why you would need any unicode normalization. Since you receive unicode strings from Python this feels like a waste of time. The whole benchmark script is a bit longer, since it allows things like visualization + generation of a random dataset.

One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench

This other library does the same as jellyfish:

  • it accepts strings with any amount of similarity differences
  • it accepts length differences and adds them as insertion/deletion

I am really stunned by the performance difference as well. Did the old implementation copy the string for the comparision? In rapidfuzz I always work with a view on the internal representation inside the Python string.

This is a very simple benchmark (I can share the benchmark script above if that helps as well). These are very long sequences, so the python call overhead gets less relevant.

from timeit import timeit

setup="""
from jellyfish import hamming_distance as jellyfish
from rapidfuzz.distance.Hamming import distance as rapidfuzz
a="a"*200000
b="b"*200000
"""

print(timeit("jellyfish(a, b)", setup=setup, number=1000))
print(timeit("jellyfish(a, a)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, b)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, a)", setup=setup, number=1000))

This results in:
old library version

0.386396555997635
0.4072674199996982
0.03347623199806549
0.0331174099992495

new library version:

4.556943121999211
4.313729999998031
0.033683902001939714
0.03330945900233928

@jamesturk
Copy link
Owner

jamesturk commented Apr 20, 2023 via email

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Apr 20, 2023

This is just PyO3/Rust overhead, and the unicode normalization matters because going character by character isn't the right way to handle unicode strings since there are multi-length chars.

Hm out of interest, does Python not perform the same kinds of normalization? As in does:

[x for x in my_string]

lead to any incorrect results?

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Apr 20, 2023

Hm I just had a look at the old implementation which calls: unicodedata_normalize as well. Maybe I should do this too. Sounds like this would be worth it in terms of correctness 👍

For most metrics this should not really have a big impact on performance.

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Apr 20, 2023

Hm especially for ascii strings (which I benchmarked) this should not have any performance impact, since they are already normalized. So it should be possible to avoid the overhead in this case.

@jamesturk jamesturk reopened this Apr 20, 2023
@jamesturk
Copy link
Owner

jamesturk commented Apr 20, 2023

note to self: Alternate Rust implementation to benchmark

pub fn vec_hamming_distance<T: PartialEq>(s1: &FastVec<T>, s2: &FastVec<T>) -> usize {
    let (longer, shorter) = if s1.len() > s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    // distance is difference in length + differing chars
    let len_diff = longer.len() - shorter.len();
    let differing_chars = shorter.iter().zip(longer.iter()).map(|(a, b)| (a != b) as usize).sum();

    len_diff + differing_chars
}

Profiling the code shows that a lot of the time is spent there whether it is necessary or not, but I'm fine w/ that in the name of correctness at edge cases (which this library had plenty more of before adding these normalization years ago) and we're talking about nearly immeasureable amounts of time in practice so the correctness/safety is worth it for these use cases.

https://github.com/jamesturk/jellyfish/blob/main/benchmarks/compare.ipynb

Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long 😄

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Apr 20, 2023

Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long smile

This was mostly an odd occurrence I noticed when benchmarking my other metrics. My first thought was that my benchmarks have to be broken somehow. I do however know of a user of my lib using the Levenshtein distance + Levenshtein edit operation backtracing with sequences of over 100k characters (to compare OCR results of archive documents) 😉

@jamesturk
Copy link
Owner

Gotcha, I'll keep this open until I get a chance to prod at the alternate implementation above & definitely appreciate you raising it in case there is some low hanging fruit but I don't want to micro-optimize & lose clarity/correctness for this one.

@maxbachmann
Copy link
Contributor Author

Yes I completely understand this. This looks pretty damming in the benchmarks, but given the low overall times it is not really of a lot of importance.

The implementation of Levenshtein and Jaro used in jellyfish are the ones with the largest potential for optimization, but these optimizations are a lot more complex as well 😓

levenshtein
jaro

@jamesturk
Copy link
Owner

These are interesting for sure, I'll need to add some length-based benchmarks on my end to see how this changes.

A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap, it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.

Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Apr 20, 2023

A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap,

Yes that's a pretty cool optimization. In C++ this is used e.g. in the std::string class as short string optimization as well.

it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.

Most of my users have relatively short strings, but tons of them (as in millions of strings).

Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.

For Jaro / JaroWinkler I have only seen relatively short sequences used. This is probably since the metric was designed to compare sequences with spelling mistakes.
For the Levenshtein distance I have absolutely seen people using longer strings. The example I was talking about earlier was https://github.com/qurator-spk/dinglehopper it's used a lot in computational biology as well.

Optimizations

I have tons of different optimizations I do behind the back, while I try to provide the user with a relatively simple interface. I provide an API to compare single sequences similar to jellyfish, but I provide one to compare multiple sequences in parallel as well. The API comparing single sequences is a lot more limited in possible optimizations.

For these functions I use the following things:

  • I use specialized implementations based on string length + character width. If available for a given metrics these are usually bitparallel implementations, which allow comparing 64 characters in parallel
  • I allow the user to provide a score cutoff which allows me to exit early if it is reached. In addition I provide a score hint, which is used to select an implementation which is slower if the score is lower, but can be significantly faster if it is at least as good.

When comparing multiple sequences at once I do the following things:

  • I do not call back into Python for each comparision. Instead I hide a PyCapsule in the function attributes, which allows me to directly call the corresponding C++ function.
  • I cache some parts between string comparisions. E.g. bitvectors for bitparallel algorithms can often be created ahead of time. Especially for short sequences this saves a lot of time.
  • for some metrics I use SIMD implementations to compare multiple short strings in parallel (e.g. compare 1 string to 32 strings with <= 8 characters using AVX2)
  • the user can enable multithreading

Algorithms

These are the complexities of my implementations of the algorithms used in jellyfish:

algorithm memory complexity time complexity
Levenshtein O(N) O([M/64]*N)
Jaro O(N) O([M/64]*N)
JaroWinkler O(N) O([M/64]*N)
DamerauLevenshtein O(N) O(M*N)

These algorithms have the following complexities in jellyfish:

algorithm memory complexity time complexity
Levenshtein O(N) O(M*N)
Jaro O(N) O(M*N)
JaroWinkler O(N) O(M*N)
DamerauLevenshtein O(M*N) O(M*N)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants