Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup poly_Rq_inv in avx2-hps2048509 #5

Merged

Conversation

OussamaDanba
Copy link
Contributor

The commit message has a detailed explanation but it's the same method as used in avx2-hrss701.
This pull request brings about a 13 times speedup to poly_Rq_inv which means that it's about 105 times faster than the reference function now.

Also renames a mask in poly_R2_mul in avx2-hrss701 since it has the wrong name (already fixed in the avx2-hps2048509 version).

This should be one of the last larger optimizations for hps2048509 (poly_lift can still be optimized if I remember correctly).

…Rq_inv

Explanation:
In order to speed up poly_Rq_inv even more we need to speed up poly_R2_inv.
We set out to compute f^(2^508 - 2) = f^(-1) mod (2, Phi_509).
This can be done using an addition chain which results in 12 multiplications
and 13 multi-squarings. This is particularly efficient in our case
since our squaring operations can be expressed as bit permutations.
Multi-squarings means repeated bit permutations which is the same as a single
combined bit permutation. There are two implementations of multi-squarings;
one using the patience sorting algorithm and the other using byte-wise shuffling.
For lower amounts of squaring the patience sorting is preferred, otherwise
the byte-wise shuffling is used.

The multiplications are also efficient due to the existence of the CLMUL
instruction set (giving us vpclmulqdq for multiplications of 64-bit polynomials).
In order to multiply two 512-bit polynomials we use two instances of karatsuba
followed by one schoolbook multiplications (implemented using vpclmulqdq).
This gives 36 multiplications of 64-bit polynomials. Afterwards we do the reduction.

The result is a very fast poly_R2_inv. This is the same strategy as used
in avx2-hrss701. See the paper "High-speed key encapsulation from NTRU"
for more details on it.

Results:
Note that poly_Rq_mul was already optimized which had the side effect of
optimizing poly_Rq_inv partially since poly_R2_inv_to_Rq_inv has
8 poly_Rq_mul calls.
On an Intel i5-8250u using gcc 8.3.0 the previous poly_Rq_inv took
about 413000 cycles on average whereas the current poly_Rq_inv takes
about 30600 cycles on average. This is about a 13 times speedup.
Compared to the reference poly_Rq_inv (3200000 cycles) we're about
105 times faster.
Previously the name implied only the lowest quadword is kept
but in reality the lowest three quadwords were kept.
@jschanck jschanck merged commit 08529b6 into jschanck:master Jun 2, 2019
@OussamaDanba OussamaDanba deleted the avx2-hps2048509_optimizations branch June 2, 2019 22:01
@OussamaDanba OussamaDanba restored the avx2-hps2048509_optimizations branch June 2, 2019 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants