Speedup poly_Rq_inv in avx2-hps2048509 #5

OussamaDanba · 2019-06-02T14:38:44Z

The commit message has a detailed explanation but it's the same method as used in avx2-hrss701.
This pull request brings about a 13 times speedup to poly_Rq_inv which means that it's about 105 times faster than the reference function now.

Also renames a mask in poly_R2_mul in avx2-hrss701 since it has the wrong name (already fixed in the avx2-hps2048509 version).

This should be one of the last larger optimizations for hps2048509 (poly_lift can still be optimized if I remember correctly).

…Rq_inv Explanation: In order to speed up poly_Rq_inv even more we need to speed up poly_R2_inv. We set out to compute f^(2^508 - 2) = f^(-1) mod (2, Phi_509). This can be done using an addition chain which results in 12 multiplications and 13 multi-squarings. This is particularly efficient in our case since our squaring operations can be expressed as bit permutations. Multi-squarings means repeated bit permutations which is the same as a single combined bit permutation. There are two implementations of multi-squarings; one using the patience sorting algorithm and the other using byte-wise shuffling. For lower amounts of squaring the patience sorting is preferred, otherwise the byte-wise shuffling is used. The multiplications are also efficient due to the existence of the CLMUL instruction set (giving us vpclmulqdq for multiplications of 64-bit polynomials). In order to multiply two 512-bit polynomials we use two instances of karatsuba followed by one schoolbook multiplications (implemented using vpclmulqdq). This gives 36 multiplications of 64-bit polynomials. Afterwards we do the reduction. The result is a very fast poly_R2_inv. This is the same strategy as used in avx2-hrss701. See the paper "High-speed key encapsulation from NTRU" for more details on it. Results: Note that poly_Rq_mul was already optimized which had the side effect of optimizing poly_Rq_inv partially since poly_R2_inv_to_Rq_inv has 8 poly_Rq_mul calls. On an Intel i5-8250u using gcc 8.3.0 the previous poly_Rq_inv took about 413000 cycles on average whereas the current poly_Rq_inv takes about 30600 cycles on average. This is about a 13 times speedup. Compared to the reference poly_Rq_inv (3200000 cycles) we're about 105 times faster.

Previously the name implied only the lowest quadword is kept but in reality the lowest three quadwords were kept.

OussamaDanba added 2 commits June 2, 2019 16:02

Rename mask in avx2-hrss701 in order to be correct

c8c9f7c

Previously the name implied only the lowest quadword is kept but in reality the lowest three quadwords were kept.

jschanck merged commit 08529b6 into jschanck:master Jun 2, 2019

OussamaDanba deleted the avx2-hps2048509_optimizations branch June 2, 2019 22:01

OussamaDanba restored the avx2-hps2048509_optimizations branch June 2, 2019 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup poly_Rq_inv in avx2-hps2048509 #5

Speedup poly_Rq_inv in avx2-hps2048509 #5

OussamaDanba commented Jun 2, 2019

Speedup poly_Rq_inv in avx2-hps2048509 #5

Speedup poly_Rq_inv in avx2-hps2048509 #5

Conversation

OussamaDanba commented Jun 2, 2019