More avx2 optimizations for hps2048509 #3

OussamaDanba · 2019-05-15T14:35:00Z

This pull request introduces avx2 for:

poly_S3_mul (77 times speedup)
poly_Rq_mul_x_minus_1 (13 times speedup)
crypto_sort (9 times speedup)
poly_S3_inv (65 times speedup)

Explanations and results can be found per commit message. poly_Rq_inv is still to be done.

Explanation: poly_Rq_mul is reused to do the multiplication. The MODQ in poly_Rq_mul does nothing due to the coefficients never going above 2036 (509*4). After the multiplication the last coefficient is fetched and multiplied by two before being added to all the other coefficients. Finally all coefficients are reduced modulo 3. Results: On an Intel i5-8250u using gcc 8.3.0 the reference poly_S3_mul takes about 287000 cycles on average whereas the avx2 version takes about 3700 cycles on average. This is about a 77 times speedup.

Explanation: Straightforward conversion of the C function but using avx2 to operate on 16 coefficients at a time rather than one. Some special handling is done for the first coefficient (since the last coefficient is needed for that one) and the following 15 coefficients (can't operate on 16 coefficients anymore). Results: On an Intel i5-8250u using gcc 8.3.0 the reference poly_Rq_mul_x_minus_1 takes about 530 cycles on average whereas the avx2 version takes about 40 cycles on average. This is about a 13 times speedup.

On an Intel i5-8250u using gcc 8.3.0 the reference sample_fixed_type (of which the majority of cycles is spent in crypto_sort) takes about 28000 on average whereas using the avx2 version of djbsort for crypto_sort takes about 3000 cycles on average. This is about a 9 times speedup. As a result ntru_encaps is about 1.5 times faster.

Explanation: An explanation of the method can be found in the paper "Fast constant-time gcd computation and modular inversion". There are no large changes from the case study described in that paper except that every polynomial is stored as four 256-bit vectors rather than six since it fits within that space for hps2048509. This reduces the amount of some vector operations. Results: On an Intel i5-8250u using gcc 8.3.0 the reference poly_S3_inv takes about 1526800 cycles on average whereas the avx2 version takes about 23250 cycles on average. This is about a 65 times speedup.

OussamaDanba added 5 commits May 1, 2019 16:03

Include missing files in Makefile-NIST

3d726b1

jschanck merged commit 9707e9c into jschanck:master May 15, 2019

OussamaDanba deleted the avx2-hps2048509_optimizations branch May 15, 2019 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More avx2 optimizations for hps2048509 #3

More avx2 optimizations for hps2048509 #3

OussamaDanba commented May 15, 2019 •

edited

Loading

More avx2 optimizations for hps2048509 #3

More avx2 optimizations for hps2048509 #3

Conversation

OussamaDanba commented May 15, 2019 • edited Loading

OussamaDanba commented May 15, 2019 •

edited

Loading