Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More avx2 optimizations for hps2048509 #3

Merged
merged 5 commits into from
May 15, 2019

Conversation

OussamaDanba
Copy link
Contributor

@OussamaDanba OussamaDanba commented May 15, 2019

This pull request introduces avx2 for:

  • poly_S3_mul (77 times speedup)
  • poly_Rq_mul_x_minus_1 (13 times speedup)
  • crypto_sort (9 times speedup)
  • poly_S3_inv (65 times speedup)

Explanations and results can be found per commit message. poly_Rq_inv is still to be done.

Explanation:
poly_Rq_mul is reused to do the multiplication. The MODQ in poly_Rq_mul does
nothing due to the coefficients never going above 2036 (509*4).
After the multiplication the last coefficient is fetched and multiplied by
two before being added to all the other coefficients. Finally
all coefficients are reduced modulo 3.

Results:
On an Intel i5-8250u using gcc 8.3.0 the reference poly_S3_mul takes
about 287000 cycles on average whereas the avx2 version takes about
3700 cycles on average. This is about a 77 times speedup.
Explanation:
Straightforward conversion of the C function but using avx2
to operate on 16 coefficients at a time rather than one.
Some special handling is done for the first coefficient
(since the last coefficient is needed for that one) and
the following 15 coefficients (can't operate on 16 coefficients anymore).

Results:
On an Intel i5-8250u using gcc 8.3.0 the reference poly_Rq_mul_x_minus_1
takes about 530 cycles on average whereas the avx2 version takes about
40 cycles on average. This is about a 13 times speedup.
On an Intel i5-8250u using gcc 8.3.0 the reference sample_fixed_type
(of which the majority of cycles is spent in crypto_sort) takes about
28000 on average whereas using the avx2 version of djbsort for crypto_sort
takes about 3000 cycles on average. This is about a 9 times speedup.
As a result ntru_encaps is about 1.5 times faster.
Explanation:
An explanation of the method can be found in the paper "Fast constant-time
gcd computation and modular inversion". There are no large changes from
the case study described in that paper except that every polynomial is
stored as four 256-bit vectors rather than six since it fits within that space
for hps2048509. This reduces the amount of some vector operations.

Results:
On an Intel i5-8250u using gcc 8.3.0 the reference poly_S3_inv takes
about 1526800 cycles on average whereas the avx2 version takes about
23250 cycles on average. This is about a 65 times speedup.
@jschanck jschanck merged commit 9707e9c into jschanck:master May 15, 2019
@OussamaDanba OussamaDanba deleted the avx2-hps2048509_optimizations branch May 15, 2019 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants