Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement avx2 version of poly_Rq_to_S3 for hps2048509 #6

Merged

Conversation

OussamaDanba
Copy link
Contributor

About an 11 times improvement.

With this we've reached the point where polynomial operations can't really be optimized much more (without some very different approach). Sampling/packing is now taking up a good chunk of the processing time.

poly_S3_mul could be implemented using the alternative approach mentioned in poly_s3_mul.py but would only bring a small/modest speedup and adds quite a bit of additional code (and a few weeks of work). It's exclusively used in decryption (which is already the fastest operation) and would likely speed it up between 5-10%.

Explanation:
Straightforward conversion of the C function but using avx2
to operate on 16 coefficients at a time rather than one.

Results:
On an Intel i5-8250u using gcc 8.3.0 the reference poly_Rq_to_S3
takes about 2300 cycles on average whereas the avx2 version takes about
205 cycles on average. This is about an 11 times speedup.
@jschanck jschanck merged commit d2560ca into jschanck:master Jun 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants