Implement avx2 version of poly_Rq_to_S3 for hps2048509 #6

OussamaDanba · 2019-06-03T12:18:28Z

About an 11 times improvement.

With this we've reached the point where polynomial operations can't really be optimized much more (without some very different approach). Sampling/packing is now taking up a good chunk of the processing time.

poly_S3_mul could be implemented using the alternative approach mentioned in poly_s3_mul.py but would only bring a small/modest speedup and adds quite a bit of additional code (and a few weeks of work). It's exclusively used in decryption (which is already the fastest operation) and would likely speed it up between 5-10%.

Explanation: Straightforward conversion of the C function but using avx2 to operate on 16 coefficients at a time rather than one. Results: On an Intel i5-8250u using gcc 8.3.0 the reference poly_Rq_to_S3 takes about 2300 cycles on average whereas the avx2 version takes about 205 cycles on average. This is about an 11 times speedup.

jschanck merged commit d2560ca into jschanck:master Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement avx2 version of poly_Rq_to_S3 for hps2048509 #6

Implement avx2 version of poly_Rq_to_S3 for hps2048509 #6

OussamaDanba commented Jun 3, 2019

Implement avx2 version of poly_Rq_to_S3 for hps2048509 #6

Implement avx2 version of poly_Rq_to_S3 for hps2048509 #6

Conversation

OussamaDanba commented Jun 3, 2019