Use BMI2 when the CPU supports it for certain field element operations #38

Yawning · 2021-02-09T15:02:53Z

Square is almost not worth it, though it may be worth investigating if it is possible to use ADCX/ADOX to make both Mul and Square faster (Isn't registry pressure fun?). Unfortunately the stub ends up being somewhat ugly, but having to fight with the inliner shouldn't come as a surprise at this point.

name \ time/op  /tmp/baseline  /tmp/mul-mulx  /tmp/square-mulx
WideMultCall-8    3.04ns ± 0%
Add-8             14.5ns ± 0%
Mul-8             36.8ns ± 0%    31.6ns ± 0%
Mul32-8           14.0ns ± 0%
Square-8          29.0ns ± 2%                      26.5ns ± 0%

``` name old time/op new time/op delta Mul-8 36.8ns ± 0% 31.6ns ± 0% -14.04% (p=0.008 n=5+5) ```

This requires changing the registers used for feSquare, but it removes a gigantic block of functionally equivalent assembly.

``` name old time/op new time/op delta Square-8 29.0ns ± 2% 26.5ns ± 0% -8.63% (p=0.008 n=5+5) ```

FiloSottile · 2021-02-09T15:49:34Z

Hi @Yawning! Two quick questions.

First, we did a bunch of work on the field implementation getting it into filippo.io/edwards25519, and eventually I am hoping to replace the guts of this package with that. Would you consider contributing these changes there? I am trying to keep that repository CLA-clean to upstream it into Go, as well, so it would involve signing the Go CLA and agreeing to submit this code as a contribution to Go.

Second, do you have any higher level benchmarks, ideally through benchstat, so we can evaluate if the extra code complexity is worth it?

Yawning · 2021-02-09T17:24:40Z

it would involve signing the Go CLA and agreeing to submit this code as a contribution to Go.

I would need to check with my employer, since this is lifted out of code that I wrote for them.

Second, do you have any higher level benchmarks, ideally through benchstat, so we can evaluate if the extra code complexity is worth it?

I assume ed25519 benchmarks will suffice as a high level comparison. The host is a i7-10510U, with turbo disabled via /sys/devices/system/cpu/intel_pstate/no_turbo.

name                       old time/op    new time/op    delta
VerifyBatch/1-8               177µs ± 0%     165µs ± 0%  -6.46%  (p=0.008 n=5+5)
VerifyBatch/2-8               249µs ± 0%     232µs ± 0%  -6.84%  (p=0.016 n=5+4)
VerifyBatch/4-8               395µs ± 0%     367µs ± 0%  -6.94%  (p=0.008 n=5+5)
VerifyBatch/8-8               684µs ± 0%     635µs ± 0%  -7.29%  (p=0.008 n=5+5)
VerifyBatch/16-8             1.26ms ± 0%    1.17ms ± 0%  -7.12%  (p=0.008 n=5+5)
VerifyBatch/32-8             2.42ms ± 1%    2.25ms ± 1%  -7.12%  (p=0.008 n=5+5)
VerifyBatch/64-8             4.74ms ± 0%    4.40ms ± 1%  -7.13%  (p=0.008 n=5+5)
VerifyBatch/128-8            8.76ms ± 0%    8.06ms ± 0%  -8.03%  (p=0.008 n=5+5)
VerifyBatch/256-8            15.9ms ± 0%    14.7ms ± 0%  -7.71%  (p=0.008 n=5+5)
VerifyBatch/384-8            22.8ms ± 0%    21.1ms ± 1%  -7.52%  (p=0.008 n=5+5)
VerifyBatch/512-8            29.6ms ± 0%    27.2ms ± 0%  -7.83%  (p=0.008 n=5+5)
VerifyBatch/768-8            42.4ms ± 0%    39.1ms ± 0%  -7.77%  (p=0.008 n=5+5)
VerifyBatch/1024-8           55.3ms ± 0%    50.9ms ± 0%  -8.04%  (p=0.008 n=5+5)
KeyGeneration/voi-8          42.9µs ± 1%    40.2µs ± 1%  -6.30%  (p=0.008 n=5+5)
KeyGeneration/stdlib-8        102µs ± 0%     102µs ± 1%    ~     (p=0.421 n=5+5)
Signing/voi-8                46.5µs ± 0%    43.8µs ± 1%  -5.93%  (p=0.008 n=5+5)
Signing/stdlib-8              104µs ± 0%     104µs ± 1%    ~     (p=0.841 n=5+5)
Verification/voi-8            132µs ± 0%     122µs ± 0%  -6.86%  (p=0.008 n=5+5)
Verification/voi_stdlib-8     129µs ± 1%     120µs ± 0%  -7.08%  (p=0.008 n=5+5)
Verification/stdlib-8         277µs ± 0%     277µs ± 0%    ~     (p=1.000 n=5+5)

Note: Benchmarks taken with the code I lifted the assembly out of. The only major difference would be that the original code implements fe^(2^k) in-place rather than square. If it really matters, I could go benchmark ed25519consensus or something, the next time I have some spare time, but a faster multiply is a faster multiply (and the operations are what I would describe as "common", especially Mul, when doing anything useful).

Yawning added 4 commits February 9, 2021 14:17

internal/radix51: Add a benchmark for FieldElement.Square

d0ea73c

internal/radix51: Use BMI2 to accelerate FieldElement.Mul

e283be7

``` name old time/op new time/op delta Mul-8 36.8ns ± 0% 31.6ns ± 0% -14.04% (p=0.008 n=5+5) ```

internal/radix51: Use a macro for the reduction/carry propagation

a8296fe

This requires changing the registers used for feSquare, but it removes a gigantic block of functionally equivalent assembly.

internal/radix51: Use BMI2 to accelerate FieldElement.Square

dbb69fe

``` name old time/op new time/op delta Square-8 29.0ns ± 2% 26.5ns ± 0% -8.63% (p=0.008 n=5+5) ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use BMI2 when the CPU supports it for certain field element operations #38

Use BMI2 when the CPU supports it for certain field element operations #38

Yawning commented Feb 9, 2021

FiloSottile commented Feb 9, 2021

Yawning commented Feb 9, 2021 •

edited

Loading

Use BMI2 when the CPU supports it for certain field element operations #38

Are you sure you want to change the base?

Use BMI2 when the CPU supports it for certain field element operations #38

Conversation

Yawning commented Feb 9, 2021

FiloSottile commented Feb 9, 2021

Yawning commented Feb 9, 2021 • edited Loading

Yawning commented Feb 9, 2021 •

edited

Loading