Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose lincomb_generic (or: Generic Linear Combination) #921

Closed
ycscaly opened this issue Aug 20, 2023 · 31 comments · Fixed by #955
Closed

Expose lincomb_generic (or: Generic Linear Combination) #921

ycscaly opened this issue Aug 20, 2023 · 31 comments · Fixed by #955

Comments

@ycscaly
Copy link
Contributor

ycscaly commented Aug 20, 2023

The LinearCombination trait allows for an optimized implementation of the linear combination of two points and scalars.

This is later generalized for k256 with lincomb_generic() which allows for an optimized implementation of a linear combination between any group of N points and scalars, but is unfortunately private.

Here I am stressing that users such as myself may benefit from such a generic optimized linear combination implementation, and wish that we either expose lincomb_generic() or, preferably, think of a suitable extension for the LinearCombination trait.

@tarcieri

@tarcieri
Copy link
Member

See #380 for some past discussion about this.

I thought there was some more discussion about upstreaming such a trait to the group crate, but all I can find is zkcrypto/group#27.

I agree it would be nice to have a trait with the ability to operate over more than just pairs of scalars/points. I'm not sure the API lincomb_generic provides is the right one.

Ideally I would like to see a solution for this upstream in the group crate. The current LinearCombination trait is really just a stopgap for the immediate needs of implementing ECDSA generically.

@tarcieri
Copy link
Member

Found the group crate issue I was thinking of: zkcrypto/group#25

@ycscaly
Copy link
Contributor Author

ycscaly commented Aug 20, 2023

Thanks, I'm not sure about the intra-organization-politics here but it seemed that there isn't much interest from group to add such trait. There is a definite use-case on my end however for k256 specifically, and it is already implemented.

I am actually fine with the slices APi of lincomb_generic
Honestly, I would prefer the trait not to even be Group restricted; i.e. Instead of

pub trait LinearCombination: Group {
    fn lincomb(x: &Self, k: &Self::Scalar, y: &Self, l: &Self::Scalar) -> Self {
        (*x * k) + (*y * l)
    }

I would prefer to have something along the lines of:

pub trait LinearCombination<T> where Self: Mul<T, Output = Self> {
    fn lincomb<const N: usize>(points: &[Self; N], scalars: &[T; N]) -> Self;
}

as I am not working with the Group trait (which actually isn't a group trait at all, it is a known-order cyclic abelian group trait, and that doesn't fit all use-cases) so it would avoid me introducing that trait myself.

That being said, exposing lincomb_generic() as-is would be good for me as well

@tarcieri
Copy link
Member

there isn't much interest from group to add such trait.

I think it's just an issue that went by the wayside. I left a comment to hopefully revive it.

Regarding:

pub trait LinearCombination<T> where Self: Mul<T, Output = Self> {
    fn lincomb<const N: usize>(points: &[Self; N], scalars: &[T; N]) -> Self;
}

I'm not sure what your use cases for such a generic trait are. Perhaps you can simply define it yourself for your own purposes and add a blanket impl for the version in elliptic-curve?

In the context of elliptic-curve it's very much meant to be in terms of Scalar and ProjectivePoint for a given curve, and using Group it can leverage the existing type taxonomy via associated types.

Regarding lincomb_generic I'm not wild about the name at the very least. There are also lots of potential alternative API designs. I think using const generic arrays is a bit constraining... it could operate over a slice of tuples as an alternative, which avoids the need to have matched-sized arrays and permits more potential data structures as inputs to the function, e.g. Vec<[_]> or Box<[_]>.

@ycscaly
Copy link
Contributor Author

ycscaly commented Aug 21, 2023

there isn't much interest from group to add such trait.

I think it's just an issue that went by the wayside. I left a comment to hopefully revive it.

Regarding:

pub trait LinearCombination<T> where Self: Mul<T, Output = Self> {
    fn lincomb<const N: usize>(points: &[Self; N], scalars: &[T; N]) -> Self;
}

I'm not sure what your use cases for such a generic trait are. Perhaps you can simply define it yourself for your own purposes and add a blanket impl for the version in elliptic-curve?

Yes, I can definitely do that assuming a linear combination trait in group

In the context of elliptic-curve it's very much meant to be in terms of Scalar and ProjectivePoint for a given curve, and using Group it can leverage the existing type taxonomy via associated types.

Regarding lincomb_generic I'm not wild about the name at the very least. There are also lots of potential alternative API designs. I think using const generic arrays is a bit constraining... it could operate over a slice of tuples as an alternative, which avoids the need to have matched-sized arrays and permits more potential data structures as inputs to the function, e.g. Vec<[_]> or Box<[_]>.

Yes, I agree about these points.

So.I guess we'll wait to see if group engages

@tarcieri
Copy link
Member

If there's no movement on the group side we can improve the trait in elliptic-curve in the next breaking release (and if there is an upstream solution in group added, ideally remove the trait from elliptic-curve).

I'd also be okay with exposing something specific to k256 but maybe reconsidering the API rather than just exposing it as-is.

@tarcieri
Copy link
Member

@fjarri curious if you have thoughts on a &[(Self; T)]-based API versus the const generic array-based one in lincomb_generic

@fjarri
Copy link
Contributor

fjarri commented Aug 22, 2023

I think it's a good idea. I can't exactly remember why I went with the existing design; it eliminates the need for some copies, but that's about it. The slice-of-tuples is definitely safer.

@ycscaly
Copy link
Contributor Author

ycscaly commented Oct 5, 2023

Since the group issue does not seem active, can we handle this internally?

@tarcieri
Copy link
Member

tarcieri commented Oct 5, 2023

I'd be okay with exposing an inherent method for it in k256, although if there aren't any objections I think I'd prefer it take a slice-of-tuples to using const generic arrays.

If that ends up being more awkward in the code though, we can reconsider.

@ycscaly
Copy link
Contributor Author

ycscaly commented Oct 10, 2023

So I started working on this, and I was surprised to both not understand at all the logic of the algorithm implemented and to see that it relies upon internal design details of the secp256k1 curve. So I digged into the PR #380, and I still didn't find an explantation for the algorithm, so I would love to have a reference @fjarri.

I was also surprised that mul() is implemented using limcomb_generic(), and that the speedup was about 1.33x from the reported benchmarking results. I added a benchmark for various sizes of N and it seems that it gets up to 1.68x for combining 8 points and scalars.

Now I'll explain my surprise; from my understanding, multi-exponentiation (or linear combination, depending on notation) is a rather "uninteresting" area of research; and I say uninteresting not to discredit its relevance but to signify that, much like multiplication algorithms, the low-hanging fruits have been collected long ago, and we have pretty much standardized and generalized those results by now.

The algorithm I know of, after internal research that was originally done to optimize our threshold Paillier library, is a rather straightforward and generic one that relies upon lookup tables. I have also made a draft PR to crypto-bigint (that was ignored until now for some reason) to introduce those changes. This is also the approach taken by curve25519-dalek with their MutliscalarMul trait.

Regarding performance, my implementation for multiexponentiation over DynResidue has achieved a 4x speedup (curve25519-dalek got a similar 3.3x improvement from benchmarking I ran), which is considerably larger than the one reported to my surprise.

I do not write this comment to discredit the work, to the contrary I am confessing two things:

  1. I do not understand the reasoning behind the choice of this specific algorithm, nor do I understand its correctness. I would love to learn more about it.
  2. that there was a possible expectation mismatch between my original intention for a generic linear combination and my request to expose lincomb_generic() specifically. I guess that what I need is something like the above-mentioned lookup tables, and I understand that this requires alloc which would for the very least prohibit its use in mul().

Sorry for the confusion, and would love to hear both of your thoughts before proceeding.

@fjarri
Copy link
Contributor

fjarri commented Oct 10, 2023

This is also the approach taken by curve25519-dalek with their MutliscalarMul trait.

I think I lifted the algorithm exactly from curve25519-dalek you're referring to, namely from here: https://github.com/dalek-cryptography/curve25519-dalek/blob/main/curve25519-dalek/src/window.rs#L34 . It uses a fixed-window decomposition of the power into chunks in range [-8, 7), so you can use point negation instead of looking up in half of the table (or calculating it). The algorithm in RustCrypto/crypto-bigint#248 uses a simple fixed-window decomposition into chunks in range [0; 16) (which is what pow_montgomery_form() already does, just for one value). The negative range wouldn't work for integers since inversion is expensive, unlike negation for curve points. I hope that clears up the choices made; although I am not by any means claiming that these are the best choices, and any speed-up (without sacrifices in other areas) is appreciated.

and to see that it relies upon internal design details of the secp256k1 curve.

Technically it relies on the existence of endomorphism, not on secp256k1 specifically, but you are right in general. At the moment of writing I was not particularly interested in other curves, so I decided that whoever is interested in them would generalize it further. Since fast multiplication for secp256k1 uses endomorphism, that's what lincomb_mul() necessarily uses too. Perhaps the radix decomposition and the endomorphism use can be decoupled somehow.

I have also made a draft RustCrypto/crypto-bigint#248 was ignored until now for some reason)

I cannot speak for Tony, but my guess would be that it's ignored because it's marked as a draft, and does not pass CI.

Regarding performance, my implementation for multiexponentiation over DynResidue has achieved a 4x speedup (curve25519-dalek got a similar 3.3x improvement from benchmarking I ran), which is considerably larger than the one reported to my surprise.

Could you elaborate on how you measure the speedup? Running the benchmarks on your PR, I get:

Montgomery arithmetic/modpow, U256^U256
                        time:   [11.999 µs 12.015 µs 12.032 µs]
Montgomery arithmetic/multi_exponentiate for 1 bases, U256^U256
                        time:   [12.212 µs 12.224 µs 12.235 µs]
Montgomery arithmetic/multi_exponentiate for 2 bases, U256^U256
                        time:   [17.789 µs 17.820 µs 17.851 µs]

I see that pow() is as fast as multi_exponentiate() for 1 element (which stands to reason since they're effectively the same, save for the Vec() usage), and multi_exponentiate() for 2 elements is about 25% faster than 2 separate calls to pow(), which is consistent with the results in #380.

@fjarri
Copy link
Contributor

fjarri commented Oct 11, 2023

P.S.

I guess that what I need is something like the above-mentioned lookup tables, and I understand that this requires alloc which would for the very least prohibit its use in mul()

lincomb_generic() from #380 does not need alloc, at the price of taking array refs as arguments.

@ycscaly
Copy link
Contributor Author

ycscaly commented Oct 11, 2023

This is also the approach taken by curve25519-dalek with their MutliscalarMul trait.

I think I lifted the algorithm exactly from curve25519-dalek you're referring to, namely from here: https://github.com/dalek-cryptography/curve25519-dalek/blob/main/curve25519-dalek/src/window.rs#L34 . It uses a fixed-window decomposition of the power into chunks in range [-8, 7), so you can use point negation instead of looking up in half of the table (or calculating it). The algorithm in RustCrypto/crypto-bigint#248 uses a simple fixed-window decomposition into chunks in range [0; 16) (which is what pow_montgomery_form() already does, just for one value). The negative range wouldn't work for integers since inversion is expensive, unlike negation for curve points. I hope that clears up the choices made; although I am not by any means claiming that these are the best choices, and any speed-up (without sacrifices in other areas) is appreciated.

and to see that it relies upon internal design details of the secp256k1 curve.

Technically it relies on the existence of endomorphism, not on secp256k1 specifically, but you are right in general. At the moment of writing I was not particularly interested in other curves, so I decided that whoever is interested in them would generalize it further. Since fast multiplication for secp256k1 uses endomorphism, that's what lincomb_mul() necessarily uses too. Perhaps the radix decomposition and the endomorphism use can be decoupled somehow.

Thanks for a much-detailed answer, I think that I got lost in the secp256k1-specific details which obscured the bigger picture; I will try and re-read that code with this background now before commenting on this.

I have also made a draft RustCrypto/crypto-bigint#248 was ignored until now for some reason)

I cannot speak for Tony, but my guess would be that it's ignored because it's marked as a draft, and does not pass CI.

I see, perhaps I'm just new to open source contributions, my code is production-ready in Tiresias, but I marked it as draft so design choices regarding crypto-bigint could be discussed prior review. @tarcieri if you're interested in doing so, let me know; I can also shift it to ready for review and let you review the code and answer my questions as you go, just thought it would be more expensive for you to do so.

Regarding performance, my implementation for multiexponentiation over DynResidue has achieved a 4x speedup (curve25519-dalek got a similar 3.3x improvement from benchmarking I ran), which is considerably larger than the one reported to my surprise.

Could you elaborate on how you measure the speedup? Running the benchmarks on your PR, I get:

Montgomery arithmetic/modpow, U256^U256
                        time:   [11.999 µs 12.015 µs 12.032 µs]
Montgomery arithmetic/multi_exponentiate for 1 bases, U256^U256
                        time:   [12.212 µs 12.224 µs 12.235 µs]
Montgomery arithmetic/multi_exponentiate for 2 bases, U256^U256
                        time:   [17.789 µs 17.820 µs 17.851 µs]

I see that pow() is as fast as multi_exponentiate() for 1 element (which stands to reason since they're effectively the same, save for the Vec() usage), and multi_exponentiate() for 2 elements is about 25% faster than 2 separate calls to pow(), which is consistent with the results in #380.

Regarding benchmarking, I have run them again for the multiexp code on Tiresias, as I'm not sure how up to date my PR was (again it was more for discussion). The results are as follows, first for 4096-bit DynResidue:

multi-exponentiation/multi_exponentiate() for 1 bases,
            PaillierModulusSizedNumber^Pailli...
                        time:   [33.980 ms 34.040 ms 34.155 ms]

multi-exponentiation/multi_exponentiate() for 2 bases,
            PaillierModulusSizedNumber^Pailli...
                        time:   [42.476 ms 42.514 ms 42.577 ms]
             
multi-exponentiation/multi_exponentiate() for 100 bases,
            PaillierModulusSizedNumber^Pail...
                        time:   [885.46 ms 886.43 ms 887.45 ms]

The 1 base is identical to pow(), 2-bases yield 34/(42/2)=1.6x improvevement, and peak optimization is 8.86ms per exponentiation for the 100-bases case which yields 3.8x improvement.

For 256-bit, the results are:

multi-exponentiation/multi_exponentiate() for 1 bases,
            U256^U256
                        time:   [11.880 µs 11.886 µs 11.893 µs]
multi-exponentiation/multi_exponentiate() for 2 bases,
            U256^U256
                        time:   [16.105 µs 16.137 µs 16.206 µs]
multi-exponentiation/multi_exponentiate() for 10 bases,
            U256^U256
                        time:   [50.759 µs 50.903 µs 51.066 µs]
multi-exponentiation/multi_exponentiate() for 100 bases,
            U256^U256
                        time:   [485.80 µs 493.85 µs 506.97 µs]

So for 2-bases, we have 11.88/(16.1/2)=1.475x improvement, for 10 bases we already have 11.886/(50/10)= 2.37x which already is very close to peak optimization at the 100-bases case. Comparing to the 10 bases of lincomb, that has 37.3/(221/10)=1.68x improvement, there still is considerable difference although it is not as stark as in comparison to the 4096-bit case that I had in mind.

If the underlying algorithm is the same, and we're just using secp256k1 optimizations that make it even better, how can these findings be explained?

I suggest we resolve this issue before I continue on implementing it, so we are certain we are implementing the right algorithm.

@tarcieri
Copy link
Member

I marked it as draft so design choices regarding crypto-bigint could be discussed prior review. @tarcieri if you're interested in doing so, let me know; I can also shift it to ready for review and let you review the code and answer my questions as you go, just thought it would be more expensive for you to do so.

@ycscaly sure, that's fine, though there are also some test failures that need addressed as well

@fjarri
Copy link
Contributor

fjarri commented Oct 11, 2023

(using the multiplication/exponentiation terminology for curve points here, for simplicity)

So, in a generalized windowed exponentiation most of the time is spent in the following places:

t_single = table_creation + squaring + mul_by_table_value

(here mul_by_table_value includes both table lookup and multiplication by the running product)

In a multi-exponentiation we use what's called a Shamir's trick to only go through one squaring stage for all the bases. So for N bases, we have

t_N = N * table_creation + squaring + N * mul_by_table_value

Therefore when you compare the two, the speedup you measure is

speedup = N * t_single / t_N

In the limit of large N this reduces to

speedup ~ 1 + squaring / (table_creation + mul_by_table_value)

That is, the more time is spent in squaring compared to the other two parts of the code, the more speedup you have. You're comparing 4096-bit integers with 256-bit curve points, so there may be multiple factors at play. But I would not be surprised if squaring is relatively more expensive for integers - because it's not much different than regular multiplication (not much to optimize there), while for curve points it is significantly simpler than multiplication.

These is just an educated guess, of course; if this discrepancy worries you, I suggest you actually profile multi-exponentiation for points and integers, and see if it is really the case.

@ycscaly
Copy link
Contributor Author

ycscaly commented Oct 11, 2023

(using the multiplication/exponentiation terminology for curve points here, for simplicity)

So, in a generalized windowed exponentiation most of the time is spent in the following places:

t_single = table_creation + squaring + mul_by_table_value

(here mul_by_table_value includes both table lookup and multiplication by the running product)

In a multi-exponentiation we use what's called a Shamir's trick to only go through one squaring stage for all the bases. So for N bases, we have

t_N = N * table_creation + squaring + N * mul_by_table_value

Therefore when you compare the two, the speedup you measure is

speedup = N * t_single / t_N

In the limit of large N this reduces to

speedup ~ 1 + squaring / (table_creation + mul_by_table_value)

That is, the more time is spent in squaring compared to the other two parts of the code, the more speedup you have. You're comparing 4096-bit integers with 256-bit curve points, so there may be multiple factors at play. But I would not be surprised if squaring is relatively more expensive for integers - because it's not much different than regular multiplication (not much to optimize there), while for curve points it is significantly simpler than multiplication.

These is just an educated guess, of course; if this discrepancy worries you, I suggest you actually profile multi-exponentiation for points and integers, and see if it is really the case.

Thanks for that. I can look through this next week, but for now just to re-iterate that dalek's multiscalar_mul got a 3.3x improvement, so this is not something that is necessairly integer vs point, however, this might be explained by the unique properties of curve25519 vs secp256k1, so this is something that requires a deep-dive. Hope to get there soon.

@fjarri
Copy link
Contributor

fjarri commented Oct 11, 2023

Thanks for that. I can look through this next week, but for now just to re-iterate that dalek's multiscalar_mul got a 3.3x improvement, so this is not something that is necessairly integer vs point

I'm willing to bet the speedup can still be explained by the formula above. Ed25519 has its own optimizations that could affect the results. Also make sure that you're measuring constant-time multiplication for it, since constant-time table lookup is a part of mul_by_table_value.

@ycscaly
Copy link
Contributor Author

ycscaly commented Oct 29, 2023

Thanks for that. I can look through this next week, but for now just to re-iterate that dalek's multiscalar_mul got a 3.3x improvement, so this is not something that is necessairly integer vs point

I'm willing to bet the speedup can still be explained by the formula above. Ed25519 has its own optimizations that could affect the results. Also make sure that you're measuring constant-time multiplication for it, since constant-time table lookup is a part of mul_by_table_value.

We have reiterated your previous comment & the code and we share your view; specifically, Ed25519 has the special property where doubling a point takes the same formula as adding a point with a different one, which thus explains the difference in performance.

I will now continue on the effort of taking your code and exposing it. Thanks.

@ycscaly
Copy link
Contributor Author

ycscaly commented Oct 29, 2023

I'd be okay with exposing an inherent method for it in k256, although if there aren't any objections I think I'd prefer it take a slice-of-tuples to using const generic arrays.

If that ends up being more awkward in the code though, we can reconsider.

It is a problem to take slice-of-tuples. If that slice isn't sized (i.e. using a generic parameter) the only way to keep the current efficient design is to use alloc.

The reason is that we loop twice, with the result of the first loop being fed as input to the second, and there is a shared expensive setup phase which we would need to double if we can't save it (and how could we save it without allocating memory, and we can't allocate it on stack if the size is not known at compile time)

So I can either expose it as is, or transform to use slice-of-tuples but with a const generic parameter, which kinda defeats the purpose.

That being said, there are definitely use-cases for which we'd want to have a linear combination of dynamically known size, for which alloc is an obvious requirement.

Perhaps I can expose lincomb_generic as-is, and add another method that requires alloc and drops the const-generics? that would result in code duplication...

@tarcieri @fjarri thoughts?

@tarcieri
Copy link
Member

there is a shared expensive setup phase which we would need to double if we can't save it (and how could we save it without allocating memory, and we can't allocate it on stack if the size is not known at compile time)

This seems like a good reason to keep the const generic parameter.

I would still suggest changing the function name, removing _generic.

@ycscaly
Copy link
Contributor Author

ycscaly commented Nov 7, 2023

there is a shared expensive setup phase which we would need to double if we can't save it (and how could we save it without allocating memory, and we can't allocate it on stack if the size is not known at compile time)

This seems like a good reason to keep the const generic parameter.

I would still suggest changing the function name, removing _generic.

so just change the name and expose? if so I'll do a pr now

@ycscaly
Copy link
Contributor Author

ycscaly commented Nov 9, 2023

Thanks!

@ycscaly
Copy link
Contributor Author

ycscaly commented Nov 13, 2023

I'd like to add a _vec API here too similar to RustCrypto/traits#1376

@ycscaly
Copy link
Contributor Author

ycscaly commented Nov 14, 2023

I'd like to add a _vec API here too similar to RustCrypto/traits#1376

@tarcieri created a new issue #973 for this
Also created a pull for the multiexp code in RustCrypto/crypto-bigint#248

Lets try to get these also in the version release please

@tarcieri
Copy link
Member

Perhaps an approach like this might also work here: https://github.com/RustCrypto/traits/pull/1376/files#r1393273429

I should probably open a tracking issue for redesigning the LinearCombination trait in the next breaking release of elliptic-curve: https://docs.rs/elliptic-curve/latest/elliptic_curve/ops/trait.LinearCombination.html

@ycscaly
Copy link
Contributor Author

ycscaly commented Nov 14, 2023

Perhaps an approach like this might also work here: https://github.com/RustCrypto/traits/pull/1376/files#r1393273429

I should probably open a tracking issue for redesigning the LinearCombination trait in the next breaking release of elliptic-curve: https://docs.rs/elliptic-curve/latest/elliptic_curve/ops/trait.LinearCombination.html

Aren't we doing a breaking change now? And yeah I can do that but what should I name it, the appropriate is already taken...

@tarcieri
Copy link
Member

There haven't been any breaking changes yet, and k256 hasn't been released with lincomb exposed yet so you're still free to change it.

I plan on doing breaking changes to elliptic-curve after we're done with changes to the symmetric @RustCrypto crates.

@ycscaly
Copy link
Contributor Author

ycscaly commented Nov 14, 2023

There haven't been any breaking changes yet, and k256 hasn't been released with lincomb exposed yet so you're still free to change it.

I plan on doing breaking changes to elliptic-curve after we're done with changes to the symmetric @RustCrypto crates.

Ok so how about we put such a trait in k256 for now, and in the next breaking change we'll replace the existing elliptic-curve one?

@tarcieri
Copy link
Member

Sure, sounds good

@ycscaly
Copy link
Contributor Author

ycscaly commented Nov 14, 2023

Sure, sounds good

Done #974

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants