Auth: OIDC login flow fails in an LXD cluster if different members handle different stages of the flow #13644

mas-who · 2024-06-21T09:29:41Z

Issue description

The OIDC login will fail in a LXD cluster if different members handle different stages of the flow. This is likely to occur if load balancing is deployed in front of the cluster.

When a user initiates the login flow, the request will reach /oidc/login and a state parameter will be set in the redirect request to idp auth endpoint. Upon successful user authentication, the idp will make a request to the /oidc/callback endpoint and the state parameter will need to be matched (data can be extracted from the state parameter). Since in the scenario where load balancing is active, it is likely that the cluster member that receives the /oidc/login request would be different to the cluster member that receives the /oidc/callback request, causing state to be mismatched which in turn will cause the login flow to fail.

Some potential ideas on how to resolve this issue:

Store the auth state in dqlite so that we can always read out this data when processing a request to the /oidc/callback endpoint.
Embed the member address in the state parameter so that we can extract this information in the /oidc/callback endpoint. If the extracted member address is not the same as the current member address, then we can forward the request to the correct member based on the extracted info.

Steps to reproduce

Test with LXD cluster:

Started oidc flow with a proxy in front, proxy pointing to lxd-node-1
While on the octa login page, change proxy to point to lxd-node-2 of the same cluster
continue login process
on hitting the /oidc/callback of lxd-node-2 I get an error message: "failed to get state: securecookie: the value is not valid"

The text was updated successfully, but these errors were encountered:

tomponline · 2024-06-21T09:31:57Z

@markylaing when you get a moment I'd appreciate your thoughts on how we could approach this one.

Could we encode (& encrypt) the necessary info in the cookied/return URL of the flow such that its available to the other cluster member?

markylaing · 2024-06-21T13:04:09Z

@markylaing when you get a moment I'd appreciate your thoughts on how we could approach this one.

Could we encode (& encrypt) the necessary info in the cookied/return URL of the flow such that its available to the other cluster member?

I think @mas-who's suggestion of encoding it in the state parameter sounds reasonable. That said, we could also recommend a load-balancer configuration in our documentation to enable e.g. sticky sessions .

tomponline · 2024-06-21T13:45:50Z

I think @mas-who's suggestion of encoding it in the state parameter sounds reasonable.

Sounds good thanks!

markylaing · 2024-07-08T07:56:22Z

I've been thinking a little more on this.

A better approach might be to use the same cookie encryption mechanism as with browser cookies. We can use the session ID in the state parameter as it is unencrypted. I'm not sure this will work with the httphelper package provided by the zitadel OIDC package though.

markylaing · 2024-07-08T11:41:17Z

I've been thinking a little more on this.

A better approach might be to use the same cookie encryption mechanism as with browser cookies. We can use the session ID in the state parameter as it is unencrypted. I'm not sure this will work with the httphelper package provided by the zitadel OIDC package though.

We can't use the session ID. The cookie block and hash keys are for the relying party and need to persist, otherwise we would be performing OIDC discovery on every request.

As we can't use the session ID, there would be no way to provide a salt when deriving keys from the cluster private key. We would need some cluster-wide mechanism of rotating a shared salt for this to work.

markylaing · 2024-07-12T08:47:40Z

Another idea for this:

Set up a background task in LXD to generate a UUID or token and store it in the database. The UUID should be overwritten on a configurable rotation. Only the dqlite leader should perform this task. The leader should send an internal request to other members indicating that the key has been updated. The UUID should be passed into the OIDC verifier, the verifier can derive a private key from the cluster private key using the UUID as a shared salt, and use this for the hash and block keys for the relying party cookie configuration.

edlerd · 2024-07-12T09:13:05Z

Another idea for this:

Set up a background task in LXD to generate a UUID or token and store it in the database. The UUID should be overwritten on a configurable rotation. Only the dqlite leader should perform this task. The leader should send an internal request to other members indicating that the key has been updated. The UUID should be passed into the OIDC verifier, the verifier can derive a private key from the cluster private key using the UUID as a shared salt, and use this for the hash and block keys for the relying party cookie configuration.

This sounds like a good approach. I see two open issues:

On updating the salt, we need to take care of users that started the login flow with the old salt and come back to a node with the new salt.
We need to ensure every member updated to the new salt. This can be a problem in case some node is offline for a short time. During being offline, it can not be notified of the new salt. It then comes back online and happily serves users with the old salt.

A possible solution is to try to decode with both old and new salt for some grace period during/after the salt update. Additionally, do not send unreliable "salt was updated" messages between cluster nodes, but reload the UUID from database every n minutes. Make the grace period 2*n minutes.

markylaing · 2024-07-12T10:12:52Z

For the first point, we already rotate cookie encryption keys for the OIDC verifier every 5 minutes and haven't encountered this issue. I'd say the configurable duration should be not less than 5 minutes and we should not accept the old salt (more secure). The worst case scenario is that someone is redirected back to the login page to start again.

For the second point, we send internal updates this way in a lot of cases (updating the identity cache is one). This way we can ensure the new UUID is written and replicated before the other members refresh it. If a member is offline, it will load the new UUID when it comes back online (currently done for the identity cache). One case we haven't encountered yet is a network blip causing the notification to fail when a member is actually online. In this case we may need to implement an exponential backoff, but since other mechanisms also rely on this, we'll implement this in a general way in a separate PR (if it becomes an issue).

tomponline · 2024-07-12T10:30:16Z

We could even include it in the heartbeat payload if needed, which is sent regularly to all members from the leader.

markylaing · 2024-07-12T14:05:59Z

question: why is it safe to use the same nonce + key pair multiple times? AFAICT the securecookies lib is using AES CTR, which has the weakness that if you have 2 different ciphertexts that were produced using the same nonce/key pair you can deduce the pair. If I understand correctly, the refresh token and id token cookies are issued using the same pair. This means that it should be easy to forge cookies for a known sessionID. Am I missing something?

Relevant issue gorilla/securecookie#66

I'm not a security expert but the session ID is used as a salt when deriving a secure key from the cluster private key using HKDF. An actor must have both the cluster private key and the session ID to derive a key which can be used to decrypt the oidc_identity and oidc_refresh cookies.

nsklikas · 2024-07-15T07:45:04Z

Never mind, after looking into it again I realize that I misunderstood how the code works.

tomponline added the Bug Confirmed to be a bug label Jun 21, 2024

tomponline added this to the lxd-6.2 milestone Jun 21, 2024

tomponline changed the title ~~The OIDC login flow fails in an LXD cluster if different members handle different stages of the flow~~ Auth: OIDC login flow fails in an LXD cluster if different members handle different stages of the flow Jun 21, 2024

tomponline modified the milestones: lxd-6.2, lxd-6.3 Nov 27, 2024

markylaing linked a pull request Feb 17, 2025 that will close this issue

OIDC: Cluster-wide key and salt for cookie encryption #15011

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auth: OIDC login flow fails in an LXD cluster if different members handle different stages of the flow #13644

Auth: OIDC login flow fails in an LXD cluster if different members handle different stages of the flow #13644

mas-who commented Jun 21, 2024

tomponline commented Jun 21, 2024

markylaing commented Jun 21, 2024

tomponline commented Jun 21, 2024

markylaing commented Jul 8, 2024

markylaing commented Jul 8, 2024

markylaing commented Jul 12, 2024

edlerd commented Jul 12, 2024

markylaing commented Jul 12, 2024

tomponline commented Jul 12, 2024

markylaing commented Jul 12, 2024

nsklikas commented Jul 15, 2024

Auth: OIDC login flow fails in an LXD cluster if different members handle different stages of the flow #13644

Auth: OIDC login flow fails in an LXD cluster if different members handle different stages of the flow #13644

Comments

mas-who commented Jun 21, 2024

Issue description

Steps to reproduce

tomponline commented Jun 21, 2024

markylaing commented Jun 21, 2024

tomponline commented Jun 21, 2024

markylaing commented Jul 8, 2024

markylaing commented Jul 8, 2024

markylaing commented Jul 12, 2024

edlerd commented Jul 12, 2024

markylaing commented Jul 12, 2024

tomponline commented Jul 12, 2024

markylaing commented Jul 12, 2024

nsklikas commented Jul 15, 2024