Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery documentation #5796

Open
kfox1111 opened this issue Jan 19, 2025 · 10 comments
Open

Recovery documentation #5796

kfox1111 opened this issue Jan 19, 2025 · 10 comments
Assignees
Labels
triage/in-progress Issue triage is in progress

Comments

@kfox1111
Copy link
Contributor

If your server is down for too long, how do you recover?

The server does not start with something like:

ERRO[0000] Fatal run error                               error="invalid server X509-SVID: invalid X509-SVID: already expired as of 2024-12-23T16:48:19Z"
ERRO[0000] Server crashed                                error="invalid server X509-SVID: invalid X509-SVID: already expired as of 2024-12-23T16:48:19Z"
@kfox1111
Copy link
Contributor Author

Is removing keys.json enough on the server? It seems to start. Anything to do with the sql db?

@kfox1111
Copy link
Contributor Author

The agent wont connect in that case too...

ERRO[0030] Agent crashed                                 error="create attestation client: failed to dial dns:///localhost:8081: context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: x509svid: could not get X509 bundle: x509bundle: no X.509 bundle found for trust domain: \\\"example.com\\\"\""

@kfox1111
Copy link
Contributor Author

For the agent, the trust bundle seems to be in agent-data.json ?

I deleted keys.json as part of trying to fix it, but not sure that step is required, as it was an unrelated issue...

@sorindumitru
Copy link
Contributor

sorindumitru commented Jan 19, 2025

ERRO[0000] Fatal run error error="invalid server X509-SVID: invalid X509-SVID: already expired as of 2024-12-23T16:48:19Z"

Was it stuck in a crash loop with this error? That sounds more like a bug. Do you happen to have the rest of the logs?

@kfox1111
Copy link
Contributor Author

I think so, but was long enough ago I don't want to say for sure.

Basically, I had a spire-server I left offline for long enough all its ca's expired. Should be able to reproduce by starting up a new spire-server with an extremely short ca time, then shut it off for a little bit until they expire, then start it back up.

@MarcosDY MarcosDY added the triage/in-progress Issue triage is in progress label Jan 21, 2025
@sorindumitru
Copy link
Contributor

If you could managed to reproduce this again and provides us log and/or details of how to reproduce it (definitely configuration and if possible steps), that would be greatly appreciated.

Did this happen in a nested deployment? Maybe if upstream is unavailable when downstream restarts this would happen.

Generic documentation on recover is likely going to be hard to have, since a lot of it depends on specific configurations. Maybe we can start with a troubleshooting/faq doc and see how it goes from there. There's definitely some repeating questions on slack, it would be nice to be able to point people somewhere.

@kfox1111
Copy link
Contributor Author

Nothing complicated or unexpected.... No nesting, no federation.

I downloaded a version of spire a while ago to my laptop. started it, had it working. when done, I shut it down.

Next time I started it back up, (weeks later) it failed to start as all its ca's had expired while it was shutdown. This is something that could happen to others, so I was curious how to recover from this situation, as I hadn't found documentation on it, and it will be something others will need too if it ever happens to them.

@sorindumitru
Copy link
Contributor

I tried to reproduce this by setting ca_ttl and default_x509_svid_ttl to small values (5m and 2m in this case) and it seems to recover here:

DEBU[0000] Found a CA journal record that matches with a local X509 authority ID  ca_journal_id=2 local_authority_id=5e8033fda0df681a07795b5900f88227dc7218b1 subsystem_name=ca_manager
INFO[0000] Journal loaded                                jwt_keys=1 subsystem_name=ca_manager x509_cas=1
DEBU[0000] Preparing X509 CA                             slot=B subsystem_name=ca_manager
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=2 local_authority_id= subsystem_name=ca_manager
INFO[0000] X509 CA prepared                              expiration="2025-01-22 09:58:21 -0500 EST" issued_at="2025-01-22 09:53:21.791654938 -0500 EST" local_authority_id=562315292f280b540e459e4c8d6d8cf9f74b1c6a self_signed=true slot=B subsystem_name=ca_manager upstream_authority_id=
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=2 local_authority_id= subsystem_name=ca_manager
INFO[0000] X509 CA activated                             expiration="2025-01-22 09:58:21 -0500 EST" issued_at="2025-01-22 09:53:21.791654938 -0500 EST" local_authority_id=562315292f280b540e459e4c8d6d8cf9f74b1c6a slot=B subsystem_name=ca_manager upstream_authority_id=
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=2 local_authority_id=562315292f280b540e459e4c8d6d8cf9f74b1c6a subsystem_name=ca_manager
DEBU[0000] Successfully rotated X.509 CA                 subsystem_name=ca_manager trust_domain_id="spiffe://example.org" ttl=299.200592622
DEBU[0000] Preparing JWT key                             slot=B subsystem_name=ca_manager
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=2 local_authority_id=562315292f280b540e459e4c8d6d8cf9f74b1c6a subsystem_name=ca_manager
INFO[0000] JWT key prepared                              expiration="2025-01-22 09:58:21.799639382 -0500 EST" issued_at="2025-01-22 09:53:21.799639382 -0500 EST" local_authority_id=b9YnzF1NTZ4IqoXIS5mI5xWYa7buYW0i slot=B subsystem_name=ca_manager
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=2 local_authority_id=562315292f280b540e459e4c8d6d8cf9f74b1c6a subsystem_name=ca_manager
INFO[0000] JWT key activated                             expiration="2025-01-22 09:58:21.799639382 -0500 EST" issued_at="2025-01-22 09:53:21.799639382 -0500 EST" local_authority_id=b9YnzF1NTZ4IqoXIS5mI5xWYa7buYW0i slot=B subsystem_name=ca_manager
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=2 local_authority_id=562315292f280b540e459e4c8d6d8cf9f74b1c6a subsystem_name=ca_manager
DEBU[0000] Rotating server SVID                          subsystem_name=svid_rotator
DEBU[0000] Signed X509 SVID                              expiration="2025-01-22T14:55:21Z" spiffe_id="spiffe://example.org/spire/server" subsystem_name=svid_rotator

It likely depends a lot on the configuration, though. Do you happen to have the configuration for spire server at the time (e.g. KeyManager and/or UpstreamAuthority plugins)?

@kfox1111
Copy link
Contributor Author

I think it was probably:

server {
    bind_address = "127.0.0.1"
    bind_port = "8081"
    jwt_issuer = "https://foo"
    trust_domain = "example.com"
    data_dir = "./data/server"
    log_level = "DEBUG"
    #ca_ttl = "10m"
    # default_x509_svid_ttl = "48h"
    #default_x509_svid_ttl = "2m"
}

plugins {
    DataStore "sql" {
        plugin_data {
            database_type = "sqlite3"
            connection_string = "./data/server/datastore.sqlite3"
        }
    }

    KeyManager "disk" {
        plugin_data {
            keys_path = "./data/server/keys.json"
        }
    }

    NodeAttestor "join_token" {
        plugin_data {}
    }

    NodeAttestor "tpm_attestor_server" {
        plugin_cmd = "/opt/spire-1.7.2/bin/tpm_attestor_server"
        plugin_data {
          hash_path = "/opt/spire-1.7.2/data/hash"
        }
    }
}

@sorindumitru
Copy link
Contributor

I notice a version number in your config, 1.7.2. Does tha match the version of spire-server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/in-progress Issue triage is in progress
Projects
None yet
Development

No branches or pull requests

3 participants