Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONFIG-329][CONFIG-330] End to end snapshot integrity check #124

Merged
merged 36 commits into from
Oct 6, 2023

Conversation

zhou-hongyu
Copy link
Contributor

@zhou-hongyu zhou-hongyu commented Sep 19, 2023

Description

  • Having ctlstore-supervisor calculate the base64 encoded sha value of snapshot.db and upload to header object's Metadata field when uploading the snapshot
  • Having download script perform integrity check by validating if the calculated checksum matches between the file downloaded and the value uploaded by the supervisor
  • Refactor s3 sdk to v2 to accommodate the effort of upgrading aws sdk to v2 initiated by [CONFIG-325] Increase ldb download speeds #117

Why

Initiated by a recent novel error from ctlstore-reflector exec dml statement error: database disk image is malformed we suspect the culprit might be the s5cmd thing that it either:

  1. Errored but didn't exit with a non-zero code
  2. Succeeded but the ldb file is corrupted anyway

In both cases k8s's retry mechanism of the init container won't get triggered. So we introduce an end-to-end integrity check to have the container bail out with a failure exit code.

Testing

Testing completed successfully in stage-euw1:eu-west-1:centrifuge-destinations/ctlstore:

  • Happy path:
(stage-euw1-write/stage-euw1:eu-west-1:centrifuge-destinations/ctlstore) hzhou@HX19VT0CH2 eks-configuration % kc logs ctlstore-reflector-v2-beta-kbm6q -c copy-snapshot
Downloading head object from s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db
Remote checksum: LBQyDZLyguuVaZS1FMEe2HT6+eaXqVCL9pm1tNNC51w=
Remote version: KoLpw8TjDsrbpajexFzuKMV0XDenSkux
Downloading snapshot from s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db with VersionID: KoLpw8TjDsrbpajexFzuKMV0XDenSkux
cp s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db snapshot.db
Local snapshot checksum: LBQyDZLyguuVaZS1FMEe2HT6+eaXqVCL9pm1tNNC51w=
Checksum matches
ldb.db ready in 8 seconds
{"startTime": 8, "downloaded": "true", "compressed": "false"}
  • Unhappy path:

By enforcing a false negative, the init container will exit with 1. But because if a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds

...
(stage-euw1-write/stage-euw1:eu-west-1:centrifuge-destinations/ctlstore) hzhou@HX19VT0CH2 eks-configuration % kc logs ctlstore-reflector-v2-beta-hr6hr -c copy-snapshot -f
Downloading head object from s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db
Remote checksum: 53ZmxxO1f3qQTHklA/cWiycnT5jLTkMuh5BWkTVzvIw=
Remote version: y0D3T0Wl8IbJOfPdXwZb1Wjt6X_IiZvH
Downloading snapshot from s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db with VersionID: y0D3T0Wl8IbJOfPdXwZb1Wjt6X_IiZvH
cp s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db snapshot.db
Local snapshot checksum: 53ZmxxO1f3qQTHklA/cWiycnT5jLTkMuh5BWkTVzvIw=
Checksum does not match
Failed to download intact snapshot
  • Test checksum info n/a:
(stage-euw1-write/stage-euw1:eu-west-1:centrifuge-destinations/ctlstore) hzhou@HX19VT0CH2 eks-configuration % kc logs ctlstore-reflector-v2-beta-xpt6v -c copy-snapshot -f
Downloading head object from s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db
Remote checksum:
Remote version: lRMKUOMPcIT4QiZjaYKHyTmZH_L3BfBJ
Downloading snapshot from s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db with VersionID: lRMKUOMPcIT4QiZjaYKHyTmZH_L3BfBJ
cp s3://segment-ctlstore-snapshots-stage-euw1/snapshot.db snapshot.db
Remote checksum is null, skipping checksum validation
ldb.db ready in 5 seconds
{"startTime": 5, "downloaded": "true", "compressed": "false"}
  • Test in prod to benchmark the extra overhead of running validation before merging:
│ Downloading head object from s3://segment-ctlstore-snapshots-production/snapshot.db                                                                                │
│ Remote checksum in sha1: JAiESzwPGnyMx1YwtUgQ9we1oJ8=                                                                                                              │
│ Remote version: vwF4jBWuPECP_7ECo._PhOFc5jEahDYB                                                                                                                   │
│ Downloading snapshot from s3://segment-ctlstore-snapshots-production/snapshot.db with VersionID: vwF4jBWuPECP_7ECo._PhOFc5jEahDYB                                  │
│ cp s3://segment-ctlstore-snapshots-production/snapshot.db snapshot.db                                                                                              │
│ Local snapshot checksum in sha1: JAiESzwPGnyMx1YwtUgQ9we1oJ8=                                                                                                      │
│ Checksum matches                                                                                                                                                   │
│ Local checksum calculation took 13 seconds                                                                                                                         │
│ ldb.db ready in 68 seconds                                                                                                                                         │
│ {"startTime": 68, "downloaded": "true", "compressed": "false"}

@zhou-hongyu zhou-hongyu force-pushed the refactor-supervisor-to-use-s3-sdk-v2 branch from 370ddb7 to a09dc8d Compare September 19, 2023 20:12
@zhou-hongyu zhou-hongyu changed the title refactor supervisor to use sdk v2 and add checksum [CONFIG-329][CONFIG-330] End to end snapshot integrity check Sep 26, 2023
scripts/download.sh Outdated Show resolved Hide resolved
scripts/download.sh Outdated Show resolved Hide resolved
@zhou-hongyu zhou-hongyu marked this pull request as ready for review September 26, 2023 21:12
@zhou-hongyu zhou-hongyu requested a review from a team as a code owner September 26, 2023 21:12
scripts/download.sh Outdated Show resolved Hide resolved
scripts/download.sh Outdated Show resolved Hide resolved
@zhou-hongyu zhou-hongyu requested review from erikdw and apacker October 3, 2023 23:05
@zhou-hongyu
Copy link
Contributor Author

Testing section in the PR description is updated with code changes.

Copy link

@apacker apacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🔥

scripts/download.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@erikdw erikdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Super minor nit about CheckSum -> Checksum, otherwise, let's 🚢 🇮🇹

Dockerfile Show resolved Hide resolved
pkg/supervisor/archived_snapshot.go Outdated Show resolved Hide resolved
pkg/supervisor/archived_snapshot.go Outdated Show resolved Hide resolved
@zhou-hongyu zhou-hongyu merged commit 146e400 into master Oct 6, 2023
@zhou-hongyu zhou-hongyu deleted the refactor-supervisor-to-use-s3-sdk-v2 branch October 6, 2023 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants