Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5371]: Storage reuse on a different cluster #488

Open
wants to merge 3 commits into
base: 6/edge
Choose a base branch
from

Conversation

Gu1nness
Copy link
Contributor

@Gu1nness Gu1nness commented Sep 17, 2024

Issue

  • Storage reuse doesn't work if we use a new application

Solution

There are multiple problems to face when reusing storage in a new application:

  • We need to be able to detect that we are using a new application and not adding back to the same replica set.
  • We cannot suppose we have kept all users passwords so we need to change it.
  • We might have to change the replica set name.
  • We need in any case to reconfigure the replicaset.

The solution chosen here is the following:

  • We store on the storage volume a file containing a random string
  • This string is also stored as a secret (because we have access to app secrets in install event, which is not the case for the app peer databag)
  • If we detect that, after installing charmed mongodb, we'll start mongodb in a degraded mode: no replicaset, no auth validation. This allows us to patch the deployment
  • Then we init the replica set, but we need to use authentication here in order to ensure that it works, because we already have the users in this case.
  • We add one more optional reconfigure : for at least two situations, it can happen that all IP changes. This leads to the replica set being fully broken, no host being able to find the other members of the replica set. In order to fix this, a new method to forcefully reconfigure the replica set is added in a specific case : None of the IPs in the mongoDB replica set config is matching the IPs in the replica set config in the Databag. This is achieved by opening a standalone connection to the node (which doesn't require any server selection in the replicaset) and getting the config (not the status which does requires to be connected to other nodes).

The (not so big) drawback:

  • Some of the most recent data might be lost (sic). WiredTiger writes the snapshots to disk every 60 seconds, which means some of the last data might be lost.
  • I tried to go to different solutions (update users and rename replica set only) in order to avoid that, which ended up in: Core dump of mongodb (yes…), MongoDB restarting in loop because of some config collections being broken. I haven't been able to figure out what the issue was, so I decided to go with this in-between solution.
  • I find this is an acceptable solution granted that:
  • If we're reusing storage in a new application willingly, we can leave time for snapshot to be written on disk
  • If we reusing storage after a crash, it's an acceptable loss to lose the very latest data.

Implements

  • Storage reuse on the same cluster (scale down, scale to zero).
  • Storage reuse on a different cluster with same app name.
  • Storage reuse on a different cluster with different app name.
  • Restore after a full cluster crash.

Suggestions for the future

  • A big part of the issues we have come from the fact that we use IPs and no DNS inside the cluster.
    Having a DNS in the cluster would help as network crash or machines restarting wouldn't change the name of each unit, and the cluster would be able to survive IP changes.
  • An integration to external secrets could help make this easier: deploying a new application connected to external secrets would help not losing the secrets in the first application and just restarting as before, with "only" big reconfigurations of hosts.

@Gu1nness Gu1nness force-pushed the DPE-5371-reuse-storage branch from 09ec58a to 10a97e0 Compare September 17, 2024 20:44
Features:
 * Storage reuse on the same cluster (scale down, scale to zero)
 * Storage reuse on a different cluster with same app name
 * Storage reuse on a different cluster with different app name
@Gu1nness Gu1nness force-pushed the DPE-5371-reuse-storage branch from 10a97e0 to b1c47a1 Compare September 18, 2024 07:02
Copy link
Contributor

@MiaAltieri MiaAltieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work Neha.

You have added a huge value to our project! Left some questions, requested changes, and nits

lib/charms/mongodb/v0/mongo.py Show resolved Hide resolved
Comment on lines +289 to +291
def drop_local_database(self):
"""DANGEROUS: Drops the local database."""
self.client.drop_database("local")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something we can calculate / store in app databag to verify whether it is safe to drop the local db? then we could add a guardrail by raising a custom exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure.
TBH dropping the local database is not really something that should be done, but it does work.
The big issue behind dropping the local DB is that it stores also the oplog.
I can try to drop less information than that in this call in order to be safer but I'm unsure it will work.
Should I give it a try?

lib/charms/mongodb/v1/helpers.py Outdated Show resolved Hide resolved
lib/charms/mongodb/v1/helpers.py Outdated Show resolved Hide resolved
lib/charms/mongodb/v1/helpers.py Outdated Show resolved Hide resolved
src/lock_hash.py Show resolved Hide resolved
tests/integration/ha_tests/test_storage.py Outdated Show resolved Hide resolved
tests/integration/ha_tests/test_storage.py Outdated Show resolved Hide resolved
tests/integration/ha_tests/test_storage.py Outdated Show resolved Hide resolved
assert sorted(storage_ids.values()) == sorted(new_storage_ids), "Storage IDs mismatch"

actual_writes = await count_writes(ops_test, app_name=new_app_name)
assert writes_results["number"] == actual_writes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same checks as suggested above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you have suggested a check above, maybe a comment got lost?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants