Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxd: Make the LXD shutdown sequence more concurrent to avoid long-running operations blocking unrelated instances shutdown #15016

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

gabrielmougard
Copy link
Contributor

@gabrielmougard
Copy link
Contributor Author

@tomponline we can start discussing the approach. I mark it as draft as I'm stillin the process to test it but just wanted to let you know about my idea. @simondeziel what was the test setup you used so that I can reproduce a bunch of long running operations and start testing this appraoch.

@simondeziel
Copy link
Member

@simondeziel what was the test setup you used so that I can reproduce a bunch of long running operations and start testing this appraoch.

It's been a while and it seems it was an image refresh taking longer than usual. I think mimic something akin with:

$ lxc launch ubuntu-minimal-daily:24.04 c1
Launching c1
$ lxc exec c1 -- sleep inifinity

Then sudo systemctl stop snap.lxd.daemon.

@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch from cc831d8 to 34c2e89 Compare February 25, 2025 17:36
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShellCheck found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch from 34c2e89 to 258618c Compare February 26, 2025 09:16
@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch 18 times, most recently from 2542323 to 42ae29a Compare February 27, 2025 08:53
@@ -78,6 +80,140 @@ const (
ClusterHeal
)

// StringToOperationType converts a string (case-insensitive) to an Operation Type.
// Returns Unknown if the string doesn't match any known type.
func StringToOperationType(opTypeString string) Type {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markylaing is there an existing/better way to do this so we dont forget to add this mapping for new entity types? Seems similar to what we've done before.

@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch from 42ae29a to 8cc26b7 Compare February 27, 2025 08:58
@gabrielmougard
Copy link
Contributor Author

gabrielmougard commented Feb 27, 2025

If one wants to reproduce the tests: this should be working https://paste.ubuntu.com/p/MGxxJz35N7/ (still fuguring out why this is different in the CI)

The OperationTracker and the tracking groups contained in it, is a handy way to keep track
of operation objects by associating them with their entity URLs. This operation tracker
is meant to track instances and volumes related entities in a separate fashion so that
we don't hold a global mutex to take a look to an entity if we want to look an an other entity of a different type at the same type.

Signed-off-by: Gabriel Mougard <[email protected]>
@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch 16 times, most recently from 3400433 to 21aaf3f Compare February 28, 2025 15:30
…perationTracker aware

This enables `instancesShutdown` to react upon operation changes (from busy to idle) and to context cancellation
while keeping the shutdown order correct if instances have `boot.stop.priority` set.

Signed-off-by: Gabriel Mougard <[email protected]>
…rationTracker and context

Signed-off-by: Gabriel Mougard <[email protected]>
…ss` utility methods

This is needed to create dummy operations using the internal API in the tests.

Signed-off-by: Gabriel Mougard <[email protected]>
For the testing phase, we need a convenient way to create dummy operations in LXD with a certain duration.

Signed-off-by: Gabriel Mougard <[email protected]>
@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch from 21aaf3f to 4437875 Compare February 28, 2025 16:03
@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch 2 times, most recently from 162fc79 to 37c2dfa Compare February 28, 2025 16:26
@gabrielmougard gabrielmougard force-pushed the fix/improve-shutdown-sequence branch from 37c2dfa to 5755fc6 Compare February 28, 2025 17:09
@gabrielmougard gabrielmougard marked this pull request as ready for review March 1, 2025 15:05
@gabrielmougard
Copy link
Contributor Author

@tomponline the tests are passing now

Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many test scenarios which is good for coverage but this test is pretty expansive in terms of runtime (~500s so >8 minutes). For comparison, the second longest test finishes in ~135s (2m15s).

I suggested switching from alpine to busybox but I don't know how much of a difference it will make. If you have other ideas to speed it up, please let me know.

Comment on lines +86 to +87
opTypeString = strings.ReplaceAll(opTypeString, " ", "")
opTypeString = strings.ToLower(opTypeString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
opTypeString = strings.ReplaceAll(opTypeString, " ", "")
opTypeString = strings.ToLower(opTypeString)
opTypeString = strings.ReplaceAll(strings.ToLower(opTypeString), " ", "")

Comment on lines +89 to +96
switch opTypeString {
case "clusterbootstrap":
return ClusterBootstrap
case "clusterjoin":
return ClusterJoin
case "backupcreate":
return BackupCreate
case "backuprename":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific logic to the order of the case? If not, could we sort them alphabetically?

// and to avoid the situation where a single very long running operation can block the shutdown of unrelated instances and the unmount of unrelated storage volumes.
if opAPI.Resources != nil {
for resourceName, resourceEntries := range opAPI.Resources {
if resourceName == "instances" || resourceName == "storage_volumes" || resourceName == "storage_volume_snapshots" || resourceName == "backups" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reversing the if to do an early continue would reduce the indentation and make it more readable IMHO.

Comment on lines +47 to +48
opClassString = strings.ReplaceAll(opClassString, " ", "")
opClassString = strings.ToLower(opClassString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
opClassString = strings.ReplaceAll(opClassString, " ", "")
opClassString = strings.ToLower(opClassString)
opClassString = strings.ReplaceAll(strings.ToLower(opClassString), " ", "")


local all_found=true
for message in "${expected_messages[@]}"; do
if grep -q "$message" "$logfile"; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if grep -q "$message" "$logfile"; then
if grep -qF "$message" "$logfile"; then

@@ -0,0 +1,671 @@
test_shutdown() {
ensure_import_testimage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used (yet).

delete_instances 5
rm "$scenario_name.log"

# The following scenarios are only relevant for LXD with storage backend other than ceph.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand the comment to explain why ceph isn't relevant here.

lxc storage volume delete mypool backups_volume
lxc storage volume delete mypool images_volume
lxc storage delete mypool
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing \n.

Comment on lines +595 to +599
lxc config set i1 boot.stop.priority 0
lxc config set i2 boot.stop.priority 1
lxc config set i3 boot.stop.priority 2
lxc config set i4 boot.stop.priority 2
lxc config set i5 boot.stop.priority 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scenario where some of the instances have a stop prio while others don't have any? I don't know if that'd be a relevant thing to test for though.


for i in $(seq 1 "$n"); do
echo "Creating instance i$i..."
lxc launch images:alpine/3.18 "i$i"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the regular busybox image called testimage with a small root disk instead?

Suggested change
lxc launch images:alpine/3.18 "i$i"
lxc launch testimage "i${i}" -d "${SMALL_ROOT_DISK}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stopping LXD should signal instances early for shutdown
3 participants