-
Notifications
You must be signed in to change notification settings - Fork 937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lxd: Make the LXD shutdown sequence more concurrent to avoid long-running operations blocking unrelated instances shutdown #15016
base: main
Are you sure you want to change the base?
Conversation
@tomponline we can start discussing the approach. I mark it as draft as I'm stillin the process to test it but just wanted to let you know about my idea. @simondeziel what was the test setup you used so that I can reproduce a bunch of long running operations and start testing this appraoch. |
It's been a while and it seems it was an image refresh taking longer than usual. I think mimic something akin with:
Then |
cc831d8
to
34c2e89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShellCheck found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
34c2e89
to
258618c
Compare
2542323
to
42ae29a
Compare
@@ -78,6 +80,140 @@ const ( | |||
ClusterHeal | |||
) | |||
|
|||
// StringToOperationType converts a string (case-insensitive) to an Operation Type. | |||
// Returns Unknown if the string doesn't match any known type. | |||
func StringToOperationType(opTypeString string) Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markylaing is there an existing/better way to do this so we dont forget to add this mapping for new entity types? Seems similar to what we've done before.
42ae29a
to
8cc26b7
Compare
If one wants to reproduce the tests: this should be working https://paste.ubuntu.com/p/MGxxJz35N7/ (still fuguring out why this is different in the CI) |
The OperationTracker and the tracking groups contained in it, is a handy way to keep track of operation objects by associating them with their entity URLs. This operation tracker is meant to track instances and volumes related entities in a separate fashion so that we don't hold a global mutex to take a look to an entity if we want to look an an other entity of a different type at the same type. Signed-off-by: Gabriel Mougard <[email protected]>
3400433
to
21aaf3f
Compare
…perationTracker aware This enables `instancesShutdown` to react upon operation changes (from busy to idle) and to context cancellation while keeping the shutdown order correct if instances have `boot.stop.priority` set. Signed-off-by: Gabriel Mougard <[email protected]>
…rationTracker and context Signed-off-by: Gabriel Mougard <[email protected]>
…ss` utility methods This is needed to create dummy operations using the internal API in the tests. Signed-off-by: Gabriel Mougard <[email protected]>
For the testing phase, we need a convenient way to create dummy operations in LXD with a certain duration. Signed-off-by: Gabriel Mougard <[email protected]>
21aaf3f
to
4437875
Compare
162fc79
to
37c2dfa
Compare
… tests Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
37c2dfa
to
5755fc6
Compare
@tomponline the tests are passing now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many test scenarios which is good for coverage but this test is pretty expansive in terms of runtime (~500s so >8 minutes). For comparison, the second longest test finishes in ~135s (2m15s).
I suggested switching from alpine to busybox but I don't know how much of a difference it will make. If you have other ideas to speed it up, please let me know.
opTypeString = strings.ReplaceAll(opTypeString, " ", "") | ||
opTypeString = strings.ToLower(opTypeString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opTypeString = strings.ReplaceAll(opTypeString, " ", "") | |
opTypeString = strings.ToLower(opTypeString) | |
opTypeString = strings.ReplaceAll(strings.ToLower(opTypeString), " ", "") |
switch opTypeString { | ||
case "clusterbootstrap": | ||
return ClusterBootstrap | ||
case "clusterjoin": | ||
return ClusterJoin | ||
case "backupcreate": | ||
return BackupCreate | ||
case "backuprename": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific logic to the order of the case
? If not, could we sort them alphabetically?
// and to avoid the situation where a single very long running operation can block the shutdown of unrelated instances and the unmount of unrelated storage volumes. | ||
if opAPI.Resources != nil { | ||
for resourceName, resourceEntries := range opAPI.Resources { | ||
if resourceName == "instances" || resourceName == "storage_volumes" || resourceName == "storage_volume_snapshots" || resourceName == "backups" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reversing the if
to do an early continue
would reduce the indentation and make it more readable IMHO.
opClassString = strings.ReplaceAll(opClassString, " ", "") | ||
opClassString = strings.ToLower(opClassString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opClassString = strings.ReplaceAll(opClassString, " ", "") | |
opClassString = strings.ToLower(opClassString) | |
opClassString = strings.ReplaceAll(strings.ToLower(opClassString), " ", "") |
|
||
local all_found=true | ||
for message in "${expected_messages[@]}"; do | ||
if grep -q "$message" "$logfile"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if grep -q "$message" "$logfile"; then | |
if grep -qF "$message" "$logfile"; then |
@@ -0,0 +1,671 @@ | |||
test_shutdown() { | |||
ensure_import_testimage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be used (yet).
delete_instances 5 | ||
rm "$scenario_name.log" | ||
|
||
# The following scenarios are only relevant for LXD with storage backend other than ceph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand the comment to explain why ceph
isn't relevant here.
lxc storage volume delete mypool backups_volume | ||
lxc storage volume delete mypool images_volume | ||
lxc storage delete mypool | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing \n
.
lxc config set i1 boot.stop.priority 0 | ||
lxc config set i2 boot.stop.priority 1 | ||
lxc config set i3 boot.stop.priority 2 | ||
lxc config set i4 boot.stop.priority 2 | ||
lxc config set i5 boot.stop.priority 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a scenario where some of the instances have a stop prio while others don't have any? I don't know if that'd be a relevant thing to test for though.
|
||
for i in $(seq 1 "$n"); do | ||
echo "Creating instance i$i..." | ||
lxc launch images:alpine/3.18 "i$i" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the regular busybox image called testimage
with a small root disk instead?
lxc launch images:alpine/3.18 "i$i" | |
lxc launch testimage "i${i}" -d "${SMALL_ROOT_DISK}" |
closes: #13460
JIRA ticket: https://warthogs.atlassian.net/browse/LXD-2600