-
Notifications
You must be signed in to change notification settings - Fork 935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network: Fix OVN ChassisGroupChassisDelete to convert chassis_name #14904
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately this issue was introduced by myself in 1d98019 trying to workaround the issue that the ovn-nbctl ha-chassis-group-remove-chassis
command doesn't provide an --if-exists
option where if the chassis member doesn't exist in the group, then no error will be raised.
Previously if the chassis was a member of multiple chassis groups (OVN networks) then if one of the OVN networks couldn't
be started (and thus its chassis wasn't added to that network's chassis group) then attempting to remove unavailable OVN
network would fail because the previous implementation thought the chassis was in the chassis group, when infact it was
in another chassis group.
However this PR as it stands introduces another issue because it does not consider that the same chassis can have multiple ha_chassis
entries (one per ovn network). This is done to achieve the per-ovn network chassis failover priority, in order to distribute uplink traffic across the cluster.
Taking the example of a microcloud with 2 OVN networks:
ovn-nbctl find ha_chassis
_uuid : f139a82d-7d32-4799-80ce-6feecbb4b3df
chassis_name : micro01
external_ids : {}
priority : 26056
_uuid : 6fe7b2ed-d98b-4e64-9585-0eef369ba082
chassis_name : micro01
external_ids : {}
priority : 1138
_uuid : a1a8d8c2-2cc6-4806-8bd8-5a4ddc485890
chassis_name : micro03
external_ids : {}
priority : 15471
_uuid : a07811af-327d-4467-ae7e-3be7afc98098
chassis_name : micro02
external_ids : {}
priority : 10207
_uuid : a7efafc7-9f65-4383-aa9d-ab8cad5443ae
chassis_name : micro03
external_ids : {}
priority : 29049
_uuid : 701780b3-8930-4365-abe7-85ab0865356e
chassis_name : micro02
external_ids : {}
priority : 22819
ovn-nbctl find ha_chassis_group
_uuid : 2df9866e-03a8-4bb9-9447-5f928014f338
external_ids : {}
ha_chassis : [a07811af-327d-4467-ae7e-3be7afc98098, a1a8d8c2-2cc6-4806-8bd8-5a4ddc485890, f139a82d-7d32-4799-80ce-6feecbb4b3df]
name : lxd-net3
_uuid : 9cc51612-237f-43dd-bb32-aebf191bf3c6
external_ids : {}
ha_chassis : [6fe7b2ed-d98b-4e64-9585-0eef369ba082, 701780b3-8930-4365-abe7-85ab0865356e, a7efafc7-9f65-4383-aa9d-ab8cad5443ae]
name : lxd-net2
We see that micro01
has 2 ha_chassis
entries:
_uuid : f139a82d-7d32-4799-80ce-6feecbb4b3df
chassis_name : micro01
external_ids : {}
priority : 26056
and
_uuid : 6fe7b2ed-d98b-4e64-9585-0eef369ba082
chassis_name : micro01
external_ids : {}
priority : 1138
The f139a82d-7d32-4799-80ce-6feecbb4b3df
is used in the ha_chassis_group
for lxd-net3
network and 6fe7b2ed-d98b-4e64-9585-0eef369ba082
is used in the has_chassis_group
for lxd-net2
network respectively.
So the problem with the PR is that it takes the entries in the ha_chassis
table and matches the first entry that matches the cluster member's chassis ID.
Then it will use that UUID to match against the ha_chassis_group
members entries for the ha_chassis_group
entry for the associated network.
This is likely to not match correctly, at least some of the time.
The naive approach we could use is to simply lookup each chassis member using the UUID and convert it to a chassis ID/name via:
ovn-nbctl get ha_chassis <group member UUID> chassis_name
But because each ovn-nbctl
invocation is expensive, because it connects to the ovn cluster and loads part of the database, I think we should load all ha_chassis members and iterate through them.
Once we switch to libovsdb, we can switch to just doing a chassis name lookup for each chassis member like Incus do here:
https://github.com/lxc/incus/blob/main/internal/server/network/ovn/ovn_nb_actions.go#L2401-L2410
So for now I think we should do this:
- First, get chassis members from the
ha_chassis_group
table, using the existing command we are using. - Next, get list of
ha_chassis
entries, and see if there is an entry that matches the cluster member's chassis ID and where its UUID is inside the chassis members list for this network's chassis group.
We can also add a test to https://github.com/canonical/lxd/blob/main/test/suites/network_ovn.sh such that when lxd shutdown
is called, the chassis group member entry should be removed.
@escabo id like to get this into 5.21/candidate as a cherry-pick before we push 5.21.3 to 5.21/stable if timing works out. |
return err | ||
} | ||
|
||
err = os.Remove(mountPath) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
This path depends on a
user-provided value
This path depends on a
user-provided value
This path depends on a
user-provided value
@escabo I think you need to rebase ( |
7ec39db
to
40beb1b
Compare
lxd/network/openvswitch/ovn.go
Outdated
@@ -1613,22 +1613,47 @@ func (o *OVN) ChassisGroupChassisAdd(haChassisGroupName OVNChassisGroup, chassis | |||
|
|||
// ChassisGroupChassisDelete deletes a chassis ID from an HA chassis group. | |||
func (o *OVN) ChassisGroupChassisDelete(haChassisGroupName OVNChassisGroup, chassisID string) error { | |||
// Check if chassis group exists. ovn-nbctl doesn't provide an "--if-exists" option for this. | |||
output, err := o.nbctl("--no-headings", "--data=bare", "--colum=name,ha_chassis", "find", "ha_chassis_group", fmt.Sprintf("name=%s", string(haChassisGroupName))) | |||
// map uuids with chassis_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically use capital letters and full stops.
// map uuids with chassis_names | |
// Map UUIDs with chassis_names. |
lxd/network/openvswitch/ovn.go
Outdated
@@ -574,7 +574,7 @@ func (o *OVN) LogicalRouterPortSetIPv6Advertisements(portName OVNRouterPort, opt | |||
fmt.Sprintf("ipv6_ra_configs:send_periodic=%t", opts.SendPeriodic), | |||
} | |||
|
|||
var removeRAConfigKeys []string | |||
removeRAConfigKeys := make([]string, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this seems to have silence the pre-alloc check, it would be best to provide a non-zero capacity.
removeRAConfigKeys := make([]string, 0) | |
removeRAConfigKeys := make([]string, 0, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have time @escabo please can you separate the chassis change and the linter fixes into separate commits?
lxd/network/openvswitch/ovn.go
Outdated
// Check if chassis group exists. ovn-nbctl doesn't provide an "--if-exists" option for this. | ||
output, err := o.nbctl("--no-headings", "--data=bare", "--colum=name,ha_chassis", "find", "ha_chassis_group", fmt.Sprintf("name=%s", string(haChassisGroupName))) | ||
// Map UUIDs with chassis_names. | ||
output, err := o.nbctl("--format=csv", "--no-headings", "--data=bare", "--column=_uuid,chassis_name", "list", "ha_chassis") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've tended to use --data=bare
when we we know that the first col wont contain the space char that we can split the columns by (so in this case uuid wont contain space), as this avoids having to undo the quoting that csv output may add.
We also tend to use the find
command rather than list
, but i would need to check why we've used that more commonly.
lxd/network/openvswitch/ovn.go
Outdated
|
||
for _, line := range lines { | ||
// a74125a8-b580-4763-b389-11ce2c8c5509,node2 | ||
fields := strings.Split(line, ",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Irrespective of whether you're splitting by comma (for csv) or space (for bare) we should use either strings.Cut
or strings.SplitN(line, ",",2)
to make explicit we expect only 2 fields.
strings.Cut
here would also avoid the need for the len
check below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this?
for _, line := range lines {
// a74125a8-b580-4763-b389-11ce2c8c5509,node2
key, value, match := strings.Cut(line, ",")
if match {
uuidToChassis[key] = value
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you pick 10 for pre allocation in some cases?
for occurrences where there wasn't a source to get the size from and we needed a handful of elements, I thought this was a nice round number :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to just add //nolint:prealloc
after the lines we cant easily get the required capacity by using len
and set the capacity to 0.
Either that or use a multiple of the number of arguments that are going to be appended the slice.
For example in AddressSetRemove
you've set the capacity as len(addresses)
but because we add up to 7 elements to the args
slice for each address, it would be better to do len(adresses)*7
in my view.
Please can you rebase as ive merged the ovn ptr pr now and there are some conflicts, then this should be ready to merge and ill follow up with some tests. Thanks |
…o fix later comparison Signed-off-by: Eric Gelinas <[email protected]> better implementation for parsing a74125a8-b580-4763-b389-11ce2c8c5509,node2 Signed-off-by: Eric Gelinas <[email protected]>
done, thank you! |
Got a couple of whitespace linter errors there |
Signed-off-by: Eric Gelinas <[email protected]> nolint instead of 10 Signed-off-by: Eric Gelinas <[email protected]> empty lines Signed-off-by: Eric Gelinas <[email protected]>
Yep, fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Follow on from #14904 I suspected here #14904 (comment) the call to get `ha_chassis` list requires both `--format=csv` to get them onto one line per record, *and* `--data=bare` to avoid the output being (double in this case) quoted. This was preventing the previous fix from actually working. Also moved the zero-length initialised slices back to uninitialised ones, as nil slices can be appended to, and is more in keeping with how we use slices elsewhere. Also added a test for chassis removal on LXD shutdown.
Now gets the list of ha_chassis first to convert the received chassis_name into a _uuid for later comparison.
Fixes #14902.