Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network: Fix OVN ChassisGroupChassisDelete to convert chassis_name #14904

Merged
merged 2 commits into from
Feb 11, 2025

Conversation

escabo
Copy link
Contributor

@escabo escabo commented Jan 31, 2025

Now gets the list of ha_chassis first to convert the received chassis_name into a _uuid for later comparison.
Fixes #14902.

@escabo escabo marked this pull request as draft January 31, 2025 21:12
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately this issue was introduced by myself in 1d98019 trying to workaround the issue that the ovn-nbctl ha-chassis-group-remove-chassis command doesn't provide an --if-exists option where if the chassis member doesn't exist in the group, then no error will be raised.

Previously if the chassis was a member of multiple chassis groups (OVN networks) then if one of the OVN networks couldn't
be started (and thus its chassis wasn't added to that network's chassis group) then attempting to remove unavailable OVN
network would fail because the previous implementation thought the chassis was in the chassis group, when infact it was
in another chassis group.

However this PR as it stands introduces another issue because it does not consider that the same chassis can have multiple ha_chassis entries (one per ovn network). This is done to achieve the per-ovn network chassis failover priority, in order to distribute uplink traffic across the cluster.

Taking the example of a microcloud with 2 OVN networks:

ovn-nbctl find ha_chassis
_uuid               : f139a82d-7d32-4799-80ce-6feecbb4b3df
chassis_name        : micro01
external_ids        : {}
priority            : 26056

_uuid               : 6fe7b2ed-d98b-4e64-9585-0eef369ba082
chassis_name        : micro01
external_ids        : {}
priority            : 1138

_uuid               : a1a8d8c2-2cc6-4806-8bd8-5a4ddc485890
chassis_name        : micro03
external_ids        : {}
priority            : 15471

_uuid               : a07811af-327d-4467-ae7e-3be7afc98098
chassis_name        : micro02
external_ids        : {}
priority            : 10207

_uuid               : a7efafc7-9f65-4383-aa9d-ab8cad5443ae
chassis_name        : micro03
external_ids        : {}
priority            : 29049

_uuid               : 701780b3-8930-4365-abe7-85ab0865356e
chassis_name        : micro02
external_ids        : {}
priority            : 22819
ovn-nbctl find ha_chassis_group
_uuid               : 2df9866e-03a8-4bb9-9447-5f928014f338
external_ids        : {}
ha_chassis          : [a07811af-327d-4467-ae7e-3be7afc98098, a1a8d8c2-2cc6-4806-8bd8-5a4ddc485890, f139a82d-7d32-4799-80ce-6feecbb4b3df]
name                : lxd-net3

_uuid               : 9cc51612-237f-43dd-bb32-aebf191bf3c6
external_ids        : {}
ha_chassis          : [6fe7b2ed-d98b-4e64-9585-0eef369ba082, 701780b3-8930-4365-abe7-85ab0865356e, a7efafc7-9f65-4383-aa9d-ab8cad5443ae]
name                : lxd-net2

We see that micro01 has 2 ha_chassis entries:

_uuid               : f139a82d-7d32-4799-80ce-6feecbb4b3df
chassis_name        : micro01
external_ids        : {}
priority            : 26056

and

_uuid               : 6fe7b2ed-d98b-4e64-9585-0eef369ba082
chassis_name        : micro01
external_ids        : {}
priority            : 1138

The f139a82d-7d32-4799-80ce-6feecbb4b3df is used in the ha_chassis_group for lxd-net3 network and 6fe7b2ed-d98b-4e64-9585-0eef369ba082 is used in the has_chassis_group for lxd-net2 network respectively.

So the problem with the PR is that it takes the entries in the ha_chassis table and matches the first entry that matches the cluster member's chassis ID.

Then it will use that UUID to match against the ha_chassis_group members entries for the ha_chassis_group entry for the associated network.

This is likely to not match correctly, at least some of the time.

The naive approach we could use is to simply lookup each chassis member using the UUID and convert it to a chassis ID/name via:

ovn-nbctl get ha_chassis <group member UUID> chassis_name

But because each ovn-nbctl invocation is expensive, because it connects to the ovn cluster and loads part of the database, I think we should load all ha_chassis members and iterate through them.

Once we switch to libovsdb, we can switch to just doing a chassis name lookup for each chassis member like Incus do here:

https://github.com/lxc/incus/blob/main/internal/server/network/ovn/ovn_nb_actions.go#L2401-L2410

So for now I think we should do this:

  1. First, get chassis members from the ha_chassis_group table, using the existing command we are using.
  2. Next, get list of ha_chassis entries, and see if there is an entry that matches the cluster member's chassis ID and where its UUID is inside the chassis members list for this network's chassis group.

We can also add a test to https://github.com/canonical/lxd/blob/main/test/suites/network_ovn.sh such that when lxd shutdown is called, the chassis group member entry should be removed.

@tomponline
Copy link
Member

@escabo id like to get this into 5.21/candidate as a cherry-pick before we push 5.21.3 to 5.21/stable if timing works out.

@github-actions github-actions bot added Documentation Documentation needs updating API Changes to the REST API labels Feb 7, 2025
return err
}

err = os.Remove(mountPath)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
@simondeziel
Copy link
Member

@escabo I think you need to rebase (git rebase -i main) to avoid pushing back old commits that are already in main.

@github-actions github-actions bot removed Documentation Documentation needs updating API Changes to the REST API labels Feb 9, 2025
@escabo escabo force-pushed the fix_14902 branch 8 times, most recently from 7ec39db to 40beb1b Compare February 9, 2025 19:26
@@ -1613,22 +1613,47 @@ func (o *OVN) ChassisGroupChassisAdd(haChassisGroupName OVNChassisGroup, chassis

// ChassisGroupChassisDelete deletes a chassis ID from an HA chassis group.
func (o *OVN) ChassisGroupChassisDelete(haChassisGroupName OVNChassisGroup, chassisID string) error {
// Check if chassis group exists. ovn-nbctl doesn't provide an "--if-exists" option for this.
output, err := o.nbctl("--no-headings", "--data=bare", "--colum=name,ha_chassis", "find", "ha_chassis_group", fmt.Sprintf("name=%s", string(haChassisGroupName)))
// map uuids with chassis_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use capital letters and full stops.

Suggested change
// map uuids with chassis_names
// Map UUIDs with chassis_names.

@@ -574,7 +574,7 @@ func (o *OVN) LogicalRouterPortSetIPv6Advertisements(portName OVNRouterPort, opt
fmt.Sprintf("ipv6_ra_configs:send_periodic=%t", opts.SendPeriodic),
}

var removeRAConfigKeys []string
removeRAConfigKeys := make([]string, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this seems to have silence the pre-alloc check, it would be best to provide a non-zero capacity.

Suggested change
removeRAConfigKeys := make([]string, 0)
removeRAConfigKeys := make([]string, 0, X)

@tomponline tomponline marked this pull request as ready for review February 11, 2025 13:06
simondeziel
simondeziel previously approved these changes Feb 11, 2025
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have time @escabo please can you separate the chassis change and the linter fixes into separate commits?

// Check if chassis group exists. ovn-nbctl doesn't provide an "--if-exists" option for this.
output, err := o.nbctl("--no-headings", "--data=bare", "--colum=name,ha_chassis", "find", "ha_chassis_group", fmt.Sprintf("name=%s", string(haChassisGroupName)))
// Map UUIDs with chassis_names.
output, err := o.nbctl("--format=csv", "--no-headings", "--data=bare", "--column=_uuid,chassis_name", "list", "ha_chassis")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've tended to use --data=bare when we we know that the first col wont contain the space char that we can split the columns by (so in this case uuid wont contain space), as this avoids having to undo the quoting that csv output may add.

We also tend to use the find command rather than list, but i would need to check why we've used that more commonly.


for _, line := range lines {
// a74125a8-b580-4763-b389-11ce2c8c5509,node2
fields := strings.Split(line, ",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irrespective of whether you're splitting by comma (for csv) or space (for bare) we should use either strings.Cut or strings.SplitN(line, ",",2) to make explicit we expect only 2 fields.

strings.Cut here would also avoid the need for the len check below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this?

	for _, line := range lines {
		// a74125a8-b580-4763-b389-11ce2c8c5509,node2
		key, value, match := strings.Cut(line, ",")
		if match {
			uuidToChassis[key] = value
		}
	}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you pick 10 for pre allocation in some cases?

@escabo
Copy link
Contributor Author

escabo commented Feb 11, 2025

Why did you pick 10 for pre allocation in some cases?

for occurrences where there wasn't a source to get the size from and we needed a handful of elements, I thought this was a nice round number :-)

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to just add //nolint:prealloc after the lines we cant easily get the required capacity by using len and set the capacity to 0.

Either that or use a multiple of the number of arguments that are going to be appended the slice.

For example in AddressSetRemove you've set the capacity as len(addresses) but because we add up to 7 elements to the args slice for each address, it would be better to do len(adresses)*7 in my view.

@tomponline tomponline changed the title Modified ChassisGroupChassisDelete to convert chassis_name Network: Fix OVN ChassisGroupChassisDelete to convert chassis_name Feb 11, 2025
@tomponline
Copy link
Member

tomponline commented Feb 11, 2025

Please can you rebase as ive merged the ovn ptr pr now and there are some conflicts, then this should be ready to merge and ill follow up with some tests. Thanks

…o fix later comparison

Signed-off-by: Eric Gelinas <[email protected]>

better implementation for parsing a74125a8-b580-4763-b389-11ce2c8c5509,node2

Signed-off-by: Eric Gelinas <[email protected]>
@escabo
Copy link
Contributor Author

escabo commented Feb 11, 2025

Please can you rebase as ive merged the ovn ptr pr now and there are some conflicts, then this should be ready to merge and ill follow up with some tests. Thanks

done, thank you!

@tomponline
Copy link
Member

Got a couple of whitespace linter errors there

Signed-off-by: Eric Gelinas <[email protected]>

nolint instead of 10

Signed-off-by: Eric Gelinas <[email protected]>

empty lines

Signed-off-by: Eric Gelinas <[email protected]>
@escabo
Copy link
Contributor Author

escabo commented Feb 11, 2025

Got a couple of whitespace linter errors there

Yep, fixed.

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tomponline tomponline merged commit 1633a4a into canonical:main Feb 11, 2025
26 checks passed
@escabo escabo deleted the fix_14902 branch February 11, 2025 22:45
tomponline added a commit that referenced this pull request Feb 12, 2025
Follow on from #14904

I suspected here
#14904 (comment) the
call to get `ha_chassis` list requires both `--format=csv` to get them
onto one line per record, *and* `--data=bare` to avoid the output being
(double in this case) quoted.

This was preventing the previous fix from actually working. 

Also moved the zero-length initialised slices back to uninitialised
ones, as nil slices can be appended to, and is more in keeping with how
we use slices elsewhere.

Also added a test for chassis removal on LXD shutdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OVN chassis stick around when evacuating a cluster member
3 participants