refine error messages about not having a propolis address #7263

iximeow · 2024-12-17T19:57:45Z

the chronology here seems to be: Nexus learned to talk directly to Propolis to communicate with an instance's serial console, that "instance to Propolis address" translation got outlined into propolis_addr_for_instance, then for region and region snapshot replacment sagas we needed to also talk to the relevant Propolis. at that point, propolis_client_for_instance used
propolis_addr_for_instance too.

this is all fine, but propolis_addr_for_instance kept a few serial console-specific error messages, which would show up in places like region replacement if an instance is stopped or shut down at an inopportune time.

in the process, i discovered that a request for an instance's serial console history with incorrect parameters manifests as a 500 rather than a 400. Propolis even returns a 400, but the failure to establish a websocket console is presumed an internal error in Nexus, and so it's wrapped as a 500 whose contents talk about a 400! fix that to just return a 400.

iximeow · 2024-12-17T20:03:54Z

nexus/src/app/instance.rs

+                return Err(Error::invalid_request(
+                    "Exactly one of 'from_start' \
+                    or 'most_recent' must be specified.",
+                ));


don't super love this, and maybe @hawkw's new dropshot error type stuff can help? this reproduces a check that's done in both Propolis and propolis-sim. when they see invalid parameters, they respond with a 400 whose message is the same as here. my thinking is, the connection to Propolis could error for other reasons, and for any other reason it would be correctly categorized as a 500. and i didn't want to just forward any 400 error from Propolis out to users here because Propolis errors aren't necessarily intended for direct end-user consumption...

if parameter validation gets more strict on the Propolis side, though, things checked there and not here will be 500s..

Yeah, hm. I think once #7196 we could have Propolis return a structured error here. But, since we are constructing the request in Nexus, I think it's reasonable-ish to say that any invalid request sent to Propolis is arguably an "internal server error" and we should be validating it here beforehand. On the other hand, duplicating the validation also feels bad. I dunno...

suppose i never replied here but, my thought primarily is that a 500 because a client didn't send parameters we didn't like really ought to be a 400 (ideally with a hint about what to do better next time). either duplicating the validation or matching on the error from Propolis, i don't really have an opinion at that point. but! not being familiar with the structured dropshot error stuff, i've backed this out and i'll stuff it in a draft + issue for the time being

iximeow · 2024-12-17T20:12:50Z

nexus/src/app/instance.rs

                    Err(Error::invalid_request(format!(
-                        "cannot connect to serial console of instance in state \"{}\"",


i tried to make this a stronger error type than Error here, so we could have nice "serial console" errors for the serial console path, and "propolis" errors for the region{, snapshot}_replacement saga path. but this runs into a bunch of new wrinkles like, instance_lookup.lookup_for() can fail in an Error-y way, db_datastore.instance_fetch_with_vmm() can fail in an Error-y way.

so after spinning a little too long on error plumbing here, i decided to take the different approach of picking the lowest common denominator message that makes sense for any caller of propolis_addr_for_instance. maybe "cannot connect to admin socket" or something would be more accurate, but i don't want to be confusing in a user-facing error message - there's no secret backdoor socket in your instance! we're just talking to the VMM!

yeah, I think just making it generic is fine. 's fair. a slightly different wording is something like "this operation cannot be performed on instances in state ...", which is also generic but suggests that the specific thing you tried to do can't be done in that state. I worry a little that "administer an instance" is broad enough that it does include operations you can perform in these states --- i.e., is deleting an instance "administering" it?

i was reminded of this PR today when @jmpesp tripped over this and incidentally has a good case where "administer" is probably too broad - the idea we were talking about was moving the state matching here to something chosen by an enum that describes what properties you want from a client here.

do you want a client for which a VMM is running? probably true for serial console APIs. do you want a client where Propolis is running? that's more permissive, and reasonable if you're trying to send a VCR replacement.

SO: if James wants to base off of this PR, we'll probably get better error messages from that enum. and if not, well, that commit wherever it goes will do a better thing here.

the idea we were talking about was moving the state matching here to something chosen by an enum that describes what properties you want from a client here

I'm a latecomer here, but I'm not the biggest fan of this approach. Because the active Propolis's state can change (or the active VMM can disappear!) as soon as this field is read, callers of this function who then create a Propolis client still have to account for the full range of errors they can receive from their calls to that client (including "no server at that address"). Given that, I would just make this routine return the active VMM address if there is one, have each caller make whatever Propolis call it wants to make, and then have each caller decide how it wants to handle the errors those calls can return. (In the serial console case, you'd detect "no server at that address" and "Propolis rejected the request" and turn them into something like "instance's serial console is currently unavailable"; in the region replacement case you'd turn those errors into some internal error that says to try the replacement again later.)

If we do still want to allow callers to reason about VMM states before calling into Propolis (e.g. so they can return more descriptive errors if a client asks to do something that's obviously not going to work), then I might turn this approach on its head a bit:

remove propolis_addr_for_instance entirely and turn it into a method on InstanceAndActiveVmm

add whatever other helper functions to InstanceAndActiveVmm seem appropriate to answer the questions callers might want to ask (e.g. have a fn has_reached_running that returns true for Running, Rebooting, and Migrating, or a fn has_launched_propolis that also returns true if the VMM is Starting)

make clients look up/fetch their own instance data and then combine these functions in whatever way they think is most appropriate

oh, yeah, having predicates on the InstanceAndActiveVmm we've loaded from the database makes a lot of sense to me.

my main motivator here is that we can have errors indicate if it's a situation that can be resolved by waiting and trying again in a bit, or if Propolis is gone and will continue to be gone. so "instance's serial console is currently unavailable" would be a pretty unfortunate message, but

if !instance.has_reached_running() { "instance is still starting" } else if instance.is_going_away() { "serial console is unavailable" } else { // we suspect Propolis is in a serial console-friendly state, have at it }

seems pretty workable? (that said, i'm not sure what to call the set of states that are Stopping | Stopped | Failed | Creating | Destroyed | SagaUnwound, and is_going_away is the least bad i could come up with for an example.)

v2, without also messing with serial console parameters along the way

iximeow commented Dec 17, 2024

View reviewed changes

iximeow force-pushed the ixi/propolis-addr-err-messages branch from 365b3f0 to 1214a47 Compare December 17, 2024 20:37

hawkw self-requested a review December 17, 2024 21:02

iximeow force-pushed the ixi/propolis-addr-err-messages branch from 1214a47 to 8784bd4 Compare February 28, 2025 23:18

refine error messages about not having a propolis address

a565ec3

v2, without also messing with serial console parameters along the way

iximeow force-pushed the ixi/propolis-addr-err-messages branch from 8784bd4 to a565ec3 Compare February 28, 2025 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refine error messages about not having a propolis address #7263

refine error messages about not having a propolis address #7263

iximeow commented Dec 17, 2024

iximeow Dec 17, 2024 •

edited

Loading

hawkw Dec 17, 2024

iximeow Feb 28, 2025

iximeow Dec 17, 2024

hawkw Dec 18, 2024

iximeow Feb 28, 2025

gjcolombo Mar 3, 2025

iximeow Mar 3, 2025

		Err(Error::invalid_request(format!(
		"cannot connect to serial console of instance in state \"{}\"",

refine error messages about not having a propolis address #7263

Are you sure you want to change the base?

refine error messages about not having a propolis address #7263

Conversation

iximeow commented Dec 17, 2024

iximeow Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

hawkw Dec 17, 2024

Choose a reason for hiding this comment

iximeow Feb 28, 2025

Choose a reason for hiding this comment

iximeow Dec 17, 2024

Choose a reason for hiding this comment

hawkw Dec 18, 2024

Choose a reason for hiding this comment

iximeow Feb 28, 2025

Choose a reason for hiding this comment

gjcolombo Mar 3, 2025

Choose a reason for hiding this comment

iximeow Mar 3, 2025

Choose a reason for hiding this comment

iximeow Dec 17, 2024 •

edited

Loading