Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine error messages about not having a propolis address #7263

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 12 additions & 11 deletions nexus/src/app/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1676,29 +1676,30 @@ impl super::Nexus {
match vmm.runtime.state {
DbVmmState::Running
| DbVmmState::Rebooting
| DbVmmState::Migrating => {
Ok((vmm.clone(), SocketAddr::new(vmm.propolis_ip.ip(), vmm.propolis_port.into())))
}
| DbVmmState::Migrating => Ok((
vmm.clone(),
SocketAddr::new(
vmm.propolis_ip.ip(),
vmm.propolis_port.into(),
),
)),

DbVmmState::Starting
| DbVmmState::Stopping
| DbVmmState::Stopped
| DbVmmState::Failed
| DbVmmState::Creating => {
| DbVmmState::Creating
| DbVmmState::Destroyed
| DbVmmState::SagaUnwound => {
Err(Error::invalid_request(format!(
"cannot connect to serial console of instance in state \"{}\"",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried to make this a stronger error type than Error here, so we could have nice "serial console" errors for the serial console path, and "propolis" errors for the region{, snapshot}_replacement saga path. but this runs into a bunch of new wrinkles like, instance_lookup.lookup_for() can fail in an Error-y way, db_datastore.instance_fetch_with_vmm() can fail in an Error-y way.

so after spinning a little too long on error plumbing here, i decided to take the different approach of picking the lowest common denominator message that makes sense for any caller of propolis_addr_for_instance. maybe "cannot connect to admin socket" or something would be more accurate, but i don't want to be confusing in a user-facing error message - there's no secret backdoor socket in your instance! we're just talking to the VMM!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think just making it generic is fine. 's fair. a slightly different wording is something like "this operation cannot be performed on instances in state ...", which is also generic but suggests that the specific thing you tried to do can't be done in that state. I worry a little that "administer an instance" is broad enough that it does include operations you can perform in these states --- i.e., is deleting an instance "administering" it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was reminded of this PR today when @jmpesp tripped over this and incidentally has a good case where "administer" is probably too broad - the idea we were talking about was moving the state matching here to something chosen by an enum that describes what properties you want from a client here.

do you want a client for which a VMM is running? probably true for serial console APIs. do you want a client where Propolis is running? that's more permissive, and reasonable if you're trying to send a VCR replacement.

SO: if James wants to base off of this PR, we'll probably get better error messages from that enum. and if not, well, that commit wherever it goes will do a better thing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea we were talking about was moving the state matching here to something chosen by an enum that describes what properties you want from a client here

I'm a latecomer here, but I'm not the biggest fan of this approach. Because the active Propolis's state can change (or the active VMM can disappear!) as soon as this field is read, callers of this function who then create a Propolis client still have to account for the full range of errors they can receive from their calls to that client (including "no server at that address"). Given that, I would just make this routine return the active VMM address if there is one, have each caller make whatever Propolis call it wants to make, and then have each caller decide how it wants to handle the errors those calls can return. (In the serial console case, you'd detect "no server at that address" and "Propolis rejected the request" and turn them into something like "instance's serial console is currently unavailable"; in the region replacement case you'd turn those errors into some internal error that says to try the replacement again later.)

If we do still want to allow callers to reason about VMM states before calling into Propolis (e.g. so they can return more descriptive errors if a client asks to do something that's obviously not going to work), then I might turn this approach on its head a bit:

  • remove propolis_addr_for_instance entirely and turn it into a method on InstanceAndActiveVmm
  • add whatever other helper functions to InstanceAndActiveVmm seem appropriate to answer the questions callers might want to ask (e.g. have a fn has_reached_running that returns true for Running, Rebooting, and Migrating, or a fn has_launched_propolis that also returns true if the VMM is Starting)
  • make clients look up/fetch their own instance data and then combine these functions in whatever way they think is most appropriate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah, having predicates on the InstanceAndActiveVmm we've loaded from the database makes a lot of sense to me.

my main motivator here is that we can have errors indicate if it's a situation that can be resolved by waiting and trying again in a bit, or if Propolis is gone and will continue to be gone. so "instance's serial console is currently unavailable" would be a pretty unfortunate message, but

if !instance.has_reached_running() {
    "instance is still starting"
} else if instance.is_going_away() {
    "serial console is unavailable"
} else {
    // we suspect Propolis is in a serial console-friendly state, have at it
}

seems pretty workable? (that said, i'm not sure what to call the set of states that are Stopping | Stopped | Failed | Creating | Destroyed | SagaUnwound, and is_going_away is the least bad i could come up with for an example.)

"cannot administer instance in state \"{}\"",
state.effective_state(),
)))
}

DbVmmState::Destroyed | DbVmmState::SagaUnwound => Err(Error::invalid_request(
"cannot connect to serial console of instance in state \"Stopped\"",
)),
}
} else {
Err(Error::invalid_request(format!(
"instance is {} and has no active serial console \
server",
"instance is {} and cannot be administered",
state.effective_state(),
)))
}
Expand Down
50 changes: 50 additions & 0 deletions nexus/tests/integration_tests/instances.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5394,6 +5394,9 @@ async fn test_instance_serial(cptestctx: &ControlPlaneTestContext) {
// Make sure we get a 404 if we try to access the serial console before creation.
let instance_serial_url =
get_instance_url(format!("{}/serial-console", instance_name).as_str());
let instance_serial_stream_url = get_instance_url(
format!("{}/serial-console/stream", instance_name).as_str(),
);
let error: HttpErrorResponseBody = NexusRequest::expect_failure(
client,
StatusCode::NOT_FOUND,
Expand Down Expand Up @@ -5491,6 +5494,53 @@ async fn test_instance_serial(cptestctx: &ControlPlaneTestContext) {
> instance.runtime.time_run_state_updated
);

// As the instance is now stopping, we can't connect to its serial console
// anymore. We also can't get its cached data; the (simulated) Propolis is
// going away.

let builder = RequestBuilder::new(
client,
http::Method::GET,
&instance_serial_stream_url,
)
.expect_status(Some(http::StatusCode::BAD_REQUEST));

let error = NexusRequest::new(builder)
.authn_as(AuthnMode::PrivilegedUser)
.execute()
.await
.unwrap()
.parsed_body::<dropshot::HttpErrorResponseBody>()
.unwrap();

assert_eq!(
error.message,
"cannot administer instance in state \"stopping\"",
);

// Have to pass some offset for the cached data request otherwise we'll
// error out of Nexus early on while validating parameters, before
// discovering the serial console is unreachable.
let builder = RequestBuilder::new(
client,
http::Method::GET,
&format!("{}&from_start=0", instance_serial_url),
)
.expect_status(Some(http::StatusCode::BAD_REQUEST));

let error = NexusRequest::new(builder)
.authn_as(AuthnMode::PrivilegedUser)
.execute()
.await
.unwrap()
.parsed_body::<dropshot::HttpErrorResponseBody>()
.unwrap();

assert_eq!(
error.message,
"cannot administer instance in state \"stopping\"",
);

let instance = instance_next;
instance_simulate(nexus, &instance_id).await;
let instance_next =
Expand Down
Loading