Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(request-response): Avoid hanging at capacity and on dial IO errors #5419

Closed
wants to merge 1 commit into from

Conversation

oblique
Copy link
Contributor

@oblique oblique commented May 25, 2024

Description

This is an alternative fix of #5417 that does not introduce a breaking change.

The idea is to reschedule the request but also avoid a potential infinite rescheduling by applying a timeout on it.

This fixes also a potential infinite rescheduling issue on dial IO errors:

StreamUpgradeError::Io(e) => {
tracing::debug!(
"outbound stream for request {} failed: {e}, retrying",
message.request_id
);
self.requested_outbound.push_back(message);
}

Notes & open questions

Due to the nature of rescheduling and timeout, it is very hard to create a unit test. I couldn't find a reliable way of doing it.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@thomaseizinger
Copy link
Contributor

I don't think rescheduling like that is a good idea. Causing more load on an overloaded system is not good. Any form of retry should have an (exponential) backoff.

So far, we've avoided retries and left them to the user. Only they know whether a message is safe to retry upon an IO or other error.

HTTP can do automated retries for the user but only because it is extremely rich in semantics. We don't have that, we just ship bytes around.

@oblique
Copy link
Contributor Author

oblique commented May 25, 2024

Do you prefer the #5417 then? But without the new variant?

The retry on dial failure was there from before. Should I remove it or keep the fix for that?

@zvolin
Copy link
Contributor

zvolin commented May 26, 2024

Maybe we could skip retrying here and just keep the hard timeout + io error on max streams opened

@thomaseizinger
Copy link
Contributor

Do you prefer the #5417 then? But without the new variant?

I think so yes. Unless we have a requirement to specifically match on that error, I would say using the Io error variant is preferable.

The retry on dial failure was there from before. Should I remove it or keep the fix for that?

Hmm, I didn't remember that. Don't really have an opinion here. I'll defer to @jxs.

@oblique
Copy link
Contributor Author

oblique commented May 29, 2024

Closing in favor of #5417 and #5429

@oblique oblique closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants