fix(request-response): Avoid hanging at capacity and on dial IO errors #5419

oblique · 2024-05-25T11:25:12Z

Description

This is an alternative fix of #5417 that does not introduce a breaking change.

The idea is to reschedule the request but also avoid a potential infinite rescheduling by applying a timeout on it.

This fixes also a potential infinite rescheduling issue on dial IO errors:

rust-libp2p/protocols/request-response/src/handler.rs

Lines 238 to 244 in 94fef37

    
           StreamUpgradeError::Io(e) => { 
        
               tracing::debug!( 
        
                   "outbound stream for request {} failed: {e}, retrying", 
        
                   message.request_id 
        
               ); 
        
               self.requested_outbound.push_back(message); 
        
           }

Notes & open questions

Due to the nature of rescheduling and timeout, it is very hard to create a unit test. I couldn't find a reliable way of doing it.

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

thomaseizinger · 2024-05-25T22:30:50Z

I don't think rescheduling like that is a good idea. Causing more load on an overloaded system is not good. Any form of retry should have an (exponential) backoff.

So far, we've avoided retries and left them to the user. Only they know whether a message is safe to retry upon an IO or other error.

HTTP can do automated retries for the user but only because it is extremely rich in semantics. We don't have that, we just ship bytes around.

oblique · 2024-05-25T23:13:37Z

Do you prefer the #5417 then? But without the new variant?

The retry on dial failure was there from before. Should I remove it or keep the fix for that?

zvolin · 2024-05-26T01:14:36Z

Maybe we could skip retrying here and just keep the hard timeout + io error on max streams opened

thomaseizinger · 2024-05-26T23:43:38Z

Do you prefer the #5417 then? But without the new variant?

I think so yes. Unless we have a requirement to specifically match on that error, I would say using the Io error variant is preferable.

The retry on dial failure was there from before. Should I remove it or keep the fix for that?

Hmm, I didn't remember that. Don't really have an opinion here. I'll defer to @jxs.

oblique · 2024-05-29T10:42:26Z

Closing in favor of #5417 and #5429

fix(request-response): Avoid hanging at capacity and on dial IO errors

ca69a02

oblique force-pushed the fix/timeout-on-rescheduled branch from 7fea44b to ca69a02 Compare May 25, 2024 11:27

oblique mentioned this pull request May 25, 2024

fix(request-response): Report failure when streams are at capacity #5417

Merged

4 tasks

oblique closed this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(request-response): Avoid hanging at capacity and on dial IO errors #5419

fix(request-response): Avoid hanging at capacity and on dial IO errors #5419

oblique commented May 25, 2024 •

edited

Loading

thomaseizinger commented May 25, 2024

oblique commented May 25, 2024

zvolin commented May 26, 2024

thomaseizinger commented May 26, 2024

oblique commented May 29, 2024

	StreamUpgradeError::Io(e) => {
	tracing::debug!(
	"outbound stream for request {} failed: {e}, retrying",
	message.request_id
	);
	self.requested_outbound.push_back(message);
	}

fix(request-response): Avoid hanging at capacity and on dial IO errors #5419

fix(request-response): Avoid hanging at capacity and on dial IO errors #5419

Conversation

oblique commented May 25, 2024 • edited Loading

Description

Notes & open questions

Change checklist

thomaseizinger commented May 25, 2024

oblique commented May 25, 2024

zvolin commented May 26, 2024

thomaseizinger commented May 26, 2024

oblique commented May 29, 2024

oblique commented May 25, 2024 •

edited

Loading