fix(dcutr): handle empty holepunch_candidates #5583

stormshield-frb · 2024-08-30T14:32:09Z

Description

A few months ago, we were experiencing from time to time some weird failures with DCUTr. After some research to find out the problem, it was a race condition : sometimes identify must be a little bit slow and the DCUTr handler is created before an identify event is received. Alone, this is not necessarily a problem. But if this race happens when the holepunch candidates list of DCUTr is empty, then DCUTr will always fail for this connection. Indeed, when receiving an new relayed established connection, DCUTr will create an Handler for this connection which will be responsible to make the hole-punching. However, the candidates that are used are the one known at Handler instantiation, so any future updates about NewExternalAddrCandidate will not be forwarded to the Handlers.

This PR is the upstream of the fix we did several months ago and we did not encountered any particular problem with it since.

Timetable of the problem

Peer 1	Time	Peer 2
Connected to relay	0.416
Receive identify from relay	0.420
	0.646	Connected to relay
	0.650	Dial Peer 1 through relay
	0.655	Connected to Peer1 through relay
	0.655	Create DCUTr handler with no hole-punch candidates
Connected to Peer 2 through relay	0.657
	0.659	Receive identify from relay (first observed_addr)
DCUTr fail `OutboundError(NoAddress)`	0.663
	0.664	DCUTr fail `InboundError(UnexpectedEof)`

Notes & open questions

I have put some TODOs about the potential merging of self.attempts += 1. When inbound, self.attempts is incremented when starting an handshake, however, when outbound, self.attempts is incremented at the "new outbound substream" request. Before I don't think it was a problem, but now that we do not necessarily trigger an handshake if there is no hole-punch candidates, I think we might was to increment self.attempts only when effectively starting the handshake. What do you think ?
It is noted in the log when starting a new handshake that, if the corresponding stream (inbound_stream or outbound_stream) was not empty, then we replace the handshake. There is warn level log stating New inbound/outbound connect stream while still upgrading previous one. Replacing previous with new. However, when reading the code of the FuturesSet::try_push method and then the FuturesMap::try_push method (which is used inside), the future pushed never replaces any old one when capacity is reached, it just returns an error. So what do you think should be done ? Should be actually replace the old with the new like the log says ? Or should we not replace the old with the new and update the log to say that the new one was dropped ?

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

mergify · 2024-12-11T12:00:21Z

This pull request has merge conflicts. Could you please resolve them @stormshield-frb? 🙏

elenaf9

Thank you for upstreaming this fix @stormshield-frb!

I am not sure if buffering a pending stream is the best solution, see my below comment about affected RTT measurements.

I was wondering, did you also consider the alternative of:

adding the Command::NewExternalAddrCandidate that this PR introduces
In the Behavior: retry in case of a Event::OutboundConnectFailed if the error is NoAddresses. I would think that by the second or third attempt the remote is likely to have completed the identify exchange and updated their external address list.

elenaf9 · 2024-12-14T08:20:50Z

protocols/dcutr/src/behaviour.rs

+    /// All relayed connections.
+    relayed_connections: HashMap<PeerId, HashSet<ConnectionId>>,


This was confusing me a bit when reviewing because it gives the impression that the logic that was previously applied to direct_connections is now applied to relayed_connection.
However, if I understand it correctly, it's actually that:

direct_connections was already unused prior to this PR, and can be removed independently

relayed_connections is required by this PR

If so, would you mind splitting the removal of direct_connections out of this PR, and either do it in a follow-up PR yourself / I can do the follow-up PR as well.

elenaf9 · 2024-12-14T08:46:10Z

protocols/dcutr/src/behaviour.rs

+        self.inner.push(address.clone(), ());
+        Some(address)


Suggested change

self.inner.push(address.clone(), ());

Some(address)

match self.inner.push(address.clone(), ()) {

Some((addr, ())) if addr == address => None,

_ => Some(address)

}

Only return Some if the address wasn't already in the cache?

elenaf9 · 2024-12-14T09:27:33Z

protocols/dcutr/src/handler/relayed.rs

-                if self
-                    .inbound_stream
-                    .try_push(inbound::handshake(
-                        stream,
-                        self.holepunch_candidates.clone(),
-                    ))
-                    .is_err()
-                {
-                    tracing::warn!(
-                        "New inbound connect stream while still upgrading previous one. Replacing previous with new.",
-                    );
-                }
-                self.attempts += 1;
-            }
+            future::Either::Left(stream) => self.set_stream(StreamType::Inbound, stream),


Won't buffering the inbound stream affect the rtt that the remote measures?
Since we do the buffering after the protocol negotiation succeeded, I think the remote will already have send a Connect message and start measuring the rtt.
So a subsequent holepunching attempt would be likely to fail because the timing is off?

stormshield-frb added 2 commits August 30, 2024 15:46

fix(dcutr): handle empty holepunch_candidates

0129730

set PR number

24cca4a

dariusc93 requested a review from jxs August 30, 2024 18:53

dariusc93 and others added 3 commits September 4, 2024 10:17

Merge branch 'master' into fix/dcutr-handle-empty-holepunch-candidates

1e85a11

Merge branch 'master' into fix/dcutr-handle-empty-holepunch-candidates

d02cd9a

Merge branch 'master' into fix/dcutr-handle-empty-holepunch-candidates

0adb765

elenaf9 self-requested a review December 11, 2024 11:59

elenaf9 reviewed Dec 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dcutr): handle empty holepunch_candidates #5583

fix(dcutr): handle empty holepunch_candidates #5583

stormshield-frb commented Aug 30, 2024 •

edited

Loading

mergify bot commented Dec 11, 2024

elenaf9 left a comment •

edited

Loading

elenaf9 Dec 14, 2024

elenaf9 Dec 14, 2024

elenaf9 Dec 14, 2024

		/// All relayed connections.
		relayed_connections: HashMap<PeerId, HashSet<ConnectionId>>,

-        self.inner.push(address.clone(), ());
-        Some(address)
+        match self.inner.push(address.clone(), ()) {
+            Some((addr, ())) if addr == address => None,
+            _ => Some(address)
+        }

fix(dcutr): handle empty holepunch_candidates #5583

Are you sure you want to change the base?

fix(dcutr): handle empty holepunch_candidates #5583

Conversation

stormshield-frb commented Aug 30, 2024 • edited Loading

Description

Timetable of the problem

Notes & open questions

Change checklist

mergify bot commented Dec 11, 2024

elenaf9 left a comment • edited Loading

Choose a reason for hiding this comment

elenaf9 Dec 14, 2024

Choose a reason for hiding this comment

elenaf9 Dec 14, 2024

Choose a reason for hiding this comment

elenaf9 Dec 14, 2024

Choose a reason for hiding this comment

stormshield-frb commented Aug 30, 2024 •

edited

Loading

elenaf9 left a comment •

edited

Loading