Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dcutr): handle empty holepunch_candidates #5583

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

stormshield-frb
Copy link
Contributor

@stormshield-frb stormshield-frb commented Aug 30, 2024

Description

A few months ago, we were experiencing from time to time some weird failures with DCUTr. After some research to find out the problem, it was a race condition : sometimes identify must be a little bit slow and the DCUTr handler is created before an identify event is received. Alone, this is not necessarily a problem. But if this race happens when the holepunch candidates list of DCUTr is empty, then DCUTr will always fail for this connection. Indeed, when receiving an new relayed established connection, DCUTr will create an Handler for this connection which will be responsible to make the hole-punching. However, the candidates that are used are the one known at Handler instantiation, so any future updates about NewExternalAddrCandidate will not be forwarded to the Handlers.

This PR is the upstream of the fix we did several months ago and we did not encountered any particular problem with it since.

Timetable of the problem

Peer 1 Time Peer 2
Connected to relay 0.416
Receive identify from relay 0.420
0.646 Connected to relay
0.650 Dial Peer 1 through relay
0.655 Connected to Peer1 through relay
0.655 Create DCUTr handler with no hole-punch candidates
Connected to Peer 2 through relay 0.657
0.659 Receive identify from relay (first observed_addr)
DCUTr fail OutboundError(NoAddress) 0.663
0.664 DCUTr fail InboundError(UnexpectedEof)

Notes & open questions

  1. I have put some TODOs about the potential merging of self.attempts += 1. When inbound, self.attempts is incremented when starting an handshake, however, when outbound, self.attempts is incremented at the "new outbound substream" request. Before I don't think it was a problem, but now that we do not necessarily trigger an handshake if there is no hole-punch candidates, I think we might was to increment self.attempts only when effectively starting the handshake. What do you think ?

  2. It is noted in the log when starting a new handshake that, if the corresponding stream (inbound_stream or outbound_stream) was not empty, then we replace the handshake. There is warn level log stating New inbound/outbound connect stream while still upgrading previous one. Replacing previous with new. However, when reading the code of the FuturesSet::try_push method and then the FuturesMap::try_push method (which is used inside), the future pushed never replaces any old one when capacity is reached, it just returns an error. So what do you think should be done ? Should be actually replace the old with the new like the log says ? Or should we not replace the old with the new and update the log to say that the new one was dropped ?

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@dariusc93 dariusc93 requested a review from jxs August 30, 2024 18:53
@elenaf9 elenaf9 self-requested a review December 11, 2024 11:59
Copy link
Contributor

mergify bot commented Dec 11, 2024

This pull request has merge conflicts. Could you please resolve them @stormshield-frb? 🙏

Copy link
Contributor

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for upstreaming this fix @stormshield-frb!

I am not sure if buffering a pending stream is the best solution, see my below comment about affected RTT measurements.

I was wondering, did you also consider the alternative of:

  • adding the Command::NewExternalAddrCandidate that this PR introduces
  • In the Behavior: retry in case of a Event::OutboundConnectFailed if the error is NoAddresses. I would think that by the second or third attempt the remote is likely to have completed the identify exchange and updated their external address list.

Comment on lines +73 to +74
/// All relayed connections.
relayed_connections: HashMap<PeerId, HashSet<ConnectionId>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was confusing me a bit when reviewing because it gives the impression that the logic that was previously applied to direct_connections is now applied to relayed_connection.
However, if I understand it correctly, it's actually that:

  • direct_connections was already unused prior to this PR, and can be removed independently
  • relayed_connections is required by this PR

If so, would you mind splitting the removal of direct_connections out of this PR, and either do it in a follow-up PR yourself / I can do the follow-up PR as well.

Comment on lines +393 to +394
self.inner.push(address.clone(), ());
Some(address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.inner.push(address.clone(), ());
Some(address)
match self.inner.push(address.clone(), ()) {
Some((addr, ())) if addr == address => None,
_ => Some(address)
}

Only return Some if the address wasn't already in the cache?

Comment on lines -103 to +153
if self
.inbound_stream
.try_push(inbound::handshake(
stream,
self.holepunch_candidates.clone(),
))
.is_err()
{
tracing::warn!(
"New inbound connect stream while still upgrading previous one. Replacing previous with new.",
);
}
self.attempts += 1;
}
future::Either::Left(stream) => self.set_stream(StreamType::Inbound, stream),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't buffering the inbound stream affect the rtt that the remote measures?
Since we do the buffering after the protocol negotiation succeeded, I think the remote will already have send a Connect message and start measuring the rtt.
So a subsequent holepunching attempt would be likely to fail because the timing is off?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants