Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(transport): various tcp transport races #1095

Merged
merged 6 commits into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions libp2p/dialer.nim
Original file line number Diff line number Diff line change
Expand Up @@ -81,16 +81,18 @@ proc dialAndUpgrade(
if dialed.dir != dir:
dialed.dir = dir
await transport.upgrade(dialed, peerId)
except CancelledError as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on why this is necessary and why we don't need to call close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, close should stay (ed6be85) - the point was to not return nil which makes the dialler try the next address instead of aborting the dialling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we catch only LPError instead?

CancelledError, LPError], raw: true).} =

Copy link
Contributor Author

@arnetheduck arnetheduck May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but for that to be safe, we would need to annotate dialAndUpgrade with raises annotations as well, so that changes in upgrade would be caught by the compiler - this would significantly increase the scope of this PR

Copy link
Contributor

@diegomrsantos diegomrsantos May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code catches and raises CancelledError, I thought this would be equivalent to only catching LPError instead of CatchableError. upgrade - the proc in the try/except - only raises those two errors. But thinking more about it, you said we still want to call closewhen catching CancelledError, so my suggestion probably doesn't improve anything. Btw, did you forget to call close?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean now, didn't think about it. So even if the called proc is annotated, it isn't safe to trust it if the caller isn't annotated as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But thinking more about it, all errors in this project should inherit from LPError, shouldn't they? In theory, it should be fine then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's even more correct as LPError should represent all the expected errors we thought about and are fine ignoring in this case. All the others shouldn't probably be swallowed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly, it's best when every abstraction has its own hierarchy - ie failure to decode a multihash is not related to a transport being closed in any meaningful way so having both derived from lperror doesn't really have any underlying motivation except that they happen to be implemented in the same codebase.

As such, it's usually best if exceptions are mapped to the abstraction layer that they're operating at, and each layer translates the exceptions coming from other layers to their own level - ie "socket closed" becomes "transport closed" becomes "peer disconnected" as it travels through the layers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted though, changing exception types is usually a breaking change, so it needs to be done carefully - in this particular case, it needs to be done across the entire transport hierarchy at the same time.

await dialed.close()
raise exc
except CatchableError as exc:
# If we failed to establish the connection through one transport,
# we won't succeeded through another - no use in trying again
await dialed.close()
debug "Connection upgrade failed", err = exc.msg, peerId = peerId.get(default(PeerId))
if exc isnot CancelledError:
if dialed.dir == Direction.Out:
libp2p_failed_upgrades_outgoing.inc()
else:
libp2p_failed_upgrades_incoming.inc()
if dialed.dir == Direction.Out:
libp2p_failed_upgrades_outgoing.inc()
else:
libp2p_failed_upgrades_incoming.inc()

# Try other address
return nil
Expand Down
9 changes: 0 additions & 9 deletions libp2p/errors.nim
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,3 @@ macro checkFutures*[F](futs: seq[F], exclude: untyped = []): untyped =
# We still don't abort but warn
debug "A future has failed, enable trace logging for details", error=exc.name
trace "Exception details", msg=exc.msg

template tryAndWarn*(message: static[string]; body: untyped): untyped =
try:
body
except CancelledError as exc:
raise exc
except CatchableError as exc:
debug "An exception has ocurred, enable trace logging for details", name = exc.name, msg = message
trace "Exception details", exc = exc.msg
19 changes: 7 additions & 12 deletions libp2p/switch.nim
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,7 @@ proc accept(s: Switch, transport: Transport) {.async.} = # noraises
except CancelledError as exc:
trace "releasing semaphore on cancellation"
upgrades.release() # always release the slot
return
except CatchableError as exc:
error "Exception in accept loop, exiting", exc = exc.msg
upgrades.release() # always release the slot
Expand All @@ -288,6 +289,12 @@ proc stop*(s: Switch) {.async, public.} =

s.started = false

try:
# Stop accepting incoming connections
await allFutures(s.acceptFuts.mapIt(it.cancelAndWait())).wait(1.seconds)
except CatchableError as exc:
debug "Cannot cancel accepts", error = exc.msg

for service in s.services:
discard await service.stop(s)

Expand All @@ -302,18 +309,6 @@ proc stop*(s: Switch) {.async, public.} =
except CatchableError as exc:
warn "error cleaning up transports", msg = exc.msg

try:
await allFutures(s.acceptFuts)
.wait(1.seconds)
except CatchableError as exc:
trace "Exception while stopping accept loops", exc = exc.msg

# check that all futures were properly
# stopped and otherwise cancel them
for a in s.acceptFuts:
if not a.finished:
a.cancel()

for service in s.services:
discard await service.stop(s)

Expand Down
Loading
Loading