-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential file descriptor leak #529
Comments
From the code reading, i found that Lines 283 to 286 in 74719ba
probably missed for this mycelium/mycelium/src/peer_manager.rs Lines 953 to 965 in 74719ba
But i'm not really sure about this because:
And even this is not the source of this issue, i think the |
It could be usefull to put a call to died there as well, but its indeed not the cause of the issue. The general idea is that a peer which dies injects itself on a channel for the router, which ends up in handle_dead_peer eventually. |
i found something interesting from the logs (couldn't download it, so i sent the screenshot) i search for Which means, there are 750 connects attempts to that node in one second, which is super high if it is not a public node. @coesensbert @LeeSmet
|
After a bunch of debuggin over the last couple of days, the issue is in the task which is spawned in Peer::new to manage the connection. It seens this can get stuck when (presumably forwarding) packets to a peer, if the remote exitted (also presumably), while there is still a bunch of data in the sockets. This is verified by taking a handle to the task and calling handle::abort() in Peer::died (while also ensuring proper cleanup in the router). In general, using a select with cancellation method, and then awaiting inside a select branch is a poor idea as the select can only cancel the actual branches while they are awaited. At this point, I think its probably best to rewrite this task to a manual future implementation/state machine |
@LeeSmet and how it relates to the above logs? |
It simulates the And this is the expected scenario for the loop to stuck if |
The task I mentioned has a |
|
Before much rewrite, can't we solve it using
|
Oh, i forgot that it is async socket. Found https://docs.rs/tokio-io-timeout/latest/tokio_io_timeout/ but not really sure about this |
@LeeSmet
Before going further by rewriting things, i think we could try to add timeout to the suspected stuck process. If i understand correctly, the one that could stuck is Lines 102 to 193 in 78dcb1a
I think we can wrap it with We also do that in the 0-stor. |
It seems that some nodes running can leak file descriptors, seemingly on connections not being closed properly. While we are observing a growing file descriptor usage on public nodes (and growing amount of connections), it should be noted that these connections don't seem to be added as peers. A very small number of nodes has even managed to exhaust their allowed file descriptor limit, causing them to go in a hot loop of errors when (trying to) connect to a public node.
It feels like something in the router is going wrong when detecting if a peer is dead and the subsequent cleanup. That is, the router presumably thinks the peer is dead and marks it as such, causing a reconnect by the peer manager. However under the hood the peer connection is not properly cleaned up. This might be possible because the connection itself is managed in the peer Inner type, which is wrapped by an Arc , so as long as 1 Peer instance is not properly cleaned up, the Inner stays alive.
Alternatively, there might be an issue in the peer manager logic.
We should also check if there is a panic somewhere causing a task to exit, preventing cleanup.
The text was updated successfully, but these errors were encountered: