Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential file descriptor leak #529

Open
LeeSmet opened this issue Dec 23, 2024 · 13 comments
Open

Potential file descriptor leak #529

LeeSmet opened this issue Dec 23, 2024 · 13 comments
Assignees
Labels
type_bug Something isn't working

Comments

@LeeSmet
Copy link
Contributor

LeeSmet commented Dec 23, 2024

It seems that some nodes running can leak file descriptors, seemingly on connections not being closed properly. While we are observing a growing file descriptor usage on public nodes (and growing amount of connections), it should be noted that these connections don't seem to be added as peers. A very small number of nodes has even managed to exhaust their allowed file descriptor limit, causing them to go in a hot loop of errors when (trying to) connect to a public node.

It feels like something in the router is going wrong when detecting if a peer is dead and the subsequent cleanup. That is, the router presumably thinks the peer is dead and marks it as such, causing a reconnect by the peer manager. However under the hood the peer connection is not properly cleaned up. This might be possible because the connection itself is managed in the peer Inner type, which is wrapped by an Arc , so as long as 1 Peer instance is not properly cleaned up, the Inner stays alive.

Alternatively, there might be an issue in the peer manager logic.

We should also check if there is a panic somewhere causing a task to exit, preventing cleanup.

@LeeSmet LeeSmet added the type_bug Something isn't working label Dec 23, 2024
@iwanbk iwanbk self-assigned this Dec 23, 2024
@iwanbk
Copy link
Member

iwanbk commented Dec 25, 2024

From the code reading, i found that died is probably missed in one place.
died definition

pub fn died(&self) {
self.inner.alive.store(false, Ordering::Relaxed);
self.inner.death_notifier.notify_one();
}

probably missed for this

// If we have a new peer notify insert the new one in the router, then notify it that
// the old one is dead.
if let Some(p) = peer {
let router = self.router.lock().unwrap();
router.add_peer_interface(p);
if let Some(old_peer) = old_peer_info
.expect("We already checked the entry was occupied so this is always Some; qed")
.pr
.upgrade()
{
router.handle_dead_peer(old_peer);
}
}

But i'm not really sure about this because:

  • i still don't have a way to reproduce this issue.
  • from the code comment, that code path is for QUIC case. But the logs from @coesensbert only logs about TCP.

And even this is not the source of this issue, i think the died usage should be improved.
Because it currently scattered in few places.
From what i can see right now, few code path rely on Router::handle_dead_peer to do cleanup, perhaps we could put died in that function as well.

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Dec 26, 2024

It could be usefull to put a call to died there as well, but its indeed not the cause of the issue.

The general idea is that a peer which dies injects itself on a channel for the router, which ends up in handle_dead_peer eventually. died is a utility function which was added, because not every connection type (looking at you quic) can detect itself being terminated. In general, this also leads to the problem now where peers can be removed and be counted as "dead" twice in the metrics

@iwanbk
Copy link
Member

iwanbk commented Dec 27, 2024

i found something interesting from the logs

(couldn't download it, so i sent the screenshot)
image

i search for mycelium: 2024-12-27T03:35:01 from the logs, and found 757 occurences.
(tried other time spans, and got similar number, around ~750)

Which means, there are 750 connects attempts to that node in one second, which is super high if it is not a public node.
Considering that mycelium client only do reconnect after 5 seconds, it means there are probably 750*5 = ~3700 clients try to connect to that node, which doesn't look normal at all, no?

@coesensbert @LeeSmet
maybe you could help to check if:

  1. that huge number is expected
  2. possibility that it is not mycelium client which try to connect to that node.

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Dec 31, 2024

After a bunch of debuggin over the last couple of days, the issue is in the task which is spawned in Peer::new to manage the connection. It seens this can get stuck when (presumably forwarding) packets to a peer, if the remote exitted (also presumably), while there is still a bunch of data in the sockets. This is verified by taking a handle to the task and calling handle::abort() in Peer::died (while also ensuring proper cleanup in the router).

In general, using a select with cancellation method, and then awaiting inside a select branch is a poor idea as the select can only cancel the actual branches while they are awaited.

At this point, I think its probably best to rewrite this task to a manual future implementation/state machine

@iwanbk
Copy link
Member

iwanbk commented Dec 31, 2024

@LeeSmet and how it relates to the above logs?

@iwanbk
Copy link
Member

iwanbk commented Dec 31, 2024

This is verified by taking a handle to the task and calling handle::abort() in Peer::died

It simulates the Peer::died never called?

And this is the expected scenario for the loop to stuck if Peer::died never called, no?

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Dec 31, 2024

The task I mentioned has a Framed type, which has a Connection embedded. This connection is a trait which represents either a TcpStream or a Quic connection. Since TcpStreams use up a file descriptor for every instance, the problem is that the task is stuck, so it never exits. As a result the Framed never gets dropped. This then causes the TcpStream to not get dropped, so from the perspective of the OS it stays open, and the file descriptor stays in use. If this continues long enough, the process hits tha allowed file descriptor limit, causing it to spam these errors

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Dec 31, 2024

Peer::died is called in the router, but right now it just notifies a Notify. In the select, the last branch awaits this Notify. If it completes, the select loops exits and the task cleans up before being dropped. But because the task is stuck inside some other select branch, the select is not evaluated, so it doesnt notice this Notify being ready, and thus is stuck forever.

@iwanbk
Copy link
Member

iwanbk commented Dec 31, 2024

It seens this can get stuck when (presumably forwarding) packets to a peer, if the remote exitted (also presumably), while there is still a bunch of data in the sockets

Before much rewrite, can't we solve it using timeout mechanism?
Whatever the flow, I think socket read/write with timeout is a must.

At this point, I think its probably best to rewrite this task to a manual future implementation/state machine

@iwanbk
Copy link
Member

iwanbk commented Dec 31, 2024

Before much rewrite, can't we solve it using timeout mechanism?
Whatever the flow, I think socket read/write with timeout is a must.

Oh, i forgot that it is async socket.
I won't be straightforward unlike the sync API like Go.

Found https://docs.rs/tokio-io-timeout/latest/tokio_io_timeout/ but not really sure about this

@iwanbk
Copy link
Member

iwanbk commented Jan 6, 2025

@LeeSmet
I check it again today

In general, using a select with cancellation method, and then awaiting inside a select branch is a poor idea as the select can only cancel the actual branches while they are awaited.

At this point, I think its probably best to rewrite this task to a manual future implementation/state machine

Before going further by rewriting things, i think we could try to add timeout to the suspected stuck process.

If i understand correctly, the one that could stuck is framed.next or framed.flush in

select! {
// Received over the TCP stream
frame = framed.next() => {
match frame {
Some(Ok(packet)) => {
match packet {
Packet::DataPacket(packet) => {
// An error here means the receiver is dropped/closed,
// this is not recoverable.
if let Err(error) = router_data_tx.send(packet).await{
error!("Error sending to to_routing_data: {}", error);
break
}
}
Packet::ControlPacket(packet) => {
if let Err(error) = router_control_tx.send((packet, peer.clone())) {
// An error here means the receiver is dropped/closed,
// this is not recoverable.
error!("Error sending to to_routing_control: {}", error);
break
}
}
}
}
Some(Err(e)) => {
error!("Frame error from {}: {e}", peer.connection_identifier());
break;
},
None => {
info!("Stream to {} is closed", peer.connection_identifier());
break;
}
}
}
Some(packet) = from_routing_data.recv() => {
if let Err(e) = framed.feed(Packet::DataPacket(packet)).await {
error!("Failed to feed data packet to connection: {e}");
break
}
for _ in 1..PACKET_COALESCE_WINDOW {
// There can be 2 cases of errors here, empty channel and no more
// senders. In both cases we don't really care at this point.
if let Ok(packet) = from_routing_data.try_recv() {
if let Err(e) = framed.feed(Packet::DataPacket(packet)).await {
error!("Failed to feed data packet to connection: {e}");
break
}
trace!("Instantly queued ready packet to transfer to peer");
} else {
// No packets ready, flush currently buffered ones
break
}
}
if let Err(e) = framed.flush().await {
error!("Failed to flush buffered peer connection data packets: {e}");
break
}
}
Some(packet) = from_routing_control.recv() => {
if let Err(e) = framed.feed(Packet::ControlPacket(packet)).await {
error!("Failed to feed control packet to connection: {e}");
break
}
for _ in 1..PACKET_COALESCE_WINDOW {
// There can be 2 cases of errors here, empty channel and no more
// senders. In both cases we don't really care at this point.
if let Ok(packet) = from_routing_control.try_recv() {
if let Err(e) = framed.feed(Packet::ControlPacket(packet)).await {
error!("Failed to feed data packet to connection: {e}");
break
}
} else {
// No packets ready, flush currently buffered ones
break
}
}
if let Err(e) = framed.flush().await {
error!("Failed to flush buffered peer connection control packets: {e}");
break
}
}
_ = death_watcher.notified() => {
break;
}

I think we can wrap it with tokio::time::timeout https://docs.rs/tokio/latest/tokio/time/fn.timeout.html and see what happens.

We also do that in the 0-stor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants