Potential file descriptor leak #529

LeeSmet · 2024-12-23T11:43:23Z

It seems that some nodes running can leak file descriptors, seemingly on connections not being closed properly. While we are observing a growing file descriptor usage on public nodes (and growing amount of connections), it should be noted that these connections don't seem to be added as peers. A very small number of nodes has even managed to exhaust their allowed file descriptor limit, causing them to go in a hot loop of errors when (trying to) connect to a public node.

It feels like something in the router is going wrong when detecting if a peer is dead and the subsequent cleanup. That is, the router presumably thinks the peer is dead and marks it as such, causing a reconnect by the peer manager. However under the hood the peer connection is not properly cleaned up. This might be possible because the connection itself is managed in the peer Inner type, which is wrapped by an Arc , so as long as 1 Peer instance is not properly cleaned up, the Inner stays alive.

Alternatively, there might be an issue in the peer manager logic.

We should also check if there is a panic somewhere causing a task to exit, preventing cleanup.

coesensbert · 2024-12-23T16:45:14Z

example:
https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%221463%5C%22,network%3D%5C%22production%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D

iwanbk · 2024-12-25T08:09:07Z

From the code reading, i found that died is probably missed in one place.
died definition

mycelium/mycelium/src/peer.rs

Lines 283 to 286 in 74719ba

    
               pub fn died(&self) { 
        
                   self.inner.alive.store(false, Ordering::Relaxed); 
        
                   self.inner.death_notifier.notify_one(); 
        
               }

probably missed for this

mycelium/mycelium/src/peer_manager.rs

Lines 953 to 965 in 74719ba

    
           // If we have a new peer notify insert the new one in the router, then notify it that 
        
           // the old one is dead. 
        
           if let Some(p) = peer { 
        
               let router = self.router.lock().unwrap(); 
        
               router.add_peer_interface(p); 
        
               if let Some(old_peer) = old_peer_info 
        
                   .expect("We already checked the entry was occupied so this is always Some; qed") 
        
                   .pr 
        
                   .upgrade() 
        
               { 
        
                   router.handle_dead_peer(old_peer); 
        
               } 
        
           }

But i'm not really sure about this because:

i still don't have a way to reproduce this issue.
from the code comment, that code path is for QUIC case. But the logs from @coesensbert only logs about TCP.

And even this is not the source of this issue, i think the died usage should be improved.
Because it currently scattered in few places.
From what i can see right now, few code path rely on Router::handle_dead_peer to do cleanup, perhaps we could put died in that function as well.

LeeSmet · 2024-12-26T09:41:27Z

It could be usefull to put a call to died there as well, but its indeed not the cause of the issue.

The general idea is that a peer which dies injects itself on a channel for the router, which ends up in handle_dead_peer eventually. died is a utility function which was added, because not every connection type (looking at you quic) can detect itself being terminated. In general, this also leads to the problem now where peers can be removed and be counted as "dead" twice in the metrics

iwanbk · 2024-12-27T03:47:53Z

i found something interesting from the logs

(couldn't download it, so i sent the screenshot)

i search for mycelium: 2024-12-27T03:35:01 from the logs, and found 757 occurences.
(tried other time spans, and got similar number, around ~750)

Which means, there are 750 connects attempts to that node in one second, which is super high if it is not a public node.
Considering that mycelium client only do reconnect after 5 seconds, it means there are probably 750*5 = ~3700 clients try to connect to that node, which doesn't look normal at all, no?

@coesensbert @LeeSmet
maybe you could help to check if:

that huge number is expected
possibility that it is not mycelium client which try to connect to that node.

LeeSmet · 2024-12-31T10:55:17Z

After a bunch of debuggin over the last couple of days, the issue is in the task which is spawned in Peer::new to manage the connection. It seens this can get stuck when (presumably forwarding) packets to a peer, if the remote exitted (also presumably), while there is still a bunch of data in the sockets. This is verified by taking a handle to the task and calling handle::abort() in Peer::died (while also ensuring proper cleanup in the router).

In general, using a select with cancellation method, and then awaiting inside a select branch is a poor idea as the select can only cancel the actual branches while they are awaited.

At this point, I think its probably best to rewrite this task to a manual future implementation/state machine

iwanbk · 2024-12-31T11:04:42Z

@LeeSmet and how it relates to the above logs?

iwanbk · 2024-12-31T11:07:35Z

This is verified by taking a handle to the task and calling handle::abort() in Peer::died

It simulates the Peer::died never called?

And this is the expected scenario for the loop to stuck if Peer::died never called, no?

LeeSmet · 2024-12-31T11:09:25Z

The task I mentioned has a Framed type, which has a Connection embedded. This connection is a trait which represents either a TcpStream or a Quic connection. Since TcpStreams use up a file descriptor for every instance, the problem is that the task is stuck, so it never exits. As a result the Framed never gets dropped. This then causes the TcpStream to not get dropped, so from the perspective of the OS it stays open, and the file descriptor stays in use. If this continues long enough, the process hits tha allowed file descriptor limit, causing it to spam these errors

LeeSmet · 2024-12-31T11:10:52Z

Peer::died is called in the router, but right now it just notifies a Notify. In the select, the last branch awaits this Notify. If it completes, the select loops exits and the task cleans up before being dropped. But because the task is stuck inside some other select branch, the select is not evaluated, so it doesnt notice this Notify being ready, and thus is stuck forever.

iwanbk · 2024-12-31T11:19:16Z

It seens this can get stuck when (presumably forwarding) packets to a peer, if the remote exitted (also presumably), while there is still a bunch of data in the sockets

Before much rewrite, can't we solve it using timeout mechanism?
Whatever the flow, I think socket read/write with timeout is a must.

At this point, I think its probably best to rewrite this task to a manual future implementation/state machine

iwanbk · 2024-12-31T11:24:24Z

Before much rewrite, can't we solve it using timeout mechanism?
Whatever the flow, I think socket read/write with timeout is a must.

Oh, i forgot that it is async socket.
I won't be straightforward unlike the sync API like Go.

Found https://docs.rs/tokio-io-timeout/latest/tokio_io_timeout/ but not really sure about this

iwanbk · 2025-01-06T04:37:30Z

@LeeSmet
I check it again today

In general, using a select with cancellation method, and then awaiting inside a select branch is a poor idea as the select can only cancel the actual branches while they are awaited.

At this point, I think its probably best to rewrite this task to a manual future implementation/state machine

Before going further by rewriting things, i think we could try to add timeout to the suspected stuck process.

If i understand correctly, the one that could stuck is framed.next or framed.flush in

mycelium/mycelium/src/peer.rs

Lines 102 to 193 in 78dcb1a

    
           select! { 
        
               // Received over the TCP stream 
        
               frame = framed.next() => { 
        
                   match frame { 
        
                       Some(Ok(packet)) => { 
        
                           match packet { 
        
                               Packet::DataPacket(packet) => { 
        
                                   // An error here means the receiver is dropped/closed, 
        
                                   // this is not recoverable. 
        
                                   if let Err(error) = router_data_tx.send(packet).await{ 
        
                                       error!("Error sending to to_routing_data: {}", error); 
        
                                       break 
        
                                   } 
        
                               } 
        
                               Packet::ControlPacket(packet) => { 
        
                                   if let Err(error) = router_control_tx.send((packet, peer.clone())) { 
        
                                       // An error here means the receiver is dropped/closed, 
        
                                       // this is not recoverable. 
        
                                       error!("Error sending to to_routing_control: {}", error); 
        
                                       break 
        
                                   } 
        
                               } 
        
                           } 
        
                       } 
        
                       Some(Err(e)) => { 
        
                           error!("Frame error from {}: {e}", peer.connection_identifier()); 
        
                           break; 
        
                       }, 
        
                       None => { 
        
                           info!("Stream to {} is closed", peer.connection_identifier()); 
        
                           break; 
        
                       } 
        
                   } 
        
               } 
        
               Some(packet) = from_routing_data.recv() => { 
        
                   if let Err(e) = framed.feed(Packet::DataPacket(packet)).await { 
        
                       error!("Failed to feed data packet to connection: {e}"); 
        
                       break 
        
                   } 
        
                   for _ in 1..PACKET_COALESCE_WINDOW { 
        
                       // There can be 2 cases of errors here, empty channel and no more 
        
                       // senders. In both cases we don't really care at this point. 
        
                       if let Ok(packet) = from_routing_data.try_recv() { 
        
                           if let Err(e) = framed.feed(Packet::DataPacket(packet)).await { 
        
                               error!("Failed to feed data packet to connection: {e}"); 
        
                               break 
        
                           } 
        
                           trace!("Instantly queued ready packet to transfer to peer"); 
        
                       } else { 
        
                           // No packets ready, flush currently buffered ones 
        
                           break 
        
                       } 
        
                   } 
        
                   if let Err(e) = framed.flush().await { 
        
                       error!("Failed to flush buffered peer connection data packets: {e}"); 
        
                       break 
        
                   } 
        
               } 
        
               Some(packet) = from_routing_control.recv() => { 
        
                   if let Err(e) = framed.feed(Packet::ControlPacket(packet)).await { 
        
                       error!("Failed to feed control packet to connection: {e}"); 
        
                       break 
        
                   } 
        
                   for _ in 1..PACKET_COALESCE_WINDOW { 
        
                       // There can be 2 cases of errors here, empty channel and no more 
        
                       // senders. In both cases we don't really care at this point. 
        
                       if let Ok(packet) = from_routing_control.try_recv() { 
        
                           if let Err(e) = framed.feed(Packet::ControlPacket(packet)).await { 
        
                               error!("Failed to feed data packet to connection: {e}"); 
        
                               break 
        
                           } 
        
                       } else { 
        
                           // No packets ready, flush currently buffered ones 
        
                           break 
        
                       } 
        
                   } 
        
                   if let Err(e) = framed.flush().await { 
        
                       error!("Failed to flush buffered peer connection control packets: {e}"); 
        
                       break 
        
                   } 
        
               } 
        
               _ = death_watcher.notified() => { 
        
                   break; 
        
               }

I think we can wrap it with tokio::time::timeout https://docs.rs/tokio/latest/tokio/time/fn.timeout.html and see what happens.

We also do that in the 0-stor.

coesensbert · 2025-01-14T11:45:45Z

Currently it's farm 1425 that is spamming loki with mycelium file descriptor error's. 50 to 70k error's each second!
https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%221425%5C%22,network%3D%5C%22production%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-2m%22,%22to%22:%22now%22%7D%7D

In the dashboard you can see none of the nodes have a workload, but viewing their metrics something is going very wrong there:
https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=1425&var-node=5DJY1fG1oae9xyoUumE6Zqj6RDRKKEuXBRpoavm5LcGBUBi7&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=now-24h&to=now&timezone=browser
https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=1425&var-node=5G8FeRTLCSNXqcnrrTVXyxo8MwBZQZ5Qkq58fvUcW8QFVqsj&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=now-24h&to=now&timezone=browser

Let me know if you need more insights

LeeSmet added the type_bug Something isn't working label Dec 23, 2024

iwanbk self-assigned this Dec 23, 2024

iwanbk mentioned this issue Jan 7, 2025

add timeout to the Framed flush operation. #540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential file descriptor leak #529

Potential file descriptor leak #529

LeeSmet commented Dec 23, 2024

coesensbert commented Dec 23, 2024

iwanbk commented Dec 25, 2024 •

edited

Loading

LeeSmet commented Dec 26, 2024

iwanbk commented Dec 27, 2024 •

edited

Loading

LeeSmet commented Dec 31, 2024

iwanbk commented Dec 31, 2024

iwanbk commented Dec 31, 2024

LeeSmet commented Dec 31, 2024

LeeSmet commented Dec 31, 2024

iwanbk commented Dec 31, 2024 •

edited

Loading

iwanbk commented Dec 31, 2024

iwanbk commented Jan 6, 2025 •

edited

Loading

coesensbert commented Jan 14, 2025

Potential file descriptor leak #529

Potential file descriptor leak #529

Comments

LeeSmet commented Dec 23, 2024

coesensbert commented Dec 23, 2024

iwanbk commented Dec 25, 2024 • edited Loading

LeeSmet commented Dec 26, 2024

iwanbk commented Dec 27, 2024 • edited Loading

LeeSmet commented Dec 31, 2024

iwanbk commented Dec 31, 2024

iwanbk commented Dec 31, 2024

LeeSmet commented Dec 31, 2024

LeeSmet commented Dec 31, 2024

iwanbk commented Dec 31, 2024 • edited Loading

iwanbk commented Dec 31, 2024

iwanbk commented Jan 6, 2025 • edited Loading

coesensbert commented Jan 14, 2025

iwanbk commented Dec 25, 2024 •

edited

Loading

iwanbk commented Dec 27, 2024 •

edited

Loading

iwanbk commented Dec 31, 2024 •

edited

Loading

iwanbk commented Jan 6, 2025 •

edited

Loading