Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC connectivity stops for good in high traffic #9594

Open
moneromooo-monero opened this issue Nov 25, 2024 · 17 comments
Open

RPC connectivity stops for good in high traffic #9594

moneromooo-monero opened this issue Nov 25, 2024 · 17 comments

Comments

@moneromooo-monero
Copy link
Collaborator

moneromooo-monero commented Nov 25, 2024

I've been debugging it on and off in Townforge for quite a long time, as I thought it was specific to my changes, but I can actually get it to happen in Monero reliably. Townforge has quite heavy TF specific functional tests, which trigger is reliably, and I got Monero to trigger it reliably by simply calling a RPC over and over, with this patch:

diff --git a/tests/functional_tests/daemon_info.py b/tests/functional_tests/daemon_info.py
index 9d645330d..94ef57c5f 100755
--- a/tests/functional_tests/daemon_info.py
+++ b/tests/functional_tests/daemon_info.py
@@ -50,7 +50,8 @@ class DaemonGetInfoTest():
         print('Test hard_fork_info')
 
         daemon = Daemon()
-        res = daemon.hard_fork_info()
+        while True:
+            res = daemon.hard_fork_info()
 
         # hard_fork version should be set at height 1
         assert 'earliest_height' in res.keys()
diff --git a/tests/functional_tests/functional_tests_rpc.py b/tests/functional_tests/functional_tests_rpc.py
index e483352a4..449512339 100755
--- a/tests/functional_tests/functional_tests_rpc.py
+++ b/tests/functional_tests/functional_tests_rpc.py
@@ -52,7 +52,7 @@ WALLET_DIRECTORY = builddir + "/functional-tests-directory"
 FUNCTIONAL_TESTS_DIRECTORY = builddir + "/tests/functional_tests"
 DIFFICULTY = 10
 
-monerod_base = [builddir + "/bin/monerod", "--regtest", "--fixed-difficulty", str(DIFFICULTY), "--no-igd", "--p2p-bind-port", "monerod_p2p_port", "--rpc-bind-port", "monerod_rpc_port", "--zmq-rpc-bind-port", "monerod_zmq_port", "--zmq-pub", "monerod_zmq_pub", "--non-interactive", "--disable-dns-checkpoints", "--check-updates", "disabled", "--rpc-ssl", "disabled", "--data-dir", "monerod_data_dir", "--log-level", "1"]
+monerod_base = [builddir + "/bin/monerod", "--regtest", "--fixed-difficulty", str(DIFFICULTY), "--no-igd", "--p2p-bind-port", "monerod_p2p_port", "--rpc-bind-port", "monerod_rpc_port", "--zmq-rpc-bind-port", "monerod_zmq_port", "--zmq-pub", "monerod_zmq_pub", "--non-interactive", "--disable-dns-checkpoints", "--check-updates", "disabled", "--rpc-ssl", "disabled", "--data-dir", "monerod_data_dir", "--log-level", "3"]
 monerod_extra = [
   ["--offline"],
   ["--rpc-payment-address", "44SKxxLQw929wRF6BA9paQ1EWFshNnKhXM3qz6Mo3JGDE2YG3xyzVutMStEicxbQGRfrYvAAYxH6Fe8rnD56EaNwUiqhcwR", "--rpc-payment-difficulty", str(DIFFICULTY), "--rpc-payment-credits", "5000", "--offline"],


Note that setting log level to 3 is needed here. Running with log level 1 will not trigger it. In Townforge, log level 1 is fine. Log level 2 will trigger fairly quickly. Monero with log level 3 will trigger is pretty much at once.

Once triggered, it never recovers. I tried adding recovery code in Townforge, to no avail (that may be because the underlying issue is not what I vaguely expect it to be).

The symptoms are en exception in handle_accept, where a syscall returns EBADF. The socket is valid at the start of the function, and becomes invalid somewhere along the execution of handle_accept. AFAICT this is not a case of the connection being destroyed by another thread, but I'd be happy to be shown to be wrong there since it's the obvious inference.

I've spent days on this over the months, I hope someone with more networking chops can have a try at it.

Note that there's been reports of RPC connectivity going down over the years, that's probably the same thing.

@0xFFFC0000
Copy link
Collaborator

I can confirm this happening. and after (briefly) testing it, this is the call stack which consumes most of the computation time:

image

P.S. Take this information with grain of salt. I will profile / debug this tomorrow.

@moneromooo-monero
Copy link
Collaborator Author

To be clear, the issue isn't performance degradation due to heavy logging, it is the server stopping accepting connections after this:

ERROR net contrib/epee/include/net/abstract_tcp_server2.inl:1528 Exception in boosted_tcp_server<t_protocol_handler>::handle_accept: set_option: Bad file descriptor

Note that if you trace around, the EBADF might come from another function, set_option is just the most likely to get whacked.

@0xFFFC0000
Copy link
Collaborator

In that case, I left it running for about 10 minutes. But I don't have any

Exception in boosted_tcp_server<t_protocol_handler>::handle_accept: set_option: Bad file descriptor

in my logs.

Usually how long it takes for the exception to show up?

@moneromooo-monero
Copy link
Collaborator Author

In three attempts, about... 5 seconds, 5 seconds, 20 seconds maybe. After waiting for servers to be running. This is on master from a1dc85c.

@moneromooo-monero
Copy link
Collaborator Author

Running:

./tests/functional_tests/functional_tests_rpc.py /usr/bin/python tests/functional_tests/ build/Linux/master/release/ daemon_info

@0xFFFC0000
Copy link
Collaborator

I am hitting the infinite while loop correctly. But haven't been able to reproduce the Bad file descriptor. After 15 minutes of running.

I will update you if anything comes up, and will do it on bare metal machine too. I am doing it on a VM right now.

@moneromooo-monero
Copy link
Collaborator Author

I'm running on an old Fedora VM. I'll try setting up a more recent one later, it might be a dep issue if you can't get it to happen.

@0xFFFC0000
Copy link
Collaborator

I tried on a vm:

 $ >> cat /etc/os-release 
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"

@moneromooo-monero
Copy link
Collaborator Author

Also happens pretty much instantly on Fedora 41, GCC 14.2.1.

@moneromooo-monero
Copy link
Collaborator Author

Also Debian 12, GCC 12.2.0.

All of this running in Qubes OS, so there might be something weird to do with xen I guess, though it does seem a bit unlikely.

@vtnerd
Copy link
Contributor

vtnerd commented Jan 6, 2025

#9459 might fix this

@tankf33der
Copy link
Contributor

i wrote a short code on vlang to bomb (!) rpc: i see like short ddos but without any crash or stops.

@moneromooo-monero
Copy link
Collaborator Author

#9459 might fix this

It does not, though it does seem to be lasting a bit longer before it dies.

@tankf33der
Copy link
Contributor

while debugging and repeating huge wallet from #9405 I found i can not send huge transfer_split txs WITHOUT --rpc-payment-allow-free-loopback on monerod.

@moneromooo-monero - you could try to add to monerod_base this key and try again.

@moneromooo-monero
Copy link
Collaborator Author

This problem occurs without using the RPC payment system.

@tankf33der
Copy link
Contributor

I repeated issue in my env: Ubuntu 24.04.1 LTS under Docker. All settings out of box.
I manually compiled recent master of monero in ASAN mode.
daemon_info.py looped and crashed (?) with long trace from ASAN for boost.
I won't show it for now.
my Exception line from log:

./monerod0/bitmonero.log:2025-01-14 13:09:00.717        [RPC1]  ERROR   net     contrib/epee/include/net/abstract_tcp_server2.inl:1528  Exception in boosted_tcp_server<t_protocol_handler>::handle_accept: local_endpoint: Bad file descriptor [system:9 at /usr/include/boost/asio/detail/reactive_socket_service.hpp:202 in function 'local_endpoint']

p.s.
Later, when I wrote this issue, I also experimented with launching, and the launch always fails with an ASAN error on Ctrl-C, but of a different type.

@jeffro256
Copy link
Contributor

jeffro256 commented Jan 22, 2025

@moneromooo-monero Would you be willing to share the last 1000 lines of the monerod0.log file once it triggers please (or the full log)? By the way, I was not able to reproduce the error on bare-metal Linux Mint 21.3 after running it for a couple hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants