Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked #10370

Open
ivanallen opened this issue Dec 12, 2024 · 15 comments
Labels

Comments

@ivanallen
Copy link

Describe the bug

We write an RPC framework based on am. The client of the rpc framework uses ucp_am_send_nbx to send the request message. After the server processes the request message, the response message is sent to the client through ucp_am_send_nbx.

When I was testing UCX_TLS=shm,tcp with rpc_press(a tool to perf RPC framework), it was possible for the press tool to hang up when I broke it with ctrl c and then run rpc_press again.

At this point, I use another simple echo client to send a request to the server, which also hang up and get no response.

After analysis, we found that the client's ucp_am_send_nbx callback was also not executed.

If I use UCX_TLS=tcp, it works. However, UCX_TLS=shm,tcp does not work.

To recover, you must restart the server.

Steps to Reproduce

  • Command line

xrpc_server is a simple echo server based on ucx.

UCX_PROTO_INFO=y UCX_TCP_CM_REUSEADDR=y UCX_MM_ERROR_HANDLING=y ./build/linux/x86_64/releasedbg/xrpc_server --attachment_size=0 --xrpc_mem_resource=default --log_level info --attachment_size 0 --num_workers=8 --enable
_wakeup_mode=false

rpc_press starts 32 threads to connect the xrpc_server and send request.

UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=info ./build/linux/x86_64/releasedbg/rpc_press -proto ./examples/xrpc/demo/echo.proto -method=example.EchoService.Echo -server 10.16.26.223:8123 -input ./input.json -qps 0 --duration 190 -attachment_size 4096 --xrpc_mem_resource=default --thread_num 32 --qps 0

output:

[2024-12-12 11:28:33.623] [info] [channel.cc:629] desc:Connected to worker worker:7 host:10.16.26.223 port:8123
[1734002913.623916] [node223:41158:32]   +-------------------------------+----------------------------------------------------------------------------------+
[1734002913.623925] [node223:41158:32]   | xrpc_channel intra-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from iov[2] host memory    |
[1734002913.623930] [node223:41158:32]   +-------------------------------+--------------------------------------------------------------------+-------------+
[1734002913.623934] [node223:41158:32]   |                       0..8240 | copy-in                                                            | sysv/memory |
[1734002913.623939] [node223:41158:32]   |                  8241..593437 | multi-frag copy-in                                                 | sysv/memory |
[1734002913.623943] [node223:41158:32]   |                   593438..inf | (?) rendezvous fragmented copy-in copy-out                         | sysv/memory |
[1734002913.623948] [node223:41158:32]   +-------------------------------+--------------------------------------------------------------------+-------------+
2024/12/12-11:28:34     sent:1532179   success:1532205   error:0     total_error:0         total_sent:1839265
2024/12/12-11:28:35     sent:2371318   success:2371316   error:0     total_error:0         total_sent:4210583
2024/12/12-11:28:36     sent:2404221   success:2404222   error:0     total_error:0         total_sent:6614804
2024/12/12-11:28:37     sent:2400176   success:2400176   error:0     total_error:0         total_sent:9014980
2024/12/12-11:28:38     sent:2397321   success:2397320   error:0     total_error:0         total_sent:11412301
2024/12/12-11:28:39     sent:2330092   success:2330096   error:0     total_error:0         total_sent:13742393
[Latency]
  avg             23 us
  50%             12 us
  70%             13 us
  90%             15 us
  95%             16 us
  97%             17 us
  99%             20 us
  99.9%           31 us
  99.99%         549 us
  max        4586555 us
^C

start rpc_press again, but it hangs!

UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=info ./build/linux/x86_64/releasedbg/rpc_press -proto ./examples/xrpc/demo/echo.proto -method=example.EchoService.Echo -server 10.16.26.223:8123 -input ./input.json -qps 0 --duration 190 -attachment_size 4096 --xrpc_mem_resource=default --thread_num 32 --qps 0

use xrpc_client to send the request and hangs too!

UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=data ./build/linux/x86_64/releasedbg/xrpc_client --host 10.16.26.223 --port 8123
  • UCX version used: ucx 1.17

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
Linux node223 4.18.0-305.3.1.el8_4.x86_64 #1 SMP Thu Jun 17 07:52:48 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

ucx log:
ucx_log.txt

@ivanallen ivanallen added the Bug label Dec 12, 2024
@yosefe
Copy link
Contributor

yosefe commented Dec 12, 2024

Hi, in order to fully support ctrl-c, the connection with UCX should be done using an IP address, for example see https://github.com/openucx/ucx/blob/master/examples/ucp_client_server.c and

void UcxConnection::connect(const struct sockaddr *src_saddr,
const struct sockaddr *dst_saddr,
socklen_t addrlen,
UcxCallback *callback)
{
set_log_prefix(dst_saddr, addrlen);
ucp_ep_params_t ep_params;
ep_params.field_mask = UCP_EP_PARAM_FIELD_FLAGS |
UCP_EP_PARAM_FIELD_SOCK_ADDR;
ep_params.flags = UCP_EP_PARAMS_FLAGS_CLIENT_SERVER;
ep_params.sockaddr.addr = dst_saddr;
ep_params.sockaddr.addrlen = addrlen;
if (src_saddr != NULL) {
ep_params.field_mask |= UCP_EP_PARAM_FIELD_LOCAL_SOCK_ADDR;
ep_params.local_sockaddr.addr = src_saddr;
ep_params.local_sockaddr.addrlen = addrlen;
}
if (_context._client_id != UcxContext::CLIENT_ID_UNDEFINED) {
ep_params.flags |= UCP_EP_PARAMS_FLAGS_SEND_CLIENT_ID;
}
char sockaddr_str[UCS_SOCKADDR_STRING_LEN];
UCX_CONN_LOG << "Connecting to "
<< ucs_sockaddr_str(dst_saddr, sockaddr_str,
UCS_SOCKADDR_STRING_LEN);
connect_common(ep_params, callback);
}

is this what being done in the RPC framework?

@ivanallen
Copy link
Author

ivanallen commented Dec 13, 2024

@yosefe Thank you for your attention to my question.

Strictly speaking, I use the same methods in ucp_hello_world and ucx_perftest.

First, the client will use an extra OOB TCP connection to request the server to obtain the address of all workers, and then use it to create all ucp_ep.

_params.field_mask |= UCP_EP_PARAM_FIELD_REMOTE_ADDRESS;
_params.address = worker_address

The client establishes a full connection with all the server workers so that it can hash the request to the specified worker thread for processing.

image

Does this method not support using ctrl-c to interrupt the process?
If not, should I create ucp_listeners that use different port for all threads?

@ivanallen ivanallen changed the title When using shared memory communication, ucp_am_send_nbx hangs and no callback not invoked When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked Dec 13, 2024
@yosefe
Copy link
Contributor

yosefe commented Dec 13, 2024

Does this method not support using ctrl-c to interrupt the process?
If not, should I create ucp_listeners that use different port for all threads?

Not, it doesn't.
in the above use case, you could still create a single listener object, and when it accepts a new conn_request create the server endpoint on a worker of choice.

@ivanallen
Copy link
Author

Does this method not support using ctrl-c to interrupt the process?
If not, should I create ucp_listeners that use different port for all threads?

Not, it doesn't. in the above use case, you could still create a single listener object, and when it accepts a new conn_request create the server endpoint on a worker of choice.

Hi @yosefe, Thanks for answering.

But by doing so, I don't know how to hash the request to the specified worker?I cannot know which server worker corresponds to the ep created on the client. Do you have any good ideas?

@yosefe
Copy link
Contributor

yosefe commented Dec 13, 2024

@ivanallen 2 possible options are to extract the client address or client_id from the connection request, see https://github.com/openucx/ucx/blob/fa01cca77754edb7dd510190640c5feb9fb2b366/src/ucp/api/ucp.h#L1520C3-L1520C26

@ivanallen
Copy link
Author

@ivanallen 2 possible options are to extract the client address or client_id from the connection request, see https://github.com/openucx/ucx/blob/fa01cca77754edb7dd510190640c5feb9fb2b366/src/ucp/api/ucp.h#L1520C3-L1520C26

Hi @yosefe

Yes, I know this API.
But this can only tell which worker in client the request is coming from. But as a client, I don't know how to send a request to the specified worker in the server.

Such as in the following picture. work0 in the client would like to send a request to work0 in the server.
image

@yosefe
Copy link
Contributor

yosefe commented Dec 13, 2024

I see, so to have precise control on the client can have multiple listeners on different ports.

@ivanallen
Copy link
Author

I see, so to have precise control on the client can have multiple listeners on different ports.

@yosefe Thank you very much.

Finally, I would like to ask why ctrl-c is not supported for connection using the worker address, is there any concern about this? Or do you have any information to refer to?

@yosefe
Copy link
Contributor

yosefe commented Dec 13, 2024

I see, so to have precise control on the client can have multiple listeners on different ports.

@yosefe Thank you very much.

Finally, I would like to ask why ctrl-c is not supported for connection using the worker address, is there any concern about this? Or do you have any information to refer to?

In order to respond to ctrl-c, there must be a kernel-based "channel" to send a FIN message and let the other side to detect the connection is closed. By the current design, that kind of channel is created in UCX only with IP/port based connection establishment.

@ivanallen
Copy link
Author

ivanallen commented Dec 17, 2024

Hi @yosefe

I have replaced the code and upgraded to 1.18.0-rc2, but I find that the problem still exists and there is a difference.

Now the callback ucp_am_send_nbx on the client side can be executed, but the am handler on the server side does not trigger the execution.

Of the server's eight workers, there are always individual workers who cannot receive requests. This only occurs at extremely high QPS.

The following is the log(ucx log reference attachment). You can see that request_id:1 of the client is sent successfully, but request_id:2 does not respond.

request_id:1 requests worker0 in the server. This channel is normal. But request_id:2 uses worker1 in the server, and the channel is not responding. (we use round robin to send requests to different server workers.)

client log:

[2024-12-17 04:15:16.065] [66723] [] [debug] channel.cc:238 desc:Received AM RPC message header_size:16 data_size:15
[2024-12-17 04:15:16.066] [66723] [] [debug] channel.cc:276 desc:Received RPC message body_size:15 is_rndv:false rpc_meta:response {
}
correlation_id: 1

[2024-12-17 04:15:16.066] [66723] [] [info] client.cc:104 request_id:1 sync test latency:740707µs actual:740836181ns
[2024-12-17 04:15:16.066] [66723] [] [info] client.cc:112 request_id:1 Received response:Hello, world!
[2024-12-17 04:15:16.066] [66723] [] [debug] channel.cc:427 desc:Sending RPC message rpc_header:RpcHeader{magic:1129467992, meta_size:53, body_size:41} rpc_meta:request {
  service_name: "example.EchoService"
  method_name: "example.EchoService.Echo"
}
correlation_id: 2
attachment_size: 26

[2024-12-17 04:15:18.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:2s
[2024-12-17 04:15:19.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:3s
[2024-12-17 04:15:20.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:4s

server log:

[2024-12-17 04:15:15.325] [66632] [] [info] message_server.cc:42 desc:session come in remote:10.16.26.223:49326
[2024-12-17 04:15:15.325] [66632] [] [info] server.cc:129 desc:Handshake message received msg:hello xrpc
[2024-12-17 04:15:15.473] [66634] [] [info] worker.cc:101 desc:connection request received client:[email protected]:39326
[2024-12-17 04:15:15.480] [66637] [] [info] worker.cc:101 desc:connection request received client:[email protected]:49514
[2024-12-17 04:15:15.495] [66636] [] [info] worker.cc:101 desc:connection request received client:[email protected]:49790
[2024-12-17 04:15:15.510] [66638] [] [info] worker.cc:101 desc:connection request received client:[email protected]:36860
[2024-12-17 04:15:15.524] [66635] [] [info] worker.cc:101 desc:connection request received client:[email protected]:33430
[2024-12-17 04:15:15.539] [66633] [] [info] worker.cc:101 desc:connection request received client:[email protected]:41722
[2024-12-17 04:15:15.558] [66640] [] [info] worker.cc:101 desc:connection request received client:[email protected]:57184
[2024-12-17 04:15:15.578] [66639] [] [info] worker.cc:101 desc:connection request received client:[email protected]:38456
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:194 desc:Received RPC message data_length:41 data_ptr:0x7fbe9036fec8 is_rndv:false rpc_meta:request {
  service_name: "example.EchoService"
  method_name: "example.EchoService.Echo"
}
correlation_id: 1
attachment_size: 26

[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:362 desc:return RPC message rpc_header:RpcHeader{magic:1129467992, meta_size:4, body_size:15} worker_id:0 request_id:1 states:RECV,RECV_DONE rpc_meta:response {
}
correlation_id: 1

[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:422 desc:client specified send flags send_flags:0
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:412 desc:RPC message return success worker_id:0 request_id:1 states:SEND,RECV,SEND_DONE,RECV_DONE
[2024-12-17 04:15:21.041] [66639] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:38456
[2024-12-17 04:15:21.041] [66640] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:57184
[2024-12-17 04:15:21.041] [66635] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:33430
[2024-12-17 04:15:21.041] [66634] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:39326
[2024-12-17 04:15:21.041] [66636] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:49790
[2024-12-17 04:15:21.041] [66637] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:49514
[2024-12-17 04:15:21.041] [66638] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:36860
[2024-12-17 04:15:21.041] [66633] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:41722
[2024-12-17 04:15:21.041] [66632] [] [info] message_server.cc:38 desc:session close remote:10.16.26.223:49326

client ucx log:
ucx.log2.txt

@ivanallen
Copy link
Author

ivanallen commented Dec 28, 2024

(gdb) p iface->read_index_elem[0]
$23 = {
  flags = 1 '\001',
  am_id = 26 '\032',
  length = 4193,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5265344
  },
  desc_data = 0x7f2d9a5057c0
}
(gdb) p iface->read_index_elem[1]
$24 = {
  flags = 0 '\000',
  am_id = 0 '\000',
  length = 0,
  desc = {
    seg_id = 0,
    seg_size = 0,
    offset = 0
  },
  desc_data = 0x0
}
(gdb) p iface->recv_fifo_ctl[0]
$25 = {
  head = 239249,
  signal_addrlen = 8,
  signal_sockaddr = {
    sun_family = 1,
    sun_path = "\000\060\062c9b", '\000' <repeats 101 times>
  },
  pad0 = "\000\000\000\000\000",
  tail = 239104,
  pid = 87383
}
(gdb) p iface->read_index
$26 = 239204
(gdb)
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)(((iface->read_index & iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$28 = {
  flags = 1 '\001',
  am_id = 26 '\032',
  length = 4193,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5265344
  },
  desc_data = 0x7f2d9a5057c0
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 1)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$29 = {
  flags = 0 '\000',
  am_id = 1 '\001',
  length = 77,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 4904832
  },
  desc_data = 0x7f2d9a4ad780
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 2)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$30 = {
  flags = 0 '\000',
  am_id = 1 '\001',
  length = 30,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5055744
  },
  desc_data = 0x7f2d9a4d2500
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 3)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$31 = {
  flags = 0 '\000',
  am_id = 26 '\032',
  length = 4191,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 6103744
  },
  desc_data = 0x7f2d9a5d22c0
}
(gdb) p iface[0]
$32 = {
  super = {
    super = {
      super = {
        ops = {
          ep_put_short = 0x5407a0 <uct_sm_ep_put_short>,
          ep_put_bcopy = 0x5407d0 <uct_sm_ep_put_bcopy>,
          ep_put_zcopy = 0x0,
          ep_get_short = 0x0,
          ep_get_bcopy = 0x5407e0 <uct_sm_ep_get_bcopy>,
          ep_get_zcopy = 0x0,
          ep_am_short = 0x543250 <uct_mm_ep_am_short>,
          ep_am_short_iov = 0x543470 <uct_mm_ep_am_short_iov>,
          ep_am_bcopy = 0x543750 <uct_mm_ep_am_bcopy>,
          ep_am_zcopy = 0x0,
          ep_atomic_cswap64 = 0x540a50 <uct_sm_ep_atomic_cswap64>,
          ep_atomic_cswap32 = 0x540a70 <uct_sm_ep_atomic_cswap32>,
          ep_atomic32_post = 0x540810 <uct_sm_ep_atomic32_post>,
          ep_atomic64_post = 0x540880 <uct_sm_ep_atomic64_post>,
          ep_atomic32_fetch = 0x5409a0 <uct_sm_ep_atomic32_fetch>,
          ep_atomic64_fetch = 0x5408f0 <uct_sm_ep_atomic64_fetch>,
          ep_tag_eager_short = 0x0,
          ep_tag_eager_bcopy = 0x0,
          ep_tag_eager_zcopy = 0x0,
          ep_tag_rndv_zcopy = 0x0,
          ep_tag_rndv_cancel = 0x0,
          ep_tag_rndv_request = 0x0,
          iface_tag_recv_zcopy = 0x0,
          iface_tag_recv_cancel = 0x0,
          ep_pending_add = 0x543a10 <uct_mm_ep_pending_add>,
          ep_pending_purge = 0x543b40 <uct_mm_ep_pending_purge>,
          ep_flush = 0x543ba0 <uct_mm_ep_flush>,
          ep_fence = 0x540c00 <uct_sm_ep_fence>,
          ep_check = 0x543bf0 <uct_mm_ep_check>,
          ep_create = 0x543170 <uct_mm_ep_t_new>,
          ep_connect = 0x0,
          ep_disconnect = 0x0,
          cm_ep_conn_notify = 0x0,
          ep_destroy = 0x543220 <uct_mm_ep_t_delete>,
          ep_get_address = 0x0,
          ep_connect_to_ep = 0x0,
          iface_accept = 0x0,
          iface_reject = 0x0,
          iface_flush = 0x540c60 <uct_mm_iface_flush>,
          iface_fence = 0x540bf0 <uct_sm_iface_fence>,
          iface_progress_enable = 0x53dfa0 <uct_base_iface_progress_enable>,
          iface_progress_disable = 0x53dfb0 <uct_base_iface_progress_disable>,
          iface_progress = 0x541390 <uct_mm_iface_progress>,
          iface_event_fd_get = 0x540c80 <uct_mm_iface_event_fd_get>,
          iface_event_arm = 0x5410a0 <uct_mm_iface_event_fd_arm>,
          iface_close = 0x541070 <uct_mm_iface_t_delete>,
          iface_query = 0x540ea0 <uct_mm_iface_query>,
          iface_get_device_address = 0x540bc0 <uct_sm_iface_get_device_address>,
          iface_get_address = 0x540c30 <uct_mm_iface_get_address>,
          iface_is_reachable = 0x53dc20 <uct_base_iface_is_reachable>
        }
      },
      internal_ops = 0x85c020 <uct_mm_iface_internal_ops>,
--Type <RET> for more, q to quit, c to continue without paging--
      md = 0x25f7280,
      worker = 0x27fabb0,
      am = {{
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x0,
          flags = 4
        }, {
          cb = 0x50b7b0 <ucp_wireup_msg_handler>,
          arg = 0x305a000,
          flags = 4
        }, {
          cb = 0x4d4dc0 <ucp_eager_only_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d5ee0 <ucp_eager_first_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d7230 <ucp_eager_middle_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x5,
          flags = 4
        }, {
          cb = 0x4d5650 <ucp_eager_sync_only_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d6880 <ucp_eager_sync_first_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d4460 <ucp_eager_sync_ack_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4b9f20 <ucp_rndv_rts_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4bfe40 <ucp_rndv_ats_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4c3c10 <ucp_rndv_rtr_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4bf150 <ucp_rndv_data_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0xd,
          flags = 4
--Type <RET> for more, q to quit, c to continue without paging--
        }, {
          cb = 0x4d8320 <ucp_eager_offload_sync_ack_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0xf,
          flags = 4
        }, {
          cb = 0x4bc5f0 <ucp_rndv_atp_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4a0b30 <ucp_put_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x49e810 <ucp_get_req_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x49fe80 <ucp_get_rep_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x14,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x15,
          flags = 4
        }, {
          cb = 0x49eb50 <ucp_rma_cmpl_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x446300 <ucp_am_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x445530 <ucp_am_long_first_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4492a0 <ucp_am_long_middle_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x448b90 <ucp_am_handler_reply>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1b,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
--Type <RET> for more, q to quit, c to continue without paging--
          arg = 0x1c,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1d,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1e,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1f,
          flags = 4
        }},
      am_tracer = 0x46cd40 <ucp_worker_am_tracer>,
      am_tracer_arg = 0x305a000,
      err_handler = 0x477b70 <ucp_worker_iface_error_handler>,
      err_handler_arg = 0x305a000,
      err_handler_flags = 4,
      prog = {
        id = 1,
        refcount = 0
      },
      progress_flags = 3,
      config = {
        num_alloc_methods = 3,
        alloc_methods = {UCT_ALLOC_METHOD_MD, UCT_ALLOC_METHOD_MMAP, UCT_ALLOC_METHOD_HEAP, UCT_ALLOC_METHOD_THP, UCT_ALLOC_METHOD_THP},
        failure_level = UCS_LOG_LEVEL_DIAG,
        max_num_eps = 18446744073709551615
      }
    },
    config = {
      bandwidth = 16106127360
    }
  },
  recv_fifo_mem = {
    address = 0x7f2d9b214000,
    length = 36864,
    method = UCT_ALLOC_METHOD_MD,
    mem_type = UCS_MEMORY_TYPE_HOST,
    md = 0x25f7280,
    memh = 0x25f6680
  },
  recv_fifo_ctl = 0x7f2d9b214000,
  recv_fifo_elems = 0x7f2d9b2140c0,
  read_index_elem = 0x7f2d9b2172c0,
  read_index = 239204,
  fifo_shift = 8 '\b',
  fifo_mask = 255,
  fifo_release_factor_mask = 127,
  fifo_poll_count = 1,
  fifo_prev_wnd_cons = 0,
  recv_desc_mp = {
    freelist = 0x7f2d9a588750,
    data = 0x261b420
  },
  last_recv_desc = 0x7f2d9a4c6018,
--Type <RET> for more, q to quit, c to continue without paging--
  signal_fd = 53,
  rx_headroom = 80,
  arbiter = {
    list = {
      prev = 0x2655790,
      next = 0x2655790
    }
  },
  release_desc = {
    cb = 0x540df0 <uct_mm_iface_release_desc>
  },
  config = {
    fifo_size = 256,
    fifo_elem_size = 128,
    seg_size = 8256,
    fifo_max_poll = 16,
    extra_cap_flags = 274877906944,
    overhead = {
      send = {
        am_short = 1e-08,
        am_bcopy = 1e-08
      },
      recv = {
        am_short = 1e-08,
        am_bcopy = 1e-08
      }
    }
  }
}
(gdb)

(gdb) p iface->config
$50 = {
  fifo_size = 256,
  fifo_elem_size = 128,
  seg_size = 8256,
  fifo_max_poll = 16,
  extra_cap_flags = 274877906944,
  overhead = {
    send = {
      am_short = 1e-08,
      am_bcopy = 1e-08
    },
    recv = {
      am_short = 1e-08,
      am_bcopy = 1e-08
    }
  }
}
(gdb) p payload_offset
No symbol "payload_offset" in current context.
(gdb) p iface->rx_headroom
$51 = 80
(gdb) p iface->recv_fifo_mem
$52 = {
  address = 0x7f2d9b214000,
  length = 36864,
  method = UCT_ALLOC_METHOD_MD,
  mem_type = UCS_MEMORY_TYPE_HOST,
  md = 0x25f7280,
  memh = 0x25f6680
}
(gdb) p iface->recv_fifo_mem.memh
$53 = (uct_mem_h) 0x25f6680
(gdb) p (uct_mm_seg_t*)iface->recv_fifo_mem.memh
$54 = (uct_mm_seg_t *) 0x25f6680
(gdb) p *(uct_mm_seg_t*)iface->recv_fifo_mem.memh
$55 = {
  seg_id = 9732101,
  address = 0x7f2d9b214000,
  length = 36864
}
(gdb)
[root@node223 xrpc]# ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 9732099    root       660        36864      1          dest
0x00000000 9732100    root       660        6291456    1          dest
0x00000000 9732101    root       660        36864      1          dest
0x00000000 9732102    root       660        6291456    1          dest
0x00000000 9732103    root       660        36864      1          dest
0x00000000 9732104    root       660        6291456    1          dest
0x00000000 9732105    root       660        36864      1          dest
0x00000000 9732106    root       660        6291456    1          dest
0x00000000 9732107    root       660        36864      1          dest
0x00000000 9732108    root       660        6291456    1          dest
0x00000000 9732109    root       660        36864      1          dest
0x00000000 9732110    root       660        6291456    1          dest
0x00000000 9732111    root       660        36864      1          dest
0x00000000 9732112    root       660        6291456    1          dest
0x00000000 9732113    root       660        36864      1          dest
0x00000000 9732114    root       660        6291456    1          dest

@ivanallen
Copy link
Author

ivanallen commented Dec 28, 2024

Hi @yosefe, I've got some new clues.

I printed the FIFO data of the server through gdb, and found that the function uct_mm_iface_fifo_has_new_data was judged to be false, which caused subsequent new requests entering the queue could not be processed.

(gdb) p iface->read_index_elem[0]
$23 = {
  flags = 1 '\001',
  am_id = 26 '\032',
  length = 4193,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5265344
  },
  desc_data = 0x7f2d9a5057c0
}

(gdb) p iface->read_index
$26 = 239204
static UCS_F_ALWAYS_INLINE int
uct_mm_iface_fifo_has_new_data(uct_mm_iface_t *iface)
{
    // ((iface->read_index >> iface->fifo_shift) & 1) => (239204 >> 8) & 1 =>  0
    // (iface->read_index_elem->flags & 1) => 1 & 1 => 1
    // return false
    return (((iface->read_index >> iface->fifo_shift) & 1) ==
            (iface->read_index_elem->flags & 1));
}

@ivanallen
Copy link
Author

ivanallen commented Dec 28, 2024

Hi @yosefe

I may already know the root cause of the problem.

if (head & iface->config.fifo_size) {
elem_flags |= UCT_MM_FIFO_ELEM_FLAG_OWNER;
}
elem->flags = elem_flags;

When the client crashes before line 378, the statement elem->flags = elem_flags; is not executed, which causes the elem to become an uncommitted request.
Therefore, when a new client continues to submit elem, the server does not respond.

Should ucx add a check to mark that the process to which the elem belongs no longer exists, so that processing of the elem can be skipped? Or some other solution?


I can reproduce the problem perfectly with fault injection.

  • start server
UCX_TCP_CM_REUSEADDR=y UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -c am -s 4096
  • start client with gdb
UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y /opt/rh/gcc-toolset-12/root/bin/gdb --args ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096

(gdb) b mm_ep.c:378 if elem->am_id == 23
No source file named mm_ep.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mm_ep.c:367 if elem->am_id == 23) pending.
(gdb) r
Thread 1 "ucp_client_serv" hit Breakpoint 1, uct_mm_ep_am_common_send (send_op=UCT_MM_SEND_AM_BCOPY, ep=0x525bb0, iface=0x41fd60, am_id=23 '\027', length=4104, header=0, payload=0x0, pack_cb=0x7ffff7d4b313 <ucp_am_eager_single_bcopy_pack>, arg=0x43b080, iov=0x0, iovcnt=0, flags=0) at sm/mm/base/mm_ep.c:378
378         elem->flags = elem_flags;
(gdb) quit
A debugging session is active.

        Inferior 1 [process 2553049] will be killed.

Quit anyway? (y or n) y
  • start client again without gdb
UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096

[root@10-16-153-78 ucx-1.18.0]# UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096
[1735531853.469238] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------+
[1735531853.469275] [10-16-153-78:2553055:0]   | client_server inter-node cfg#0 | active message by ucp_am_send* from host memory    |
[1735531853.469278] [10-16-153-78:2553055:0]   +--------------------------------+-------------------------------------------------+--+
[1735531853.469281] [10-16-153-78:2553055:0]   +--------------------------------+-------------------------------------------------+--+
[1735531853.469314] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------+
[1735531853.469321] [10-16-153-78:2553055:0]   | client_server inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory    |
[1735531853.469324] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------------------------------+--+
[1735531853.469343] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------------------------------+--+
[1735531853.474304] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------+
[1735531853.474333] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* from host memory    |
[1735531853.474337] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------+----------+
[1735531853.474340] [10-16-153-78:2553055:0]   |                        0..8184 | short                                   | tcp/ens4 |
[1735531853.474343] [10-16-153-78:2553055:0]   |                      8185..inf | multi-frag copy-in                      | tcp/ens4 |
[1735531853.474346] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------+----------+
[1735531853.475635] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------------------+
[1735531853.475656] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send*(fast-completion) from host memory    |
[1735531853.475659] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------+----------+
[1735531853.475663] [10-16-153-78:2553055:0]   |                        0..8184 | short                                                    | tcp/ens4 |
[1735531853.475666] [10-16-153-78:2553055:0]   |                   8185..262143 | multi-frag copy-in                                       | tcp/ens4 |
[1735531853.475681] [10-16-153-78:2553055:0]   |                      256K..inf | (?) rendezvous fragmented copy-in copy-out               | tcp/ens4 |
[1735531853.475685] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------+----------+
[1735531853.478350] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------------------------+
[1735531853.478376] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send*(multi) from host memory    |
[1735531853.478380] [10-16-153-78:2553055:0]   +--------------------------------+------------------------------------------------+----------+
[1735531853.478382] [10-16-153-78:2553055:0]   |                        0..8184 | short                                          | tcp/ens4 |
[1735531853.478385] [10-16-153-78:2553055:0]   |                    8185..65528 | zero-copy                                      | tcp/ens4 |
[1735531853.478388] [10-16-153-78:2553055:0]   |                     65529..inf | multi-frag copy-in                             | tcp/ens4 |
[1735531853.478391] [10-16-153-78:2553055:0]   +--------------------------------+------------------------------------------------+----------+
[1735531853.479714] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------+
[1735531853.479744] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag from host memory    |
[1735531853.479747] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------+----------+
[1735531853.479750] [10-16-153-78:2553055:0]   |                        0..8176 | short                                                   | tcp/ens4 |
[1735531853.479753] [10-16-153-78:2553055:0]   |                      8177..inf | multi-frag copy-in                                      | tcp/ens4 |
[1735531853.479759] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------+----------+
[1735531853.481060] [10-16-153-78:2553055:0]   +--------------------------------+-------------------------------------------------------------------------------------+
[1735531853.481081] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag(fast-completion) from host memory    |
[1735531853.481084] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------------+----------+
[1735531853.481087] [10-16-153-78:2553055:0]   |                        0..8176 | short                                                                    | tcp/ens4 |
[1735531853.481094] [10-16-153-78:2553055:0]   |                   8177..262143 | multi-frag copy-in                                                       | tcp/ens4 |
[1735531853.481103] [10-16-153-78:2553055:0]   |                      256K..inf | (?) rendezvous fragmented copy-in copy-out                               | tcp/ens4 |
[1735531853.481110] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------------+----------+
[1735531853.482531] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------------------------+
[1735531853.482553] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag(multi) from host memory    |
[1735531853.482556] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------------+----------+
[1735531853.482559] [10-16-153-78:2553055:0]   |                         0..594 | short                                                          | tcp/ens4 |
[1735531853.482564] [10-16-153-78:2553055:0]   |                       595..inf | multi-frag copy-in                                             | tcp/ens4 |
[1735531853.482579] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------------+----------+

Hang!

@yosefe
Copy link
Contributor

yosefe commented Jan 1, 2025

@ivanallen currently shared memory transport does not support detecting of the remote process crash. setting UCX_MM_ERROR_HANDLING=y disregards this limitation and enables selecting shared memory transport.
i am not sure it will be enough to add a check before "elem->flags = elem_flags", since there are other places in the code that may suffer from similar problem. also, adding such check in a naïve way may add performance overhead.

@ivanallen
Copy link
Author

ivanallen commented Jan 2, 2025

Hi @yosefe Thank you for your reply.

If that's the case, then it seems that shared memory transport cannot be used in the production environment. It appears there is no self-healing method available for the server unless restarting it.

However, I still hope that UCX can resolve this issue, but it seems quite challenging at the moment.

We have related experience in the production environment, which is to maintain a separate FIFO queue for each producer(SPSC fifo), if this producer fails, then the server will remove the FIFO associated with this producer. The problem with this is that you may need to maintain a large number of FIFO queues, but this does not present much of a problem at the moment in our environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants