When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked #10370

ivanallen · 2024-12-12T11:31:32Z

Describe the bug

We write an RPC framework based on am. The client of the rpc framework uses ucp_am_send_nbx to send the request message. After the server processes the request message, the response message is sent to the client through ucp_am_send_nbx.

When I was testing UCX_TLS=shm,tcp with rpc_press(a tool to perf RPC framework), it was possible for the press tool to hang up when I broke it with ctrl c and then run rpc_press again.

At this point, I use another simple echo client to send a request to the server, which also hang up and get no response.

After analysis, we found that the client's ucp_am_send_nbx callback was also not executed.

If I use UCX_TLS=tcp, it works. However, UCX_TLS=shm,tcp does not work.

To recover, you must restart the server.

Steps to Reproduce

Command line

xrpc_server is a simple echo server based on ucx.

UCX_PROTO_INFO=y UCX_TCP_CM_REUSEADDR=y UCX_MM_ERROR_HANDLING=y ./build/linux/x86_64/releasedbg/xrpc_server --attachment_size=0 --xrpc_mem_resource=default --log_level info --attachment_size 0 --num_workers=8 --enable
_wakeup_mode=false

rpc_press starts 32 threads to connect the xrpc_server and send request.

UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=info ./build/linux/x86_64/releasedbg/rpc_press -proto ./examples/xrpc/demo/echo.proto -method=example.EchoService.Echo -server 10.16.26.223:8123 -input ./input.json -qps 0 --duration 190 -attachment_size 4096 --xrpc_mem_resource=default --thread_num 32 --qps 0

output:

[2024-12-12 11:28:33.623] [info] [channel.cc:629] desc:Connected to worker worker:7 host:10.16.26.223 port:8123
[1734002913.623916] [node223:41158:32]   +-------------------------------+----------------------------------------------------------------------------------+
[1734002913.623925] [node223:41158:32]   | xrpc_channel intra-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from iov[2] host memory    |
[1734002913.623930] [node223:41158:32]   +-------------------------------+--------------------------------------------------------------------+-------------+
[1734002913.623934] [node223:41158:32]   |                       0..8240 | copy-in                                                            | sysv/memory |
[1734002913.623939] [node223:41158:32]   |                  8241..593437 | multi-frag copy-in                                                 | sysv/memory |
[1734002913.623943] [node223:41158:32]   |                   593438..inf | (?) rendezvous fragmented copy-in copy-out                         | sysv/memory |
[1734002913.623948] [node223:41158:32]   +-------------------------------+--------------------------------------------------------------------+-------------+
2024/12/12-11:28:34     sent:1532179   success:1532205   error:0     total_error:0         total_sent:1839265
2024/12/12-11:28:35     sent:2371318   success:2371316   error:0     total_error:0         total_sent:4210583
2024/12/12-11:28:36     sent:2404221   success:2404222   error:0     total_error:0         total_sent:6614804
2024/12/12-11:28:37     sent:2400176   success:2400176   error:0     total_error:0         total_sent:9014980
2024/12/12-11:28:38     sent:2397321   success:2397320   error:0     total_error:0         total_sent:11412301
2024/12/12-11:28:39     sent:2330092   success:2330096   error:0     total_error:0         total_sent:13742393
[Latency]
  avg             23 us
  50%             12 us
  70%             13 us
  90%             15 us
  95%             16 us
  97%             17 us
  99%             20 us
  99.9%           31 us
  99.99%         549 us
  max        4586555 us
^C

start rpc_press again, but it hangs!

UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=info ./build/linux/x86_64/releasedbg/rpc_press -proto ./examples/xrpc/demo/echo.proto -method=example.EchoService.Echo -server 10.16.26.223:8123 -input ./input.json -qps 0 --duration 190 -attachment_size 4096 --xrpc_mem_resource=default --thread_num 32 --qps 0

use xrpc_client to send the request and hangs too!

UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=data ./build/linux/x86_64/releasedbg/xrpc_client --host 10.16.26.223 --port 8123

UCX version used: ucx 1.17

Setup and versions

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)

Linux node223 4.18.0-305.3.1.el8_4.x86_64 #1 SMP Thu Jun 17 07:52:48 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

ucx log:
ucx_log.txt

The text was updated successfully, but these errors were encountered:

yosefe · 2024-12-12T11:42:35Z

Hi, in order to fully support ctrl-c, the connection with UCX should be done using an IP address, for example see https://github.com/openucx/ucx/blob/master/examples/ucp_client_server.c and

ucx/test/apps/iodemo/ucx_wrapper.cc

Lines 813 to 840 in 79de87a

    
           void UcxConnection::connect(const struct sockaddr *src_saddr, 
        
                                       const struct sockaddr *dst_saddr, 
        
                                       socklen_t addrlen, 
        
                                       UcxCallback *callback) 
        
           { 
        
               set_log_prefix(dst_saddr, addrlen); 
        
               ucp_ep_params_t ep_params; 
        
               ep_params.field_mask       = UCP_EP_PARAM_FIELD_FLAGS | 
        
                                            UCP_EP_PARAM_FIELD_SOCK_ADDR; 
        
               ep_params.flags            = UCP_EP_PARAMS_FLAGS_CLIENT_SERVER; 
        
               ep_params.sockaddr.addr    = dst_saddr; 
        
               ep_params.sockaddr.addrlen = addrlen; 
        
               if (src_saddr != NULL) { 
        
                   ep_params.field_mask            |= UCP_EP_PARAM_FIELD_LOCAL_SOCK_ADDR; 
        
                   ep_params.local_sockaddr.addr    = src_saddr; 
        
                   ep_params.local_sockaddr.addrlen = addrlen; 
        
               } 
        
               if (_context._client_id != UcxContext::CLIENT_ID_UNDEFINED) { 
        
                   ep_params.flags |= UCP_EP_PARAMS_FLAGS_SEND_CLIENT_ID; 
        
               } 
        
               char sockaddr_str[UCS_SOCKADDR_STRING_LEN]; 
        
               UCX_CONN_LOG << "Connecting to " 
        
                            << ucs_sockaddr_str(dst_saddr, sockaddr_str, 
        
                                                UCS_SOCKADDR_STRING_LEN); 
        
               connect_common(ep_params, callback); 
        
           }

is this what being done in the RPC framework?

ivanallen · 2024-12-13T01:58:58Z

@yosefe Thank you for your attention to my question.

Strictly speaking, I use the same methods in ucp_hello_world and ucx_perftest.

First, the client will use an extra OOB TCP connection to request the server to obtain the address of all workers, and then use it to create all ucp_ep.

_params.field_mask |= UCP_EP_PARAM_FIELD_REMOTE_ADDRESS;
_params.address = worker_address

The client establishes a full connection with all the server workers so that it can hash the request to the specified worker thread for processing.

Does this method not support using ctrl-c to interrupt the process?
If not, should I create ucp_listeners that use different port for all threads?

yosefe · 2024-12-13T07:54:51Z

Does this method not support using ctrl-c to interrupt the process?
If not, should I create ucp_listeners that use different port for all threads?

Not, it doesn't.
in the above use case, you could still create a single listener object, and when it accepts a new conn_request create the server endpoint on a worker of choice.

ivanallen · 2024-12-13T08:12:57Z

Does this method not support using ctrl-c to interrupt the process?
If not, should I create ucp_listeners that use different port for all threads?

Not, it doesn't. in the above use case, you could still create a single listener object, and when it accepts a new conn_request create the server endpoint on a worker of choice.

Hi @yosefe, Thanks for answering.

But by doing so, I don't know how to hash the request to the specified worker？I cannot know which server worker corresponds to the ep created on the client. Do you have any good ideas?

yosefe · 2024-12-13T08:15:51Z

@ivanallen 2 possible options are to extract the client address or client_id from the connection request, see https://github.com/openucx/ucx/blob/fa01cca77754edb7dd510190640c5feb9fb2b366/src/ucp/api/ucp.h#L1520C3-L1520C26

ivanallen · 2024-12-13T08:50:01Z

@ivanallen 2 possible options are to extract the client address or client_id from the connection request, see https://github.com/openucx/ucx/blob/fa01cca77754edb7dd510190640c5feb9fb2b366/src/ucp/api/ucp.h#L1520C3-L1520C26

Hi @yosefe

Yes, I know this API.
But this can only tell which worker in client the request is coming from. But as a client, I don't know how to send a request to the specified worker in the server.

Such as in the following picture. work0 in the client would like to send a request to work0 in the server.

yosefe · 2024-12-13T08:59:32Z

I see, so to have precise control on the client can have multiple listeners on different ports.

ivanallen · 2024-12-13T09:02:19Z

I see, so to have precise control on the client can have multiple listeners on different ports.

@yosefe Thank you very much.

Finally, I would like to ask why ctrl-c is not supported for connection using the worker address, is there any concern about this? Or do you have any information to refer to?

yosefe · 2024-12-13T10:11:11Z

I see, so to have precise control on the client can have multiple listeners on different ports.

@yosefe Thank you very much.

Finally, I would like to ask why ctrl-c is not supported for connection using the worker address, is there any concern about this? Or do you have any information to refer to?

In order to respond to ctrl-c, there must be a kernel-based "channel" to send a FIN message and let the other side to detect the connection is closed. By the current design, that kind of channel is created in UCX only with IP/port based connection establishment.

ivanallen · 2024-12-17T01:53:16Z

Hi @yosefe

I have replaced the code and upgraded to 1.18.0-rc2, but I find that the problem still exists and there is a difference.

Now the callback ucp_am_send_nbx on the client side can be executed, but the am handler on the server side does not trigger the execution.

Of the server's eight workers, there are always individual workers who cannot receive requests. This only occurs at extremely high QPS.

The following is the log（ucx log reference attachment）. You can see that request_id:1 of the client is sent successfully, but request_id:2 does not respond.

request_id:1 requests worker0 in the server. This channel is normal. But request_id:2 uses worker1 in the server, and the channel is not responding. (we use round robin to send requests to different server workers.)

client log:

[2024-12-17 04:15:16.065] [66723] [] [debug] channel.cc:238 desc:Received AM RPC message header_size:16 data_size:15
[2024-12-17 04:15:16.066] [66723] [] [debug] channel.cc:276 desc:Received RPC message body_size:15 is_rndv:false rpc_meta:response {
}
correlation_id: 1

[2024-12-17 04:15:16.066] [66723] [] [info] client.cc:104 request_id:1 sync test latency:740707µs actual:740836181ns
[2024-12-17 04:15:16.066] [66723] [] [info] client.cc:112 request_id:1 Received response:Hello, world!
[2024-12-17 04:15:16.066] [66723] [] [debug] channel.cc:427 desc:Sending RPC message rpc_header:RpcHeader{magic:1129467992, meta_size:53, body_size:41} rpc_meta:request {
  service_name: "example.EchoService"
  method_name: "example.EchoService.Echo"
}
correlation_id: 2
attachment_size: 26

[2024-12-17 04:15:18.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:2s
[2024-12-17 04:15:19.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:3s
[2024-12-17 04:15:20.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:4s

server log:

[2024-12-17 04:15:15.325] [66632] [] [info] message_server.cc:42 desc:session come in remote:10.16.26.223:49326
[2024-12-17 04:15:15.325] [66632] [] [info] server.cc:129 desc:Handshake message received msg:hello xrpc
[2024-12-17 04:15:15.473] [66634] [] [info] worker.cc:101 desc:connection request received client:[email protected]:39326
[2024-12-17 04:15:15.480] [66637] [] [info] worker.cc:101 desc:connection request received client:[email protected]:49514
[2024-12-17 04:15:15.495] [66636] [] [info] worker.cc:101 desc:connection request received client:[email protected]:49790
[2024-12-17 04:15:15.510] [66638] [] [info] worker.cc:101 desc:connection request received client:[email protected]:36860
[2024-12-17 04:15:15.524] [66635] [] [info] worker.cc:101 desc:connection request received client:[email protected]:33430
[2024-12-17 04:15:15.539] [66633] [] [info] worker.cc:101 desc:connection request received client:[email protected]:41722
[2024-12-17 04:15:15.558] [66640] [] [info] worker.cc:101 desc:connection request received client:[email protected]:57184
[2024-12-17 04:15:15.578] [66639] [] [info] worker.cc:101 desc:connection request received client:[email protected]:38456
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:194 desc:Received RPC message data_length:41 data_ptr:0x7fbe9036fec8 is_rndv:false rpc_meta:request {
  service_name: "example.EchoService"
  method_name: "example.EchoService.Echo"
}
correlation_id: 1
attachment_size: 26

[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:362 desc:return RPC message rpc_header:RpcHeader{magic:1129467992, meta_size:4, body_size:15} worker_id:0 request_id:1 states:RECV,RECV_DONE rpc_meta:response {
}
correlation_id: 1

[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:422 desc:client specified send flags send_flags:0
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:412 desc:RPC message return success worker_id:0 request_id:1 states:SEND,RECV,SEND_DONE,RECV_DONE
[2024-12-17 04:15:21.041] [66639] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:38456
[2024-12-17 04:15:21.041] [66640] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:57184
[2024-12-17 04:15:21.041] [66635] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:33430
[2024-12-17 04:15:21.041] [66634] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:39326
[2024-12-17 04:15:21.041] [66636] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:49790
[2024-12-17 04:15:21.041] [66637] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:49514
[2024-12-17 04:15:21.041] [66638] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:36860
[2024-12-17 04:15:21.041] [66633] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:41722
[2024-12-17 04:15:21.041] [66632] [] [info] message_server.cc:38 desc:session close remote:10.16.26.223:49326

client ucx log:
ucx.log2.txt

ivanallen · 2024-12-28T11:49:29Z

(gdb) p iface->read_index_elem[0]
$23 = {
  flags = 1 '\001',
  am_id = 26 '\032',
  length = 4193,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5265344
  },
  desc_data = 0x7f2d9a5057c0
}
(gdb) p iface->read_index_elem[1]
$24 = {
  flags = 0 '\000',
  am_id = 0 '\000',
  length = 0,
  desc = {
    seg_id = 0,
    seg_size = 0,
    offset = 0
  },
  desc_data = 0x0
}
(gdb) p iface->recv_fifo_ctl[0]
$25 = {
  head = 239249,
  signal_addrlen = 8,
  signal_sockaddr = {
    sun_family = 1,
    sun_path = "\000\060\062c9b", '\000' <repeats 101 times>
  },
  pad0 = "\000\000\000\000\000",
  tail = 239104,
  pid = 87383
}
(gdb) p iface->read_index
$26 = 239204
(gdb)
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)(((iface->read_index & iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$28 = {
  flags = 1 '\001',
  am_id = 26 '\032',
  length = 4193,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5265344
  },
  desc_data = 0x7f2d9a5057c0
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 1)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$29 = {
  flags = 0 '\000',
  am_id = 1 '\001',
  length = 77,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 4904832
  },
  desc_data = 0x7f2d9a4ad780
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 2)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$30 = {
  flags = 0 '\000',
  am_id = 1 '\001',
  length = 30,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5055744
  },
  desc_data = 0x7f2d9a4d2500
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 3)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$31 = {
  flags = 0 '\000',
  am_id = 26 '\032',
  length = 4191,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 6103744
  },
  desc_data = 0x7f2d9a5d22c0
}

(gdb) p iface[0]
$32 = {
  super = {
    super = {
      super = {
        ops = {
          ep_put_short = 0x5407a0 <uct_sm_ep_put_short>,
          ep_put_bcopy = 0x5407d0 <uct_sm_ep_put_bcopy>,
          ep_put_zcopy = 0x0,
          ep_get_short = 0x0,
          ep_get_bcopy = 0x5407e0 <uct_sm_ep_get_bcopy>,
          ep_get_zcopy = 0x0,
          ep_am_short = 0x543250 <uct_mm_ep_am_short>,
          ep_am_short_iov = 0x543470 <uct_mm_ep_am_short_iov>,
          ep_am_bcopy = 0x543750 <uct_mm_ep_am_bcopy>,
          ep_am_zcopy = 0x0,
          ep_atomic_cswap64 = 0x540a50 <uct_sm_ep_atomic_cswap64>,
          ep_atomic_cswap32 = 0x540a70 <uct_sm_ep_atomic_cswap32>,
          ep_atomic32_post = 0x540810 <uct_sm_ep_atomic32_post>,
          ep_atomic64_post = 0x540880 <uct_sm_ep_atomic64_post>,
          ep_atomic32_fetch = 0x5409a0 <uct_sm_ep_atomic32_fetch>,
          ep_atomic64_fetch = 0x5408f0 <uct_sm_ep_atomic64_fetch>,
          ep_tag_eager_short = 0x0,
          ep_tag_eager_bcopy = 0x0,
          ep_tag_eager_zcopy = 0x0,
          ep_tag_rndv_zcopy = 0x0,
          ep_tag_rndv_cancel = 0x0,
          ep_tag_rndv_request = 0x0,
          iface_tag_recv_zcopy = 0x0,
          iface_tag_recv_cancel = 0x0,
          ep_pending_add = 0x543a10 <uct_mm_ep_pending_add>,
          ep_pending_purge = 0x543b40 <uct_mm_ep_pending_purge>,
          ep_flush = 0x543ba0 <uct_mm_ep_flush>,
          ep_fence = 0x540c00 <uct_sm_ep_fence>,
          ep_check = 0x543bf0 <uct_mm_ep_check>,
          ep_create = 0x543170 <uct_mm_ep_t_new>,
          ep_connect = 0x0,
          ep_disconnect = 0x0,
          cm_ep_conn_notify = 0x0,
          ep_destroy = 0x543220 <uct_mm_ep_t_delete>,
          ep_get_address = 0x0,
          ep_connect_to_ep = 0x0,
          iface_accept = 0x0,
          iface_reject = 0x0,
          iface_flush = 0x540c60 <uct_mm_iface_flush>,
          iface_fence = 0x540bf0 <uct_sm_iface_fence>,
          iface_progress_enable = 0x53dfa0 <uct_base_iface_progress_enable>,
          iface_progress_disable = 0x53dfb0 <uct_base_iface_progress_disable>,
          iface_progress = 0x541390 <uct_mm_iface_progress>,
          iface_event_fd_get = 0x540c80 <uct_mm_iface_event_fd_get>,
          iface_event_arm = 0x5410a0 <uct_mm_iface_event_fd_arm>,
          iface_close = 0x541070 <uct_mm_iface_t_delete>,
          iface_query = 0x540ea0 <uct_mm_iface_query>,
          iface_get_device_address = 0x540bc0 <uct_sm_iface_get_device_address>,
          iface_get_address = 0x540c30 <uct_mm_iface_get_address>,
          iface_is_reachable = 0x53dc20 <uct_base_iface_is_reachable>
        }
      },
      internal_ops = 0x85c020 <uct_mm_iface_internal_ops>,
--Type <RET> for more, q to quit, c to continue without paging--
      md = 0x25f7280,
      worker = 0x27fabb0,
      am = {{
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x0,
          flags = 4
        }, {
          cb = 0x50b7b0 <ucp_wireup_msg_handler>,
          arg = 0x305a000,
          flags = 4
        }, {
          cb = 0x4d4dc0 <ucp_eager_only_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d5ee0 <ucp_eager_first_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d7230 <ucp_eager_middle_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x5,
          flags = 4
        }, {
          cb = 0x4d5650 <ucp_eager_sync_only_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d6880 <ucp_eager_sync_first_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4d4460 <ucp_eager_sync_ack_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4b9f20 <ucp_rndv_rts_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4bfe40 <ucp_rndv_ats_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4c3c10 <ucp_rndv_rtr_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4bf150 <ucp_rndv_data_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0xd,
          flags = 4
--Type <RET> for more, q to quit, c to continue without paging--
        }, {
          cb = 0x4d8320 <ucp_eager_offload_sync_ack_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0xf,
          flags = 4
        }, {
          cb = 0x4bc5f0 <ucp_rndv_atp_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4a0b30 <ucp_put_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x49e810 <ucp_get_req_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x49fe80 <ucp_get_rep_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x14,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x15,
          flags = 4
        }, {
          cb = 0x49eb50 <ucp_rma_cmpl_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x446300 <ucp_am_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x445530 <ucp_am_long_first_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x4492a0 <ucp_am_long_middle_handler>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x448b90 <ucp_am_handler_reply>,
          arg = 0x305a000,
          flags = 0
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1b,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
--Type <RET> for more, q to quit, c to continue without paging--
          arg = 0x1c,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1d,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1e,
          flags = 4
        }, {
          cb = 0x53cd60 <uct_iface_stub_am_handler>,
          arg = 0x1f,
          flags = 4
        }},
      am_tracer = 0x46cd40 <ucp_worker_am_tracer>,
      am_tracer_arg = 0x305a000,
      err_handler = 0x477b70 <ucp_worker_iface_error_handler>,
      err_handler_arg = 0x305a000,
      err_handler_flags = 4,
      prog = {
        id = 1,
        refcount = 0
      },
      progress_flags = 3,
      config = {
        num_alloc_methods = 3,
        alloc_methods = {UCT_ALLOC_METHOD_MD, UCT_ALLOC_METHOD_MMAP, UCT_ALLOC_METHOD_HEAP, UCT_ALLOC_METHOD_THP, UCT_ALLOC_METHOD_THP},
        failure_level = UCS_LOG_LEVEL_DIAG,
        max_num_eps = 18446744073709551615
      }
    },
    config = {
      bandwidth = 16106127360
    }
  },
  recv_fifo_mem = {
    address = 0x7f2d9b214000,
    length = 36864,
    method = UCT_ALLOC_METHOD_MD,
    mem_type = UCS_MEMORY_TYPE_HOST,
    md = 0x25f7280,
    memh = 0x25f6680
  },
  recv_fifo_ctl = 0x7f2d9b214000,
  recv_fifo_elems = 0x7f2d9b2140c0,
  read_index_elem = 0x7f2d9b2172c0,
  read_index = 239204,
  fifo_shift = 8 '\b',
  fifo_mask = 255,
  fifo_release_factor_mask = 127,
  fifo_poll_count = 1,
  fifo_prev_wnd_cons = 0,
  recv_desc_mp = {
    freelist = 0x7f2d9a588750,
    data = 0x261b420
  },
  last_recv_desc = 0x7f2d9a4c6018,
--Type <RET> for more, q to quit, c to continue without paging--
  signal_fd = 53,
  rx_headroom = 80,
  arbiter = {
    list = {
      prev = 0x2655790,
      next = 0x2655790
    }
  },
  release_desc = {
    cb = 0x540df0 <uct_mm_iface_release_desc>
  },
  config = {
    fifo_size = 256,
    fifo_elem_size = 128,
    seg_size = 8256,
    fifo_max_poll = 16,
    extra_cap_flags = 274877906944,
    overhead = {
      send = {
        am_short = 1e-08,
        am_bcopy = 1e-08
      },
      recv = {
        am_short = 1e-08,
        am_bcopy = 1e-08
      }
    }
  }
}
(gdb)

(gdb) p iface->config
$50 = {
  fifo_size = 256,
  fifo_elem_size = 128,
  seg_size = 8256,
  fifo_max_poll = 16,
  extra_cap_flags = 274877906944,
  overhead = {
    send = {
      am_short = 1e-08,
      am_bcopy = 1e-08
    },
    recv = {
      am_short = 1e-08,
      am_bcopy = 1e-08
    }
  }
}
(gdb) p payload_offset
No symbol "payload_offset" in current context.
(gdb) p iface->rx_headroom
$51 = 80
(gdb) p iface->recv_fifo_mem
$52 = {
  address = 0x7f2d9b214000,
  length = 36864,
  method = UCT_ALLOC_METHOD_MD,
  mem_type = UCS_MEMORY_TYPE_HOST,
  md = 0x25f7280,
  memh = 0x25f6680
}
(gdb) p iface->recv_fifo_mem.memh
$53 = (uct_mem_h) 0x25f6680
(gdb) p (uct_mm_seg_t*)iface->recv_fifo_mem.memh
$54 = (uct_mm_seg_t *) 0x25f6680
(gdb) p *(uct_mm_seg_t*)iface->recv_fifo_mem.memh
$55 = {
  seg_id = 9732101,
  address = 0x7f2d9b214000,
  length = 36864
}
(gdb)

[root@node223 xrpc]# ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 9732099    root       660        36864      1          dest
0x00000000 9732100    root       660        6291456    1          dest
0x00000000 9732101    root       660        36864      1          dest
0x00000000 9732102    root       660        6291456    1          dest
0x00000000 9732103    root       660        36864      1          dest
0x00000000 9732104    root       660        6291456    1          dest
0x00000000 9732105    root       660        36864      1          dest
0x00000000 9732106    root       660        6291456    1          dest
0x00000000 9732107    root       660        36864      1          dest
0x00000000 9732108    root       660        6291456    1          dest
0x00000000 9732109    root       660        36864      1          dest
0x00000000 9732110    root       660        6291456    1          dest
0x00000000 9732111    root       660        36864      1          dest
0x00000000 9732112    root       660        6291456    1          dest
0x00000000 9732113    root       660        36864      1          dest
0x00000000 9732114    root       660        6291456    1          dest

ivanallen · 2024-12-28T12:05:21Z

Hi @yosefe, I've got some new clues.

I printed the FIFO data of the server through gdb, and found that the function uct_mm_iface_fifo_has_new_data was judged to be false, which caused subsequent new requests entering the queue could not be processed.

(gdb) p iface->read_index_elem[0]
$23 = {
  flags = 1 '\001',
  am_id = 26 '\032',
  length = 4193,
  desc = {
    seg_id = 9732102,
    seg_size = 6291456,
    offset = 5265344
  },
  desc_data = 0x7f2d9a5057c0
}

(gdb) p iface->read_index
$26 = 239204

static UCS_F_ALWAYS_INLINE int
uct_mm_iface_fifo_has_new_data(uct_mm_iface_t *iface)
{
    // ((iface->read_index >> iface->fifo_shift) & 1) => (239204 >> 8) & 1 =>  0
    // (iface->read_index_elem->flags & 1) => 1 & 1 => 1
    // return false
    return (((iface->read_index >> iface->fifo_shift) & 1) ==
            (iface->read_index_elem->flags & 1));
}

ivanallen · 2024-12-28T13:24:10Z

Hi @yosefe

I may already know the root cause of the problem.

ucx/src/uct/sm/mm/base/mm_ep.c

Lines 375 to 378 in 9ce35d0

    
           if (head & iface->config.fifo_size) { 
        
               elem_flags |= UCT_MM_FIFO_ELEM_FLAG_OWNER; 
        
           } 
        
           elem->flags = elem_flags;

When the client crashes before line 378, the statement elem->flags = elem_flags; is not executed, which causes the elem to become an uncommitted request.
Therefore, when a new client continues to submit elem, the server does not respond.

Should ucx add a check to mark that the process to which the elem belongs no longer exists, so that processing of the elem can be skipped? Or some other solution?

I can reproduce the problem perfectly with fault injection.

start server

UCX_TCP_CM_REUSEADDR=y UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -c am -s 4096

start client with gdb

UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y /opt/rh/gcc-toolset-12/root/bin/gdb --args ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096

(gdb) b mm_ep.c:378 if elem->am_id == 23
No source file named mm_ep.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mm_ep.c:367 if elem->am_id == 23) pending.
(gdb) r
Thread 1 "ucp_client_serv" hit Breakpoint 1, uct_mm_ep_am_common_send (send_op=UCT_MM_SEND_AM_BCOPY, ep=0x525bb0, iface=0x41fd60, am_id=23 '\027', length=4104, header=0, payload=0x0, pack_cb=0x7ffff7d4b313 <ucp_am_eager_single_bcopy_pack>, arg=0x43b080, iov=0x0, iovcnt=0, flags=0) at sm/mm/base/mm_ep.c:378
378         elem->flags = elem_flags;
(gdb) quit
A debugging session is active.

        Inferior 1 [process 2553049] will be killed.

Quit anyway? (y or n) y

start client again without gdb

UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096

[root@10-16-153-78 ucx-1.18.0]# UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096
[1735531853.469238] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------+
[1735531853.469275] [10-16-153-78:2553055:0]   | client_server inter-node cfg#0 | active message by ucp_am_send* from host memory    |
[1735531853.469278] [10-16-153-78:2553055:0]   +--------------------------------+-------------------------------------------------+--+
[1735531853.469281] [10-16-153-78:2553055:0]   +--------------------------------+-------------------------------------------------+--+
[1735531853.469314] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------+
[1735531853.469321] [10-16-153-78:2553055:0]   | client_server inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory    |
[1735531853.469324] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------------------------------+--+
[1735531853.469343] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------------------------------+--+
[1735531853.474304] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------+
[1735531853.474333] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* from host memory    |
[1735531853.474337] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------+----------+
[1735531853.474340] [10-16-153-78:2553055:0]   |                        0..8184 | short                                   | tcp/ens4 |
[1735531853.474343] [10-16-153-78:2553055:0]   |                      8185..inf | multi-frag copy-in                      | tcp/ens4 |
[1735531853.474346] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------+----------+
[1735531853.475635] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------------------+
[1735531853.475656] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send*(fast-completion) from host memory    |
[1735531853.475659] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------+----------+
[1735531853.475663] [10-16-153-78:2553055:0]   |                        0..8184 | short                                                    | tcp/ens4 |
[1735531853.475666] [10-16-153-78:2553055:0]   |                   8185..262143 | multi-frag copy-in                                       | tcp/ens4 |
[1735531853.475681] [10-16-153-78:2553055:0]   |                      256K..inf | (?) rendezvous fragmented copy-in copy-out               | tcp/ens4 |
[1735531853.475685] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------+----------+
[1735531853.478350] [10-16-153-78:2553055:0]   +--------------------------------+-----------------------------------------------------------+
[1735531853.478376] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send*(multi) from host memory    |
[1735531853.478380] [10-16-153-78:2553055:0]   +--------------------------------+------------------------------------------------+----------+
[1735531853.478382] [10-16-153-78:2553055:0]   |                        0..8184 | short                                          | tcp/ens4 |
[1735531853.478385] [10-16-153-78:2553055:0]   |                    8185..65528 | zero-copy                                      | tcp/ens4 |
[1735531853.478388] [10-16-153-78:2553055:0]   |                     65529..inf | multi-frag copy-in                             | tcp/ens4 |
[1735531853.478391] [10-16-153-78:2553055:0]   +--------------------------------+------------------------------------------------+----------+
[1735531853.479714] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------+
[1735531853.479744] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag from host memory    |
[1735531853.479747] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------+----------+
[1735531853.479750] [10-16-153-78:2553055:0]   |                        0..8176 | short                                                   | tcp/ens4 |
[1735531853.479753] [10-16-153-78:2553055:0]   |                      8177..inf | multi-frag copy-in                                      | tcp/ens4 |
[1735531853.479759] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------+----------+
[1735531853.481060] [10-16-153-78:2553055:0]   +--------------------------------+-------------------------------------------------------------------------------------+
[1735531853.481081] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag(fast-completion) from host memory    |
[1735531853.481084] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------------+----------+
[1735531853.481087] [10-16-153-78:2553055:0]   |                        0..8176 | short                                                                    | tcp/ens4 |
[1735531853.481094] [10-16-153-78:2553055:0]   |                   8177..262143 | multi-frag copy-in                                                       | tcp/ens4 |
[1735531853.481103] [10-16-153-78:2553055:0]   |                      256K..inf | (?) rendezvous fragmented copy-in copy-out                               | tcp/ens4 |
[1735531853.481110] [10-16-153-78:2553055:0]   +--------------------------------+--------------------------------------------------------------------------+----------+
[1735531853.482531] [10-16-153-78:2553055:0]   +--------------------------------+---------------------------------------------------------------------------+
[1735531853.482553] [10-16-153-78:2553055:0]   | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag(multi) from host memory    |
[1735531853.482556] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------------+----------+
[1735531853.482559] [10-16-153-78:2553055:0]   |                         0..594 | short                                                          | tcp/ens4 |
[1735531853.482564] [10-16-153-78:2553055:0]   |                       595..inf | multi-frag copy-in                                             | tcp/ens4 |
[1735531853.482579] [10-16-153-78:2553055:0]   +--------------------------------+----------------------------------------------------------------+----------+

Hang!

yosefe · 2025-01-01T13:26:22Z

@ivanallen currently shared memory transport does not support detecting of the remote process crash. setting UCX_MM_ERROR_HANDLING=y disregards this limitation and enables selecting shared memory transport.
i am not sure it will be enough to add a check before "elem->flags = elem_flags", since there are other places in the code that may suffer from similar problem. also, adding such check in a naïve way may add performance overhead.

ivanallen · 2025-01-02T06:27:47Z

Hi @yosefe Thank you for your reply.

If that's the case, then it seems that shared memory transport cannot be used in the production environment. It appears there is no self-healing method available for the server unless restarting it.

However, I still hope that UCX can resolve this issue, but it seems quite challenging at the moment.

We have related experience in the production environment, which is to maintain a separate FIFO queue for each producer(SPSC fifo), if this producer fails, then the server will remove the FIFO associated with this producer. The problem with this is that you may need to maintain a large number of FIFO queues, but this does not present much of a problem at the moment in our environment.

ivanallen added the Bug label Dec 12, 2024

ivanallen changed the title ~~When using shared memory communication, ucp_am_send_nbx hangs and no callback not invoked~~ When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked Dec 13, 2024

ivanallen mentioned this issue Dec 13, 2024

ucp_worker.c:728 Assertion `wiface->activate_count > 0' failed: iface 0x43f3800 (tcp/lo) #10376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked #10370

When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked #10370

ivanallen commented Dec 12, 2024

yosefe commented Dec 12, 2024

ivanallen commented Dec 13, 2024 •

edited

Loading

yosefe commented Dec 13, 2024

ivanallen commented Dec 13, 2024

yosefe commented Dec 13, 2024

ivanallen commented Dec 13, 2024

yosefe commented Dec 13, 2024

ivanallen commented Dec 13, 2024

yosefe commented Dec 13, 2024

ivanallen commented Dec 17, 2024 •

edited

Loading

ivanallen commented Dec 28, 2024 •

edited

Loading

ivanallen commented Dec 28, 2024 •

edited

Loading

ivanallen commented Dec 28, 2024 •

edited

Loading

yosefe commented Jan 1, 2025

ivanallen commented Jan 2, 2025 •

edited

Loading

When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked #10370

When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked #10370

Comments

ivanallen commented Dec 12, 2024

Describe the bug

Steps to Reproduce

Setup and versions

yosefe commented Dec 12, 2024

ivanallen commented Dec 13, 2024 • edited Loading

yosefe commented Dec 13, 2024

ivanallen commented Dec 13, 2024

yosefe commented Dec 13, 2024

ivanallen commented Dec 13, 2024

yosefe commented Dec 13, 2024

ivanallen commented Dec 13, 2024

yosefe commented Dec 13, 2024

ivanallen commented Dec 17, 2024 • edited Loading

ivanallen commented Dec 28, 2024 • edited Loading

ivanallen commented Dec 28, 2024 • edited Loading

ivanallen commented Dec 28, 2024 • edited Loading

yosefe commented Jan 1, 2025

ivanallen commented Jan 2, 2025 • edited Loading

ivanallen commented Dec 13, 2024 •

edited

Loading

ivanallen commented Dec 17, 2024 •

edited

Loading

ivanallen commented Dec 28, 2024 •

edited

Loading

ivanallen commented Dec 28, 2024 •

edited

Loading

ivanallen commented Dec 28, 2024 •

edited

Loading

ivanallen commented Jan 2, 2025 •

edited

Loading