Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlx5 connect on mlx5_1 failed: Connection timed out #9971

Open
shinoharakazuya opened this issue Jun 24, 2024 · 15 comments
Open

mlx5 connect on mlx5_1 failed: Connection timed out #9971

shinoharakazuya opened this issue Jun 24, 2024 · 15 comments
Labels

Comments

@shinoharakazuya
Copy link

Describe the bug

I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.

Steps to Reproduce

  • Command line: Please see log file.
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v): Please see log file.
  • Any UCX environment variables used

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...): Please see log file.
    • cat /etc/issue or cat /etc/redhat-release + uname -a
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues: Please see log file.
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
      • or: MLNX_OFED version ofed_info -s
    • HW information from ibstat or ibv_devinfo -vv command
  • For GPU related issues:
    • GPU type : H100
    • Cuda:
      • Drivers version:12.2
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv : Please see log file.

Additional information (depending on the issue)

  • OpenMPI version:5.0.3
  • Output of ucx_info -d to show transports and devices recognized by UCX: Please see log file.
@shinoharakazuya
Copy link
Author

logfile.txt

@yosefe
Copy link
Contributor

yosefe commented Jun 29, 2024

@shinoharakazuya can you pls post the output of show_gids command, and check if setting UCX_IB_ROCE_LOCAL_SUBNET=y helps to resolve the issue?

@changchengx
Copy link
Contributor

@jandres742 FYI

@yosefe
Copy link
Contributor

yosefe commented Jun 30, 2024

NOTE: This issue happens on Nvidia internal cluster

@ivanallen
Copy link

@yosefe I have same issue.

client:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc  UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152  -n 5000000 -e

server:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest -e
[1737007020.457942] [node13:967927:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737007020.457946] [node13:967927:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory                          |
[1737007020.457947] [node13:967927:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737007020.457950] [node13:967927:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737007020.457953] [node13:967927:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737007020.457955] [node13:967927:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_1:1 |
[1737007020.457958] [node13:967927:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737007021.503646] [node13:967927:a]       ib_device.c:1332 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::a288:c2ff:feb4:87d7 flow_label=0xffffffff sgid_index=1 traffic_class=106) for RC DEVX QP connect on mlx5_1 failed: Connection timed out
[1737007021.503771] [node13:967927:0]         libperf.c:1069 UCX  ERROR error handler called with status -80 (Endpoint timeout)
[root@node12 ucx-1.18.0]# show_gids
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_0  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:87e6                 v1      ens2f0np0
mlx5_0  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:87e6                 v2      ens2f0np0
mlx5_0  1       2       0000:0000:0000:0000:0000:ffff:0a10:1d0c 10.16.29.12     v1      ens2f0np0
mlx5_0  1       3       0000:0000:0000:0000:0000:ffff:0a10:1d0c 10.16.29.12     v2      ens2f0np0
mlx5_1  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:87e7                 v1      ens2f1np1
mlx5_1  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:87e7                 v2      ens2f1np1
mlx5_2  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:a562                 v1      ens7f0np0
mlx5_2  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:a562                 v2      ens7f0np0
mlx5_2  1       2       0000:0000:0000:0000:0000:ffff:0a10:270c 10.16.39.12     v1      ens7f0np0
mlx5_2  1       3       0000:0000:0000:0000:0000:ffff:0a10:270c 10.16.39.12     v2      ens7f0np0
mlx5_3  1       0       fe80:0000:0000:0000:a288:c2ff:feb4:a563                 v1      ens7f1np1
mlx5_3  1       1       fe80:0000:0000:0000:a288:c2ff:feb4:a563                 v2      ens7f1np1
n_gids_found=12

@yosefe
Copy link
Contributor

yosefe commented Jan 16, 2025

@ivanallen mlx5_1 does not have an IP address, is that expected?

@ivanallen
Copy link

@ivanallen mlx5_1 does not have an IP address, is that expected?

Yes, that is expected. We don't configure mlx5_1 and mlx5_3.

@yosefe
Copy link
Contributor

yosefe commented Jan 16, 2025

Yes, that is expected. We don't configure mlx5_1 and mlx5_3.

Seems like the test being run on mlx5_1? Per the command above:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152 -n 5000000 -e

@ivanallen
Copy link

Yes, that is expected. We don't configure mlx5_1 and mlx5_3.

Seems like the test being run on mlx5_1? Per the command above:

UCX_PROTO_ENABLE=y UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 UCX_TLS=rc UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 2097152 -n 5000000 -e

@yosefe Do you mean using mlx5_2? mlx5_1 has no ip address. I have the same problem if I use UCX_NET_DEVICES=mlx5_0:1 and mlx5_1:1.

server:

[root@node13 ucx-1.18.0]# UCX_TLS=rc UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest -e
[1737024891.886842] [node13:2797867:0]        perftest.c:800  UCX  WARN  CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.16.29.12:52468
+----------------------------------------------------------------------------------------------------------+
| API:          protocol layer                                                                             |
| Test:         am bandwidth / message rate                                                                |
| Data layout:  (automatic)                                                                                |
| Send memory:  host                                                                                       |
| Recv memory:  host                                                                                       |
| Message size: 1048576                                                                                    |
| Window size:  32                                                                                         |
| AM header size: 0                                                                                        |
+----------------------------------------------------------------------------------------------------------+
[1737024893.671591] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.671602] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* from host memory                                                 |
[1737024893.671606] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671609] [node13:2797867:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.671612] [node13:2797867:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.671613] [node13:2797867:0]   |               8247..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.671616] [node13:2797867:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.671619] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671782] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.671786] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(fast-completion) from host memory                                |
[1737024893.671788] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.671791] [node13:2797867:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.671794] [node13:2797867:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.671796] [node13:2797867:0]   |               8247..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.671798] [node13:2797867:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024893.671801] [node13:2797867:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.671802] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672161] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672165] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(multi) from host memory                                          |
[1737024893.672166] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672171] [node13:2797867:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672173] [node13:2797867:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024893.672175] [node13:2797867:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672178] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672367] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672371] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory                                 |
[1737024893.672373] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672375] [node13:2797867:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672377] [node13:2797867:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.672378] [node13:2797867:0]   |               8239..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.672381] [node13:2797867:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672384] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672535] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672538] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(fast-completion) from host memory                |
[1737024893.672540] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672543] [node13:2797867:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672545] [node13:2797867:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024893.672548] [node13:2797867:0]   |               8239..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024893.672551] [node13:2797867:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024893.672554] [node13:2797867:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672556] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672739] [node13:2797867:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024893.672742] [node13:2797867:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory                          |
[1737024893.672744] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024893.672748] [node13:2797867:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024893.672752] [node13:2797867:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024893.672755] [node13:2797867:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024893.672756] [node13:2797867:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024894.680128] [node13:2797867:0]       ib_device.c:1332 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.16.39.12 flow_label=0xffffffff sgid_index=3 traffic_class=106) for RC DEVX QP connect on mlx5_2 failed: Connection timed out
[1737024894.680178] [node13:2797867:0]         libperf.c:1069 UCX  ERROR error handler called with status -80 (Endpoint timeout)
[root@node13 ucx-1.18.0]#

client:

[root@node12 ucx-1.18.0]# UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 UCX_PROTO_ENABLE=y UCX_TLS=rc  UCX_PROTO_INFO=y ./install-release-mt/bin/ucx_perftest 10.16.29.13 -t ucp_am_bw -s 1048576  -n 5000000 -e                                 [53/1874]
[1737024666.944378] [node13:2783427:0]        perftest.c:800  UCX  WARN  CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1737024667.122800] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.122811] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* from host memory                                                 |
[1737024667.122814] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122817] [node13:2783427:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.122819] [node13:2783427:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.122822] [node13:2783427:0]   |               8247..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.122825] [node13:2783427:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.122827] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122978] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.122982] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(fast-completion) from host memory                                |
[1737024667.122984] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.122987] [node13:2783427:0]   |                   0..2038 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.122990] [node13:2783427:0]   |                2039..8246 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.122993] [node13:2783427:0]   |               8247..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.122997] [node13:2783427:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024667.122999] [node13:2783427:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123001] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123351] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123355] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send*(multi) from host memory                                          |
[1737024667.123356] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123360] [node13:2783427:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123362] [node13:2783427:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024667.123364] [node13:2783427:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123368] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123534] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123537] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory                                 |
[1737024667.123541] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123543] [node13:2783427:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123545] [node13:2783427:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.123546] [node13:2783427:0]   |               8239..29420 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.123550] [node13:2783427:0]   |                29421..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123553] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123705] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123708] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(fast-completion) from host memory                |
[1737024667.123710] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123714] [node13:2783427:0]   |                   0..2030 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123716] [node13:2783427:0]   |                2031..8238 | copy-in                                   | rc_mlx5/mlx5_0:1                                    |
[1737024667.123718] [node13:2783427:0]   |               8239..22493 | multi-frag copy-in                        | rc_mlx5/mlx5_0:1                                    |
[1737024667.123720] [node13:2783427:0]   |             22494..262143 | multi-frag zero-copy                      | rc_mlx5/mlx5_0:1                                    |
[1737024667.123723] [node13:2783427:0]   |                 256K..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123727] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123900] [node13:2783427:0]   +---------------------------+-------------------------------------------------------------------------------------------------+
[1737024667.123904] [node13:2783427:0]   | perftest inter-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from host memory                          |
[1737024667.123906] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024667.123909] [node13:2783427:0]   |                    0..514 | short                                     | rc_mlx5/mlx5_0:1                                    |
[1737024667.123912] [node13:2783427:0]   |                 515..4844 | zero-copy                                 | rc_mlx5/mlx5_0:1                                    |
[1737024667.123914] [node13:2783427:0]   |                 4845..inf | (?) rendezvous zero-copy read from remote | 50% on rc_mlx5/mlx5_0:1 and 50% on rc_mlx5/mlx5_2:1 |
[1737024667.123917] [node13:2783427:0]   +---------------------------+-------------------------------------------+-----------------------------------------------------+
[1737024720.960016] [node13:2783427:0]         libperf.c:1069 UCX  ERROR error handler called with status -80 (Endpoint timeout)

@yosefe
Copy link
Contributor

yosefe commented Jan 16, 2025

Can you try:
ping -I ens7f0np0 10.16.39.12 on node13?

@yosefe
Copy link
Contributor

yosefe commented Jan 16, 2025

Also, can you try adding UCX_IB_ROCE_LOCAL_SUBNET=y (to both client and server)?

@ivanallen
Copy link

@yosefe Sorry, it looks like a network failure. I'll look into it myself first.

Can you try: ping -I ens7f0np0 10.16.39.12 on node13?

[root@localhost network-scripts]# ping -I ens7f0np0 10.16.39.12
PING 10.16.39.12 (10.16.39.12) from 10.16.39.13 ens7f0np0: 56(84) bytes of data.
From 10.16.39.13 icmp_seq=1 Destination Host Unreachable
From 10.16.39.13 icmp_seq=2 Destination Host Unreachable
From 10.16.39.13 icmp_seq=3 Destination Host Unreachable
From 10.16.39.13 icmp_seq=4 Destination Host Unreachable
From 10.16.39.13 icmp_seq=5 Destination Host Unreachable
[root@localhost network-scripts]# ping 10.16.39.12
PING 10.16.39.12 (10.16.39.12) 56(84) bytes of data.
From 10.16.39.13 icmp_seq=1 Destination Host Unreachable
From 10.16.39.13 icmp_seq=2 Destination Host Unreachable
From 10.16.39.13 icmp_seq=3 Destination Host Unreachable

@ivanallen
Copy link

@yosefe
When I configure UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 is already working. But only 2*100Gbps bandwidth.

However, in my other environment(425Gbps), without limiting UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 can also work properly, and can get 425Gbs bandwidth.

@yosefe
Copy link
Contributor

yosefe commented Jan 19, 2025

@ivanallen what is the network speed of each NIC (can be checked by ibstat or ibv_devinfo)?
Does the other environment have more configured NICs?

@ivanallen
Copy link

@ivanallen what is the network speed of each NIC (can be checked by ibstat or ibv_devinfo)? Does the other environment have more configured NICs?

Hi @yosefe, Can we look at #10430 first? I suspect there is a problem with the conversion between bond and non-bond. This time let's look at the bandwidth of the bond environment first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants