-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlx5 connect on mlx5_1 failed: Connection timed out #9971
Comments
@shinoharakazuya can you pls post the output of |
@jandres742 FYI |
NOTE: This issue happens on Nvidia internal cluster |
@yosefe I have same issue. client:
server:
|
@ivanallen mlx5_1 does not have an IP address, is that expected? |
Yes, that is expected. We don't configure mlx5_1 and mlx5_3. |
Seems like the test being run on mlx5_1? Per the command above:
|
@yosefe Do you mean using mlx5_2? mlx5_1 has no ip address. I have the same problem if I use UCX_NET_DEVICES=mlx5_0:1 and mlx5_1:1. server:
client:
|
Can you try: |
Also, can you try adding UCX_IB_ROCE_LOCAL_SUBNET=y (to both client and server)? |
@yosefe Sorry, it looks like a network failure. I'll look into it myself first.
|
@yosefe However, in my other environment(425Gbps), without limiting UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1 can also work properly, and can get 425Gbs bandwidth. |
@ivanallen what is the network speed of each NIC (can be checked by ibstat or ibv_devinfo)? |
Hi @yosefe, Can we look at #10430 first? I suspect there is a problem with the conversion between bond and non-bond. This time let's look at the bandwidth of the bond environment first. |
Describe the bug
I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.
Steps to Reproduce
ucx_info -v
): Please see log file.Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
: Please see log file.Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX: Please see log file.The text was updated successfully, but these errors were encountered: