You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we have a simple code that runs successfully on NVIDIA A100 GPUs but fails on AMD MI250/MI210. The code uses proxy channel over CudaIPC connection. We wonder if proxy channel is still buggy on AMD GPUs.
The code does the following:
Rank 0: Intialize a cp.zeros data buffer and a cp.ones scratch buffer. Perform data += scratch, and then putWithSignal the data buffer into rank 1's data buffer through proxy channel.
Rank 1: Intialize a cp.zeros data buffer and wait for rank 0's signals.
The data is in total 1024*256 ints and is reduce and send in 4*blockDim.x chunks. We launch only one 128-thread threadblock. After the code finishes, both rank 0 and 1 should have data buffer being all ones. However, we noticed that rank 1's data buffer data[66048...196607] is still all zeros, while rank 0's data buffer is correctly all ones.
There is a __syncthreads() after the reduce and a check if the data[start] == 1 never fails. Somehow the putWithSignal afterwards still put all zero data for some chunks. The same program runs successfully on NVIDIA GPUs or with smaller nelem_total on AMD GPUs. We wonder if there is a bug with proxy channel or rocm.
@liangyuRain Please try out #423 and use mscclpp.utils.GpuBuffer instead of CuPy arrays where your communication happens on. The usage is the same as CuPy ndarrays. We still cannot gurantee it can work, because CuPy doesn't officially support ROCm 6.x yet and we haven't tried CuPy with AMD MI250/MI210. Please let us know if you still encounter any issues.
@chhwang Thx a lot for fixing this! I believe the problem is resolved. However, there seems to be a separate issue with AMD GPUs #439 that predates this fix.
Hi, we have a simple code that runs successfully on NVIDIA A100 GPUs but fails on AMD MI250/MI210. The code uses proxy channel over
CudaIPC
connection. We wonder if proxy channel is still buggy on AMD GPUs.The code does the following:
cp.zeros
data buffer and acp.ones
scratch buffer. Performdata += scratch
, and thenputWithSignal
the data buffer into rank 1's data buffer through proxy channel.cp.zeros
data buffer and wait for rank 0's signals.The data is in total 1024*256 ints and is reduce and send in
4*blockDim.x
chunks. We launch only one 128-thread threadblock. After the code finishes, both rank 0 and 1 should have data buffer being all ones. However, we noticed that rank 1's data bufferdata[66048...196607]
is still all zeros, while rank 0's data buffer is correctly all ones.There is a
__syncthreads()
after the reduce and a check if thedata[start] == 1
never fails. Somehow theputWithSignal
afterwards still put all zero data for some chunks. The same program runs successfully on NVIDIA GPUs or with smallernelem_total
on AMD GPUs. We wonder if there is a bug with proxy channel or rocm.Platform info:
863a59936084b0dd88c221185841b8c773d17446
0188dd8b16938fa835bcda797f70f9af2f8b4980
Kernel code
Python code
Output
The text was updated successfully, but these errors were encountered: