Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: tried to use nn.dataParallel however crashed #1421

Closed
jdgh000 opened this issue Nov 13, 2024 · 15 comments
Closed

[Issue]: tried to use nn.dataParallel however crashed #1421

jdgh000 opened this issue Nov 13, 2024 · 15 comments

Comments

@jdgh000
Copy link

jdgh000 commented Nov 13, 2024

Problem Description

Ran following example:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html with little modification but it failed during run:
if I apply nn.dataParallel to model then it occurs, without applying it works
model = nn.DataParallel(model)

code:

import sys
sys.path.append('..')
from classes import *

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):
    DEBUG = 0
    DEBUGL2 = 0

    def __init__(self, size, length):

        if self.DEBUG:
            print("GG: RandomDataset.__init__(size=", size, "length: ", length, ")")

        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):

        if self.DEBUGL2:
            print("GG: RandomDataset.__getitem__(index=", index, ")")

        return self.data[index]

    def __len__(self):

        if self.DEBUG:
            print("GG: RandomDataset.__len__() returning self.len: ", self.len)

        return len(self.data)

# Parameters and DataLoaders
input_size = 1000
output_size = 10

batch_size = 1000
data_size = 60000

if not torch.cuda.is_available():
    print("GPU is not detected.")
    quit(1)

device = torch.device("cuda:0")

# Create random data set: input size = 1k, data_size = 60k, batch_size: 1k.

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

model = Model(input_size, output_size)

if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(), "output_size", output.size())

 root@u488 dataparallellism]$ sudo python3 ex1.py
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
 root@u488 dataparallellism]$ nano -w "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py"
 root@u488 dataparallellism]$ cat /opt/rocm/.info/version
6.2.0-66

Operating System

rhel9

CPU

9500hx ryzen

GPU

mi250

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

Run example code with nn.dataParallel (actual code pasted in problem description):

https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@harkgill-amd
Copy link

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

@zichguan-amd
Copy link

zichguan-amd commented Nov 13, 2024

Hi @jdgh000, looks like you are running on a laptop with integrated graphics, you can check if rocminfo shows two graphics devices. Since integrated graphics are not supported, you can bypass it by setting the environment variable HIP_VISIBLE_DEVICES to only use the discrete GPU as documented here: https://rocmdocs.amd.com/projects/HIP/en/develop/how-to/debugging.html#making-device-visible

@jdgh000
Copy link
Author

jdgh000 commented Nov 13, 2024

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

thx, let me know,

@harkgill-amd
Copy link

As @zichguan-amd mentioned, this has to do with the example being ran on your APU rather than a dedicated graphics card. Correct me if I'm wrong, but I believe you're running on a 5900HX. Could you try running directly on your dGPU by adding this line at the top of your python script?

os.environ['HIP_VISIBLE_DEVICES']='0'
 

@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

this is not apu sure, cpu model I put is wrong. it is mi250. since cpu model is not that important, i just typed the suggestion.

@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

Name: AMD EPYC 7763 64-Core Processor
Name: AMD EPYC 7763 64-Core Processor
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a

@zichguan-amd
Copy link

In that case can you run with NCCL_DEBUG=INFO or NCCL_DEBUG=TRACE for details as suggested by the error message?

@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

I saw the prompt and did few times but does not seem to outputting much than not using...either TRACE or INFO

sudo mkdir log ; NCCL_DEBUG=INFO sudo python3 ex1.py 2>&1 | sudo tee log/ex1-NCCL_DEBUG.INFO.log
mkdir: cannot create directory ‘log’: File exists
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/1-dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)



@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

seems failing in one of these:
/usr/local/lib64/python3.9/site-packages/torch/_C/init.pyi:10823:def _broadcast_coalesced(
/usr/local/lib64/python3.9/site-packages/torch/_C/_distributed_c10d.pyi:619:def _broadcast_coalesced(
but i can only function prototype, not body, can not see what is going on in these call

@zichguan-amd
Copy link

With sudo you need to use -E to preserve the environment variables. Also, can you upgrade to the latest ROCm 6.2.4 and PyTorch 2.5.1 and see if that fixes it?

@jdgh000
Copy link
Author

jdgh000 commented Nov 15, 2024

It is already torch2.6.1 and ROCm6.2.4
torch 2.5.1+rocm6.2
torchaudio 2.5.1+rocm6.2
torchvision 0.20.1+rocm6.2

@jdgh000
Copy link
Author

jdgh000 commented Nov 16, 2024

you said you reproduced it, should not you be able to look into this instead of poking around blindfully? i am not able to do the experimental steps at this point I reported enough that you are able to see on your side.

Secondly, your reasoning, irrational and logic is very weak on this, you already seen on your system but then later attempts to attributes to APU, the fact that you can see it is due to APU is already negated by the fact that you are able to seen on your side. @zichguan-amd please dont have me to try something fruitless steps i.e. debug envariable and upgrading, it is just spinning the wheels all the time pls instead follow the reasoning and logic to address this issue!

@zichguan-amd
Copy link

We were only able to reproduce this issue when using integrated graphics, so we kindly ask you to provide more details in order for us to help you find a fix. NCCL Error 1 can have different causes, including HW failure, see pytorch/pytorch#11756. You may also want to check if the error only occurs when using some specific GPUs.

@jdgh000
Copy link
Author

jdgh000 commented Dec 1, 2024

it is not working on ROCm, on nvidia rtx gpu:

  return F.linear(input, self.weight, self.bias)
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])

@jdgh000
Copy link
Author

jdgh000 commented Dec 1, 2024

We were only able to reproduce this issue when using integrated graphics, so we kindly ask you to provide more details in order for us to help you find a fix. NCCL Error 1 can have different causes, including HW failure, see pytorch/pytorch#11756. You may also want to check if the error only occurs when using some specific GPUs.

What made you think you were able to reproduce only on IG? It does not say anywhere it says that above?? I gave you all relevant information. I hgve you all the information gpu models, rocm version above, you just ignored those and re-asked. What you says here absolutely makes no sense because you just changed the story anew saying it is only reproducible on IG. Could you paste log on IG and discreet? I dont think you do because it makes no sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants