CUDA error: operation not permitted when stream is capturing #691

acmilannesta · 2025-02-21T03:01:59Z

MoE module training在multi devices上training时出现异步错误尝试使用wait stream也仍然报错
···
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/trainer/ppytorch/utils/torchscript_converter.py", line 502, in forward
obj = torch.pad(_151, [0, int(torch.rsub(_153, 63))])
_154 = (dcn_layers).forward((masknet_layers).forward(obj, ), )
_155 = torch.pad((MoE).forward(_154, ), [1, 0])
~~~~~~~~~~~~ <--- HERE
_156 = torch.split(torch.sigmoid(_155), 1, 1)
_157, _158, _159, _160, _161, _162, _163, _164, _165, _166, _167, _168, _169, = _156
File "code/torch/trainer/ppytorch/mlenv/common/inference_utils/___torch_mangle_382.py", line 11, in forward
module = self.module
x = torch.to(argument_1, 15)
return torch.to((module).forward(x, ), 6)
~~~~~~~~~~~~~~~ <--- HERE
File "code/torch/trainer/ppytorch/mlenv/common/packageable_modules/deepseek_moe.py", line 77, in forward
41 = annotate(List[Optional[Tensor]], [idx2])
y3 = torch.index_put(y2, _41, _40)
idx3, top3, = torch.where(torch.eq(_0, 39))
~~~~~~~~~~~ <--- HERE
_42 = annotate(List[Optional[Tensor]], [idx3])
_43 = torch.index(y3, _42)

RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
···

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: operation not permitted when stream is capturing #691

CUDA error: operation not permitted when stream is capturing #691

acmilannesta commented Feb 21, 2025

CUDA error: operation not permitted when stream is capturing #691

CUDA error: operation not permitted when stream is capturing #691

Comments

acmilannesta commented Feb 21, 2025