We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练了3次都是在第20次时失败,大佬可以看一下吗 前两次是如下报错:
terminate called after throwing an instance of 'c10::Error' what(): CUDA error: misaligned address CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bfb40d4d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4bfb3d736b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4b946cdb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: <unknown function> + 0x1985457 (0x7f4b9696d457 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0x1d4b680 (0x7f4be3baa680 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f4be3bab812 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #6: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f4be481a7bf in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x7f4be3e9e2ab in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #8: <unknown function> + 0x2d2206b (0x7f4be4b8106b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #10: <unknown function> + 0x2b5b453 (0x7f4be49ba453 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x4015f9b (0x7f4be5e74f9b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x401641e (0x7f4be5e7541e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x7f4be43ee819 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #15: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x7f4be3e94e5b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #16: <unknown function> + 0x2eeef81 (0x7f4be4d4df81 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x7f4be456d15e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x132 (0x7f4bfb869d22 in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so) frame #19: NeuralNetwork::infer() + 0xb6b (0x7f4bfb86777b in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so) frame #20: <unknown function> + 0x5972d (0x7f4bfb86872d in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so) frame #21: <unknown function> + 0x145a0 (0x7f4bfba115a0 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so) frame #22: <unknown function> + 0x8609 (0x7f4c1b7ff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #23: clone + 0x43 (0x7f4c1b724133 in /usr/lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped)
后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下:
terminate called after throwing an instance of 'std::runtime_error' what(): The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/neural_network/___torch_mangle_1624.py", line 30, in forward p_conv = self.p_conv res_layers = self.res_layers _0 = (res_layers).forward(inputs, ) ~~~~~~~~~~~~~~~~~~~ <--- HERE _1 = (p_bn).forward((p_conv).forward(_0, ), ) _2 = (relu).forward(_1, ) File "code/__torch__/torch/nn/modules/container/___torch_mangle_1613.py", line 16, in forward _1 = getattr(self, "1") _0 = getattr(self, "0") _4 = (_1).forward((_0).forward(inputs, ), ) ~~~~~~~~~~~ <--- HERE return (_3).forward((_2).forward(_4, ), ) File "code/__torch__/neural_network/___torch_mangle_1594.py", line 25, in forward _1 = (conv2).forward((relu).forward(_0, ), ) _2 = (bn2).forward(_1, ) _3 = (downsample_bn).forward((downsample_conv).forward(inputs, ), ) ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE input = torch.add_(_2, _3) return (relu).forward1(input, ) File "code/__torch__/torch/nn/modules/conv/___torch_mangle_1592.py", line 10, in forward inputs: Tensor) -> Tensor: weight = self.weight input = torch._convolution(inputs, weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) ~~~~~~~~~~~~~~~~~~ <--- HERE return input Traceback of TorchScript, original code (most recent call last): /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(459): _conv_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(463): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(47): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py(217): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(84): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(1056): trace_module /root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(794): trace /root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(279): save_model /root/autodl-tmp/alpha-zero-gomoku/test/../src/learner.py(114): learn learner_test.py(17): <module> RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Aborted (core dumped)
The text was updated successfully, but these errors were encountered:
use CUDA 11.6/PyTorch 1.10/LibTorch 1.10(Pre-cxx11 ABI)/SWIG 4.0.2
Sorry, something went wrong.
No branches or pull requests
训练了3次都是在第20次时失败,大佬可以看一下吗
前两次是如下报错:
后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下:
The text was updated successfully, but these errors were encountered: