Skip to content
This repository has been archived by the owner on Mar 14, 2024. It is now read-only.

raise EOFError and RuntimeError: CUDA error: device-side assert triggered #254

Open
Studypython2016 opened this issue Mar 22, 2022 · 2 comments

Comments

@Studypython2016
Copy link

Starting to train...
2022-03-22 18:49:55,234 [Trainer-0] Loading entity counts...
2022-03-22 18:49:55,235 [Trainer-0] Creating workers...
2022-03-22 18:49:55,305 [Trainer-0] Initializing global model...
2022-03-22 18:49:55,329 [Trainer-0] Creating GPU workers...
2022-03-22 18:49:55,368 [Trainer-0] Starting epoch 1 / 3, edge path 1 / 1, edge chunk 1 / 1
2022-03-22 18:49:55,368 [Trainer-0] Edge path: BigGraph/pbg_test/2/pbg_test_train_partitioned
2022-03-22 18:49:55,369 [Trainer-0] still in queue: 0
2022-03-22 18:49:55,369 [Trainer-0] Swapping partitioned embeddings None ( 0 , 0 )
2022-03-22 18:49:55,369 [Trainer-0] Loading partitioned embeddings from checkpoint
2022-03-22 18:50:02,021 [GPU #3] GPU subprocess 3 up and running
2022-03-22 18:50:02,575 [GPU #0] GPU subprocess 0 up and running
2022-03-22 18:50:02,781 [GPU #2] GPU subprocess 2 up and running
2022-03-22 18:50:02,801 [GPU #1] GPU subprocess 1 up and running
........
/opt/conda/conda-bld/pytorch_1614378098133/work/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [118,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
Process GPU #3:
Traceback (most recent call last):
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 181, in run
lr=job.lr,
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 288, in do_one_job
batch_size=batch_size, model=model, batch_processor=trainer, edges=gpu_edges
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/batching.py", line 122, in process_in_batches
all_stats.append(batch_processor.process_one_batch(model, batch_edges))
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/batching.py", line 165, in process_one_batch
return self._process_one_batch(model, batch_edges)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_cpu.py", line 90, in _process_one_batch
scores, reg = model(batch_edges)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/model.py", line 691, in forward
edges.get_relation_type(),
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/edgelist.py", line 117, in get_relation_type
return self.get_relation_type_as_scalar()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/edgelist.py", line 108, in get_relation_type_as_scalar
return int(self.rel)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/wumeng/workspace/graph_embedding/src/pbg_test.py", line 107, in
main()
File "/home/wumeng/workspace/graph_embedding/src/pbg_test.py", line 101, in main
train(config, subprocess_init=subprocess_init)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train.py", line 42, in train
coordinator.train()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_cpu.py", line 667, in train
stats = self._coordinate_train(edges, eval_edge_idxs, epoch_idx)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 624, in _coordinate_train
gpu_idx, result = self.gpu_pool.wait_for_next()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 365, in wait_for_next
res = p.master_endpoint.recv()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

�Message from syslogd@localhost at Mar 22 18:50:38 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#212 stuck for 23s! [cuda-EvtHandlr:53047]

�Message from syslogd@localhost at Mar 22 18:51:06 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [cuda-EvtHandlr:53047]

�Message from syslogd@localhost at Mar 22 18:51:34 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [cuda-EvtHandlr:53047]

�Message from syslogd@localhost at Mar 22 18:52:06 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [cuda-EvtHandlr:53047]

�Message from syslogd@localhost at Mar 22 18:53:02 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#67 stuck for 23s! [cuda-EvtHandlr:53561]

�Message from syslogd@localhost at Mar 22 18:53:30 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#67 stuck for 22s! [cuda-EvtHandlr:53561]

�Message from syslogd@localhost at Mar 22 18:54:02 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#67 stuck for 23s! [cuda-EvtHandlr:53561]

�Message from syslogd@localhost at Mar 22 18:54:54 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [cuda-EvtHandlr:53560]

�Message from syslogd@localhost at Mar 22 18:55:22 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [cuda-EvtHandlr:53560]

�Message from syslogd@localhost at Mar 22 18:55:54 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [cuda-EvtHandlr:53560]

�Message from syslogd@localhost at Mar 22 18:56:42 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#80 stuck for 23s! [cuda-EvtHandlr:53560]

�Message from syslogd@localhost at Mar 22 18:57:58 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#72 stuck for 22s! [cuda-EvtHandlr:53303]

�Message from syslogd@localhost at Mar 22 18:58:26 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#72 stuck for 22s! [cuda-EvtHandlr:53303]

�Message from syslogd@localhost at Mar 22 18:58:58 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#72 stuck for 22s! [cuda-EvtHandlr:53303]

Process finished with exit code 1

@Geongyu
Copy link

Geongyu commented May 16, 2022

I have same errors, someone can fix this ??
image

@avinabsaha
Copy link

@Geongyu did you figure this out?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants