You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 14, 2024. It is now read-only.
Starting to train...
2022-03-22 18:49:55,234 [Trainer-0] Loading entity counts...
2022-03-22 18:49:55,235 [Trainer-0] Creating workers...
2022-03-22 18:49:55,305 [Trainer-0] Initializing global model...
2022-03-22 18:49:55,329 [Trainer-0] Creating GPU workers...
2022-03-22 18:49:55,368 [Trainer-0] Starting epoch 1 / 3, edge path 1 / 1, edge chunk 1 / 1
2022-03-22 18:49:55,368 [Trainer-0] Edge path: BigGraph/pbg_test/2/pbg_test_train_partitioned
2022-03-22 18:49:55,369 [Trainer-0] still in queue: 0
2022-03-22 18:49:55,369 [Trainer-0] Swapping partitioned embeddings None ( 0 , 0 )
2022-03-22 18:49:55,369 [Trainer-0] Loading partitioned embeddings from checkpoint
2022-03-22 18:50:02,021 [GPU #3] GPU subprocess 3 up and running
2022-03-22 18:50:02,575 [GPU #0] GPU subprocess 0 up and running
2022-03-22 18:50:02,781 [GPU #2] GPU subprocess 2 up and running
2022-03-22 18:50:02,801 [GPU #1] GPU subprocess 1 up and running
........
/opt/conda/conda-bld/pytorch_1614378098133/work/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [118,0,0], thread: [63,0,0] Assertion
srcIndex < srcSelectDimSize
failed.Process GPU #3:
Traceback (most recent call last):
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 181, in run
lr=job.lr,
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 288, in do_one_job
batch_size=batch_size, model=model, batch_processor=trainer, edges=gpu_edges
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/batching.py", line 122, in process_in_batches
all_stats.append(batch_processor.process_one_batch(model, batch_edges))
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/batching.py", line 165, in process_one_batch
return self._process_one_batch(model, batch_edges)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_cpu.py", line 90, in _process_one_batch
scores, reg = model(batch_edges)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/model.py", line 691, in forward
edges.get_relation_type(),
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/edgelist.py", line 117, in get_relation_type
return self.get_relation_type_as_scalar()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/edgelist.py", line 108, in get_relation_type_as_scalar
return int(self.rel)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/wumeng/workspace/graph_embedding/src/pbg_test.py", line 107, in
main()
File "/home/wumeng/workspace/graph_embedding/src/pbg_test.py", line 101, in main
train(config, subprocess_init=subprocess_init)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train.py", line 42, in train
coordinator.train()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_cpu.py", line 667, in train
stats = self._coordinate_train(edges, eval_edge_idxs, epoch_idx)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 624, in _coordinate_train
gpu_idx, result = self.gpu_pool.wait_for_next()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/site-packages/torchbiggraph/train_gpu.py", line 365, in wait_for_next
res = p.master_endpoint.recv()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/miniconda3/envs/torch_wf/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
�Message from syslogd@localhost at Mar 22 18:50:38 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#212 stuck for 23s! [cuda-EvtHandlr:53047]
�Message from syslogd@localhost at Mar 22 18:51:06 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [cuda-EvtHandlr:53047]
�Message from syslogd@localhost at Mar 22 18:51:34 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [cuda-EvtHandlr:53047]
�Message from syslogd@localhost at Mar 22 18:52:06 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [cuda-EvtHandlr:53047]
�Message from syslogd@localhost at Mar 22 18:53:02 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#67 stuck for 23s! [cuda-EvtHandlr:53561]
�Message from syslogd@localhost at Mar 22 18:53:30 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#67 stuck for 22s! [cuda-EvtHandlr:53561]
�Message from syslogd@localhost at Mar 22 18:54:02 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#67 stuck for 23s! [cuda-EvtHandlr:53561]
�Message from syslogd@localhost at Mar 22 18:54:54 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [cuda-EvtHandlr:53560]
�Message from syslogd@localhost at Mar 22 18:55:22 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [cuda-EvtHandlr:53560]
�Message from syslogd@localhost at Mar 22 18:55:54 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [cuda-EvtHandlr:53560]
�Message from syslogd@localhost at Mar 22 18:56:42 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#80 stuck for 23s! [cuda-EvtHandlr:53560]
�Message from syslogd@localhost at Mar 22 18:57:58 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#72 stuck for 22s! [cuda-EvtHandlr:53303]
�Message from syslogd@localhost at Mar 22 18:58:26 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#72 stuck for 22s! [cuda-EvtHandlr:53303]
�Message from syslogd@localhost at Mar 22 18:58:58 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#72 stuck for 22s! [cuda-EvtHandlr:53303]
Process finished with exit code 1
The text was updated successfully, but these errors were encountered: