Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!sh experiments/COCOA/pcnet_m/train.sh # you may have to set --nproc_per_node=#YOUR_GPUS, I have modified the nproc_per_node =1. Thank you #45

Open
Lan2Kailee opened this issue Oct 1, 2021 · 0 comments

Comments

@Lan2Kailee
Copy link

/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
"The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_muppckot/none_9kg5iq21
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/7/error.json
Traceback (most recent call last):
Traceback (most recent call last):
File "main.py", line 48, in
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
main(args)
File "main.py", line 29, in main
Traceback (most recent call last):
File "main.py", line 48, in
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
Traceback (most recent call last):
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
Traceback (most recent call last):
File "main.py", line 48, in
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
main(args)
File "main.py", line 29, in main
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
main(args)
File "main.py", line 29, in main
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
broadcast_params(self.module)broadcast_params(self.module)broadcast_params(self.module)broadcast_params(self.module)

broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params

File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
dist.broadcast(p, 0) dist.broadcast(p, 0)

  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast

File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
dist.broadcast(p, 0)dist.broadcast(p, 0)

      File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast

File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).work = default_pg.broadcast([tensor], opts)

work = default_pg.broadcast([tensor], opts)work = default_pg.broadcast([tensor], opts)RuntimeError
: RuntimeErrorNCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

RuntimeError: work = default_pg.broadcast([tensor], opts):
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).RuntimeError
: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 6 (pid: 1211) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/7/error.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant