You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
"The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_muppckot/none_9kg5iq21
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/7/error.json
Traceback (most recent call last):
Traceback (most recent call last):
File "main.py", line 48, in
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
main(args)
File "main.py", line 29, in main
Traceback (most recent call last):
File "main.py", line 48, in
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
Traceback (most recent call last):
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
Traceback (most recent call last):
File "main.py", line 48, in
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
main(args)
File "main.py", line 29, in main
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
main(args)
File "main.py", line 29, in main
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
broadcast_params(self.module)broadcast_params(self.module)broadcast_params(self.module)broadcast_params(self.module)
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
dist.broadcast(p, 0) dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
dist.broadcast(p, 0)dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).work = default_pg.broadcast([tensor], opts)
work = default_pg.broadcast([tensor], opts)work = default_pg.broadcast([tensor], opts)RuntimeError
: RuntimeErrorNCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
RuntimeError: work = default_pg.broadcast([tensor], opts):
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).RuntimeError
: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 6 (pid: 1211) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/7/error.json
The text was updated successfully, but these errors were encountered:
/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
"The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from
os.environ('LOCAL_RANK')
instead.INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_muppckot/none_9kg5iq21
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_0/7/error.json
Traceback (most recent call last):
Traceback (most recent call last):
File "main.py", line 48, in
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
main(args)
File "main.py", line 29, in main
Traceback (most recent call last):
File "main.py", line 48, in
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
Traceback (most recent call last):
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
Traceback (most recent call last):
File "main.py", line 48, in
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
File "main.py", line 48, in
Traceback (most recent call last):
File "main.py", line 48, in
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
main(args)
File "main.py", line 29, in main
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
main(args)
File "main.py", line 29, in main
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
main(args)
File "main.py", line 29, in main
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
trainer = Trainer(args)
File "/content/drive/MyDrive/deocclusion/trainer.py", line 61, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
args.model, load_pretrain=args.load_pretrain, dist_model=True)
File "/content/drive/MyDrive/deocclusion/models/partial_completion_mask.py", line 16, in init
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
super(PartialCompletionMask, self).init(params, dist_model)
File "/content/drive/MyDrive/deocclusion/models/single_stage_model.py", line 16, in init
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
self.model = utils.DistModule(self.model)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 16, in init
broadcast_params(self.module)broadcast_params(self.module)broadcast_params(self.module)broadcast_params(self.module)
broadcast_params(self.module)
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
File "/content/drive/MyDrive/deocclusion/utils/distributed_utils.py", line 32, in broadcast_params
dist.broadcast(p, 0) dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
dist.broadcast(p, 0)dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
dist.broadcast(p, 0)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).work = default_pg.broadcast([tensor], opts)
work = default_pg.broadcast([tensor], opts)work = default_pg.broadcast([tensor], opts)RuntimeError
: RuntimeErrorNCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
RuntimeError: work = default_pg.broadcast([tensor], opts):
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).RuntimeError
: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 6 (pid: 1211) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_muppckot/none_9kg5iq21/attempt_1/7/error.json
The text was updated successfully, but these errors were encountered: