You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
作者您好:
我用另外一种分布式训练启动您代码CIFAR100-10-10进行复现,结果非常接近,但是两张V100,每个epoch训练要51s左右,使用github上面的命令启动train.sh出现了一下问题:
给您邮箱发邮件时,显示address rejected。您空闲的时候看看就好,您的研究对我来说非常有帮助
root@cd98f4d76410:/code/DKT_source# bash train.sh 0,1 --options /code/DKT_source/options/data/cifar100_10-10.yaml /code/DKT_source/options/data/cifar100_order1.yaml /code/DKT_source/options/model/cifar_DKT.yaml --name DKT --data-path /data/Logic888/CIFAR100/cifar100 --output-basedir /data/Logic888/CIFAR100/DKT/save_checkpoints --memory-size 2000
Launching exp on 0,1...
/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/code/DKT_source/continual/robust_models_ImageNet.py:391: UserWarning: Overwriting rvt_tiny in registry with continual.robust_models_ImageNet.rvt_tiny. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:411: UserWarning: Overwriting rvt_tiny_plus in registry with continual.robust_models_ImageNet.rvt_tiny_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:433: UserWarning: Overwriting rvt_small in registry with continual.robust_models_ImageNet.rvt_small. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:453: UserWarning: Overwriting rvt_small_plus in registry with continual.robust_models_ImageNet.rvt_small_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:475: UserWarning: Overwriting rvt_base in registry with continual.robust_models_ImageNet.rvt_base. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:495: UserWarning: Overwriting rvt_base_plus in registry with continual.robust_models_ImageNet.rvt_base_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base_plus(pretrained, **kwargs):
usage: DKT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--incremental-batch-size INCREMENTAL_BATCH_SIZE] [--epochs EPOCHS]
[--base-epochs BASE_EPOCHS] [--no-amp] [--model MODEL] [--input-size INPUT_SIZE] [--patch-size PATCH_SIZE]
[--embed-dim EMBED_DIM] [--depth DEPTH] [--num-heads NUM_HEADS] [--drop PCT] [--drop-path PCT]
[--norm {layer,scale}] [--opt OPTIMIZER] [--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM]
[--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--incremental-lr INCREMENTAL_LR]
[--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--incremental-warmup-lr LR] [--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N]
[--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING]
[--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug] [--no-repeated-aug] [--reprob PCT]
[--remode REMODE] [--recount RECOUNT] [--resplit] [--auto-kd] [--kd KD] [--distillation-tau DISTILLATION_TAU]
[--resnet] [--data-path DATA_PATH] [--data-set {CIFAR,IMNET,INAT,INAT19}]
[--data-path-subTrain DATA_PATH_SUBTRAIN] [--data-path-subVal DATA_PATH_SUBVAL]
[--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output-dir OUTPUT_DIR]
[--output-basedir OUTPUT_BASEDIR] [--device DEVICE] [--seed SEED] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--initial-increment INITIAL_INCREMENT]
[--increment INCREMENT] [--class-order CLASS_ORDER [CLASS_ORDER ...]] [--eval-every EVAL_EVERY] [--debug]
[--retrain-scratch] [--max-task MAX_TASK] [--name NAME] [--options [OPTIONS ...]] [--DKT]
[--duplex-clf DUPLEX_CLF] [--memory-size MEMORY_SIZE] [--distributed-memory] [--global-memory]
[--oversample-memory OVERSAMPLE_MEMORY] [--oversample-memory-ft OVERSAMPLE_MEMORY_FT] [--rehearsal-test-trsf]
[--rehearsal-modes REHEARSAL_MODES] [--fixed-memory]
[--rehearsal {random,closest_token,closest_all,icarl_token,icarl_all,furthest_token,furthest_all}]
[--sep-memory] [--replay-memory REPLAY_MEMORY] [--finetuning {balanced}] [--finetuning-mode FINETUNING_MODE]
[--finetuning-lr FINETUNING_LR] [--finetuning-teacher] [--finetuning-resetclf] [--only-ft] [--ft-no-sampling]
[--freeze-task [FREEZE_TASK ...]] [--freeze-ft [FREEZE_FT ...]] [--freeze-eval] [--log-path LOG_PATH]
[--log-category LOG_CATEGORY] [--bce-loss] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
[--dist_url DIST_URL] [--resume RESUME] [--start-task START_TASK] [--start-epoch START_EPOCH]
[--save-every-epoch SAVE_EVERY_EPOCH] [--validation VALIDATION]
DKT training and evaluation script: error: unrecognized arguments: --local-rank=1
/code/DKT_source/continual/robust_models_ImageNet.py:391: UserWarning: Overwriting rvt_tiny in registry with continual.robust_models_ImageNet.rvt_tiny. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:411: UserWarning: Overwriting rvt_tiny_plus in registry with continual.robust_models_ImageNet.rvt_tiny_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:433: UserWarning: Overwriting rvt_small in registry with continual.robust_models_ImageNet.rvt_small. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:453: UserWarning: Overwriting rvt_small_plus in registry with continual.robust_models_ImageNet.rvt_small_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:475: UserWarning: Overwriting rvt_base in registry with continual.robust_models_ImageNet.rvt_base. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:495: UserWarning: Overwriting rvt_base_plus in registry with continual.robust_models_ImageNet.rvt_base_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base_plus(pretrained, **kwargs):
usage: DKT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--incremental-batch-size INCREMENTAL_BATCH_SIZE] [--epochs EPOCHS]
[--base-epochs BASE_EPOCHS] [--no-amp] [--model MODEL] [--input-size INPUT_SIZE] [--patch-size PATCH_SIZE]
[--embed-dim EMBED_DIM] [--depth DEPTH] [--num-heads NUM_HEADS] [--drop PCT] [--drop-path PCT]
[--norm {layer,scale}] [--opt OPTIMIZER] [--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM]
[--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--incremental-lr INCREMENTAL_LR]
[--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--incremental-warmup-lr LR] [--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N]
[--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING]
[--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug] [--no-repeated-aug] [--reprob PCT]
[--remode REMODE] [--recount RECOUNT] [--resplit] [--auto-kd] [--kd KD] [--distillation-tau DISTILLATION_TAU]
[--resnet] [--data-path DATA_PATH] [--data-set {CIFAR,IMNET,INAT,INAT19}]
[--data-path-subTrain DATA_PATH_SUBTRAIN] [--data-path-subVal DATA_PATH_SUBVAL]
[--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output-dir OUTPUT_DIR]
[--output-basedir OUTPUT_BASEDIR] [--device DEVICE] [--seed SEED] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--initial-increment INITIAL_INCREMENT]
[--increment INCREMENT] [--class-order CLASS_ORDER [CLASS_ORDER ...]] [--eval-every EVAL_EVERY] [--debug]
[--retrain-scratch] [--max-task MAX_TASK] [--name NAME] [--options [OPTIONS ...]] [--DKT]
[--duplex-clf DUPLEX_CLF] [--memory-size MEMORY_SIZE] [--distributed-memory] [--global-memory]
[--oversample-memory OVERSAMPLE_MEMORY] [--oversample-memory-ft OVERSAMPLE_MEMORY_FT] [--rehearsal-test-trsf]
[--rehearsal-modes REHEARSAL_MODES] [--fixed-memory]
[--rehearsal {random,closest_token,closest_all,icarl_token,icarl_all,furthest_token,furthest_all}]
[--sep-memory] [--replay-memory REPLAY_MEMORY] [--finetuning {balanced}] [--finetuning-mode FINETUNING_MODE]
[--finetuning-lr FINETUNING_LR] [--finetuning-teacher] [--finetuning-resetclf] [--only-ft] [--ft-no-sampling]
[--freeze-task [FREEZE_TASK ...]] [--freeze-ft [FREEZE_FT ...]] [--freeze-eval] [--log-path LOG_PATH]
[--log-category LOG_CATEGORY] [--bce-loss] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
[--dist_url DIST_URL] [--resume RESUME] [--start-task START_TASK] [--start-epoch START_EPOCH]
[--save-every-epoch SAVE_EVERY_EPOCH] [--validation VALIDATION]
DKT training and evaluation script: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 633) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
作者您好:
我用另外一种分布式训练启动您代码CIFAR100-10-10进行复现,结果非常接近,但是两张V100,每个epoch训练要51s左右,使用github上面的命令启动train.sh出现了一下问题:
给您邮箱发邮件时,显示address rejected。您空闲的时候看看就好,您的研究对我来说非常有帮助
root@cd98f4d76410:/code/DKT_source# bash train.sh 0,1 --options /code/DKT_source/options/data/cifar100_10-10.yaml /code/DKT_source/options/data/cifar100_order1.yaml /code/DKT_source/options/model/cifar_DKT.yaml --name DKT --data-path /data/Logic888/CIFAR100/cifar100 --output-basedir /data/Logic888/CIFAR100/DKT/save_checkpoints --memory-size 2000
Launching exp on 0,1...
/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects
--local-rank
argument to be set, pleasechange it to read from
os.environ['LOCAL_RANK']
instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/code/DKT_source/continual/robust_models_ImageNet.py:391: UserWarning: Overwriting rvt_tiny in registry with continual.robust_models_ImageNet.rvt_tiny. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:411: UserWarning: Overwriting rvt_tiny_plus in registry with continual.robust_models_ImageNet.rvt_tiny_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:433: UserWarning: Overwriting rvt_small in registry with continual.robust_models_ImageNet.rvt_small. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:453: UserWarning: Overwriting rvt_small_plus in registry with continual.robust_models_ImageNet.rvt_small_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:475: UserWarning: Overwriting rvt_base in registry with continual.robust_models_ImageNet.rvt_base. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:495: UserWarning: Overwriting rvt_base_plus in registry with continual.robust_models_ImageNet.rvt_base_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base_plus(pretrained, **kwargs):
usage: DKT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--incremental-batch-size INCREMENTAL_BATCH_SIZE] [--epochs EPOCHS]
[--base-epochs BASE_EPOCHS] [--no-amp] [--model MODEL] [--input-size INPUT_SIZE] [--patch-size PATCH_SIZE]
[--embed-dim EMBED_DIM] [--depth DEPTH] [--num-heads NUM_HEADS] [--drop PCT] [--drop-path PCT]
[--norm {layer,scale}] [--opt OPTIMIZER] [--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM]
[--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--incremental-lr INCREMENTAL_LR]
[--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--incremental-warmup-lr LR] [--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N]
[--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING]
[--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug] [--no-repeated-aug] [--reprob PCT]
[--remode REMODE] [--recount RECOUNT] [--resplit] [--auto-kd] [--kd KD] [--distillation-tau DISTILLATION_TAU]
[--resnet] [--data-path DATA_PATH] [--data-set {CIFAR,IMNET,INAT,INAT19}]
[--data-path-subTrain DATA_PATH_SUBTRAIN] [--data-path-subVal DATA_PATH_SUBVAL]
[--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output-dir OUTPUT_DIR]
[--output-basedir OUTPUT_BASEDIR] [--device DEVICE] [--seed SEED] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--initial-increment INITIAL_INCREMENT]
[--increment INCREMENT] [--class-order CLASS_ORDER [CLASS_ORDER ...]] [--eval-every EVAL_EVERY] [--debug]
[--retrain-scratch] [--max-task MAX_TASK] [--name NAME] [--options [OPTIONS ...]] [--DKT]
[--duplex-clf DUPLEX_CLF] [--memory-size MEMORY_SIZE] [--distributed-memory] [--global-memory]
[--oversample-memory OVERSAMPLE_MEMORY] [--oversample-memory-ft OVERSAMPLE_MEMORY_FT] [--rehearsal-test-trsf]
[--rehearsal-modes REHEARSAL_MODES] [--fixed-memory]
[--rehearsal {random,closest_token,closest_all,icarl_token,icarl_all,furthest_token,furthest_all}]
[--sep-memory] [--replay-memory REPLAY_MEMORY] [--finetuning {balanced}] [--finetuning-mode FINETUNING_MODE]
[--finetuning-lr FINETUNING_LR] [--finetuning-teacher] [--finetuning-resetclf] [--only-ft] [--ft-no-sampling]
[--freeze-task [FREEZE_TASK ...]] [--freeze-ft [FREEZE_FT ...]] [--freeze-eval] [--log-path LOG_PATH]
[--log-category LOG_CATEGORY] [--bce-loss] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
[--dist_url DIST_URL] [--resume RESUME] [--start-task START_TASK] [--start-epoch START_EPOCH]
[--save-every-epoch SAVE_EVERY_EPOCH] [--validation VALIDATION]
DKT training and evaluation script: error: unrecognized arguments: --local-rank=1
/code/DKT_source/continual/robust_models_ImageNet.py:391: UserWarning: Overwriting rvt_tiny in registry with continual.robust_models_ImageNet.rvt_tiny. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:411: UserWarning: Overwriting rvt_tiny_plus in registry with continual.robust_models_ImageNet.rvt_tiny_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_tiny_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:433: UserWarning: Overwriting rvt_small in registry with continual.robust_models_ImageNet.rvt_small. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:453: UserWarning: Overwriting rvt_small_plus in registry with continual.robust_models_ImageNet.rvt_small_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_small_plus(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:475: UserWarning: Overwriting rvt_base in registry with continual.robust_models_ImageNet.rvt_base. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base(pretrained, **kwargs):
/code/DKT_source/continual/robust_models_ImageNet.py:495: UserWarning: Overwriting rvt_base_plus in registry with continual.robust_models_ImageNet.rvt_base_plus. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
def rvt_base_plus(pretrained, **kwargs):
usage: DKT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--incremental-batch-size INCREMENTAL_BATCH_SIZE] [--epochs EPOCHS]
[--base-epochs BASE_EPOCHS] [--no-amp] [--model MODEL] [--input-size INPUT_SIZE] [--patch-size PATCH_SIZE]
[--embed-dim EMBED_DIM] [--depth DEPTH] [--num-heads NUM_HEADS] [--drop PCT] [--drop-path PCT]
[--norm {layer,scale}] [--opt OPTIMIZER] [--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM]
[--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--incremental-lr INCREMENTAL_LR]
[--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--incremental-warmup-lr LR] [--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N]
[--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING]
[--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug] [--no-repeated-aug] [--reprob PCT]
[--remode REMODE] [--recount RECOUNT] [--resplit] [--auto-kd] [--kd KD] [--distillation-tau DISTILLATION_TAU]
[--resnet] [--data-path DATA_PATH] [--data-set {CIFAR,IMNET,INAT,INAT19}]
[--data-path-subTrain DATA_PATH_SUBTRAIN] [--data-path-subVal DATA_PATH_SUBVAL]
[--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output-dir OUTPUT_DIR]
[--output-basedir OUTPUT_BASEDIR] [--device DEVICE] [--seed SEED] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--initial-increment INITIAL_INCREMENT]
[--increment INCREMENT] [--class-order CLASS_ORDER [CLASS_ORDER ...]] [--eval-every EVAL_EVERY] [--debug]
[--retrain-scratch] [--max-task MAX_TASK] [--name NAME] [--options [OPTIONS ...]] [--DKT]
[--duplex-clf DUPLEX_CLF] [--memory-size MEMORY_SIZE] [--distributed-memory] [--global-memory]
[--oversample-memory OVERSAMPLE_MEMORY] [--oversample-memory-ft OVERSAMPLE_MEMORY_FT] [--rehearsal-test-trsf]
[--rehearsal-modes REHEARSAL_MODES] [--fixed-memory]
[--rehearsal {random,closest_token,closest_all,icarl_token,icarl_all,furthest_token,furthest_all}]
[--sep-memory] [--replay-memory REPLAY_MEMORY] [--finetuning {balanced}] [--finetuning-mode FINETUNING_MODE]
[--finetuning-lr FINETUNING_LR] [--finetuning-teacher] [--finetuning-resetclf] [--only-ft] [--ft-no-sampling]
[--freeze-task [FREEZE_TASK ...]] [--freeze-ft [FREEZE_FT ...]] [--freeze-eval] [--log-path LOG_PATH]
[--log-category LOG_CATEGORY] [--bce-loss] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
[--dist_url DIST_URL] [--resume RESUME] [--start-task START_TASK] [--start-epoch START_EPOCH]
[--save-every-epoch SAVE_EVERY_EPOCH] [--validation VALIDATION]
DKT training and evaluation script: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 633) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/code/DKT_source/main.py FAILED
Failures:
[1]:
time : 2023-12-05_07:36:20
host : cd98f4d76410
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 634)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-12-05_07:36:20
host : cd98f4d76410
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 633)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: