Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi launchers for deepseed should not set unsupported --num_gpus #3319

Open
chiragjn opened this issue Jan 2, 2025 · 0 comments
Open

Comments

@chiragjn
Copy link

chiragjn commented Jan 2, 2025

To reproduce:

accelerate launch --mixed_precision bf16 --use_deepspeed --deepspeed_multinode_launcher openmpi --deepspeed_hostfile /etc/mpi/hostfile --same_network --monitor_interval "30" --num_machines "2" --num_processes "2" --main_process_port 29500 --machine_rank "-1" train.py

Which translates to

deepspeed --no_local_rank --hostfile /etc/mpi/hostfile --launcher openmpi --master_port 29500 --num_gpus 1 train.py --deepspeed ./deepspeed_configs/5_ds_z3_config.json --model_name_or_path Qwen/Qwen2.5-0.5B --dataset_name trl-lib/Capybara --learning_rate 2.0e-5 --num_train_epochs 1 --packing --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --gradient_checkpointing --logging_steps 25 --eval_strategy steps --eval_steps 100 --output_dir /mnt/shared/Qwen2-0.5B-SFT --use_liger

Which will lead to

openmpi backend does not support limiting num nodes/gpus

Deepspeed error reference:

https://github.com/microsoft/DeepSpeed/blob/3573858e7ce2c723b8c43231c6c6b0cf97dca2fc/deepspeed/launcher/multinode_runner.py#L141C44-L141C92


Accelerate should skip setting num gpus for openmpi based launchers in the else clause here:

if num_machines > 1 and args.deepspeed_multinode_launcher != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
cmd = ["deepspeed", "--no_local_rank"]
cmd.extend(["--hostfile", str(args.deepspeed_hostfile), "--launcher", str(args.deepspeed_multinode_launcher)])
if args.deepspeed_exclusion_filter is not None:
cmd.extend(
[
"--exclude",
str(args.deepspeed_exclusion_filter),
]
)
elif args.deepspeed_inclusion_filter is not None:
cmd.extend(
[
"--include",
str(args.deepspeed_inclusion_filter),
]
)
else:
cmd.extend(["--num_gpus", str(args.num_processes // args.num_machines)])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant