Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). #236

Closed
hellen9527 opened this issue Feb 8, 2025 · 4 comments

Comments

@hellen9527
Copy link

hellen9527 commented Feb 8, 2025

I am using the latest training code, and both transformers and trl are the latest versions from the main branch.

I only have 4 L20 GPUs and want to try training the GRPO qwen-1.5b model, but I encountered an error. It says that 8 results were sampled, and training cannot be done with 3 GPUs. The issue is that with 4 GPUs, one must be reserved for vllm to sample and infer, so only 3 GPUs are actually used for training. Similarly, if I use 8 GPUs, only 7 are actually used for training. Does this mean I need to set num_generations=7? Is it unreasonable to use the default setting of 8 for the number of GPUs, since an 8-GPU machine cannot actually use all 8 GPUs? Also, if I need 64 GPUs, what should I do? How can I set up multi-machine training?

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 240, in <module>
[rank1]:     main(script_args, training_args, model_args)
[rank1]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 175, in main
[rank1]:     trainer = GRPOTrainer(
[rank1]:   File "/home/jovyan/.local/lib/python3.10/site-packages/trl-0.15.0.dev0-py3.10.egg/trl/trainer/grpo_trainer.py", line 320, in __init__
[rank1]:     raise ValueError(
[rank1]: ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). Given the current train batch size, the valid values for the number of generations are: [3].
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 240, in <module>
[rank2]:     main(script_args, training_args, model_args)
[rank2]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 175, in main
[rank2]:     trainer = GRPOTrainer(
[rank2]:   File "/home/jovyan/.local/lib/python3.10/site-packages/trl-0.15.0.dev0-py3.10.egg/trl/trainer/grpo_trainer.py", line 320, in __init__
[rank2]:     raise ValueError(
[rank2]: ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). Given the current train batch size, the valid values for the number of generations are: [3].
@caijianwei1996
Copy link

set a parameter in grpoconfig,default is 8,change to 3

@hellen9527
Copy link
Author

set a parameter in grpoconfig,default is 8,change to 3

Thank you. Setting it to 3, does this mean the size of each group is 3? If I set it to a larger number, will it converge faster? Setting it to 3 does work, but why can’t it be a multiple of 3, like 6 or 9? It just makes the model compute a few more times, which takes longer, but doesn’t it limit it to factors rather than multiples?

@jiaxiang-wu
Copy link

It is related to a recent PR in the trl library.

This PR introduces a more flexible approach:
Instead of defining per_device_batch_size as the number of prompts per device, it now represents the number of generations per device.
This allows for much greater flexibility in choosing the number of generations (G) and the batch size per device.
The only constraint is that the global batch size (num_processes * per_device_batch_size) must be divisible by G.

Ref:
huggingface/trl#2776

@hellen9527
Copy link
Author

It is related to a recent PR in the trl library.

This PR introduces a more flexible approach:
Instead of defining per_device_batch_size as the number of prompts per device, it now represents the number of generations per device.
This allows for much greater flexibility in choosing the number of generations (G) and the batch size per device.
The only constraint is that the global batch size (num_processes * per_device_batch_size) must be divisible by G.

Ref: huggingface/trl#2776

Thank you, I understand now. The reason I could only set it to 3 before was because I set per_device_batch_size=1, and I only had 3 GPUs for training. So, when I increase the per_device_batch_size, I can also increase the num_generations accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants