ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). #236

hellen9527 · 2025-02-08T02:56:22Z

I am using the latest training code, and both transformers and trl are the latest versions from the main branch.

I only have 4 L20 GPUs and want to try training the GRPO qwen-1.5b model, but I encountered an error. It says that 8 results were sampled, and training cannot be done with 3 GPUs. The issue is that with 4 GPUs, one must be reserved for vllm to sample and infer, so only 3 GPUs are actually used for training. Similarly, if I use 8 GPUs, only 7 are actually used for training. Does this mean I need to set num_generations=7? Is it unreasonable to use the default setting of 8 for the number of GPUs, since an 8-GPU machine cannot actually use all 8 GPUs? Also, if I need 64 GPUs, what should I do? How can I set up multi-machine training?

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 240, in <module>
[rank1]:     main(script_args, training_args, model_args)
[rank1]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 175, in main
[rank1]:     trainer = GRPOTrainer(
[rank1]:   File "/home/jovyan/.local/lib/python3.10/site-packages/trl-0.15.0.dev0-py3.10.egg/trl/trainer/grpo_trainer.py", line 320, in __init__
[rank1]:     raise ValueError(
[rank1]: ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). Given the current train batch size, the valid values for the number of generations are: [3].
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 240, in <module>
[rank2]:     main(script_args, training_args, model_args)
[rank2]:   File "/home/rainightli/PythonProjects/open-r1/src/open_r1/grpo.py", line 175, in main
[rank2]:     trainer = GRPOTrainer(
[rank2]:   File "/home/jovyan/.local/lib/python3.10/site-packages/trl-0.15.0.dev0-py3.10.egg/trl/trainer/grpo_trainer.py", line 320, in __init__
[rank2]:     raise ValueError(
[rank2]: ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). Given the current train batch size, the valid values for the number of generations are: [3].

The text was updated successfully, but these errors were encountered:

caijianwei1996 · 2025-02-08T03:15:20Z

set a parameter in grpoconfig,default is 8,change to 3

hellen9527 · 2025-02-08T03:42:36Z

set a parameter in grpoconfig,default is 8,change to 3

Thank you. Setting it to 3, does this mean the size of each group is 3? If I set it to a larger number, will it converge faster? Setting it to 3 does work, but why can’t it be a multiple of 3, like 6 or 9? It just makes the model compute a few more times, which takes longer, but doesn’t it limit it to factors rather than multiples?

jiaxiang-wu · 2025-02-08T03:46:26Z

It is related to a recent PR in the trl library.

This PR introduces a more flexible approach:
Instead of defining per_device_batch_size as the number of prompts per device, it now represents the number of generations per device.
This allows for much greater flexibility in choosing the number of generations (G) and the batch size per device.
The only constraint is that the global batch size (num_processes * per_device_batch_size) must be divisible by G.

Ref:
huggingface/trl#2776

hellen9527 · 2025-02-08T04:49:42Z

It is related to a recent PR in the trl library.

This PR introduces a more flexible approach:
Instead of defining per_device_batch_size as the number of prompts per device, it now represents the number of generations per device.
This allows for much greater flexibility in choosing the number of generations (G) and the batch size per device.
The only constraint is that the global batch size (num_processes * per_device_batch_size) must be divisible by G.

Ref: huggingface/trl#2776

Thank you, I understand now. The reason I could only set it to 3 before was because I set per_device_batch_size=1, and I only had 3 GPUs for training. So, when I increase the per_device_batch_size, I can also increase the num_generations accordingly.

hellen9527 closed this as completed Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). #236

ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). #236

hellen9527 commented Feb 8, 2025 •

edited

Loading

caijianwei1996 commented Feb 8, 2025

hellen9527 commented Feb 8, 2025

jiaxiang-wu commented Feb 8, 2025

hellen9527 commented Feb 8, 2025

ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). #236

ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (8). #236

Comments

hellen9527 commented Feb 8, 2025 • edited Loading

caijianwei1996 commented Feb 8, 2025

hellen9527 commented Feb 8, 2025

jiaxiang-wu commented Feb 8, 2025

hellen9527 commented Feb 8, 2025

hellen9527 commented Feb 8, 2025 •

edited

Loading