You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was training Qwen2-VL-2B-instruct on 8 80G GPU (7 for training and 1 for vllm). The training dataset is the authors' provided GEOQA_R1V_Train_8K dataset (including 8,031 samples in total).
I set the per_device_train_batch_size=1, gradient_accumulation_steps=4 and num_train_epochs=1. In my understanding, the global train batch size would be 1*4*7=28 and the total training steps should be 8031*1/28≈286.82. But the training log gives me a total training step of 2007:
Hi @SpursGoZmy, I found that the steps should be computed as 8031 / 4 = 2007.75. Regarding the 7 generations, each card is assigned to one generation, which means that during a single forward and backward pass, the batch size is effectively 1 rather than 7. This detail is the key point I discovered when reading the code.
Hi @SpursGoZmy, I found that the steps should be computed as 8031 / 4 = 2007.75. Regarding the 7 generations, each card is assigned to one generation, which means that during a single forward and backward pass, the batch size is effectively 1 rather than 7. This detail is the key point I discovered when reading the code.
This is gold. Thank you very much for your help! I will look into this to understand it.
I was training Qwen2-VL-2B-instruct on 8 80G GPU (7 for training and 1 for vllm). The training dataset is the authors' provided GEOQA_R1V_Train_8K dataset (including 8,031 samples in total).
I set the
per_device_train_batch_size=1
,gradient_accumulation_steps=4
andnum_train_epochs=1
. In my understanding, the global train batch size would be1*4*7=28
and the total training steps should be8031*1/28≈286.82
. But the training log gives me a total training step of 2007:Is there something wrong with the python script? or Did I get it wrong?
My training scripts is:
The text was updated successfully, but these errors were encountered: