You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Getting gradient overflow and skipped step every 2 or so steps. Training the 13b llama model on 7 a100s with context window of 512. Below is the command line run. When I tried to config state 3 or tried to get rid of gradient accumulation steps the GPU would run out of memory when attempting to load the model into memory at the start of training. Any suggestions on how to get rid of the gradient overflow issue or how to partition the model and load parts of it into multiple GPUs at the start of training? Would be super grateful for help!
Getting gradient overflow and skipped step every 2 or so steps. Training the 13b llama model on 7 a100s with context window of 512. Below is the command line run. When I tried to config state 3 or tried to get rid of gradient accumulation steps the GPU would run out of memory when attempting to load the model into memory at the start of training. Any suggestions on how to get rid of the gradient overflow issue or how to partition the model and load parts of it into multiple GPUs at the start of training? Would be super grateful for help!
deepspeed --num_gpus=7 run_clm.py --deepspeed ds_config_stage2.json --model_name_or_path decapoda-research/llama-13b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --bf16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 400 --gradient_accumulation_steps 3 --per_device_train_batch_size 2 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 400 --save_strategy steps --load_best_model_at_end=True --block_size=512 --report_to=wandb
The text was updated successfully, but these errors were encountered: