Crashes during finetuning #131

gameveloster · 2023-07-04T16:30:12Z

I am trying to do a 30B finetune on 2x3090 using data parallel and the process will always crash before 3 steps is completed. I run the finetune script in a screen session on a remote computer, and the screen session is gone when i reestablish the SSH connection after the crash.

This is the command I use to start finetune

torchrun --nproc_per_node=2 --master_port=1234 finetune.py ./testdocs.txt \
    --ds_type=txt \
    --lora_out_dir=./loras/ \
    --llama_q4_config_dir=./models/Neko-Institute-of-Science_LLaMA-30B-4bit-128g \
    --llama_q4_model=./Neko-Institute-of-Science_LLaMA-30B-4bit-128g/llama-30b-4bit-128g.safetensors \
    --mbatch_size=1 \
    --batch_size=2 \
    --epochs=3 \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --groupsize=-1 \
    --xformers \
    --backend=cuda \
    --grad_chckpt

Has anyone else gotten the same crashing problem?

The text was updated successfully, but these errors were encountered:

ghost · 2023-07-06T21:27:40Z

probably because you ran out of VRAM. Try using a batch size of 1.

johnsmith0031 · 2023-07-07T01:52:51Z

I think you should keep the ssh session, or running the task in background

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes during finetuning #131

Crashes during finetuning #131

gameveloster commented Jul 4, 2023 •

edited

Loading

ghost commented Jul 6, 2023

johnsmith0031 commented Jul 7, 2023

Crashes during finetuning #131

Crashes during finetuning #131

Comments

gameveloster commented Jul 4, 2023 • edited Loading

ghost commented Jul 6, 2023

johnsmith0031 commented Jul 7, 2023

gameveloster commented Jul 4, 2023 •

edited

Loading