You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I really appreciate you releasing this work.
I have been trying to do something similar with the original Starcoder finetuning code but have had a variety of issues.
Unfortunately, when I run this script on my own dataset (it's only around 6800 MOO verbs) I get a pretty rapid OOM on a machine with 8x A100 80gb cards.
At first I thought it was because I was trying to increase max_seq_size, (I was hoping for 1024 tokens) but dropping it back to 512 gave me the same issue.
I then tried reducing batch size to 1, but that also did not work and errored out with insufficient memory again.
The only other thing I changed is the prompt, although I made very minor changes to that, mostly just changing the language to my own and picking different columns out of my dataset.
Here is my run.sh:
#! /usr/bin/env bashset -e # stop on first errorset -u # stop if any variable is unboundset -o pipefail # stop if any command in a pipe fails
LOG_FILE="output.log"
TRANSFORMERS_VERBOSITY=info
get_gpu_count() {
local gpu_count
gpu_count=$(nvidia-smi -L | wc -l)echo"$gpu_count"
}
gpu_count=$(get_gpu_count)echo"Number of GPUs: $gpu_count"train() {
local script="$1"shift 1
local script_args="$@"if [ -z"$script" ] || [ -z"$script_args" ];thenecho"Error: Missing arguments. Please provide the script and script_args."return 1
fi
{ torchrun --nproc_per_node="$gpu_count""$script"$script_args2>&1; } | tee -a "$LOG_FILE"
}
train train.py \
--model_name_or_path "bigcode/starcoder" \
--data_path ./verbs_augmented/verbs_augmented.jsonl \
--bf16 True \
--output_dir moocoder \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 100 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard" \
--fsdp_transformer_layer_cls_to_wrap 'GPTBigCodeBlock' \
--tf32 True
Any idea what might be going wrong here/can I give you any more info to help me figure this out?
The text was updated successfully, but these errors were encountered:
I really appreciate you releasing this work.
I have been trying to do something similar with the original Starcoder finetuning code but have had a variety of issues.
Unfortunately, when I run this script on my own dataset (it's only around 6800 MOO verbs) I get a pretty rapid OOM on a machine with 8x A100 80gb cards.
At first I thought it was because I was trying to increase max_seq_size, (I was hoping for 1024 tokens) but dropping it back to 512 gave me the same issue.
I then tried reducing batch size to 1, but that also did not work and errored out with insufficient memory again.
The only other thing I changed is the prompt, although I made very minor changes to that, mostly just changing the language to my own and picking different columns out of my dataset.
Here is my run.sh:
Any idea what might be going wrong here/can I give you any more info to help me figure this out?
The text was updated successfully, but these errors were encountered: