Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where can we change the ratio for train dataset split? #822

Open
apple-1 opened this issue Dec 12, 2024 · 6 comments
Open

Where can we change the ratio for train dataset split? #822

apple-1 opened this issue Dec 12, 2024 · 6 comments
Labels

Comments

@apple-1
Copy link

apple-1 commented Dec 12, 2024

Is it possible to change the train split ratio? Right now, from 1400 rows in the train file, I get 250 rows in train dataset.

@abhishekkrthakur
Copy link
Member

you could split the data yourself and upload both training and valid splits :)

@apple-1
Copy link
Author

apple-1 commented Dec 12, 2024

I am using my local machine for training. I had placed a train file - train.csv - in the data folder with 1400 rows. After running the trainer, the trainer log includes this piece of info:

INFO | 2024-12-12 12:09:15 | autotrain.trainers.clm.utils:process_input_data:398 - Train data: Dataset({
features: ['text', 'Description'],
num_rows: 250

Does that mean it takes only 250 rows from the train file?

I am new to ML. Kindly explain a bit.

@abhishekkrthakur
Copy link
Member

what are you training? please provide more details :)

@apple-1
Copy link
Author

apple-1 commented Dec 13, 2024

Hi, I am training GPT2 locally.

My train set has 1400 rows - please see attached. And also attaching the screenshot of the log of training.
train.csv
data-rows

Config is as follows:

conf = f"""
task: llm-{trainer}
base_model: {model_name}
project_name: {project_name}
log: tensorboard
backend: local

data:
path: /data
train_split: train
valid_split: null
chat_template: null
column_mapping:
text_column: text

params:
block_size: {block_size}
lr: {learning_rate}
warmup_ratio: {warmup_ratio}
weight_decay: {weight_decay}
epochs: {num_epochs}
batch_size: {batch_size}
gradient_accumulation: {gradient_accumulation}
mixed_precision: {mixed_precision}
peft: {peft}
quantization: {quantization}
lora_r: {lora_r}
lora_alpha: {lora_alpha}
lora_dropout: {lora_dropout}
unsloth: {unsloth}

hub:
username: ${{HF_USERNAME}}
token: ${{HF_TOKEN}}
push_to_hub: {push_to_hub}
"""

@apple-1
Copy link
Author

apple-1 commented Dec 14, 2024

The params I used are:

unsloth = False # @param ["False", "True"] {type:"raw"}
learning_rate = 2e-4 # @param {type:"number"}
num_epochs = 1 #@param {type:"number"}
batch_size = 1 # @param {type:"slider", min:1, max:32, step:1}
block_size = 256 # @param {type:"number"}
trainer = "sft" # @param ["generic", "sft"] {type:"string"}
warmup_ratio = 0.1 # @param {type:"number"}
weight_decay = 0.01 # @param {type:"number"}
gradient_accumulation = 2 # @param {type:"number"}
mixed_precision = "none" # @param ["fp16", "bf16", "none"] {type:"string"}
peft = True # @param ["False", "True"] {type:"raw"}
quantization = "int8" # @param ["int4", "int8", "none"] {type:"string"}
lora_r = 16 #@param {type:"number"}
lora_alpha = 32 #@param {type:"number"}
lora_dropout = 0.05 #@param {type:"number"}

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants