[BUG] autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence. #827
Labels
bug
Something isn't working
Prerequisites
Backend
Colab
Interface Used
UI
CLI Command
UI Screenshots & Parameters
Error Logs
Device 0: Tesla T4 - 3.00MiB/15360MiB
INFO | 2024-12-16 12:21:25 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 1293
INFO | 2024-12-16 12:21:25 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 1293
ERROR | 2024-12-16 12:21:19 | autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.
ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.
raise ValueError(
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 593, in _prepare_packed_dataloader
return self._prepare_packed_dataloader(
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 488, in _prepare_dataset
train_dataset = self._prepare_dataset(
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 368, in init
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
trainer = SFTTrainer(
File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/clm/train_clm_sft.py", line 46, in train
train_sft(config)
File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/clm/main.py", line 28, in train
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/common.py", line 212, in wrapper
Traceback (most recent call last):
The above exception was the direct cause of the following exception:
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
raise DatasetGenerationError("An error occurred while generating the dataset") from e
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1643, in _prepare_split_single
for job_id, done, content in self._prepare_split_single(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1486, in _prepare_split
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1000, in _download_and_prepare
super()._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1648, in _download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 924, in download_and_prepare
self.builder.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/io/generator.py", line 49, in read
).read()
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 1099, in from_generator
packed_dataset = Dataset.from_generator(
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 589, in _prepare_packed_dataloader
Traceback (most recent call last):
The above exception was the direct cause of the following exception:
IndexError: list index out of range
for key in tokens_and_encodings[0][0].keys():
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus
return super()._batch_encode_plus(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 127, in _batch_encode_plus
return self._batch_encode_plus(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3311, in batch_encode_plus
return self.batch_encode_plus(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3109, in _call_one
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3021, in call
tokenized_inputs = self.tokenizer(buffer, add_special_tokens=self.add_special_tokens, truncation=False)[
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/utils.py", line 654, in iter
yield from constant_length_iterator
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 586, in data_generator
for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/generator/generator.py", line 33, in _generate_examples
for key, record in generator:
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1607, in _prepare_split_single
ERROR | 2024-12-16 12:21:19 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):
Generating train split: 10982 examples [00:38, 281.78 examples/s]
Generating train split: 10456 examples [00:38, 286.75 examples/s]
Generating train split: 9908 examples [00:36, 348.17 examples/s]
Generating train split: 9284 examples [00:36, 199.69 examples/s]
Generating train split: 8682 examples [00:32, 265.40 examples/s]
Generating train split: 8443 examples [00:32, 220.36 examples/s]
Generating train split: 7600 examples [00:28, 259.81 examples/s]
Generating train split: 7000 examples [00:26, 270.77 examples/s]
Generating train split: 6000 examples [00:23, 219.05 examples/s]
Generating train split: 5734 examples [00:21, 326.33 examples/s]
Generating train split: 5310 examples [00:21, 214.40 examples/s]
Generating train split: 4691 examples [00:18, 262.79 examples/s]
Generating train split: 4386 examples [00:18, 195.29 examples/s]
Generating train split: 3856 examples [00:14, 386.11 examples/s]
Generating train split: 3382 examples [00:13, 266.29 examples/s]
Generating train split: 3000 examples [00:11, 358.83 examples/s]
Generating train split: 2539 examples [00:11, 251.78 examples/s]
Generating train split: 1495 examples [00:06, 356.68 examples/s]
Generating train split: 1000 examples [00:06, 187.30 examples/s]
Generating train split: 314 examples [00:03, 113.01 examples/s]
Generating train split: 1 examples [00:03, 3.82s/ examples]
Token indices sequence length is longer than the specified maximum sequence length for this model (1232 > 1024). Running this sequence through the model will result in indexing errors
INFO | 2024-12-16 12:20:39 | autotrain.trainers.clm.train_clm_sft:train:39 - creating trainer
INFO | 2024-12-16 12:20:39 | autotrain.trainers.clm.utils:get_model:960 - model dtype: torch.float16
low_cpu_mem_usage
was None, now default to True since model is quantized.INFO | 2024-12-16 12:20:35 | autotrain.trainers.clm.utils:get_model:929 - loading model...
INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:get_model:921 - loading model config...
WARNING | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:get_model:919 - Unsloth not available, continuing without it...
INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:get_model:877 - Can use unsloth: False
INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_block_size:801 - Using block size 1024
INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_training_args:723 - configuring training args
INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_logging_steps:684 - Logging steps: 25
INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_logging_steps:671 - configuring logging steps
INFO | 2024-12-16 12:20:32 | autotrain.trainers.clm.utils:process_input_data:551 - Valid data: None
})
num_rows: 10000
features: ['text'],
INFO | 2024-12-16 12:20:32 | autotrain.trainers.clm.utils:process_input_data:550 - Train data: Dataset({
Generating train split: 100%|██████████| 10000/10000 [00:00<00:00, 41129.94 examples/s]
Generating train split: 100%|██████████| 10000/10000 [00:00<00:00, 42671.26 examples/s]
Generating train split: 40%|████ | 4000/10000 [00:00<00:00, 33099.45 examples/s]
Generating train split: 0%| | 0/10000 [00:00<?, ? examples/s]
INFO | 2024-12-16 12:20:30 | autotrain.trainers.clm.train_clm_sft:train:11 - Starting SFT training...
To avoid this warning pass in values for each of the problematic parameters or run
accelerate config
.--dynamo_backend
was set to a value of'no'
The following values were not passed to
accelerate launch
and had defaults used instead:INFO | 2024-12-16 12:20:15 | autotrain.backends.local:create:25 - Training PID: 1293
INFO | 2024-12-16 12:20:15 | autotrain.commands:launch_command:515 - {'model': 'openai-community/gpt2', 'project_name': 'autotrain-sac23-6qkr5', 'data_path': 'stas/openwebtext-10k', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 2048, 'padding': 'none', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'lr': 3e-05, 'epochs': 3, 'batch_size': 2, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': 'none', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': 'prompt', 'text_column': 'text', 'rejected_text_column': 'rejected_text', 'push_to_hub': True, 'username': 'AlexExon', 'token': '*****', 'unsloth': False, 'distributed_backend': 'ddp'}
INFO | 2024-12-16 12:20:15 | autotrain.commands:launch_command:514 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-sac23-6qkr5/training_params.json']
INFO | 2024-12-16 12:20:15 | autotrain.backends.local:create:20 - Starting local training...
INFO | 2024-12-16 12:20:15 | autotrain.app.ui_routes:handle_form:540 - hardware: local-ui
INFO | 2024-12-16 12:20:07 | autotrain.app.ui_routes:fetch_params:415 - Task: llm:sft
INFO: 85.172.107.239:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO | 2024-12-16 12:18:50 | autotrain.app.ui_routes:fetch_params:415 - Task: llm:sft
INFO: 85.172.107.239:0 - "GET / HTTP/1.1" 307 Temporary Redirect
INFO: Uvicorn running on http://127.0.0.1:7860 (Press CTRL+C to quit)
INFO: Application startup complete.
INFO: Waiting for application startup.
INFO: Started server process [856]
INFO | 2024-12-16 12:18:41 | autotrain.app.app::24 - AutoTrain started successfully
INFO | 2024-12-16 12:18:41 | autotrain.app.app::23 - AutoTrain version: 0.8.33
INFO | 2024-12-16 12:18:41 | autotrain.app.app::13 - Starting AutoTrain...
INFO | 2024-12-16 12:18:41 | autotrain.app.ui_routes::315 - AutoTrain started successfully
INFO | 2024-12-16 12:18:37 | autotrain.app.ui_routes::31 - Starting AutoTrain...
Additional Information
The text was updated successfully, but these errors were encountered: