[BUG] autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence. #827

coderrussia · 2024-12-16T12:32:26Z

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Colab

Interface Used

UI

CLI Command

UI Screenshots & Parameters

Error Logs

Device 0: Tesla T4 - 3.00MiB/15360MiB

INFO | 2024-12-16 12:21:25 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 1293

INFO | 2024-12-16 12:21:25 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 1293

ERROR | 2024-12-16 12:21:19 | autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.

ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.

raise ValueError(

File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 593, in _prepare_packed_dataloader

return self._prepare_packed_dataloader(

File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 488, in _prepare_dataset

train_dataset = self._prepare_dataset(

File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 368, in init

return func(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py", line 165, in wrapped_func

return f(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f

trainer = SFTTrainer(

File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/clm/train_clm_sft.py", line 46, in train

train_sft(config)

File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/clm/main.py", line 28, in train

return func(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/common.py", line 212, in wrapper

Traceback (most recent call last):

The above exception was the direct cause of the following exception:

datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

raise DatasetGenerationError("An error occurred while generating the dataset") from e

File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1643, in _prepare_split_single

for job_id, done, content in self._prepare_split_single(

File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1486, in _prepare_split

self._prepare_split(split_generator, **prepare_split_kwargs)

File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1000, in _download_and_prepare

super()._download_and_prepare(

File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1648, in _download_and_prepare

self._download_and_prepare(

File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 924, in download_and_prepare

self.builder.download_and_prepare(

File "/usr/local/lib/python3.10/dist-packages/datasets/io/generator.py", line 49, in read

).read()

File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 1099, in from_generator

packed_dataset = Dataset.from_generator(

File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 589, in _prepare_packed_dataloader

Traceback (most recent call last):

The above exception was the direct cause of the following exception:

IndexError: list index out of range

for key in tokens_and_encodings[0][0].keys():

File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus

return super()._batch_encode_plus(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 127, in _batch_encode_plus

return self._batch_encode_plus(

File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3311, in batch_encode_plus

return self.batch_encode_plus(

File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3109, in _call_one

encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)

File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3021, in call

tokenized_inputs = self.tokenizer(buffer, add_special_tokens=self.add_special_tokens, truncation=False)[

File "/usr/local/lib/python3.10/dist-packages/trl/trainer/utils.py", line 654, in iter

yield from constant_length_iterator

File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 586, in data_generator

for idx, ex in enumerate(self.config.generator(**gen_kwargs)):

File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/generator/generator.py", line 33, in _generate_examples

for key, record in generator:

File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1607, in _prepare_split_single

ERROR | 2024-12-16 12:21:19 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):

Generating train split: 10982 examples [00:38, 281.78 examples/s]

Generating train split: 10456 examples [00:38, 286.75 examples/s]

Generating train split: 9908 examples [00:36, 348.17 examples/s]

Generating train split: 9284 examples [00:36, 199.69 examples/s]

Generating train split: 8682 examples [00:32, 265.40 examples/s]

Generating train split: 8443 examples [00:32, 220.36 examples/s]

Generating train split: 7600 examples [00:28, 259.81 examples/s]

Generating train split: 7000 examples [00:26, 270.77 examples/s]

Generating train split: 6000 examples [00:23, 219.05 examples/s]

Generating train split: 5734 examples [00:21, 326.33 examples/s]

Generating train split: 5310 examples [00:21, 214.40 examples/s]

Generating train split: 4691 examples [00:18, 262.79 examples/s]

Generating train split: 4386 examples [00:18, 195.29 examples/s]

Generating train split: 3856 examples [00:14, 386.11 examples/s]

Generating train split: 3382 examples [00:13, 266.29 examples/s]

Generating train split: 3000 examples [00:11, 358.83 examples/s]

Generating train split: 2539 examples [00:11, 251.78 examples/s]

Generating train split: 1495 examples [00:06, 356.68 examples/s]

Generating train split: 1000 examples [00:06, 187.30 examples/s]

Generating train split: 314 examples [00:03, 113.01 examples/s]

Generating train split: 1 examples [00:03, 3.82s/ examples]

Token indices sequence length is longer than the specified maximum sequence length for this model (1232 > 1024). Running this sequence through the model will result in indexing errors

INFO | 2024-12-16 12:20:39 | autotrain.trainers.clm.train_clm_sft:train:39 - creating trainer

INFO | 2024-12-16 12:20:39 | autotrain.trainers.clm.utils:get_model:960 - model dtype: torch.float16

low_cpu_mem_usage was None, now default to True since model is quantized.

INFO | 2024-12-16 12:20:35 | autotrain.trainers.clm.utils:get_model:929 - loading model...

INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:get_model:921 - loading model config...

WARNING | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:get_model:919 - Unsloth not available, continuing without it...

INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:get_model:877 - Can use unsloth: False

INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_block_size:801 - Using block size 1024

INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_training_args:723 - configuring training args

INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_logging_steps:684 - Logging steps: 25

INFO | 2024-12-16 12:20:34 | autotrain.trainers.clm.utils:configure_logging_steps:671 - configuring logging steps

INFO | 2024-12-16 12:20:32 | autotrain.trainers.clm.utils:process_input_data:551 - Valid data: None

})

num_rows: 10000

features: ['text'],

INFO | 2024-12-16 12:20:32 | autotrain.trainers.clm.utils:process_input_data:550 - Train data: Dataset({

Generating train split: 100%|██████████| 10000/10000 [00:00<00:00, 41129.94 examples/s]

Generating train split: 100%|██████████| 10000/10000 [00:00<00:00, 42671.26 examples/s]

Generating train split: 40%|████ | 4000/10000 [00:00<00:00, 33099.45 examples/s]

Generating train split: 0%| | 0/10000 [00:00<?, ? examples/s]

INFO | 2024-12-16 12:20:30 | autotrain.trainers.clm.train_clm_sft:train:11 - Starting SFT training...

To avoid this warning pass in values for each of the problematic parameters or run accelerate config.

--dynamo_backend was set to a value of 'no'

The following values were not passed to accelerate launch and had defaults used instead:

INFO | 2024-12-16 12:20:15 | autotrain.backends.local:create:25 - Training PID: 1293

INFO | 2024-12-16 12:20:15 | autotrain.commands:launch_command:515 - {'model': 'openai-community/gpt2', 'project_name': 'autotrain-sac23-6qkr5', 'data_path': 'stas/openwebtext-10k', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 2048, 'padding': 'none', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'lr': 3e-05, 'epochs': 3, 'batch_size': 2, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': 'none', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': 'prompt', 'text_column': 'text', 'rejected_text_column': 'rejected_text', 'push_to_hub': True, 'username': 'AlexExon', 'token': '*****', 'unsloth': False, 'distributed_backend': 'ddp'}

INFO | 2024-12-16 12:20:15 | autotrain.commands:launch_command:514 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-sac23-6qkr5/training_params.json']

INFO | 2024-12-16 12:20:15 | autotrain.backends.local:create:20 - Starting local training...

INFO | 2024-12-16 12:20:15 | autotrain.app.ui_routes:handle_form:540 - hardware: local-ui

INFO | 2024-12-16 12:20:07 | autotrain.app.ui_routes:fetch_params:415 - Task: llm:sft

INFO: 85.172.107.239:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found

INFO | 2024-12-16 12:18:50 | autotrain.app.ui_routes:fetch_params:415 - Task: llm:sft

INFO: 85.172.107.239:0 - "GET / HTTP/1.1" 307 Temporary Redirect

INFO: Uvicorn running on http://127.0.0.1:7860 (Press CTRL+C to quit)

INFO: Application startup complete.

INFO: Waiting for application startup.

INFO: Started server process [856]

INFO | 2024-12-16 12:18:41 | autotrain.app.app::24 - AutoTrain started successfully

INFO | 2024-12-16 12:18:41 | autotrain.app.app::23 - AutoTrain version: 0.8.33

INFO | 2024-12-16 12:18:41 | autotrain.app.app::13 - Starting AutoTrain...

INFO | 2024-12-16 12:18:41 | autotrain.app.ui_routes::315 - AutoTrain started successfully

INFO | 2024-12-16 12:18:37 | autotrain.app.ui_routes::31 - Starting AutoTrain...

Additional Information

The text was updated successfully, but these errors were encountered:

coderrussia added the bug Something isn't working label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence. #827

[BUG] autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence. #827

coderrussia commented Dec 16, 2024

[BUG] autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence. #827

[BUG] autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence. #827

Comments

coderrussia commented Dec 16, 2024

Prerequisites

Backend

Interface Used

CLI Command

UI Screenshots & Parameters

Error Logs

Additional Information