Learning to generate EOS tokens #1623

vwxyzjn · 2024-05-06T02:20:48Z

@edbeeching and I noticed sometimes the trained SFT models do not learn to stop generations. In other words, the model never learn to generate EOS tokens.

Upon some digging, I noticed this is mainly an issue with the dataset preprocessing. In particular, if we simply pass a dataset like https://huggingface.co/datasets/timdettmers/openassistant-guanaco to the SFTTrainer, the trainer may not postpend the completion with an EOS token.

If we run for item1, item2 in zip(inputs["input_ids"][1], inputs["attention_mask"][1]): print(item1, item2) at https://github.com/huggingface/transformers/blob/91d155ea92da372b319a79dd4eef69533ee15170/src/transformers/trainer.py#L3207, with our SFT example we get

python examples/scripts/sft.py \
    --model_name_or_path="facebook/opt-350m" \
    --report_to="wandb" \
    --learning_rate=1.41e-5 \
    --per_device_train_batch_size=2 \
    --gradient_accumulation_steps=16 \
    --output_dir="sft_openassistant-guanaco" \
    --logging_steps=1 \
    --num_train_epochs=3 \
    --max_steps=-1 \
    --push_to_hub \
    --gradient_checkpointing \
    --dataset_text_field text

Notice how the pad token / eos token corresponds to attention mask = 0.

potential solution

This can be resolved if we add an eos token to the dataset itself. For example,

trl/examples/scripts/minimal/sft.py

Line 57 in dc012ea

    
           "{% for message in messages %}{{' ' + message['content']}}{% endfor %}{{ eos_token }}"

always adds an EOS token to the tokenized dataset, and as a result we get

python examples/scripts/minimal/sft.py \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --learning_rate 5e-05 \
    --logging_steps 10 \
    --evaluation_strategy epoch \
    --max_seq_length 1024 \
    --num_train_epochs 5 \
    --output_dir models/minimal/sft

Notice how the first eos token corresponds to attention mask = 1.

The text was updated successfully, but these errors were encountered:

edbeeching · 2024-05-06T06:54:35Z

As I mentioned on out internal slack, we should probably add a line such as:

    if sft_config.packing is False:
        tokenizer.add_eos_token = True

this needs to be removed before saving the model as otherwise generation is broken:

    if sft_config.packing is False:
        # setting this as true breaks generation during evaluation
        tokenizer.add_eos_token = False

I tested these additions in h4 and it resolved many of the issues we saw with models trained with packing=False.

yananchen1989 · 2024-05-06T15:06:59Z

is this an issue when packing=True ? I also do find that the generations from the SFT model are quite wordy.

derekelewis · 2024-05-09T22:59:03Z

@yananchen1989 I believe the answer is yes for packing=True & packing=False. I'm experiencing lack of predicting EOS on SFTTrainer fine-tuned models w/ using chat templates. Still doing testing, but it doesn't seem to be an issue when not using chat templates and using formatting_func instead.

PEFT also seems to be a contributing factor. No PEFT and EOS is predicted correctly. W/ PEFT and EOS is not correctly predicted.

vwxyzjn · 2024-05-15T18:47:39Z

Actually I am not even sure if setting the tokenizer.pad_token = tokenizer.eos_token would work. Even if the dataset has an EOS token, what happens is that attention_mask is set to 1, but the label is still set to -100, so the loss on the EOS token is still masked out.

for input_id, attention_mask, label in zip(inputs["input_ids"][0], inputs["attention_mask"][0], inputs["labels"][0]): print(f"{input_id=}, {attention_mask=}, {label=}")

input_id=tensor(15, device='cuda:0'), attention_mask=tensor(1, device='cuda:0'), label=tensor(15, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(1, device='cuda:0'), label=tensor(-100, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(0, device='cuda:0'), label=tensor(-100, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(0, device='cuda:0'), label=tensor(-100, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(0, device='cuda:0'), label=tensor(-100, device='cuda:0')

yananchen1989 · 2024-05-21T15:39:24Z

yes, i agree that no matter packing is set or not, EOS token has not been properly predicted which causes lengthy output.

vwxyzjn · 2024-05-21T15:43:11Z

@yananchen1989 FYI when packing is set this should not be a problem. See #1646 (comment).

github-actions · 2024-06-15T15:06:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

derekelewis mentioned this issue May 10, 2024

Using PEFT causes model to not predict EOS huggingface/peft#1672

Closed

4 tasks

vwxyzjn mentioned this issue May 16, 2024

Prototype Dataset Processor #1646

Closed

github-actions bot closed this as completed Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning to generate EOS tokens #1623

Learning to generate EOS tokens #1623

vwxyzjn commented May 6, 2024

edbeeching commented May 6, 2024

yananchen1989 commented May 6, 2024

derekelewis commented May 9, 2024 •

edited

Loading

vwxyzjn commented May 15, 2024

yananchen1989 commented May 21, 2024

vwxyzjn commented May 21, 2024

github-actions bot commented Jun 15, 2024

Learning to generate EOS tokens #1623

Learning to generate EOS tokens #1623

Comments

vwxyzjn commented May 6, 2024

potential solution

edbeeching commented May 6, 2024

yananchen1989 commented May 6, 2024

derekelewis commented May 9, 2024 • edited Loading

vwxyzjn commented May 15, 2024

yananchen1989 commented May 21, 2024

vwxyzjn commented May 21, 2024

github-actions bot commented Jun 15, 2024

derekelewis commented May 9, 2024 •

edited

Loading