Fix labels & eos_token for SFT #819

li-plus · 2023-11-28T03:00:27Z

Fixed two issues:

Padding should be ignored in training. Their labels should be set to -100 for CrossEntropyLoss to ignore them.
Append correct eos_token to the response text. Otherwise the default <|endoftext|> will be tokenized into multiple tokens.

li-plus · 2024-09-08T12:33:47Z

Just rebased and fixed code format. Could anybody merge this PR since it's been approved. I don't have merge permission.

tjruwase · 2024-09-09T15:47:18Z

@li-plus, apologies for the delay on this. Can you please share a bit more details on the motivation for this PR. For example, what improvements did you observe in your workloads? Thanks!

li-plus · 2024-09-10T03:50:36Z

@tjruwase These are bugs that should be fixed. In master code, labels for padding are not ignored. Model will be trained to generate padding tokens, which is unnecessary. Also, eos token is not recognized as a single token. If you examine the input_ids in the training_scripts/opt/single_gpu/run_1.3b.sh example, it's tokenized into [49069, 15483, 1397, 1116, 29015, 15483, 15698], i.e. ['.<', '|', 'end', 'of', 'text', '|', '>'], which is wrong. This PR fixed these bugs.

tjruwase · 2024-09-10T12:37:23Z

@li-plus, got it. Thanks for this great contribution.

li-plus requested review from jeffra, tjruwase, ShadenSmith, conglongli, awan-10, eltonzheng, minjiaz, RezaYazdaniAminabadi, duli2012, mrwyattii, arashb and xiaoxiawu-microsoft as code owners November 28, 2023 03:00

li-plus force-pushed the fix-sft branch from e7736e5 to 42f2d02 Compare November 28, 2023 03:07

stceum approved these changes Mar 23, 2024

View reviewed changes

Fix labels & eos_token for SFT

a2503e7

li-plus force-pushed the fix-sft branch from 0a6f500 to a2503e7 Compare September 8, 2024 12:29

tjruwase requested review from tohtana and removed request for arashb, ShadenSmith, jeffra, duli2012, conglongli, awan-10, mrwyattii, eltonzheng, minjiaz, RezaYazdaniAminabadi and xiaoxiawu-microsoft September 9, 2024 12:18

tjruwase approved these changes Sep 10, 2024

View reviewed changes

tjruwase merged commit a256c04 into microsoft:master Sep 10, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix labels & eos_token for SFT #819

Fix labels & eos_token for SFT #819

li-plus commented Nov 28, 2023

li-plus commented Sep 8, 2024

tjruwase commented Sep 9, 2024

li-plus commented Sep 10, 2024

tjruwase commented Sep 10, 2024

Fix labels & eos_token for SFT #819

Fix labels & eos_token for SFT #819

Conversation

li-plus commented Nov 28, 2023

li-plus commented Sep 8, 2024

tjruwase commented Sep 9, 2024

li-plus commented Sep 10, 2024

tjruwase commented Sep 10, 2024