-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPO models generate multiple / corrupted responses #1025
Comments
i encounter silimar problem |
Is it possible that you didn't have Also cc @kashif |
ok checking! |
I check my code, i have correctly set the tokenizer.eos token and tokenizer.padding side I also tried adjusting various settings, and in the end, I speculated whether it was a training crash, so I lowered the learning rate and resolved the issue. |
I, too, among the various tests I have done, have already checked that the EOS token was correctly set, but despite this I have not had any positive results. @Ricardokevins, would you mind sharing some code snippets and the learning rate you used? Even after that test I could not solve the problem unfortunately. |
Small update: I have done more testing by reducing the learning rate and it seems to work much better than before. In any case, there are still instances in the output that take data from the train set input (e.g., 'cat' instead of 'a','b', or 'c'), however, they are significantly fewer in number than before. |
Oh, i still have the problem, if i increase the Learning rate from 1e-6 to 2e-6 |
I think this issue may relate to the padding process in DPODataCollatorWithPadding, the padding side is strange the chosen_input_id and reject_input_id is right padding while the prompt_input_ids is left padding and the instruction model for tuning i use is left padding |
I check other open source project and observe similar issue (repetition generation) eric-mitchell/direct-preference-optimization#8 |
@Ricardokevins do you mind trying with the refactored PR #885 where we have done some more fixes for the padding etc. |
@kashif Yeah, i adopt the PR and tune the model 500steps with 2e-6 learning rate. @Devy99 You can also try this code |
Thank you for the support! I'll try as soon as possible and I'll keep you updated. |
I observe the problem again with a new experient ( 5e-7 1700steps) I am still looking for new solutions. Based on experiments, it appears that larger learning rates and longer update steps are more likely to trigger this issue. Therefore, I speculate whether it is related to the DPO loss mentioned above in another open-source project. eric-mitchell/direct-preference-optimization#35). |
Sorry for the late reply, but in my case it seems that the problem is solved. Specifically, I have taken the following measures:
No changes were made to the data collator (I used the default one, as in the attached code). |
Congrats! I am a little comfuse and interested in the root cause of the problem. I checked your code, and I noticed that you didn't specify the data_collator. However, in the dpo_trainer, the default data_collator is DPODataCollatorWithPadding. Did you use the code changes on DPODataCollatorWithPadding from the PR in trl/trainer/utils.py? I'm confused about the statement "No changes were made to the data collator (I used the default one, as in the attached code)." |
Exactly, I simply tested the above code with the changes made in the PR and the data collator was not specified ( hence, the default one is used according to the documentation ). |
Thank you! I haven't observed this problem in my subsequent experiments either. The previous occurrence was likely a random occurrence. We can go ahead and close the issue for now. I may need to seek your assistance again if I encounter any further issues. |
@kashif unluckly, I have to re-open the issue. Now I am testing the script on a real and more complex dataset for a text generation task and the problem seems to persist. In particular, by increasing the number of steps the performance of the model deteriorates considerably, leading to checkpoints where a large number of repeated characters are generated. I also tried changing the learning rate several times, without getting any promising results. Of course for all experiments I tried the pull request code, without success. |
no worries @Devy99 i can test with my branch to check... the main issue being that your model is an enc-dec style model while the code is mostly been tested on decoder only style models... |
Thank you very much! Then, I look forward to hearing from you. |
Hi @kashif , sorry to bother you. Is there any update about this problem? P.S. @Ricardokevins did you solve your problem / find any solution? |
I am also facing similar issue.. |
Update: reducing the lr did help removing the redundancy |
Hi @kashif @lvwerra @younesbelkada , sorry to ask again. Can you tell me if the DPO implementation will be further tested on encoder-decoder models like T5? I have tried to replace my model with a decoder-only, but an encoder-decoder like T5 would be ideal for the experiments I am doing. Can you tell me more about it? |
unluckly, when i explore more basemodel for DPO, i encounter the problem again... I still think there might be some bugs in the code or the algorithm itself. Lower Learning rate might just slow the deteriorates, and the problem might still exsit. But i have no idea about why... |
1 similar comment
unluckly, when i explore more basemodel for DPO, i encounter the problem again... I still think there might be some bugs in the code or the algorithm itself. Lower Learning rate might just slow the deteriorates, and the problem might still exsit. But i have no idea about why... |
@Devy99 @Ricardokevins Hi, I met the problem,too. |
I didnt change this setting (might be the default value) |
@Devy99 i think with the refactorings that happened the enc-dec setup needs a closer look, I will take a look with a tiny T5 style model and report back |
In my setting (my recent research on Multilingual Reasoning by Preference Optimization), I use 1e-6 for DPO training on LlaMA2(7/13B)-based LLM. And it works well on these Decoder-Only LLM. If 1e-6 not working, you can lower it to 2e-7. Meanwhile, you can put those bad sampling output with repetition to "Reject" to penalize this behavior
|
@kashif nice, thank you! I'll wait for your updates then 🙏 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Up ( no stale ) |
Up ...! |
@Devy99 i think the trainers havent been working with encoder-decoder style models so I fear there might still be a bug with masking and tokenization etc. Let me find some time to have a look at T5, I can use your example above to check? |
@kashif ok, thanks for your patience and your work! Actually, I don't have the dataset used for the example, since it passed some time from when I opened this issue. But it can be easily generated from scratch since I applied trivial heuristics to determine the output value. |
Hi! Any update on this issue? I also have a similar problem with the KTO Trainer and TinyMistral. |
Not yet, but @kashif is actually testing it for T5 |
Hello, I am planning to train DPOon a 100M GPT-2 model. Since this is just for testing purposes, I haven't formatted my training data specifically. I randomly selected a DPO dataset from Hugging Face for training. After 2 days of DPO training, I found that my GPT-2 model couldn't even generate coherent sentences, or rather, the model's performance degraded catastrophically. I am puzzled. Do I need to format my training data in a specific way like yours to see results? Is it normal for the model's performance to degrade with random data? Of course, I am more than willing to reproduce the issue you initially encountered, as long as it doesn't result in the model degrading to an untrained state. |
Up... facing the same problem with encoder-decoder style models (BART) on PPO. Having read the relevant thread(s), I suspect this might be universal issue among DPO and PPO trainers (?) |
The same problem here, after tuning an Only-Decoder model using SFT, the behaviour of the model is as expected. But applying DPO training, the model does not even generate consistent responses. As a guide, I have used the parameters from the huggingface/alignment-handbook. |
This blog post may related to this problem. It seems as a fundamental issue of DPO/KTO loss. https://kyunghyuncho.me/a-proper-preference-optimization-loss-and-its-gradient/ |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Up. Any news, @kashif ? |
I'm facing the same problem. Currently, I'm using in-distribution data sampled from model output and a low learning rate (5e-7). The repetition still exists. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Up |
I am currently working on a project that requires me to use the DPO trainer for the T5 model. I have noticed that the DPO trainer provided by Huggingface has some issues. Unfortunately, the development team does not seem to be paying much attention to this so far. (I understand most people are now focusing more on decoder-only language models.) Because from the code snippet in "https://github.com/huggingface/trl/blob/main/tests/test_dpo_trainer.py", which states: " Anyway, I encountered a specific issue with the original DPO trainer when applying it to the T5 model, mainly concerning the padding value added in the decoder. And, after addressing these problems, the DPO trainer is now functioning correctly for my use case. Additionally, I have found that using a fine-tuned T5 model and a very small learning rate is crucial for stable training. |
@shengminp thank you for sharing your experience! Thanks in advance 🙏 |
@Devy99 Sure, I've made several modifications based on my specific needs, so I can only provide you with the problem code. I'm not sure if these details will help you solve your current issues, but here they are. The version of library are as follow:
The problem code is primarily in DPODataCollatorWithPadding. In line 380 of (https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py), you will find the following condition: Please note that my current changes are solely intended to ensure the T5 model runs without issues. This means the code has not been tested with decoder-only models or other encoder-decoder models, so it is possible that my changes could cause errors in other situations. |
Let me share more details about my experience. I followed the xxx-tuning phase utilized in most of today's LM model training, which includes the following steps:
I am not sure if these experience can help you. And at present, I am still debugging and modifying the code, so the code are quite complicated. Please forgive me for not being able to provide too many details of the code at this time. |
@shengminp thanks! I'll give a try 😄 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
@Devy99 can you try now on the latest head with an encoder-decoder model? |
@kashif thanks for your work! Lately, I am quite busy, so I'll update you as soon as possible. Anyway, I also invite the others who experienced my same issue to check whether the problem is fixed or not. |
Hi, I am running some tests with DPOTrainer to see how it works but I have encountered some problems during the inference phase of the generated model. In details, this is the pipeline of operations I performed:
I pre-trained from scratch a T5 model on natural language (English language). For this operation, I followed the instructions of the Hugging Face library. As for training the tokenizer, this was done using the sentencepiece library. The generated file (extension .model) was then used through the T5Tokenizer class, which allows using the .model file instead of a json file.
I fine-tuned T5 using a very trivial dataset such as the following.
In summary, if there is no word 'the' in the input then the output will be 'a', if there is only one occurrence of 'the' then the output will be 'b', and so on... For fine-tuning, I did not use the SFTTrainer class but the classic Seq2SeqTrainer.
Then, I performed the DPO with the same inputs as the dataset present above, but in the JSON format. The code used is the same as the example on the repository. In this case, however, we used our finetuned T5 model and tokenizer (with classes T5ForConditionalGeneration, T5Tokenizer, T5Config). You can find the JSON file and the full code at the end of this message.
The problem arises in the inference phase of the model generated by the DPOTrainer. In fact, for several instances the output generated by the model is 'a a a a a a', ' b b b b b b b', 'c c c c c c c c', and so on... (the number of repetitions of the class is variable). Moreover, this behavior becomes more pronounced as the number of steps increases. Also, as the number of steps increases, words that are part of the train set are generated in the output (e.g., 'aaacat' is generated).
I cannot figure out what could be the cause of this behavior. By making inference of the simply fine-tuned model, the output generated is as expected (i.e., a class between 'a', 'b', 'c' and 'd'), so the problem is introduced during training with DPO. I also tried to use the pre-trained 't5-small' model / tokenizer instead of the ones trained from scratch, but the problem still persists.
I look forward to your feedback should more information or snippets of code used be needed.
DPO dataset
[ { 'prompt': 'I love cats', 'chosen': 'a', 'rejected': 'b', }, { 'prompt': 'I love cats', 'chosen': 'a', 'rejected': 'c', }, { 'prompt': 'I love cats', 'chosen': 'a', 'rejected': 'd', }, { 'prompt': 'The cat is orange', 'chosen': 'b', 'rejected': 'a', }, { 'prompt': 'The cat is orange', 'chosen': 'b', 'rejected': 'c', }, { 'prompt': 'The cat is orange', 'chosen': 'b', 'rejected': 'd', } ... ]DPO code
```The text was updated successfully, but these errors were encountered: