-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated items in Sky-T1_data_17k.json #33
Comments
Thank you for your analysis! This is due to an oversight of the data preparation. We didn't examine duplicate items in STILL-2. Thank you for your replication! In the release, we actually reported numbers with temperature=0.7 (third row). Your numbers actually look higher than our released results. |
We present the model fine-tuned using the deduplicated version of the dataset. We show that, after removing the deduplicated science and puzzle data from the training set, the fine-tuned model exhibits a slight decline in performance on AIME, MATH500, GPAQ Diamond, and MMLU when evaluated with a temperature setting of 0.7. However, there is a significant improvement in the model's performance on LiveCodeBench. It seems like there is no strong correlation between task performance and the domain of the training data...
|
Thank you for sharing the results. And based on your results before data deduplication, there seems to be a significant performance difference using different temperatures, which also makes me a little confused. For mathematical problems, the steps and conclusions of reasoning should be rigorous. It seems that greedy decoding should be used to avoid randomness. |
We use only math domain data from STILL2 with the same system prompt for fine-tuning the base Qwen2.5-32B-Instruct model (STILL-2-Math-32B-SFT), and the Sky-T1-32B-SFT model (Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT). Here are the results.
|
Thank you for your results. It seems that the "STILL-2-Math-32B-SFT (temp=0.7)" fine-tuned only in the math domain has the best GPQA performance. |
We also test these model on MMLUPro, GSM8K, and ARC-C. The full results are shown below. Sampling with Temperature 0.7
|
Hello, SkyThought Team,
We have noticed that the Sky-T1_data_17k.json dataset contains duplicated items, which seem to originate from the 815 science and puzzle data included in STILL-2. We would like to inquire whether these items were deliberately sampled twice or if this is due to an oversight in the dataset preparation.
To illustrate this observation, we have conducted a simple analysis:
Furthermore, we have replicated the training and evaluation procedures using the current dataset. The results are presented below:
We are currently conducting additional experiments with a deduplicated version of the dataset, and will provide updated results upon completion.
Sincerely,
The text was updated successfully, but these errors were encountered: