Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated items in Sky-T1_data_17k.json #33

Open
fanqiwan opened this issue Jan 18, 2025 · 6 comments
Open

Duplicated items in Sky-T1_data_17k.json #33

fanqiwan opened this issue Jan 18, 2025 · 6 comments
Assignees

Comments

@fanqiwan
Copy link
Contributor

fanqiwan commented Jan 18, 2025

Hello, SkyThought Team,

We have noticed that the Sky-T1_data_17k.json dataset contains duplicated items, which seem to originate from the 815 science and puzzle data included in STILL-2. We would like to inquire whether these items were deliberately sampled twice or if this is due to an oversight in the dataset preparation.

To illustrate this observation, we have conducted a simple analysis:

sky_t1_data: list[dict] = json.load(open("Sky-T1_data_17k.json", "r"))

sky_t1_data_questions: list[str] = [data["conversations"][0]["value"] for data in sky_t1_data]
sky_t1_data_answers: list[str] = [data["conversations"][1]["value"] for data in sky_t1_data]

print(len(sky_t1_data)) # 16401
print(len(set(sky_t1_data_questions))) # 15586
print(len(set(sky_t1_data_answers))) # 15586

Furthermore, we have replicated the training and evaluation procedures using the current dataset. The results are presented below:

  AIME24 MATH500 GPQA Diamond MMLU LiveCodeBench
Sky-T1-32B-Preview (official) 43.3 86.4 56.8 - -
Sky-T1-32B-SFT (temperature=0.0) 0.3333 0.862 0.5303 0.8319 0.4423
Sky-T1-32B-SFT (temperature=0.7) 0.5 0.89 0.5505 0.8324 0.4951

We are currently conducting additional experiments with a deduplicated version of the dataset, and will provide updated results upon completion.

Sincerely,

@DachengLi1
Copy link
Collaborator

Thank you for your analysis! This is due to an oversight of the data preparation. We didn't examine duplicate items in STILL-2.

Thank you for your replication! In the release, we actually reported numbers with temperature=0.7 (third row). Your numbers actually look higher than our released results.

@fanqiwan
Copy link
Contributor Author

fanqiwan commented Jan 19, 2025

We present the model fine-tuned using the deduplicated version of the dataset. We show that, after removing the deduplicated science and puzzle data from the training set, the fine-tuned model exhibits a slight decline in performance on AIME, MATH500, GPAQ Diamond, and MMLU when evaluated with a temperature setting of 0.7. However, there is a significant improvement in the model's performance on LiveCodeBench.

It seems like there is no strong correlation between task performance and the domain of the training data...

  AIME24 MATH500 GPQA Diamond MMLU LiveCodeBench
Sky-T1-32B-Preview (official) 43.3 86.4 56.8 - -
Sky-T1-32B-SFT (temperature=0.0) 0.3333 0.862 0.5303 0.8319 0.4423
Sky-T1-32B-SFT (temperature=0.7) 0.5 0.89 0.5505 0.8324 0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0) 0.3333 0.864 0.5354 0.8312 0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7) 0.3333 0.876 0.5404 0.8298 0.5225

@LetheSec
Copy link

LetheSec commented Jan 20, 2025

We present the model fine-tuned using the deduplicated version of the dataset. We show that, after removing the deduplicated science and puzzle data from the training set, the fine-tuned model exhibits a slight decline in performance on AIME, MATH500, GPAQ Diamond, and MMLU when evaluated with a temperature setting of 0.7. However, there is a significant improvement in the model's performance on LiveCodeBench.我们展示了使用数据集的去重版本进行微调的模型。我们发现,从训练集中删除重复数据的科学和谜题数据后,在使用 0.7 的温度设置进行评估时,微调模型在 AIME、MATH500、GPAQ Diamond 和 MMLU 上的性能略有下降。不过,该模型在 LiveCodeBench 上的性能有了显着的提升。

It seems like there is no strong correlation between task performance and the domain of the training data...任务表现和训练数据领域之间似乎没有很强的相关性......

  AIME24 MATH500 GPQA Diamond MMLU LiveCodeBench
Sky-T1-32B-Preview (official) 43.3 86.4 56.8 - -
Sky-T1-32B-SFT (temperature=0.0) 0.3333 0.862 0.5303 0.8319 0.4423
Sky-T1-32B-SFT (temperature=0.7) 0.5 0.89 0.5505 0.8324 0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0) 0.3333 0.864 0.5354 0.8312 0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7) 0.3333 0.876 0.5404 0.8298 0.5225

Thank you for sharing the results.
Have you tried training with the same system prompt using the AIBOX/long_form_thought_data_5k dataset released by the STILL2 team? Although the Sky-T1 dataset contains more mathematical data, I think the model performance should be relatively close. I agree with your last sentence.

And based on your results before data deduplication, there seems to be a significant performance difference using different temperatures, which also makes me a little confused. For mathematical problems, the steps and conclusions of reasoning should be rigorous. It seems that greedy decoding should be used to avoid randomness.

@fanqiwan
Copy link
Contributor Author

We use only math domain data from STILL2 with the same system prompt for fine-tuning the base Qwen2.5-32B-Instruct model (STILL-2-Math-32B-SFT), and the Sky-T1-32B-SFT model (Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT). Here are the results.

  AIME24 MATH500 GPQA Diamond MMLU LiveCodeBench
Sky-T1-32B-Preview (official) 43.3 86.4 56.8 - -
Sky-T1-32B-SFT (temperature=0.0) 0.3333 0.862 0.5303 0.8319 0.4423
Sky-T1-32B-SFT (temperature=0.7) 0.5 0.89 0.5505 0.8324 0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0) 0.3333 0.864 0.5354 0.8312 0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7) 0.3333 0.876 0.5404 0.8298 0.5225
STILL-2-Math-32B-SFT (temperature=0.0) 0.4 0.878 0.5354 0.8099 0.4971
STILL-2-Math-32B-SFT (temperature=0.7) 0.4 0.9 0.5606 0.808 0.5088
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.0) 0.4667 0.888 0.4848 0.8243 0.4932
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.7) 0.4667 0.872  0.4848 0.8222 0.5342

@LetheSec
Copy link

We use only math domain data from STILL2 with the same system prompt for fine-tuning the base Qwen2.5-32B-Instruct model (STILL-2-Math-32B-SFT), and the Sky-T1-32B-SFT model (Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT). Here are the results.

  AIME24 MATH500 GPQA Diamond MMLU LiveCodeBench
Sky-T1-32B-Preview (official) 43.3 86.4 56.8 - -
Sky-T1-32B-SFT (temperature=0.0) 0.3333 0.862 0.5303 0.8319 0.4423
Sky-T1-32B-SFT (temperature=0.7) 0.5 0.89 0.5505 0.8324 0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0) 0.3333 0.864 0.5354 0.8312 0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7) 0.3333 0.876 0.5404 0.8298 0.5225
STILL-2-Math-32B-SFT (temperature=0.0) 0.4 0.878 0.5354 0.8099 0.4971
STILL-2-Math-32B-SFT (temperature=0.7) 0.4 0.9 0.5606 0.808 0.5088
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.0) 0.4667 0.888 0.4848 0.8243 0.4932
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.7) 0.4667 0.872  0.4848 0.8222 0.5342

Thank you for your results. It seems that the "STILL-2-Math-32B-SFT (temp=0.7)" fine-tuned only in the math domain has the best GPQA performance.

@fanqiwan
Copy link
Contributor Author

We also test these model on MMLUPro, GSM8K, and ARC-C. The full results are shown below.

Sampling with Temperature 0.7

  AIME24 MATH500 GSM8K GPQA Diamond ARC-C MMLU-Pro MMLU LiveCodeBench
Official Results                
o1-preview (official) 0.4 0.814   0.752        
QwQ-32B-Preview (official) 0.5 0.854   0.525        
Sky-T1-32B-Preview (official) 0.433 0.824   0.568        
Ours Results                
QwQ-32B-Preview 0.4333 0.888 0.9553 0.5354 0.9471 0.6343 0.8487 0.5166
Sky-T1-32B-Preview 0.3 0.872 0.9545 0.5354 0.9556 0.6483 0.819 0.4834
Qwen2.5-72B-Instruct 0.1667 0.836 0.9416 0.5253 0.9556 0.5301 0.8083 0.4501
Qwen2.5-32B-Instruct 0.1667 0.806 0.9393 0.4444 0.9539 0.5679 0.7948 0.5049
Qwen2.5-14B-Instruct 0.1333 0.788 0.9287 0.4242 0.9258 0.5045 0.7634 0.3894
Qwen2.5-7B-Instruct 0.1 0.758 0.8961 0.3434 0.7014 0.4824 0.7212 0.3346
STILL-2-Math-32B-SFT 0.4 0.9 0.9591 0.5606 0.9556 0.5968 0.808 0.5088
Sky-T1-32B-SFT 0.5 0.89 0.9522 0.5505 0.9667 0.6678 0.8324 0.4951
Sky-T1-Dedup-32B-SFT 0.3333 0.876 0.9515 0.5404 0.9608 0.6634 0.8298 0.5225
Sky-T1-14B-SFT - 0.792 0.9462 0.4545 0.9369 0.579 0.784 0.3112
Sky-T1-7B-SFT 0.0667 0.716 0.9105 0.404 0.9147 0.4975 0.7131 0.1957
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT 0.4667 0.872 0.9575 0.4848 0.9642 0.6372 0.8222 0.5342

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants