Duplicated items in Sky-T1_data_17k.json #33

fanqiwan · 2025-01-18T14:25:11Z

Hello, SkyThought Team,

We have noticed that the Sky-T1_data_17k.json dataset contains duplicated items, which seem to originate from the 815 science and puzzle data included in STILL-2. We would like to inquire whether these items were deliberately sampled twice or if this is due to an oversight in the dataset preparation.

To illustrate this observation, we have conducted a simple analysis:

sky_t1_data: list[dict] = json.load(open("Sky-T1_data_17k.json", "r"))

sky_t1_data_questions: list[str] = [data["conversations"][0]["value"] for data in sky_t1_data]
sky_t1_data_answers: list[str] = [data["conversations"][1]["value"] for data in sky_t1_data]

print(len(sky_t1_data)) # 16401
print(len(set(sky_t1_data_questions))) # 15586
print(len(set(sky_t1_data_answers))) # 15586

Furthermore, we have replicated the training and evaluation procedures using the current dataset. The results are presented below:

	AIME24	MATH500	GPQA Diamond	MMLU	LiveCodeBench
Sky-T1-32B-Preview (official)	43.3	86.4	56.8	-	-
Sky-T1-32B-SFT (temperature=0.0)	0.3333	0.862	0.5303	0.8319	0.4423
Sky-T1-32B-SFT (temperature=0.7)	0.5	0.89	0.5505	0.8324	0.4951

We are currently conducting additional experiments with a deduplicated version of the dataset, and will provide updated results upon completion.

Sincerely,

DachengLi1 · 2025-01-18T16:33:31Z

Thank you for your analysis! This is due to an oversight of the data preparation. We didn't examine duplicate items in STILL-2.

Thank you for your replication! In the release, we actually reported numbers with temperature=0.7 (third row). Your numbers actually look higher than our released results.

fanqiwan · 2025-01-19T13:55:46Z

We present the model fine-tuned using the deduplicated version of the dataset. We show that, after removing the deduplicated science and puzzle data from the training set, the fine-tuned model exhibits a slight decline in performance on AIME, MATH500, GPAQ Diamond, and MMLU when evaluated with a temperature setting of 0.7. However, there is a significant improvement in the model's performance on LiveCodeBench.

It seems like there is no strong correlation between task performance and the domain of the training data...

	AIME24	MATH500	GPQA Diamond	MMLU	LiveCodeBench
Sky-T1-32B-Preview (official)	43.3	86.4	56.8	-	-
Sky-T1-32B-SFT (temperature=0.0)	0.3333	0.862	0.5303	0.8319	0.4423
Sky-T1-32B-SFT (temperature=0.7)	0.5	0.89	0.5505	0.8324	0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0)	0.3333	0.864	0.5354	0.8312	0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7)	0.3333	0.876	0.5404	0.8298	0.5225

LetheSec · 2025-01-20T06:47:01Z

We present the model fine-tuned using the deduplicated version of the dataset. We show that, after removing the deduplicated science and puzzle data from the training set, the fine-tuned model exhibits a slight decline in performance on AIME, MATH500, GPAQ Diamond, and MMLU when evaluated with a temperature setting of 0.7. However, there is a significant improvement in the model's performance on LiveCodeBench.我们展示了使用数据集的去重版本进行微调的模型。我们发现，从训练集中删除重复数据的科学和谜题数据后，在使用 0.7 的温度设置进行评估时，微调模型在 AIME、MATH500、GPAQ Diamond 和 MMLU 上的性能略有下降。不过，该模型在 LiveCodeBench 上的性能有了显着的提升。

It seems like there is no strong correlation between task performance and the domain of the training data...任务表现和训练数据领域之间似乎没有很强的相关性......

AIME24 MATH500 GPQA Diamond MMLU LiveCodeBench
Sky-T1-32B-Preview (official) 43.3 86.4 56.8 - -
Sky-T1-32B-SFT (temperature=0.0) 0.3333 0.862 0.5303 0.8319 0.4423
Sky-T1-32B-SFT (temperature=0.7) 0.5 0.89 0.5505 0.8324 0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0) 0.3333 0.864 0.5354 0.8312 0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7) 0.3333 0.876 0.5404 0.8298 0.5225

Thank you for sharing the results.
Have you tried training with the same system prompt using the AIBOX/long_form_thought_data_5k dataset released by the STILL2 team? Although the Sky-T1 dataset contains more mathematical data, I think the model performance should be relatively close. I agree with your last sentence.

And based on your results before data deduplication, there seems to be a significant performance difference using different temperatures, which also makes me a little confused. For mathematical problems, the steps and conclusions of reasoning should be rigorous. It seems that greedy decoding should be used to avoid randomness.

fanqiwan · 2025-01-20T07:09:46Z

We use only math domain data from STILL2 with the same system prompt for fine-tuning the base Qwen2.5-32B-Instruct model (STILL-2-Math-32B-SFT), and the Sky-T1-32B-SFT model (Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT). Here are the results.

	AIME24	MATH500	GPQA Diamond	MMLU	LiveCodeBench
Sky-T1-32B-Preview (official)	43.3	86.4	56.8	-	-
Sky-T1-32B-SFT (temperature=0.0)	0.3333	0.862	0.5303	0.8319	0.4423
Sky-T1-32B-SFT (temperature=0.7)	0.5	0.89	0.5505	0.8324	0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0)	0.3333	0.864	0.5354	0.8312	0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7)	0.3333	0.876	0.5404	0.8298	0.5225
STILL-2-Math-32B-SFT (temperature=0.0)	0.4	0.878	0.5354	0.8099	0.4971
STILL-2-Math-32B-SFT (temperature=0.7)	0.4	0.9	0.5606	0.808	0.5088
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.0)	0.4667	0.888	0.4848	0.8243	0.4932
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.7)	0.4667	0.872	0.4848	0.8222	0.5342

LetheSec · 2025-01-20T07:38:20Z

We use only math domain data from STILL2 with the same system prompt for fine-tuning the base Qwen2.5-32B-Instruct model (STILL-2-Math-32B-SFT), and the Sky-T1-32B-SFT model (Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT). Here are the results.

AIME24 MATH500 GPQA Diamond MMLU LiveCodeBench
Sky-T1-32B-Preview (official) 43.3 86.4 56.8 - -
Sky-T1-32B-SFT (temperature=0.0) 0.3333 0.862 0.5303 0.8319 0.4423
Sky-T1-32B-SFT (temperature=0.7) 0.5 0.89 0.5505 0.8324 0.4951
Sky-T1-Dedup-32B-SFT (temperature=0.0) 0.3333 0.864 0.5354 0.8312 0.5108
Sky-T1-Dedup-32B-SFT (temperature=0.7) 0.3333 0.876 0.5404 0.8298 0.5225
STILL-2-Math-32B-SFT (temperature=0.0) 0.4 0.878 0.5354 0.8099 0.4971
STILL-2-Math-32B-SFT (temperature=0.7) 0.4 0.9 0.5606 0.808 0.5088
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.0) 0.4667 0.888 0.4848 0.8243 0.4932
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT (temperature=0.7) 0.4667 0.872 0.4848 0.8222 0.5342

Thank you for your results. It seems that the "STILL-2-Math-32B-SFT (temp=0.7)" fine-tuned only in the math domain has the best GPQA performance.

fanqiwan · 2025-01-20T10:35:20Z

We also test these model on MMLUPro, GSM8K, and ARC-C. The full results are shown below.

Sampling with Temperature 0.7

	AIME24	MATH500	GSM8K	GPQA Diamond	ARC-C	MMLU-Pro	MMLU	LiveCodeBench
Official Results
o1-preview (official)	0.4	0.814		0.752
QwQ-32B-Preview (official)	0.5	0.854		0.525
Sky-T1-32B-Preview (official)	0.433	0.824		0.568
Ours Results
QwQ-32B-Preview	0.4333	0.888	0.9553	0.5354	0.9471	0.6343	0.8487	0.5166
Sky-T1-32B-Preview	0.3	0.872	0.9545	0.5354	0.9556	0.6483	0.819	0.4834
Qwen2.5-72B-Instruct	0.1667	0.836	0.9416	0.5253	0.9556	0.5301	0.8083	0.4501
Qwen2.5-32B-Instruct	0.1667	0.806	0.9393	0.4444	0.9539	0.5679	0.7948	0.5049
Qwen2.5-14B-Instruct	0.1333	0.788	0.9287	0.4242	0.9258	0.5045	0.7634	0.3894
Qwen2.5-7B-Instruct	0.1	0.758	0.8961	0.3434	0.7014	0.4824	0.7212	0.3346
STILL-2-Math-32B-SFT	0.4	0.9	0.9591	0.5606	0.9556	0.5968	0.808	0.5088
Sky-T1-32B-SFT	0.5	0.89	0.9522	0.5505	0.9667	0.6678	0.8324	0.4951
Sky-T1-Dedup-32B-SFT	0.3333	0.876	0.9515	0.5404	0.9608	0.6634	0.8298	0.5225
Sky-T1-14B-SFT	-	0.792	0.9462	0.4545	0.9369	0.579	0.784	0.3112
Sky-T1-7B-SFT	0.0667	0.716	0.9105	0.404	0.9147	0.4975	0.7131	0.1957
Sky-T1-32B-SFT-STILL-2-Math-Off-Policy-SFT	0.4667	0.872	0.9575	0.4848	0.9642	0.6372	0.8222	0.5342

DachengLi1 assigned DachengLi1 and caoshiyi Jan 18, 2025

RyanMarten mentioned this issue Jan 21, 2025

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated items in Sky-T1_data_17k.json #33

Duplicated items in Sky-T1_data_17k.json #33

fanqiwan commented Jan 18, 2025 •

edited

Loading

DachengLi1 commented Jan 18, 2025

fanqiwan commented Jan 19, 2025 •

edited

Loading

LetheSec commented Jan 20, 2025 •

edited

Loading

fanqiwan commented Jan 20, 2025

LetheSec commented Jan 20, 2025

fanqiwan commented Jan 20, 2025

Duplicated items in Sky-T1_data_17k.json #33

Duplicated items in Sky-T1_data_17k.json #33

Comments

fanqiwan commented Jan 18, 2025 • edited Loading

DachengLi1 commented Jan 18, 2025

fanqiwan commented Jan 19, 2025 • edited Loading

LetheSec commented Jan 20, 2025 • edited Loading

fanqiwan commented Jan 20, 2025

LetheSec commented Jan 20, 2025

fanqiwan commented Jan 20, 2025

fanqiwan commented Jan 18, 2025 •

edited

Loading

fanqiwan commented Jan 19, 2025 •

edited

Loading

LetheSec commented Jan 20, 2025 •

edited

Loading