Merge pull request #126 from ninehills/create-pull-request/patch

Changes by create-pull-request action
ninehills · Mar 3, 2025 · c3c21e1 · c3c21e1
2 parents 8b02f15 + 4133b86
commit c3c21e1
Showing 1 changed file with 16 additions and 6 deletions.
diff --git a/articles/121.md b/articles/121.md
@@ -6,18 +6,27 @@
 > Link and comments: <https://github.com/ninehills/blog/issues/121>  
 
 
-随着 DeepSeek R1 的发布，如果想复刻 R1 或者在某个领域实践 RFT（Reinforcement Fine-Tuning），可以看看我整理的清单，会持续更新。
-同时我个人尝试的结果也会更新上。
+DeepSeek R1 相关资料，全部被我个人阅读并精选，不是简单的罗列。
 
-> 更新时间：2025.1.29
+> 更新时间：2025.3.1
 
+- 文章
+	- [Reasoning best practices](https://platform.openai.com/docs/guides/reasoning-best-practices)：**【重点】** OpenAI 的思考模型最佳实践，必看。
+	- Greg 的 思考模型 Prompt：![Image](https://github.com/user-attachments/assets/9ea9e1ea-a1b0-4971-9a27-a6a70b19541b)
+	- [Understanding Reasoning LLMs](https://magazine.sebastianraschka.com/p/understanding-reasoning-llms)：偏学术一些的文章。
+	- [A Visual Guide to Reasoning LLMs](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms)：**【重点】** 非常棒的介绍，可视化做的很好。
+	- [DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge](https://huggingface.co/blog/NormalUhr/grpo)：GRPO 算法的非数学理解，适合非算法方向的。
 - 论文
-	- [DeepSeek R1](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf)：DeepSeek R1 本体论文，写的引人入胜。
+	- [DeepSeek R1](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf)：**【重点】**DeepSeek R1 本体论文，写的引人入胜。
 	- [Kimi K1.5](https://arxiv.org/pdf/2501.12599v1)：Kimi K1.5 推理模型的思路和 R1 类似，在数据和奖励函数上有更多的细节。
 	- [DeepSeek Math](https://arxiv.org/pdf/2402.03300)：GRPO 算法的提出，GRPO 相比于 PPO 节约了 Value Model，从而降低了训练的显存要求。
+	- [SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training](https://arxiv.org/abs/2501.17161)：对 SFT 和 RL 效果和应用方向的研究，但是结论仅供参考，还需要大量的实践。
+	- 最近有大量的 Reasoning Model 论文，但是经得起时间考验还没有，后续随着阅读逐渐增加。
 - GRPO 开源实现：主要是要支持 reward function。
-	- [trl grpo trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)：TRL 的 GRPOTrainer 实现，目前尚未发版，需要安装 trl 的 main 分支。
+	- [trl grpo trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)：TRL 的 GRPOTrainer 实现
 	- [veRL](https://github.com/volcengine/verl)：字节开源的 RL 实现，也支持 GRPO reward function。
+	- [Unsloth](https://docs.unsloth.ai/basics/reasoning-grpo-and-rl)：**【重点】**Unsloth 的 GRPO 实现，可大幅减少显存使用。
+	- [verifiers](https://github.com/willccbb/verifiers)：封装好的一些验证器。
 - R1 复刻项目、数据集
 	- [open-r1](https://github.com/huggingface/open-r1/)：**【重点】**包括数据合成、SFT、GRPO RL 的代码。
 	- [TinyZero](https://github.com/Jiayi-Pan/TinyZero)：在简单的类24点问题上复刻 R1 RL 范式。
@@ -27,4 +36,5 @@
 	- [open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal)：R1 多模态的复刻项目
 	- [open-thoughts](https://github.com/open-thoughts/open-thoughts)：**【重点】** 最成熟的 R1 复刻项目，已经发布了 [Bespoke-Stratos-17k dataset](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k) 和 [OpenThoughts-114k dataset](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 项目，仅经过 SFT 即可以逼近 R1-distill 模型
 	- [R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT)：1.68M 条 R1 蒸馏数据集
-	- [grpo_demo.py](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)：**【重点】** 基于 0.5B 模型的 RL demo，可以用来学习怎么训练。
+	- [grpo_demo.py](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)：**【重点】** 基于 0.5B 模型的 RL demo，可以用来学习怎么训练。
+	- [Chinese-DeepSeek-R1-Distill-data-110k](https://huggingface.co/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k)：中文蒸馏数据集