Merge pull request #30 from RaviSriTejaKuriseti/main

Update README.md till Feb 23
dair-ai · Feb 26, 2025 · d4f0e1f · d4f0e1f
2 parents dd930db + 752639b
commit d4f0e1f
Showing 1 changed file with 15 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -11,6 +11,7 @@ Here is the weekly series:
 Here is the weekly series:
 
 ## 2025
+- [Top ML Papers of the Week (February 17 - February 23)](./#top-ml-papers-of-the-week-february-17---february-23---2025)
 - [Top ML Papers of the Week (February 10 - February 16)](./#top-ml-papers-of-the-week-february-10---february-16---2025)
 - [Top ML Papers of the Week (February 3 - February 9)](./#top-ml-papers-of-the-week-february-3---february-9---2025)
 - [Top ML Papers of the Week (January 27 - February 2)](./#top-ml-papers-of-the-week-january-27---february-2---2025)
@@ -132,6 +133,20 @@ Here is the weekly series:
 
 [Join our Discord](https://discord.gg/SKgkVT8BGJ)
 
+## Top ML Papers of the Week (February 17 - February 23) - 2025
+| **Paper**  | **Links** | 
+| ------------- | ------------- | 
+| 1) AI Co-Scientist  Google introduces AI co-scientist, a multi-agent AI system built with Gemini 2.0 to help accelerate scientific breakthroughs.  Key highlights:   <br> ● What's the goal of this AI co-scientist? – It can serve as a  "virtual scientific collaborator to help scientists generate novel  hypotheses and research proposals, and to accelerate the clock speed  of scientific and biomedical discoveries."   <br> ● How is it built? – It uses a coalition of specialized agents  inspired by the scientific method. It can generate, evaluate, and  refine hypotheses. It also has self-improving capabilities.   <br> ● Collaboration and tools are key! – Scientists can either propose  ideas or provide feedback on outputs generated by the agentic system.  Tools like web search and specialized AI models improve the quality of  responses.   <br> ● Hierarchical Multi-Agent System – AI co-scientist is built with a  Supervisor agent that assigns tasks to specialized agents. Apparently,  this architecture helps with scaling compute and iteratively improving  scientific reasoning.   <br> ● Test-time Compute – AI co-scientist leverages test-time compute  scaling to iteratively reason, evolve, and improve outputs. Self-play,  self-critique, and self-improvement are all important to generate and  refine hypotheses and proposals.   <br> ● Performance? – Self-improvement relies on the Elo auto-evaluation  metric. On GPQA diamond questions, they found that "higher Elo ratings  positively correlate with a higher probability of correct answers." AI  co-scientist outperforms other SoTA agentic and reasoning models for  complex problems generated by domain experts. The performance  increases with more time spent on reasoning, surpassing unassisted  human experts. Experts assessed the AI co-scientist to have a higher  potential for novelty and impact. It was even preferred over other  models like OpenAI o1. | [Paper](https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf), [Tweet](https://x.com/omarsar0/status/1892223515660579219) |
+| 2) The AI CUDA Engineer  Sakana AI introduces The AI CUDA Engineer, an end-to-end agentic system that can produce highly optimized CUDA kernels.  Key contributions:   <br> ● Why is this research important? – Writing efficient CUDA kernels is  challenging for humans. The AI CUDA Engineer is an end-to-end agent  built with the capabilities to automatically produce and optimize CUDA  kernels more effectively.   <br> ● What's up with CUDA? – Writing CUDA kernels can help achieve  high-performing AI algorithms. However, this requires GPU knowledge,  and most AI algorithms today are written in a higher-level abstraction  layer such as PyTorch.   <br> ● An Agentic Pipeline – The agent translates PyTorch code into CUDA  kernels (Stages 1 & 2), then applies evolutionary optimization (Stage  3) like crossover prompting, leading to an Innovation Archive (Stage  4) that reuses “stepping stone” kernels for further gains.   <br> ● Kernel Runtime Speedups – The team claims that The AI CUDA Engineer  discovers CUDA kernels with speedups that reach as high as 10-100x  faster than native and compiled kernels in PyTorch. It can also  convert entire ML architectures into optimized CUDA kernels. Online  users have challenged the [claimed  speedups](https://x.com/main_horse/status/1892446384910987718)  (Sakana AI has provided an [update](https://x.com/SakanaAILabs/status/1892385766510338559)  on the issue).   <br> ● Performance – The AI CUDA Engineer robustly translates PyTorch Code  to CUDA Kernels. It achieves more than a 90% translation success rate.   <br> ● Highlighted AI CUDA Engineer-Discovered Kernels – Another claim is  that The AI CUDA Engineer can robustly improve CUDA runtime. It  outperforms PyTorch Native runtimes for 81% out of 229 considered  tasks. 20% of all discovered CUDA kernels are at least twice as fast  as their PyTorch implementations.   <br> ● The AI CUDA Engineer Archive – The team has made available an  archive of more than 17000 verified CUDA kernels. These can be used  for downstream fine-tuning of LLMs. There is also a website to explore  verified CUDA kernels. | [Technical Report](https://pub.sakana.ai/static/paper.pdf), [Blog](https://sakana.ai/ai-cuda-engineer/), [Dataset](https://pub.sakana.ai/ai-cuda-engineer), [Tweet](https://x.com/SakanaAILabs/status/1892385766510338559) |
+| 3) Native Sparse Attention  DeepSeek-AI and collaborators present Native Sparse Attention (NSA), a novel sparse attention mechanism designed to improve computational efficiency while maintaining model performance in long-context language modeling.  Key contributions:   <br> ● Hierarchical Sparse Attention – NSA combines coarse-grained  compression, fine-grained token selection, and sliding window  mechanisms to balance global context awareness and local precision.   <br> ● Hardware-Aligned Optimization – The authors introduce a blockwise  sparse attention mechanism optimized for Tensor Core utilization,  reducing memory bandwidth constraints and enhancing efficiency.   <br> ● End-to-End Trainability – Unlike prior sparse attention methods that  focus mainly on inference, NSA enables fully trainable sparsity,  reducing pretraining costs while preserving model capabilities.  Results and Impact:   <br> ● Outperforms Full Attention – Despite being sparse, NSA matches or  exceeds Full Attention on general benchmarks, long-context reasoning,  and instruction-based tasks.   <br> ● Massive Speedups – NSA achieves up to 11.6× speedup over Full  Attention on 64k-token sequences across all stages (decoding, forward,  and backward passes).   <br> ● Strong Long-Context Performance – In 64k Needle-in-a-Haystack  retrieval, NSA achieves perfect accuracy, significantly outperforming  other sparse methods.   <br> ● Enhanced Chain-of-Thought Reasoning – Fine-tuned NSA surpasses Full  Attention on AIME mathematical reasoning tasks, suggesting improved  long-range logical dependencies.  By making sparse attention natively trainable and optimizing for modern hardware, NSA provides a scalable solution for next-gen LLMs handling extremely long contexts. | [Paper](https://arxiv.org/abs/2502.11089), [Tweet](https://x.com/deepseek_ai/status/1891745487071609327) |
+| 4) Large Language Diffusion Model  Proposes LLaDA, a diffusion-based approach that can match or beat leading autoregressive LLMs in many tasks.  Key highlights:   <br> ● Questioning autoregressive dominance – While almost all large  language models (LLMs) use the next-token prediction paradigm, the  authors propose that key capabilities (scalability,   in-context learning, instruction-following) actually derive from  general generative principles rather than strictly from autoregressive  modeling.   <br> ● Masked diffusion + Transformers – LLaDA is built on a masked  diffusion framework that learns by progressively masking tokens and  training a Transformer to recover the original text. This yields a  non-autoregressive generative model—potentially addressing  left-to-right constraints in standard LLMs.   <br> ● Strong scalability – Trained on 2.3T tokens (8B parameters), LLaDA  performs competitively with top LLaMA-based LLMs across math (GSM8K,  MATH), code (HumanEval), and general benchmarks (MMLU). It  demonstrates that the diffusion paradigm scales similarly well to  autoregressive baselines.   <br> ● Breaks the “reversal curse” – LLaDA shows balanced forward/backward  reasoning, outperforming GPT-4 and other AR models on reversal tasks  (e.g. reversing a poem line). Because diffusion does not enforce  left-to-right generation, it is robust at backward completions.   <br> ● Multi-turn dialogue and instruction-following – After supervised  fine-tuning, LLaDA can carry on multi-turn conversations. It exhibits  strong instruction adherence and fluency similar to chat-based AR  LLMs—further evidence that advanced LLM traits do not necessarily rely  on autoregression. | [Paper](https://arxiv.org/abs/2502.09992), [Tweet](https://x.com/omarsar0/status/1891568386494300252) |
+| 5) SWE-Lancer  Researchers from OpenAI introduce SWE-Lancer, a benchmark evaluating LLMs on 1,488 real-world freelance software engineering tasks from Upwork, collectively worth $1M in payouts.  Key takeaways:   <br> ● A new benchmark for software engineering automation – Unlike  previous coding benchmarks focused on isolated tasks (e.g., program  synthesis, competitive programming), SWE-Lancer tests full-stack  engineering and managerial decision-making. It evaluates both  Individual Contributor (IC) SWE tasks, where models write and debug  code, and SWE Manager tasks, where models select the best technical  proposal.   <br> ● Real-world economic impact – Each task has a verifiable monetary  value, mirroring freelance market rates. Payouts range from $250 bug  fixes to $32,000 feature implementations. The benchmark maps model  performance to earnings, offering a tangible metric for automation  potential.   <br> ● Rigorous evaluation with end-to-end tests – Unlike unit-test-based  benchmarks, SWE-Lancer employs browser-driven, triple-verified  end-to-end (E2E) tests developed by professional engineers. These  tests reflect real-world software validation and prevent grading  hacks.   <br> ● Challenging tasks remain unsolved – Even the best-performing model,  Claude 3.5 Sonnet, only solves 26.2% of IC SWE tasks and 44.9% of SWE  Manager tasks, earning $208K out of $500.8K in the open-source  SWE-Lancer Diamond set. This highlights the gap between current AI  capabilities and human software engineers.   <br> ● Key findings on LLM performance: | [Paper](https://arxiv.org/abs/2502.12115), [Tweet](https://x.com/OpenAI/status/1891911123517018521) |
+| 6) Optimizing Model Selection for Compound AI  Researchers from Microsoft Research and collaborators introduce LLMSelector, a framework to improve multi-call LLM pipelines by selecting the best model per module instead of using one LLM everywhere.  Key insights include:   <br> ● Large performance boost with per-module model choices – Rather than  relying on a single LLM for each sub-task in compound systems, the  authors show that mixing different LLMs can yield 5%–70% higher  accuracy. Each model has unique strengths (e.g., better at critique  vs. generation), so assigning modules selectively substantially  improves end-to-end results.   <br> ● LLMSelector algorithm – They propose an iterative routine that  assigns an optimal model to each module, guided by a novel “LLM  diagnoser” to estimate per-module performance. The procedure scales  linearly with the number of modules—far more efficient than exhaustive  search.   <br> ● Monotonicity insights – Empirically, boosting any single module’s  performance (while holding others fixed) often improves the overall  system. This motivates an approximate factorization approach, where  local gains translate into global improvements.  LLMSelector works for any static compound system with fixed modules (e.g., generator–critic–refiner). | [Paper](https://arxiv.org/abs/2502.14815), [Tweet](https://x.com/omarsar0/status/1892945381174210933) |
+| 7) Open-Reasoner-Zero  Open-Reasoner-Zero (ORZ) is an open-source large-scale minimalist reinforcement learning (RL) framework that enhances reasoning capabilities. ORZ demonstrates significant scalability requiring only 1/30th of the training steps of DeepSeek-R1-Zero-Qwen-32B to outperform it on GPQA Diamond. Key contributions and findings:   <br> ● Minimalist RL Training Works – Unlike traditional RLHF setups, ORZ  removes KL regularization and relies on vanilla PPO with GAE (λ=1,  γ=1) and a simple rule-based reward function to scale both response  length and reasoning accuracy.   <br> ● Outperforms Closed-Source Models – ORZ-32B beats  DeepSeek-R1-Zero-Qwen-32B on GPQA Diamond while using significantly  fewer training steps, proving that training efficiency can be  drastically improved with a streamlined RL pipeline.   <br> ● Emergent Reasoning Abilities – ORZ exhibits "step moments", where  response lengths and accuracy suddenly increase, indicating emergent  reasoning capabilities with continued training.   <br> ● Massive Scaling Potential – ORZ’s response length scaling mirrors  trends seen in DeepSeek-R1-Zero (671B MoE), but with 5.8x fewer  training steps. Training shows no signs of saturation, hinting at even  further gains with continued scaling.   <br> ● Fully Open-Source – The training code, model weights, data, and  hyperparameters are all released, ensuring reproducibility and  enabling broader adoption in the research community.   <br> ● Mathematical & Logical Reasoning – ORZ significantly improves  accuracy on benchmarks like MATH500, AIME2024, and AIME2025 with a  simple binary reward system that only evaluates answer correctness.   <br> ● Generalization – Without any instruction tuning, ORZ-32B outperforms  Qwen2.5-32B Instruct on MMLU_PRO, showcasing its strong reasoning  generalization despite being trained purely on RL. | [Paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf), [Tweet](https://x.com/CyouSakura/status/1892428094075502960) |
+| 8) MoBA  MoBA is a new attention mechanism that enhances efficiency in handling long-context sequences for LLMs while maintaining strong performance.  Key insights:   <br> ● Adaptive Attention for Long Contexts – MoBA applies the Mixture of  Experts (MoE) paradigm to the attention mechanism, allowing each query  token to attend selectively to the most relevant key-value blocks  rather than the full context. This enables models to handle extended  sequences efficiently.   <br> ● Seamless Transition Between Full and Sparse Attention – Unlike  static sparse attention methods like sliding window or sink attention,  MoBA can dynamically switch between full and sparse attention modes,  ensuring adaptability without sacrificing generalization.   <br> ● Improved Computational Efficiency – By partitioning sequences into  blocks and using a gating mechanism to route queries, MoBA  significantly reduces computational complexity, achieving up to 6.5×  speedup over FlashAttention in prefill and scaling efficiently to 10M  tokens with a 16× reduction in computation time.   <br> ● Comparable Performance to Full Attention – Extensive experiments  show that MoBA achieves language modeling loss and benchmark  performance nearly identical to full attention, even at high sparsity  levels (~95.31%). It matches full attention in long-context benchmarks  like Needle in a Haystack and RULER@128K.   <br> ● Hybrid MoBA-Full Attention Strategy – MoBA can be integrated  flexibly with standard Transformers, allowing for layer-wise  hybridization (mixing MoBA and full attention at different layers),  which improves supervised fine-tuning (SFT) stability and long-context  retention. | [Paper](https://github.com/MoonshotAI/MoBA/blob/master/MoBA_Tech_Report.pdf), [Tweet](https://x.com/Kimi_Moonshot/status/1891825059599352259) |
+| 9) The Danger of Overthinking  This paper investigates overthinking in Large Reasoning Models (LRMs)—a phenomenon where models prioritize extended internal reasoning over interacting with their environment. Their study analyzes 4,018 software engineering task trajectories to understand how reasoning models handle decision-making in agentic settings.  Key findings:   <br> ● Overthinking reduces task performance – Higher overthinking scores  (favoring internal reasoning over real-world feedback) correlate with  lower issue resolution rates, especially in reasoning-optimized  models. Simple interventions, like selecting solutions with the lowest  overthinking scores, improve performance by 30% while reducing compute  costs by 43%.   <br> ● Three failure patterns identified – The study categorizes  overthinking into:   <br> ● Reasoning models are more prone to overthinking – Compared to  non-reasoning models, LRMs exhibit 3× higher overthinking scores on  average, despite their superior reasoning capabilities.   <br> ● Function calling mitigates overthinking – Models with native  function-calling support show significantly lower overthinking scores,  suggesting structured execution pathways improve efficiency in agentic  environments.   <br> ● Scaling and mitigation strategies – The researchers propose  reinforcement learning adjustments and function-calling optimizations  to curb overthinking while maintaining strong reasoning capabilities. | [Paper](https://www.arxiv.org/abs/2502.08235), [Tweet](https://x.com/Alex_Cuadron/status/1890533660434321873) |
+| 10) Inner Thinking Transformers  Inner Thinking Transformer (ITT) is a new method that enhances reasoning efficiency in small-scale LLMs via dynamic depth scaling. ITT aims to mitigate parameter bottlenecks in LLMs, providing scalable reasoning efficiency without expanding model size.  Key contributions:   <br> ● Adaptive Token Processing – ITT dynamically allocates extra  computation to complex tokens using Adaptive Token Routing. This  allows the model to focus on difficult reasoning steps while  efficiently handling simple tokens.   <br> ● Residual Thinking Connections (RTC) – A new residual accumulation  mechanism iteratively refines token representations, allowing the  model to self-correct without increasing parameters.   <br> ● Test-Time Scaling without Extra Parameters – ITT achieves 96.5% of a  466M Transformer’s accuracy using only 162M parameters, reducing  training data needs by 43.2% while outperforming loop-based  alternatives in 11 benchmarks.   <br> ● Elastic Deep Thinking – ITT allows flexible scaling of computation  at inference time, optimizing between accuracy and efficiency  dynamically. | [Paper](https://arxiv.org/abs/2502.13842v1), [Tweet](https://x.com/dair_ai/status/1893308342073991258) |
+
 ## Top ML Papers of the Week (February 10 - February 16) - 2025
 | **Paper**  | **Links** | 
 | ------------- | ------------- |