Skip to content

Commit

Permalink
Merge pull request #30 from RaviSriTejaKuriseti/main
Browse files Browse the repository at this point in the history
Update README.md till Feb 23
  • Loading branch information
omarsar authored Feb 26, 2025
2 parents dd930db + 752639b commit d4f0e1f
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Here is the weekly series:
Here is the weekly series:

## 2025
- [Top ML Papers of the Week (February 17 - February 23)](./#top-ml-papers-of-the-week-february-17---february-23---2025)
- [Top ML Papers of the Week (February 10 - February 16)](./#top-ml-papers-of-the-week-february-10---february-16---2025)
- [Top ML Papers of the Week (February 3 - February 9)](./#top-ml-papers-of-the-week-february-3---february-9---2025)
- [Top ML Papers of the Week (January 27 - February 2)](./#top-ml-papers-of-the-week-january-27---february-2---2025)
Expand Down Expand Up @@ -132,6 +133,20 @@ Here is the weekly series:

[Join our Discord](https://discord.gg/SKgkVT8BGJ)

## Top ML Papers of the Week (February 17 - February 23) - 2025
| **Paper** | **Links** |
| ------------- | ------------- |
| 1) AI Co-Scientist Google introduces AI co-scientist, a multi-agent AI system built with Gemini 2.0 to help accelerate scientific breakthroughs. Key highlights: <br> ● What's the goal of this AI co-scientist? – It can serve as a "virtual scientific collaborator to help scientists generate novel hypotheses and research proposals, and to accelerate the clock speed of scientific and biomedical discoveries." <br> ● How is it built? – It uses a coalition of specialized agents inspired by the scientific method. It can generate, evaluate, and refine hypotheses. It also has self-improving capabilities. <br> ● Collaboration and tools are key! – Scientists can either propose ideas or provide feedback on outputs generated by the agentic system. Tools like web search and specialized AI models improve the quality of responses. <br> ● Hierarchical Multi-Agent System – AI co-scientist is built with a Supervisor agent that assigns tasks to specialized agents. Apparently, this architecture helps with scaling compute and iteratively improving scientific reasoning. <br> ● Test-time Compute – AI co-scientist leverages test-time compute scaling to iteratively reason, evolve, and improve outputs. Self-play, self-critique, and self-improvement are all important to generate and refine hypotheses and proposals. <br> ● Performance? – Self-improvement relies on the Elo auto-evaluation metric. On GPQA diamond questions, they found that "higher Elo ratings positively correlate with a higher probability of correct answers." AI co-scientist outperforms other SoTA agentic and reasoning models for complex problems generated by domain experts. The performance increases with more time spent on reasoning, surpassing unassisted human experts. Experts assessed the AI co-scientist to have a higher potential for novelty and impact. It was even preferred over other models like OpenAI o1. | [Paper](https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf), [Tweet](https://x.com/omarsar0/status/1892223515660579219) |
| 2) The AI CUDA Engineer Sakana AI introduces The AI CUDA Engineer, an end-to-end agentic system that can produce highly optimized CUDA kernels. Key contributions: <br> ● Why is this research important? – Writing efficient CUDA kernels is challenging for humans. The AI CUDA Engineer is an end-to-end agent built with the capabilities to automatically produce and optimize CUDA kernels more effectively. <br> ● What's up with CUDA? – Writing CUDA kernels can help achieve high-performing AI algorithms. However, this requires GPU knowledge, and most AI algorithms today are written in a higher-level abstraction layer such as PyTorch. <br> ● An Agentic Pipeline – The agent translates PyTorch code into CUDA kernels (Stages 1 & 2), then applies evolutionary optimization (Stage 3) like crossover prompting, leading to an Innovation Archive (Stage 4) that reuses “stepping stone” kernels for further gains. <br> ● Kernel Runtime Speedups – The team claims that The AI CUDA Engineer discovers CUDA kernels with speedups that reach as high as 10-100x faster than native and compiled kernels in PyTorch. It can also convert entire ML architectures into optimized CUDA kernels. Online users have challenged the [claimed speedups](https://x.com/main_horse/status/1892446384910987718) (Sakana AI has provided an [update](https://x.com/SakanaAILabs/status/1892385766510338559) on the issue). <br> ● Performance – The AI CUDA Engineer robustly translates PyTorch Code to CUDA Kernels. It achieves more than a 90% translation success rate. <br> ● Highlighted AI CUDA Engineer-Discovered Kernels – Another claim is that The AI CUDA Engineer can robustly improve CUDA runtime. It outperforms PyTorch Native runtimes for 81% out of 229 considered tasks. 20% of all discovered CUDA kernels are at least twice as fast as their PyTorch implementations. <br> ● The AI CUDA Engineer Archive – The team has made available an archive of more than 17000 verified CUDA kernels. These can be used for downstream fine-tuning of LLMs. There is also a website to explore verified CUDA kernels. | [Technical Report](https://pub.sakana.ai/static/paper.pdf), [Blog](https://sakana.ai/ai-cuda-engineer/), [Dataset](https://pub.sakana.ai/ai-cuda-engineer), [Tweet](https://x.com/SakanaAILabs/status/1892385766510338559) |
| 3) Native Sparse Attention DeepSeek-AI and collaborators present Native Sparse Attention (NSA), a novel sparse attention mechanism designed to improve computational efficiency while maintaining model performance in long-context language modeling. Key contributions: <br> ● Hierarchical Sparse Attention – NSA combines coarse-grained compression, fine-grained token selection, and sliding window mechanisms to balance global context awareness and local precision. <br> ● Hardware-Aligned Optimization – The authors introduce a blockwise sparse attention mechanism optimized for Tensor Core utilization, reducing memory bandwidth constraints and enhancing efficiency. <br> ● End-to-End Trainability – Unlike prior sparse attention methods that focus mainly on inference, NSA enables fully trainable sparsity, reducing pretraining costs while preserving model capabilities. Results and Impact: <br> ● Outperforms Full Attention – Despite being sparse, NSA matches or exceeds Full Attention on general benchmarks, long-context reasoning, and instruction-based tasks. <br> ● Massive Speedups – NSA achieves up to 11.6× speedup over Full Attention on 64k-token sequences across all stages (decoding, forward, and backward passes). <br> ● Strong Long-Context Performance – In 64k Needle-in-a-Haystack retrieval, NSA achieves perfect accuracy, significantly outperforming other sparse methods. <br> ● Enhanced Chain-of-Thought Reasoning – Fine-tuned NSA surpasses Full Attention on AIME mathematical reasoning tasks, suggesting improved long-range logical dependencies. By making sparse attention natively trainable and optimizing for modern hardware, NSA provides a scalable solution for next-gen LLMs handling extremely long contexts. | [Paper](https://arxiv.org/abs/2502.11089), [Tweet](https://x.com/deepseek_ai/status/1891745487071609327) |
| 4) Large Language Diffusion Model Proposes LLaDA, a diffusion-based approach that can match or beat leading autoregressive LLMs in many tasks. Key highlights: <br> ● Questioning autoregressive dominance – While almost all large language models (LLMs) use the next-token prediction paradigm, the authors propose that key capabilities (scalability, in-context learning, instruction-following) actually derive from general generative principles rather than strictly from autoregressive modeling. <br> ● Masked diffusion + Transformers – LLaDA is built on a masked diffusion framework that learns by progressively masking tokens and training a Transformer to recover the original text. This yields a non-autoregressive generative model—potentially addressing left-to-right constraints in standard LLMs. <br> ● Strong scalability – Trained on 2.3T tokens (8B parameters), LLaDA performs competitively with top LLaMA-based LLMs across math (GSM8K, MATH), code (HumanEval), and general benchmarks (MMLU). It demonstrates that the diffusion paradigm scales similarly well to autoregressive baselines. <br> ● Breaks the “reversal curse” – LLaDA shows balanced forward/backward reasoning, outperforming GPT-4 and other AR models on reversal tasks (e.g. reversing a poem line). Because diffusion does not enforce left-to-right generation, it is robust at backward completions. <br> ● Multi-turn dialogue and instruction-following – After supervised fine-tuning, LLaDA can carry on multi-turn conversations. It exhibits strong instruction adherence and fluency similar to chat-based AR LLMs—further evidence that advanced LLM traits do not necessarily rely on autoregression. | [Paper](https://arxiv.org/abs/2502.09992), [Tweet](https://x.com/omarsar0/status/1891568386494300252) |
| 5) SWE-Lancer Researchers from OpenAI introduce SWE-Lancer, a benchmark evaluating LLMs on 1,488 real-world freelance software engineering tasks from Upwork, collectively worth $1M in payouts. Key takeaways: <br> ● A new benchmark for software engineering automation – Unlike previous coding benchmarks focused on isolated tasks (e.g., program synthesis, competitive programming), SWE-Lancer tests full-stack engineering and managerial decision-making. It evaluates both Individual Contributor (IC) SWE tasks, where models write and debug code, and SWE Manager tasks, where models select the best technical proposal. <br> ● Real-world economic impact – Each task has a verifiable monetary value, mirroring freelance market rates. Payouts range from $250 bug fixes to $32,000 feature implementations. The benchmark maps model performance to earnings, offering a tangible metric for automation potential. <br> ● Rigorous evaluation with end-to-end tests – Unlike unit-test-based benchmarks, SWE-Lancer employs browser-driven, triple-verified end-to-end (E2E) tests developed by professional engineers. These tests reflect real-world software validation and prevent grading hacks. <br> ● Challenging tasks remain unsolved – Even the best-performing model, Claude 3.5 Sonnet, only solves 26.2% of IC SWE tasks and 44.9% of SWE Manager tasks, earning $208K out of $500.8K in the open-source SWE-Lancer Diamond set. This highlights the gap between current AI capabilities and human software engineers. <br> ● Key findings on LLM performance: | [Paper](https://arxiv.org/abs/2502.12115), [Tweet](https://x.com/OpenAI/status/1891911123517018521) |
| 6) Optimizing Model Selection for Compound AI Researchers from Microsoft Research and collaborators introduce LLMSelector, a framework to improve multi-call LLM pipelines by selecting the best model per module instead of using one LLM everywhere. Key insights include: <br> ● Large performance boost with per-module model choices – Rather than relying on a single LLM for each sub-task in compound systems, the authors show that mixing different LLMs can yield 5%–70% higher accuracy. Each model has unique strengths (e.g., better at critique vs. generation), so assigning modules selectively substantially improves end-to-end results. <br> ● LLMSelector algorithm – They propose an iterative routine that assigns an optimal model to each module, guided by a novel “LLM diagnoser” to estimate per-module performance. The procedure scales linearly with the number of modules—far more efficient than exhaustive search. <br> ● Monotonicity insights – Empirically, boosting any single module’s performance (while holding others fixed) often improves the overall system. This motivates an approximate factorization approach, where local gains translate into global improvements. LLMSelector works for any static compound system with fixed modules (e.g., generator–critic–refiner). | [Paper](https://arxiv.org/abs/2502.14815), [Tweet](https://x.com/omarsar0/status/1892945381174210933) |
| 7) Open-Reasoner-Zero Open-Reasoner-Zero (ORZ) is an open-source large-scale minimalist reinforcement learning (RL) framework that enhances reasoning capabilities. ORZ demonstrates significant scalability requiring only 1/30th of the training steps of DeepSeek-R1-Zero-Qwen-32B to outperform it on GPQA Diamond. Key contributions and findings: <br> ● Minimalist RL Training Works – Unlike traditional RLHF setups, ORZ removes KL regularization and relies on vanilla PPO with GAE (λ=1, γ=1) and a simple rule-based reward function to scale both response length and reasoning accuracy. <br> ● Outperforms Closed-Source Models – ORZ-32B beats DeepSeek-R1-Zero-Qwen-32B on GPQA Diamond while using significantly fewer training steps, proving that training efficiency can be drastically improved with a streamlined RL pipeline. <br> ● Emergent Reasoning Abilities – ORZ exhibits "step moments", where response lengths and accuracy suddenly increase, indicating emergent reasoning capabilities with continued training. <br> ● Massive Scaling Potential – ORZ’s response length scaling mirrors trends seen in DeepSeek-R1-Zero (671B MoE), but with 5.8x fewer training steps. Training shows no signs of saturation, hinting at even further gains with continued scaling. <br> ● Fully Open-Source – The training code, model weights, data, and hyperparameters are all released, ensuring reproducibility and enabling broader adoption in the research community. <br> ● Mathematical & Logical Reasoning – ORZ significantly improves accuracy on benchmarks like MATH500, AIME2024, and AIME2025 with a simple binary reward system that only evaluates answer correctness. <br> ● Generalization – Without any instruction tuning, ORZ-32B outperforms Qwen2.5-32B Instruct on MMLU_PRO, showcasing its strong reasoning generalization despite being trained purely on RL. | [Paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf), [Tweet](https://x.com/CyouSakura/status/1892428094075502960) |
| 8) MoBA MoBA is a new attention mechanism that enhances efficiency in handling long-context sequences for LLMs while maintaining strong performance. Key insights: <br> ● Adaptive Attention for Long Contexts – MoBA applies the Mixture of Experts (MoE) paradigm to the attention mechanism, allowing each query token to attend selectively to the most relevant key-value blocks rather than the full context. This enables models to handle extended sequences efficiently. <br> ● Seamless Transition Between Full and Sparse Attention – Unlike static sparse attention methods like sliding window or sink attention, MoBA can dynamically switch between full and sparse attention modes, ensuring adaptability without sacrificing generalization. <br> ● Improved Computational Efficiency – By partitioning sequences into blocks and using a gating mechanism to route queries, MoBA significantly reduces computational complexity, achieving up to 6.5× speedup over FlashAttention in prefill and scaling efficiently to 10M tokens with a 16× reduction in computation time. <br> ● Comparable Performance to Full Attention – Extensive experiments show that MoBA achieves language modeling loss and benchmark performance nearly identical to full attention, even at high sparsity levels (~95.31%). It matches full attention in long-context benchmarks like Needle in a Haystack and RULER@128K. <br> ● Hybrid MoBA-Full Attention Strategy – MoBA can be integrated flexibly with standard Transformers, allowing for layer-wise hybridization (mixing MoBA and full attention at different layers), which improves supervised fine-tuning (SFT) stability and long-context retention. | [Paper](https://github.com/MoonshotAI/MoBA/blob/master/MoBA_Tech_Report.pdf), [Tweet](https://x.com/Kimi_Moonshot/status/1891825059599352259) |
| 9) The Danger of Overthinking This paper investigates overthinking in Large Reasoning Models (LRMs)—a phenomenon where models prioritize extended internal reasoning over interacting with their environment. Their study analyzes 4,018 software engineering task trajectories to understand how reasoning models handle decision-making in agentic settings. Key findings: <br> ● Overthinking reduces task performance – Higher overthinking scores (favoring internal reasoning over real-world feedback) correlate with lower issue resolution rates, especially in reasoning-optimized models. Simple interventions, like selecting solutions with the lowest overthinking scores, improve performance by 30% while reducing compute costs by 43%. <br> ● Three failure patterns identified – The study categorizes overthinking into: <br> ● Reasoning models are more prone to overthinking – Compared to non-reasoning models, LRMs exhibit 3× higher overthinking scores on average, despite their superior reasoning capabilities. <br> ● Function calling mitigates overthinking – Models with native function-calling support show significantly lower overthinking scores, suggesting structured execution pathways improve efficiency in agentic environments. <br> ● Scaling and mitigation strategies – The researchers propose reinforcement learning adjustments and function-calling optimizations to curb overthinking while maintaining strong reasoning capabilities. | [Paper](https://www.arxiv.org/abs/2502.08235), [Tweet](https://x.com/Alex_Cuadron/status/1890533660434321873) |
| 10) Inner Thinking Transformers Inner Thinking Transformer (ITT) is a new method that enhances reasoning efficiency in small-scale LLMs via dynamic depth scaling. ITT aims to mitigate parameter bottlenecks in LLMs, providing scalable reasoning efficiency without expanding model size. Key contributions: <br> ● Adaptive Token Processing – ITT dynamically allocates extra computation to complex tokens using Adaptive Token Routing. This allows the model to focus on difficult reasoning steps while efficiently handling simple tokens. <br> ● Residual Thinking Connections (RTC) – A new residual accumulation mechanism iteratively refines token representations, allowing the model to self-correct without increasing parameters. <br> ● Test-Time Scaling without Extra Parameters – ITT achieves 96.5% of a 466M Transformer’s accuracy using only 162M parameters, reducing training data needs by 43.2% while outperforming loop-based alternatives in 11 benchmarks. <br> ● Elastic Deep Thinking – ITT allows flexible scaling of computation at inference time, optimizing between accuracy and efficiency dynamically. | [Paper](https://arxiv.org/abs/2502.13842v1), [Tweet](https://x.com/dair_ai/status/1893308342073991258) |

## Top ML Papers of the Week (February 10 - February 16) - 2025
| **Paper** | **Links** |
| ------------- | ------------- |
Expand Down

0 comments on commit d4f0e1f

Please sign in to comment.