🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Expanding the Action Space of LLMs to Reason Beyond Language |
Published at 2025-10-08 |
|
#ML
|
The study presents a method to enhance Large Language Models (LLMs) by allowing them to interact with external environments beyond just language, such as symbolic operators or simulators. The new approach, called ExpA Reinforcement Learning (EARL), enables LLMs to perform better on tasks requiring multi-turn interactions and contingent planning, outperforming vocabulary-constrained action baselines.... |
Read More |
|
|
|
![]() |
AlphaQuanter: An End-to-End Tool-Orchestrated Agentic Reinforcement Learning Framework for Stock Trading |
Published at 2025-10-15 |
|
#ML
|
The authors present a new single-agent framework called AlphaQuanter that uses reinforcement learning to automate stock trading. This tool improves upon existing multi-agent frameworks by being more efficient, providing consistent signals, and learning a coherent strategy from market feedback, ultimately achieving top performance on key financial metrics and offering valuable insights for human traders.... |
Read More |
|
|
|
|
![]() |
Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations |
Published at 2025-10-15 |
|
#ML
|
This study discusses the limitations of current AI simulations that are confined to static environments and proposes a new approach using LLMs in multi-agent systems to model complex, ever-changing societies. The authors introduce a taxonomy for this field and present a research roadmap focused on creating adaptive, socially-aware AI ecosystems through open-ended co-evolution.... |
Read More |
|
|
|
![]() |
PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold |
Published at 2025-10-17 |
|
#ML
|
The study presents PokeeResearch-7B, a 7B-parameter deep research agent that uses a unified reinforcement learning framework to improve its robustness, alignment, and scalability. This agent outperforms other 7B-scale deep research agents in 10 popular benchmarks by using an annotation-free reinforcement learning framework and a chain-of-thought-driven multi-call reasoning scaffold for self-verification and adaptive recovery from tool failures.... |
Read More |
|
|
|
|
![]() |
Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis |
Published at 2025-10-17 |
|
#ML
|
The study presents a new framework that combines medical multimodal understanding and generation, which can process various medical inputs and produce diverse outputs. This unified model improves performance in medical image understanding and generation tasks by enabling knowledge sharing between them, filling gaps in data representation and feature integration.... |
Read More |
|
|
|
![]() |
Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism |
Published at 2025-10-17 |
|
#ML
|
The study presents SciRecipe, a large dataset of structured biological protocols, and introduces the 'Sketch-and-Fill' paradigm to improve protocol generation. They develop Thoth, a model trained to generate accurate and executable protocols, which outperforms existing models in various benchmarks, thus enhancing the efficiency of scientific experiment reproduction.... |
Read More |
|
|
|
|
![]() |
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies |
Published at 2025-10-18 |
|
#ML
|
The authors present PRISMM-Bench, a new benchmark that tests large multimodal models' ability to handle inconsistencies in scientific papers, which are often missed by existing benchmarks. They collected 262 real-world inconsistencies and designed tasks to evaluate models' performance in detecting, correcting, and reasoning about these inconsistencies across different modalities. The results show that current models struggle with this task, highlighting the need for improvement in this area.... |
Read More |
|
|
|
![]() |
Chem-R: Learning to Reason as a Chemist |
Published at 2025-10-19 |
|
#ML
|
The authors present a new model, Chem-R, which is trained in three stages to mimic chemist's problem-solving processes. This model outperforms other large language models and chemical foundation models, demonstrating its potential for AI-driven chemical discovery.... |
Read More |
|
|
|
|
![]() |
Video Reasoning without Training |
Published at 2025-10-19 |
|
#ML
|
This study finds that high-quality models improve reasoning by exploring and exploiting options, and then converging on a solution. The researchers then developed V-Reason, a method that tunes model behavior during inference without training, leading to better performance and efficiency compared to RL-trained models.... |
Read More |
|
|
|
![]() |
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth |
Published at 2025-10-20 |
|
#ML
|
The authors present Any-Depth Alignment (ADA), a method that enhances the safety of large language models by reintroducing alignment tokens during generation, which helps the model reassess and prevent harmful responses at any point. ADA is effective across various models, maintaining high performance without altering the base model's parameters and reducing successful adversarial attacks significantly.... |
Read More |
|
|
|
|
![]() |
DeepSeek-OCR: Contexts Optical Compression |
Published at 2025-10-20 |
|
#ML
|
DeepSeek-OCR is a new system that uses optical 2D mapping to compress long texts into fewer vision tokens, enabling high OCR precision even with a high compression ratio. It outperforms other models in text recognition and can generate large-scale training data for LLMs/VLMs efficiently.... |
Read More |
|
|
|
![]() |
Efficient Long-context Language Model Training by Core Attention Disaggregation |
Published at 2025-10-20 |
|
#ML
|
The authors propose a new method called core attention disaggregation (CAD) that improves the training of long-context large language models by separating a crucial computation step from the rest of the model. This separation allows for more efficient use of computing resources and reduces imbalances and bottlenecks, leading to faster and more effective training.... |
Read More |
|
|
|
|
![]() |
EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning |
Published at 2025-10-20 |
|
#ML
|
The paper presents a new framework for creating reliable and verifiable synthetic data that can be used for training language models. This framework can generate problems, solutions, and verification artifacts, and it improves data filtering by ensuring consistency between human-annotated and strategy-induced checks. The approach has been tested and shown to improve performance on LiveCodeBench and AgentBench-OS tasks.... |
Read More |
|
|
|
![]() |
GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver |
Published at 2025-10-20 |
|
#ML
|
The authors propose a new method called Generalized Adversarial Solver to improve the efficiency of diffusion models in generating high-quality samples. This method simplifies the training process and enhances fine-grained details by combining the original distillation loss with adversarial training, resulting in better performance compared to existing solver training methods.... |
Read More |
|
|
|
|
![]() |
Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution |
Published at 2025-10-20 |
|
#ML
|
The study reveals that current multilingual watermarking methods for large language models are not effective in medium- and low-resource languages due to a flaw in semantic clustering. To solve this, they propose STEAM, a back-translation-based detection method that enhances watermark strength across various languages, tokenizers, and watermarking techniques, improving fairness in watermarking.... |
Read More |
|
|
|
![]() |
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues |
Published at 2025-10-20 |
|
#ML
|
The study presents MT-Video-Bench, a comprehensive video understanding benchmark designed to evaluate AI's performance in multi-turn dialogues, focusing on six key competencies like interactive sports analysis and video-based tutoring. This benchmark helps assess various multimodal large language models, highlighting their strengths and weaknesses in real-world multi-turn scenarios.... |
Read More |
|
|
|
|
![]() |
MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models |
Published at 2025-10-20 |
|
#ML
|
The authors present a training framework that significantly improves efficiency and performance for large-scale video generation models by optimizing data processing, model architecture, training strategy, and infrastructure. The resulting model, MUG-V 10B, matches or exceeds state-of-the-art video generators and is open-sourced along with the training code and inference pipelines.... |
Read More |
|
|
|
![]() |
Planned Diffusion |
Published at 2025-10-20 |
|
#ML
|
The study presents a new method called 'planned diffusion' that combines the strengths of autoregressive and diffusion models for text generation. This method first creates a short plan using autoregressive models and then generates text in parallel using diffusion models, resulting in faster and high-quality text generation with Pareto-optimal trade-off between quality and latency.... |
Read More |
|
|
|
|
![]() |
World-in-World: World Models in a Closed-Loop World |
Published at 2025-10-20 |
|
#ML
|
The authors present World-in-World, a new platform for evaluating generative world models in a realistic, closed-loop setting that focuses on task success rather than just visual quality. They find that controllability and scaling with action-observation data are more important than visual quality for successful decision-making by embodied agents.... |
Read More |
|
|
|
![]() |
ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning |
Published at 2025-10-20 |
|
#ML
|
The authors present ssToken, a method for improving large language models' fine-tuning by using a self-modulated signal from a history model and an attention-based metric to select important tokens, without relying on an additional reference model or solely on loss information.... |
Read More |
|
|
|
|
![]() |
DSI-Bench: A Benchmark for Dynamic Spatial Intelligence |
Published at 2025-10-21 |
|
#ML
|
DSI-Bench is a new benchmark for evaluating models' understanding of dynamic 3D scenarios, as existing models struggle with simultaneous motion of observers and objects. The benchmark, which includes over 1,000 videos and 1,700 questions, reveals limitations in current models, such as confusing observer and object motion and failing to accurately infer relative relationships in dynamic scenarios.... |
Read More |
|
|
|
![]() |
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model |
Published at 2025-10-21 |
|
#ML
|
The authors have developed Ring-1T, an open-source thinking model with a trillion-scale parameter, which is the first of its kind. They introduced three innovations to tackle the challenges of training such a large model, resulting in exceptional performance across various benchmarks, including a silver medal-level result in the IMO-2025, demonstrating its advanced reasoning capabilities.... |
Read More |
|
|
|
|
![]() |
Extracting alignment data in open models |
Published at 2025-10-21 |
|
#ML
|
This study demonstrates that a substantial amount of alignment training data can be extracted from a post-trained model, which can enhance its capabilities like long-context reasoning, safety, and math skills. The research suggests that using embedding models for measuring success in data extraction is more effective than string matching and highlights a potential risk in extracting alignment data.... |
Read More |
|
|
|
![]() |
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs |
Published at 2025-10-21 |
|
#ML
|
The authors present a new method called Grasp Any Region (GAR) to improve Multimodal Large Language Models' understanding of complex scenes by considering both local and global contexts. GAR can precisely perceive details, model interactions between multiple prompts, and perform advanced compositional reasoning, outperforming other models in various benchmarks.... |
Read More |
|
|
|
|
![]() |
IF-VidCap: Can Video Caption Models Follow Instructions? |
Published at 2025-10-21 |
|
#ML
|
The authors present IF-VidCap, a new benchmark for evaluating video captioning models based on their ability to follow instructions, which is a gap in current benchmarks that focus on descriptive comprehensiveness. The benchmark assesses captions on format and content correctness, and the evaluation of various models shows that while proprietary models still lead, open-source solutions are catching up, and specialized dense captioning models struggle with complex instructions compared to general... |
Read More |
|
|
|
![]() |
LightMem: Lightweight and Efficient Memory-Augmented Generation |
Published at 2025-10-21 |
|
#ML
|
The authors present a new memory system called LightMem that helps large language models better use historical information in complex environments. LightMem is inspired by human memory and reduces computational overhead, improving accuracy and efficiency in experiments with GPT and Qwen backbones.... |
Read More |
|
|
|
|
![]() |
MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation |
Published at 2025-10-21 |
|
#ML
|
This study presents a new sparse attention method called MoGA that efficiently matches tokens without relying on blockwise estimation, enabling effective long-range interactions in video generation. The researchers use MoGA to create an efficient model that can generate high-quality, long videos at a rapid frame rate, which is validated through various experiments.... |
Read More |
|
|
|
![]() |
Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos |
Published at 2025-10-21 |
|
#ML
|
The authors present a new method called Mono4DGS-HDR that creates detailed 3D scenes with high contrast and brightness from standard low-quality videos taken with varying exposure settings. This technique works without needing precise camera information and improves the consistency of the scenes over time, outperforming other methods in both quality and speed.... |
Read More |
|
|
|
|
![]() |
ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder |
Published at 2025-10-21 |
|
#ML
|
The authors present a new method called ProCLIP to improve the alignment between an LLM-based embedder and the CLIP image encoder. ProCLIP overcomes limitations of the original CLIP text encoder by leveraging the LLM's ability to process long texts and multiple languages, while also ensuring that the vision-language alignment in the CLIP image encoder is not disrupted.... |
Read More |
|
|
|
![]() |
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views |
Published at 2025-10-21 |
|
#ML
|
The authors present 3DThinker, a novel framework that improves 3D spatial reasoning from limited views by leveraging geometric information within images. Unlike previous methods, 3DThinker enables 3D mental imagery during reasoning without requiring 3D prior input or explicit 3D labels, and it outperforms strong baselines in various benchmarks.... |
Read More |
|
|
|
|
![]() |
Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning |
Published at 2025-10-21 |
|
#ML
|
The authors present a new method called Critique-Post-Edit to improve the personalization of large language models. This approach uses a personalized reward model and a critique-post-edit mechanism to create more accurate and customized responses, outperforming standard methods and even surpassing GPT-4.1 in some cases.... |
Read More |
|
|
|
![]() |
UltraGen: High-Resolution Video Generation with Hierarchical Attention |
Published at 2025-10-21 |
|
#ML
|
The paper presents UltraGen, a new video generation framework that allows for efficient and end-to-end high-resolution video synthesis, addressing the computational bottleneck of existing models by using a hierarchical dual-branch attention architecture and spatially compressed global modeling strategy, enabling the generation of 1080P and even 4K resolution videos for the first time.... |
Read More |
|
|
|
|
![]() |
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation |
Published at 2025-10-21 |
|
#ML
|
UniGenBench++ is a new benchmark for testing the accuracy of text-to-image generation models. It includes 600 prompts in both English and Chinese, covering various themes and evaluation criteria, to assess the models' performance in generating images that align with the given text prompt.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|