🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs |
Published at 2025-10-06 |
|
#ML
|
This study analyzes the trade-offs of parallel decoding in diffusion LLMs, revealing that they can have significantly reduced performance in real-world tasks compared to autoregressive LLMs due to ignoring token dependencies. The research introduces ParallelBench, a new benchmark for diffusion LLMs, which highlights the need for improved decoding methods to balance speed and quality.... |
Read More |
|
|
|
![]() |
Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation |
Published at 2025-10-08 |
|
#ML
|
The study presents HaystackCraft, a new benchmark for testing long-context robustness of language models in noisy, real-world scenarios. The researchers evaluate the impact of various retrieval strategies on model performance and demonstrate the challenges of agentic long-context reasoning, providing a valuable tool for future advancements in this area.... |
Read More |
|
|
|
|
![]() |
CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving |
Published at 2025-10-09 |
|
#ML
|
The authors present a new method called CVD-STORM that creates realistic, long-term, multi-view videos for self-driving cars. This technique uses a specialized Variational Autoencoder to capture 3D structures and temporal dynamics, which is then combined with a video diffusion process to produce high-quality videos with additional information like depth estimation.... |
Read More |
|
|
|
![]() |
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model |
Published at 2025-10-11 |
|
#ML
|
The research proposes a new method called Soft Prompt, which uses minimal additional parameters and separate learnable embeddings for each data source to improve generalist Vision-Language-Action models. This approach, implemented in the X-VLA architecture, enables effective exploitation of varying cross-embodiment features, demonstrating state-of-the-art performance across various simulations and real-world robots.... |
Read More |
|
|
|
|
![]() |
Evaluating Language Models' Evaluations of Games |
Published at 2025-10-12 |
|
#ML
|
This study proposes a new way to evaluate AI systems by assessing their ability to evaluate games, focusing on payoff and funness. The results indicate that reasoning models are more aligned with human evaluations compared to non-reasoning language models, but their alignment weakens as they approach game-theoretic optimality, especially when evaluating funness.... |
Read More |
|
|
|
![]() |
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model |
Published at 2025-10-12 |
|
#ML
|
FG-CLIP 2 is a new bilingual vision-language model that improves fine-grained alignment for both English and Chinese by using detailed supervision and specific training techniques. It outperforms other methods on various tasks and languages, and its code and benchmark are available for further research.... |
Read More |
|
|
|
|
![]() |
GraphTracer: Graph-Guided Failure Tracing in LLM Agents for Robust Multi-Turn Deep Search |
Published at 2025-10-12 |
|
#ML
|
The study presents a new framework called GraphTracer that helps identify and diagnose errors in multi-agent systems powered by large language models, particularly in multi-turn deep search scenarios, by analyzing information flow between agents and generating synthetic data to target critical nodes, resulting in improved attribution accuracy and performance.... |
Read More |
|
|
|
![]() |
HyperAgent: Leveraging Hypergraphs for Topology Optimization in Multi-Agent Communication |
Published at 2025-10-12 |
|
#ML
|
The study presents a new framework called HyperAgent that uses hypergraphs to optimize communication among multiple agents, improving group collaboration and adapting to different task complexities, resulting in better performance and efficiency compared to existing methods.... |
Read More |
|
|
|
|
![]() |
MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training |
Published at 2025-10-12 |
|
#ML
|
The authors propose a new framework called MTSQL-R1 for improving Text-to-SQL systems in handling long conversations and complex queries. By treating the task as a game-like decision process, the system interacts with a database and a memory to generate accurate and coherent SQL queries, outperforming existing methods in experiments.... |
Read More |
|
|
|
![]() |
Revisiting Model Interpolation for Efficient Reasoning |
Published at 2025-10-12 |
|
#ML
|
The study examines a straightforward method of combining two models, called model interpolation, and discovers it follows a three-stage evolution with unique behaviors that help balance performance and cost. Experiments show that this interpolated model outperforms more complex merging methods, and further tests confirm the findings, providing a practical guide for creating models with specific reasoning skills.... |
Read More |
|
|
|
|
![]() |
Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning |
Published at 2025-10-12 |
|
#ML
|
This study presents Latent-Trajectory signals, which predict the success of reasoning paths in models by analyzing changes in their internal representations. These signals allow for more efficient and effective answer selection during inference, reducing token usage by up to 70% and improving accuracy by 2.6% on average compared to majority voting, by identifying promising candidates early in the reasoning process.... |
Read More |
|
|
|
![]() |
Direct Multi-Token Decoding |
Published at 2025-10-13 |
|
#ML
|
The study proposes a new method called Direct Multi-Token Decoding (DMTD) that generates multiple tokens using only the late layers of a transformer model, which could speed up the process and reduce the need for repeated traversals of earlier layers. Experiments with a fine-tuned DMTD model showed up to a 2x speedup with minimal performance loss, and it's expected to perform better with larger datasets.... |
Read More |
|
|
|
|
![]() |
EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling |
Published at 2025-10-13 |
|
#ML
|
The proposed method, EAGer, reduces unnecessary computation and enhances performance by utilizing model uncertainty through token-wise entropy distribution for adaptive inference-time scaling. EAGer selectively branches to multiple reasoning paths in high-entropy areas, reallocating saved computational resources to more complex prompts, resulting in significant efficiency and performance improvements on complex reasoning benchmarks.... |
Read More |
|
|
|
![]() |
MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model |
Published at 2025-10-13 |
|
#ML
|
The authors present MATH-Beyond, a new math benchmark designed to challenge current RL methods and encourage the discovery of new reasoning skills beyond existing models' capabilities. The benchmark consists of harder math problems that existing RL fine-tuned models struggle to solve, aiming to stimulate the development of more advanced RL techniques.... |
Read More |
|
|
|
|
![]() |
Point Prompting: Counterfactual Tracking with Video Diffusion Models |
Published at 2025-10-13 |
|
#ML
|
This study demonstrates a new method to use video diffusion models for point tracking without prior training. By placing a colored marker at the point of interest and regenerating the video, the model can trace the point's trajectory even through occlusions, outperforming other zero-shot methods and competing with specialized self-supervised models.... |
Read More |
|
|
|
![]() |
Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs |
Published at 2025-10-13 |
|
#ML
|
This study presents AT-GRPO, a new algorithm and training system that applies on-policy reinforcement learning to multi-agent systems for large language models, resulting in significant improvements in task performance across various domains.... |
Read More |
|
|
|
|
![]() |
What Generative Search Engines Like and How to Optimize Web Content Cooperatively |
Published at 2025-10-13 |
|
#ML
|
This study presents a framework called AutoGEO that learns and applies generative engine preferences to improve web content visibility and traction, using both prompt-based and rule-based methods. The framework's effectiveness is demonstrated through experiments on various benchmarks, confirming its ability to enhance content optimization while preserving search utility.... |
Read More |
|
|
|
![]() |
CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving |
Published at 2025-10-14 |
|
#ML
|
This study presents a new method called CoIRL-AD that combines imitation learning and reinforcement learning for autonomous driving. By using a competitive framework, CoIRL-AD allows these two learning methods to interact, leading to better generalization and performance, especially in tricky driving scenarios.... |
Read More |
|
|
|
|
![]() |
Learning to Grasp Anything by Playing with Random Toys |
Published at 2025-10-14 |
|
#ML
|
The study explores how robots can learn to grasp various objects by interacting with simple toys made from basic shapes. The findings suggest that robots can achieve strong zero-shot performance on real-world objects by training with these toys, thanks to an object-centric visual representation. The model outperforms state-of-the-art methods on the YCB dataset, demonstrating a promising approach for scalable and generalizable learning in robotic manipulation.... |
Read More |
|
|
|
![]() |
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization |
Published at 2025-10-15 |
|
#ML
|
This study reveals that attention can be used to understand the reasoning process of large language models (LLMs) by distinguishing between locally and globally focused attention heads. The researchers introduce new methods for optimizing LLM reasoning by targeting critical nodes in the attention process, leading to improved performance on various reasoning tasks.... |
Read More |
|
|
|
|
![]() |
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs |
Published at 2025-10-15 |
|
#ML
|
The authors present a new dataset, Honey-Data-15M, which contains 15 million question-answer pairs and is improved through cleaning techniques and a dual-level chain-of-thought enrichment strategy. They also provide a data curation pipeline, HoneyPipe, and its underlying framework, DataStudio, to ensure transparency and adaptability in data curation. Experiments with their 8B model, Bee-8B, trained on this dataset show that it outperforms previous fully open multimodal large language models and ... |
Read More |
|
|
|
![]() |
Dedelayed: Deleting remote inference delay via on-device correction |
Published at 2025-10-15 |
|
#ML
|
This study presents a method called Dedelayed that reduces the delay in remote inference, enabling real-time, low-latency predictions on local devices. By using a lightweight local model and fusing features from past frames computed by a remote model, Dedelayed improves semantic segmentation accuracy compared to local-only or remote-only baselines, especially under longer delays and higher-motion scenes.... |
Read More |
|
|
|
|
![]() |
Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs |
Published at 2025-10-15 |
|
#ML
|
This study presents a method called Deflanderization to improve the balance between character authenticity and task execution in AI-generated game dialogue, achieving top ranks in various challenge tracks by combining lightweight prompting and fine-tuned large models.... |
Read More |
|
|
|
![]() |
FlashWorld: High-quality 3D Scene Generation within Seconds |
Published at 2025-10-15 |
|
#ML
|
The authors present a new model, FlashWorld, that creates high-quality 3D scenes quickly from images or text, improving upon previous methods by being faster and more realistic. They achieve this by combining the strengths of two approaches, using a video model to pre-train and a distribution matching method to enhance visual quality, resulting in a model that is efficient and generalizable.... |
Read More |
|
|
|
|
![]() |
Generative Universal Verifier as Multimodal Meta-Reasoner |
Published at 2025-10-15 |
|
#ML
|
The research presents a new tool for improving multimodal reasoning in vision-language models by introducing a benchmark to evaluate visual outcomes and designing methods to create large-scale data and train a universal verifier. The tool enhances the generation process and broadens its application in various scenarios, leading to notable improvements and advancing towards more trustworthy and controllable reasoning systems.... |
Read More |
|
|
|
![]() |
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math |
Published at 2025-10-15 |
|
#ML
|
The study presents Hard2Verify, a benchmark for evaluating the step-level verification of open-ended mathematical reasoning by large language models. The benchmark, created with extensive human labor, assesses the ability of verifiers to identify mistakes in responses generated by advanced LLMs, and the results show that closed-source models generally outperform open-source ones, with further analysis on the factors affecting verification performance.... |
Read More |
|
|
|
|
![]() |
Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain |
Published at 2025-10-15 |
|
#ML
|
The research presents a tool called HFTP that uses frequency analysis to understand how large language models and the human brain process syntax. The study finds that while language models and the brain have some similarities in syntax processing, advanced models may rely on different mechanisms than the brain, offering new insights into the interpretability of model improvements.... |
Read More |
|
|
|
![]() |
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue |
Published at 2025-10-15 |
|
#ML
|
InteractiveOmni is a unified, open-source model for understanding and generating audio-visual content in multi-turn interactions. It uses a multi-stage training strategy, a curated dataset for long-term conversations, and benchmarks for evaluation, outperforming other models and maintaining high performance even with reduced size.... |
Read More |
|
|
|
|
![]() |
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy |
Published at 2025-10-15 |
|
#ML
|
The researchers developed a new framework called InternVLA-M1 that improves robot instruction-following by using spatial information to connect instructions with robot actions. This framework outperforms previous methods in various tasks and environments, demonstrating its potential for creating more intelligent and adaptable robots.... |
Read More |
|
|
|
![]() |
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models |
Published at 2025-10-15 |
|
#ML
|
The study examines the robustness of robotic manipulation models by testing their performance under various controlled conditions, such as changes in camera angles, lighting, and language instructions. The results show that these models are highly sensitive to certain factors and can perform poorly even with minor changes, while being surprisingly unaffected by others, indicating that high benchmark scores may not accurately reflect their true competency.... |
Read More |
|
|
|
|
![]() |
NOSA: Native and Offloadable Sparse Attention |
Published at 2025-10-15 |
|
#ML
|
The authors present NOSA, a framework for trainable sparse attention that reduces memory usage and improves decoding speed by offloading key-value pairs to the CPU, without altering the attention computation. NOSA's design, which introduces locality constraints, leads to a 2.3x improvement in decoding throughput compared to traditional methods, while maintaining performance.... |
Read More |
|
|
|
![]() |
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning |
Published at 2025-10-15 |
|
#ML
|
The researchers present a method called PhysMaster that improves the physical accuracy of video generation models by incorporating reinforcement learning and human feedback to learn physical representations. This approach enhances the physics-awareness of the model, enabling it to generate more realistic and plausible videos by adhering to physical laws.... |
Read More |
|
|
|
|
![]() |
Reasoning in Space via Grounding in the World |
Published at 2025-10-15 |
|
#ML
|
The authors propose a new model, GS-Reasoner, which effectively combines 3D visual grounding and spatial reasoning by introducing a dual-path pooling mechanism. This model eliminates the need for external modules and outperforms existing 3D large language models, and the authors also present the Grounded Chain-of-Thought dataset to further enhance the integration of grounding and spatial reasoning.... |
Read More |
|
|
|
![]() |
The Art of Scaling Reinforcement Learning Compute for LLMs |
Published at 2025-10-15 |
|
#ML
|
This study is the first large-scale systematic analysis of scaling reinforcement learning in large language models, involving over 400,000 GPU-hours. The researchers established a framework to analyze and predict RL scaling, identified factors affecting compute efficiency, and proposed a best-practice recipe, ScaleRL, for stable and scalable RL training.... |
Read More |
|
|
|
|
![]() |
The Role of Computing Resources in Publishing Foundation Model Research |
Published at 2025-10-15 |
|
#ML
|
This study examines the connection between computing resources and advancements in foundation models, finding that more resources lead to more citations and funding, but not necessarily to better research environments, domains, or methodologies. The researchers suggest promoting shared and affordable computing opportunities to encourage diverse participation and innovation in AI research.... |
Read More |
|
|
|
![]() |
Trace Anything: Representing Any Video in 4D via Trajectory Fields |
Published at 2025-10-15 |
|
#ML
|
Researchers have developed a new method to represent videos as Trajectory Fields, which map each pixel's continuous 3D movement over time. They also introduced a neural network, Trace Anything, that predicts these trajectories in one go, offering improved efficiency and new capabilities like motion forecasting, all while setting a new standard for trajectory field estimation.... |
Read More |
|
|
|
|
![]() |
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark |
Published at 2025-10-15 |
|
#ML
|
The authors present a new benchmark called Uni-MMMU that tests the integration of visual understanding and generation in models by coupling tasks in eight domains like science and coding, and reveals performance disparities and dependencies between these abilities.... |
Read More |
|
|
|
![]() |
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning |
Published at 2025-10-15 |
|
#ML
|
The researchers created a new model called UniME-V2 that uses advanced language models to improve how well multimodal embedding models understand subtle differences between data samples. This model helps the embedding models better distinguish between similar samples and makes them more accurate for various tasks.... |
Read More |
|
|
|
|
![]() |
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE |
Published at 2025-10-15 |
|
#ML
|
The study presents a new model, UniMoE-Audio, which combines speech and music generation in a single framework, addressing the challenge of separate development for these auditory forms. The model uses a dynamic-capacity Mixture-of-Experts architecture with specialized experts and a three-stage training process to overcome task conflicts and data imbalances, achieving superior performance in both domains.... |
Read More |
|
|
|
![]() |
Universal Image Restoration Pre-training via Masked Degradation Classification |
Published at 2025-10-15 |
|
#ML
|
The researchers present a new method called Masked Degradation Classification Pre-Training (MaskDCPT) for training models to restore images. This method uses the type of image degradation as a weak form of guidance and helps improve the model's performance and robustness. The pre-trained model can then be used for various image restoration tasks and significantly outperforms existing methods on different restoration tasks.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|