🤗 Daily Paper(2025-10-01)

0 views

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 1, 2025, 4:08:50 PMOct 1

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

Published at 2025-09-26

#ML

This study explores how reinforcement learning improves language model planning by comparing policy gradient and Q-learning methods. The research reveals that while RL enhances generalization through exploration, it also faces challenges like diversity collapse in policy gradient, which Q-learning avoids. Additionally, the paper emphasizes the importance of reward design in Q-learning to prevent reward hacking....

BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

Published at 2025-09-26

#ML

The authors present BUILD-BENCH, a realistic benchmark for testing LLM agents in compiling diverse open-source software, and introduce OSS-BUILD-AGENT, a high-performing LLM-based agent for this task. They also analyze different compilation methods and their effects on the task, aiming to guide future improvements in software development and security....

Convolutional Set Transformer

Published at 2025-09-26

#ML

The Convolutional Set Transformer (CST) is a new neural network that can directly handle sets of images, extracting features and understanding the relationships between them at the same time. This allows for better performance in tasks like classifying sets of images and detecting unusual images, and it also works with methods to explain how the network makes decisions, unlike other methods....

LLM Watermark Evasion via Bias Inversion

Published at 2025-09-26

#ML

The study presents a new method called BIRA that can remove watermarks from large language models with over 99% success rate, without needing to know the specifics of the watermarking scheme. This reveals a major weakness in current watermarking techniques, highlighting the importance of creating stronger defenses....

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Published at 2025-09-26

#ML

This study presents DeeptraceReward, a new benchmark that helps identify and understand the signs that humans use to detect AI-generated videos. The researchers trained a language model to recognize these signs and found that it was better at identifying fake videos than a popular language model, GPT-5. The study also found that it's easier for humans and models to tell if a video is fake or real, but harder to pinpoint the exact signs of fakeness, especially in terms of timing....

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Published at 2025-09-27

#ML

The authors present Dolphin, an efficient method for extracting target speech using visual cues, which significantly reduces computational cost compared to existing methods. Dolphin outperforms state-of-the-art models in separation quality while being more than 2.4 times faster and using over 50% fewer parameters, making it practical for real-world applications....

Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

Published at 2025-09-27

#ML

The authors present a new approach called T2PAM that uses user feedback during interactions to improve LLMs' performance in extended conversations. They also introduce ROSA, a lightweight algorithm that implements T2PAM, enabling efficient updates towards an optimal policy with minimal computational overhead, resulting in significantly better task performance....

d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Published at 2025-09-27

#ML

The authors present a new method called d^2Cache to improve the speed of diffusion-based language models without compromising their performance. This method uses a two-stage strategy to select and update tokens, allowing for more efficient use of memory and enabling quasi left-to-right generation, which results in faster inference and better generation quality....

CORRECT: COndensed eRror RECognition via knowledge Transfer in multi-agent systems

Published at 2025-09-28

#ML

The authors present CORRECT, a lightweight error recognition framework for multi-agent systems that uses a cache of error patterns to identify and correct errors without requiring training. This approach significantly improves error localization and adapts to dynamic deployments quickly, and the authors also introduce a large-scale dataset of annotated trajectories to support research in this area....

Estimating Time Series Foundation Model Transferability via In-Context Learning

Published at 2025-09-28

#ML

The authors present TimeTic, a framework that predicts the performance of time series foundation models after fine-tuning, using in-context learning and tabular foundation models. TimeTic outperforms zero-shot performance as a transferability score, demonstrating strong alignment with actual fine-tuned performance on a comprehensive benchmark....

Humanline: Online Alignment as Perceptual Loss

Published at 2025-09-28

#ML

The authors explain why online alignment methods generally work better than offline ones by linking them to human perception. They propose a new design pattern that incorporates human perception into alignment objectives, which can improve training efficiency without sacrificing performance....

Knowledge Homophily in Large Language Models

Published at 2025-09-28

#ML

The study explores how large language models organize their knowledge, comparing it to human cognitive processes. They create a graph representation of the model's knowledge and use a neural network to estimate the model's understanding of different facts, helping to improve its efficiency in learning new information and answering complex questions....

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Published at 2025-09-28

#ML

The authors present MCPMark, a new benchmark for evaluating the practical use of MCP (the standard for LLMs to interact with external systems) in a more realistic and comprehensive manner. MCPMark includes 127 tasks that require diverse and rich interactions with the environment, and empirical results show that even the best-performing model struggles with these tasks, highlighting the challenges in creating general agents....

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

Published at 2025-09-28

#ML

This study presents a new method called Q-Tuning to improve the efficiency of training large language models by optimizing both sample and token usage. Q-Tuning first selects valuable training examples and then prunes unnecessary tokens from those samples, resulting in significant performance improvements across various benchmarks....

A Cartography of Open Collaboration in Open Source AI: Mapping Practices, Motivations, and Governance in 14 Open Large Language Model Projects

Published at 2025-09-29

#ML

This study examines the collaboration methods, motivations, and organizational models of 14 open large language model (LLM) projects from various regions and sectors. The research reveals that open LLM collaboration extends to multiple aspects, developers have diverse motivations, and five distinct organizational models exist, offering practical recommendations for stakeholders to support the global AI community....

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Published at 2025-09-29

#ML

The study presents a new method, DC-VideoGen, to speed up video generation by compressing and fine-tuning pre-trained video models, resulting in faster inference and higher resolution video generation with minimal quality loss....

DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation

Published at 2025-09-29

#ML

The study presents a new method for real-time API retrieval, enhancing context-aware code generation for auto-completion and AI applications. They create a new dataset to tackle API leak issues, achieving high retrieval accuracy and reducing latency with a compact reranker model....

InfoAgent: Advancing Autonomous Information-Seeking Agents

Published at 2025-09-29

#ML

Researchers created InfoAgent, an advanced AI agent that improves its skills by interacting with external tools. InfoAgent uses a unique data pipeline and self-hosted search infrastructure, requiring fewer tool interactions to answer questions accurately and outperforming other open-source agents in various tasks....

LayerD: Decomposing Raster Graphic Designs into Layers

Published at 2025-09-29

#ML

The authors present a new method called LayerD that separates raster graphic designs into layers, allowing for re-editing. They achieve this by progressively extracting uncovered layers and use a quality metric to evaluate the results, showing that LayerD is better than existing methods and can be used with advanced image generators for editing....

MANI-Pure: Magnitude-Adaptive Noise Injection for Adversarial Purification

Published at 2025-09-29

#ML

This study finds that harmful image changes are mostly in high-frequency areas and proposes MANI-Pure, a method that targets these areas with custom noise to remove the changes without losing important image details, improving both regular and robust image recognition accuracy....

Nudging the Boundaries of LLM Reasoning

Published at 2025-09-29

#ML

The study presents a new method called NuRL to help language models learn from difficult problems that they couldn't solve before. NuRL does this by generating hints to simplify the problems, which results in improved performance across various benchmarks and models, and can even increase the model's overall capabilities....

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Published at 2025-09-29

#ML

The authors present a new framework called Vision-Zero that improves vision-language models through strategic visual games generated from any image pair. This method enables models to train themselves without human input, leading to better performance across various tasks and domains, and has the potential to reduce the high cost and effort required for training these models....

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Published at 2025-09-29

#ML

The authors create VisualOverload, a new benchmark for visual understanding using 2,720 question-answer pairs from high-resolution scans of complex paintings. They find that even the best model struggles with accuracy in densely populated scenes, revealing a gap in current vision models....

Who invented deep residual learning?

Published at 2025-09-29

#ML

The inventors of deep residual learning, a significant advancement in artificial neural networks, are explored in a timeline of its development, which has become the most cited scientific article of the 21st century as of 2025....

Who's Your Judge? On the Detectability of LLM-Generated Judgments

Published at 2025-09-29

#ML

This study presents a method to detect judgments generated by large language models (LLMs) using only the judgment scores and candidates, which is important for sensitive situations like academic peer reviewing. The proposed method, J-Detector, is a simple and transparent neural detector that uses linguistic and LLM-enhanced features to link biases in LLM judges with candidates' properties, and it has been shown to be effective across various datasets....

jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Published at 2025-09-29

#ML

The researchers present a compact, 0.6B parameter multilingual document reranker called jina-reranker-v3. It uses a novel 'last but not late interaction' method, which allows for rich cross-document interactions within the same context window, leading to improved performance on the BEIR benchmark with an nDCG@10 score of 61.94, all while being ten times smaller than other generative listwise rerankers....

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

Published at 2025-09-30

#ML

The paper presents a new framework called AttnRL for efficient exploration in reasoning models using Process-Supervised Reinforcement Learning (PSRL). AttnRL improves upon existing PSRL methods by branching from positions with high attention scores and employing an adaptive sampling strategy based on problem difficulty and historical batch size, resulting in better performance and efficiency on various mathematical reasoning benchmarks....

DA^2: Depth Anything in Any Direction

Published at 2025-09-30

#ML

The authors present DA^2, a panoramic depth estimator that addresses the challenges of limited data and inefficiency in previous methods. They introduce a data curation engine to generate high-quality panoramic depth data and propose SphereViT to mitigate spherical distortions, resulting in improved performance and higher efficiency compared to existing approaches....

DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

Published at 2025-09-30

#ML

The authors present DeepScientist, an AI system that conducts autonomous scientific discovery over long periods by focusing on goal-oriented tasks. This system has surpassed human-designed methods in three frontier AI tasks by significant margins, providing the first large-scale evidence of AI making scientific discoveries that push the boundaries of human knowledge....

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Published at 2025-09-30

#ML

The authors created a compact GUI agent called Ferret-UI Lite that works on various platforms and performs well on small-scale GUI tasks, using methods like data curation, reasoning, and reinforcement learning, which helps it achieve competitive performance in GUI grounding and navigation benchmarks....

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

Published at 2025-09-30

#ML

The authors present a new framework called Implicit Multimodal Guidance (IMG) that improves the alignment between diffusion-generated images and their input prompts without needing additional data or editing. IMG uses a large language model to detect misalignments, adjusts diffusion features to reduce misalignments, and formulates a trainable objective to realign the image and prompt. The method is more effective than existing ones and can be easily integrated into previous finetuning-based alig...

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Published at 2025-09-30

#ML

The study explores how Large Language Models (LLMs) acquire visual knowledge from text alone, identifying two types of visual priors: perception and reasoning. The research reveals that reasoning ability is primarily developed through pre-training on reasoning-centric data, while perception ability is more sensitive to the vision encoder and visual instruction tuning data. The findings offer a data-centric recipe for pre-training vision-aware LLMs and introduce the Multi-Level Existence Bench (M...

Mem-α: Learning Memory Construction via Reinforcement Learning

Published at 2025-09-30

#ML

The authors present Mem-alpha, a reinforcement learning framework that enables agents to manage complex memory systems by learning from interaction and feedback. The framework is trained using a specialized dataset with diverse multi-turn interaction patterns and evaluated based on question-answering accuracy, resulting in significant improvements over existing memory-augmented agent baselines and robust generalization to long sequences....

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Published at 2025-09-30

#ML

This study explores the dual nature of reasoning in Vision-Language Models, finding that while it improves logical inference, it can also cause the model to ignore visual input. The researchers propose a new method called Vision-Anchored Policy Optimization to address this issue, resulting in improved performance on various benchmarks....

MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

Published at 2025-09-30

#ML

The study presents a new method called MotionRAG that improves the realism of motion in image-to-video generation. It does this by using relevant reference videos to adapt motion priors, resulting in more realistic motion dynamics with minimal computational overhead and zero-shot generalization to new domains....

Muon Outperforms Adam in Tail-End Associative Memory Learning

Published at 2025-09-30

#ML

The Muon optimizer is faster than Adam in training Large Language Models, and this study shows that Muon is better at learning rare classes in heavy-tailed data due to its update rule, which leads to more balanced learning across classes compared to Adam....

OceanGym: A Benchmark Environment for Underwater Embodied Agents

Published at 2025-09-30

#ML

OceanGym is a new testing ground for AI in underwater environments, which is difficult due to poor visibility and strong currents. This platform has eight tasks and uses advanced AI models to help agents understand their surroundings, explore, and complete goals, while also showing there's still much to improve for AI to work well in these challenging conditions....

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

Published at 2025-09-30

#ML

The study introduces a new concept called operational safety for large language models, which measures their ability to handle specific tasks safely. The researchers also present a benchmark called OffTopicEval to evaluate this safety. Their findings show that most models struggle with operational safety, but they propose prompt-based steering methods to improve performance....

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Published at 2025-09-30

#ML

The CritPt benchmark tests large language models on unpublished, research-level physics problems designed by active physicists. Results show that while LLMs perform well on individual tasks, they struggle with full-scale research challenges, revealing a gap between current AI capabilities and real-world physics research demands....

Regression Language Models for Code

Published at 2025-09-30

#ML

The researchers present a unified Regression Language Model that can predict various outcomes like memory footprint, latency, and accuracy of code written in different programming languages, without relying on heavy and domain-specific feature engineering. This model demonstrates high performance across multiple datasets and tasks, including competitive programming submissions, CodeNet languages, and neural architecture search design spaces....

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

Published at 2025-09-30

#ML

This study presents TAU, a benchmark of everyday sounds in Taiwan, to evaluate the performance of large audio-language models in understanding culturally distinctive sounds. Experiments show that popular models struggle with these localized sounds, highlighting the need for more equitable and culturally sensitive evaluations....

TTT3R: 3D Reconstruction as Test-Time Training

Published at 2025-09-30

#ML

This research improves 3D reconstruction using Recurrent Neural Networks by introducing a new method called TTT3R, which enhances the model's ability to adapt to new observations while retaining past information, resulting in better performance and faster processing speed....

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

Published at 2025-09-30

#ML

The authors present a new Large Language Model architecture, Dragon Hatchling (BDH), inspired by scale-free biological networks, which combines strong theoretical foundations, interpretability, and Transformer-like performance. BDH rivals GPT2 performance on language and translation tasks, and its working memory during inference relies on synaptic plasticity with Hebbian learning using spiking neurons, making it a biologically plausible model for speech....

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

Published at 2025-09-30

#ML

The study uses circuit analysis to reveal that post-training large reasoning models develop new, specialized attention heads that support complex reasoning tasks. The research also identifies a trade-off between sophisticated problem-solving and potential errors or inefficiencies in simpler tasks, offering insights for future training policy design....

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Published at 2025-09-30

#ML

The authors present a new method called TFPI that enhances the efficiency and effectiveness of reasoning models by reducing token usage during inference, leading to faster training, improved performance, and lower computational costs without requiring complex training designs or specialized rewards....

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Published at 2025-09-30

#ML

The authors present TruthRL, a reinforcement learning framework that improves the truthfulness of large language models by optimizing for correct answers, reducing hallucinations, and enabling abstention when uncertain. Experiments show that TruthRL significantly outperforms vanilla RL in reducing hallucinations and improving truthfulness, highlighting the importance of a truthfulness-driven learning objective....

Video Object Segmentation-Aware Audio Generation

Published at 2025-09-30

#ML

The authors present SAGANet, a new model that allows for controllable audio generation in videos by using object-level segmentation maps, enabling users to have fine-grained control over the audio generated. They also introduce a new dataset, Segmented Music Solos, to support this task and further research in the field....

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Published at 2025-09-30

#ML

The authors present VitaBench, a new benchmark for testing LLM agents in realistic scenarios, such as food delivery and travel services. They found that even advanced models struggle with these complex tasks, achieving only a 30% success rate on cross-scenario tasks....

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

Published at 2025-09-30

#ML

The study introduces a new benchmark called VERA for evaluating reasoning ability in voice-interactive systems, which reveals significant performance gaps between text and voice models. The benchmark helps analyze the impact of architectural choices on reliability and offers a way to measure progress toward real-time voice assistants that are both fluent and reliable in reasoning....

dParallel: Learnable Parallel Decoding for dLLMs

Published at 2025-09-30

#ML

The authors present dParallel, a method that improves the parallel decoding process of diffusion large language models (dLLMs) by addressing the bottleneck of sequential certainty convergence for masked tokens. This is achieved through certainty-forcing distillation, which reduces the number of decoding steps significantly without compromising performance, as demonstrated by experiments on various benchmarks....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages