🤗 Daily Paper(2025-10-08)

5 views

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 8, 2025, 4:08:11 PMOct 8

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

On Code-Induced Reasoning in LLMs

Published at 2025-09-25

#ML

The study explores how different aspects of code impact the reasoning abilities of large language models (LLMs). By constructing parallel instruction datasets in ten programming languages and applying controlled perturbations, the researchers find that LLMs are more sensitive to structural rather than semantic disruptions, and that appropriate abstractions like pseudocode and flowcharts can be as effective as code....

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Published at 2025-09-26

#ML

This study presents DRIFT, a new method for learning from user dissatisfaction signals, which are more abundant than satisfaction signals in real-world language model deployments. DRIFT outperforms existing methods in improving model performance on various tasks and preserving exploratory capacity, and it is particularly effective at larger scales....

CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Published at 2025-09-27

#ML

The study presents a new method called Clinical Contrastive Decoding (CCD) to reduce errors in AI models used for interpreting medical images. CCD improves the accuracy of these models by integrating clinical signals from expert models, enhancing their performance on various datasets without altering the original models....

CoDA: Coding LM via Diffusion Adaptation

Published at 2025-09-27

#ML

This study presents CoDA, a lightweight and efficient diffusion model for coding with 1.7B parameters, which matches or outperforms larger models on various benchmarks, and provides open-source tools to promote further research in this area....

Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

Published at 2025-09-28

#ML

The study presents Fathom-DeepResearch, a system with two specialized models for agentic applications. The first model, Fathom-Search-4B, is trained for evidence-based investigation using live web search and targeted webpage querying, while the second model, Fathom-Synthesizer-4B, converts search traces into structured reports. This system outperforms existing ones in various reasoning tasks, including open-ended information-seeking tasks....

CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation

Published at 2025-09-29

#ML

The study presents a new framework called CARE that focuses on improving cognitive reasoning in emotional support conversations, without relying on large synthetic datasets. CARE uses the original training set to guide models in generating coherent and supportive responses, and further refines the process using reinforcement learning, resulting in more logical and empathetic responses....

Fast-dLLM v2: Efficient Block-Diffusion LLM

Published at 2025-09-30

#ML

The authors present Fast-dLLM v2, a block diffusion language model that adapts pretrained autoregressive models for parallel text generation with just 1B tokens of fine-tuning, compared to 580B tokens for full-attention diffusion LLMs like Dream. This model uses a novel training recipe, a hierarchical caching mechanism, and a parallel decoding pipeline to achieve up to 2.5x faster decoding than standard autoregressive models without sacrificing quality, making it a promising solution for practic...

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Published at 2025-10-01

#ML

The study presents WebDetective, a new benchmark for evaluating deep search tasks in RAG systems and web agents, addressing the limitations of current evaluation methods by using hint-free multi-hop questions and a holistic evaluation framework. The evaluation of 25 state-of-the-art models reveals their inability to discover reasoning chains autonomously, leading to the development of EvidenceLoop, an agentic workflow that improves both search and synthesis capabilities, demonstrating WebDetecti...

HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation

Published at 2025-10-01

#ML

The authors propose HalluGuard, a small reasoning model that helps reduce false information in AI-generated content by using a synthetic dataset and preference-based fine-tuning. This model performs as well as larger, more specialized models in detecting and explaining false information, making it a promising tool for improving the reliability of AI-generated content....

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

Published at 2025-10-02

#ML

The authors present a new generative modeling framework called Equilibrium Matching (EqM) that learns an equilibrium gradient of an implicit energy landscape, unlike traditional diffusion and flow-based models. EqM offers improved generation performance, empirically outperforming other models and theoretically learning the data manifold, while also handling tasks like denoising, OOD detection, and image composition....

OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

Published at 2025-10-03

#ML

OneFlow is a new model that allows for generating text and images simultaneously and independently, unlike previous models that had a strict order. This model is efficient, performs better than previous models, and can be used for new tasks like concurrent generation and iterative refinement....

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

Published at 2025-10-03

#ML

VeriGuard is a new framework that ensures safety of AI agents in sensitive areas like healthcare by providing formal guarantees through a two-stage process. It first validates agent behavior offline by clarifying user intent, synthesizing a policy, and verifying it, then monitors online actions to ensure they adhere to the pre-verified policy....

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Published at 2025-10-04

#ML

The research presents BIOMEDICA-LongCAP, a dataset of 1 million image-caption pairs with longer descriptions, and BMC-LongCLIP, a biomedical vision-language model that uses this dataset. BMC-LongCLIP can handle longer contexts, improving retrieval and classification tasks and reducing wasted tokens....

A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

Published at 2025-10-05

#ML

The authors present a new framework for training reward models to distinguish between good and bad responses, addressing a reliability gap in preference alignment techniques. They introduce an adaptive inference strategy that reduces reliability failures and improves inference speed, demonstrated in the IMDB-sentiment setting....

Drax: Speech Recognition with Discrete Flow Matching

Published at 2025-10-05

#ML

The authors present Drax, a new framework for automatic speech recognition that uses discrete flow matching to enable efficient parallel decoding. They improve upon previous methods by creating a probability path guided by audio conditions, which reduces errors and improves the accuracy-efficiency trade-off, making it a promising approach for non-autoregressive ASR....

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

Published at 2025-10-05

#ML

The authors present Caco, a framework that uses code to create high-quality, verifiable, and diverse reasoning data for language models. Caco automates the process of generating reasoning traces, validates them through code execution, and converts them into natural language instructions, resulting in a scalable and trustworthy reasoning system that outperforms existing baselines....

AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems

Published at 2025-10-06

#ML

This study tests if AI models can solve AI research problems using only their pre-existing knowledge, without additional training or tools. The results show that while these models can sometimes find feasible solutions and propose new ideas, their problem-solving ability is still quite weak and easily influenced by how the problem is presented, revealing both their potential and their current limitations....

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

Published at 2025-10-06

#ML

The authors present BIRD-INTERACT, a new benchmark for testing text-to-SQL models in a realistic, multi-turn setting. This benchmark includes a complex interaction environment, two evaluation settings, and a challenging task suite that covers the full range of database operations. The results show that even advanced models struggle with the benchmark, emphasizing the importance of effective interaction for complex text-to-SQL tasks....

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

Published at 2025-10-06

#ML

The researchers developed a new system called ChartAgent that can understand and answer questions about complex charts by interacting with the charts visually, like a human would. This system outperforms previous methods, especially on charts without annotations or those requiring precise visual interpretation, and works well with different types of charts and language models....

GRACE: Generative Representation Learning via Contrastive Policy Optimization

Published at 2025-10-06

#ML

GRACE is a new framework that trains Large Language Models (LLMs) to be more interpretable by using contrastive signals as rewards to guide a generative policy, creating human-understandable explanations for its understanding. This method improves the overall score by 11.5% in the supervised setting and adds 6.9% in the unsupervised variant on the MTEB benchmark, while maintaining its general capabilities....

Less is More: Recursive Reasoning with Tiny Networks

Published at 2025-10-06

#ML

A new method called Tiny Recursive Model (TRM) has been developed, which uses a single small neural network with only 2 layers to solve challenging problems like Sudoku and Maze more effectively than a previous approach called Hierarchical Reasoning Model (HRM). TRM, with just 7 million parameters, outperforms most large language models on ARC-AGI tasks while using significantly fewer parameters....

Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

Published at 2025-10-06

#ML

The authors propose a new strategy called Exploratory Annealed Decoding (EAD) for improving exploration in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). EAD dynamically adjusts the sampling temperature during generation, promoting diversity at the start and preserving sample quality towards the end, resulting in more stable training and better performance compared to fixed-temperature sampling....

LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation

Published at 2025-10-06

#ML

The paper proposes three methods to reduce memory usage in video generation models: Asynchronous Cache Swapping, Feature Chunk, and Slicing Latents to Decode. These methods increase inference speed and lower memory consumption without significantly affecting quality....

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Published at 2025-10-06

#ML

The study presents MADPO, a new method for preference optimization in large language models that overcomes the limitations of previous approaches by using a reward model to estimate preference margins and applying adaptive weights to the loss for each training sample, resulting in improved performance and robustness....

Scalable In-context Ranking with Generative Models

Published at 2025-10-06

#ML

This study presents a new method called BlockRank that improves the efficiency and effectiveness of In-context Ranking (ICR) in Information Retrieval (IR) by exploiting the attention structure of large language models (LLMs). The method enforces inter-document block sparsity and optimizes query-document block relevance, reducing attention complexity and improving retrieval performance, as demonstrated by experiments on various benchmarks....

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

Published at 2025-10-06

#ML

The authors present a new method called TensorBLEU that significantly speeds up the evaluation of large-scale natural language processing models during training, enabling faster research progress. By using GPU-accelerated, per-sentence computation within PyTorch and a memory-efficient mechanism, TensorBLEU offers over 13x speedup on consumer-grade GPUs and over 40x on data-center-class hardware compared to traditional CPU-based methods....

ASPO: Asymmetric Importance Sampling Policy Optimization

Published at 2025-10-07

#ML

The researchers found an issue in the Outcome-Supervised Reinforcement Learning method used for training large language models, where positive-advantage tokens were not weighted correctly, leading to unbalanced updates. They proposed a new method called Asymmetric Importance Sampling Policy Optimization (ASPO) that fixes this issue by flipping the importance sampling ratios of positive-advantage tokens and using a soft dual-clipping mechanism to stabilize updates. ASPO significantly improves tra...

Adaptive Pruning for Increased Robustness and Reduced Computational Overhead in Gaussian Process Accelerated Saddle Point Searches

Published at 2025-10-07

#ML

This study presents a method to improve the efficiency and stability of Gaussian process regression in accelerating saddle point searches on high-dimensional energy surfaces by using geometry-aware optimal transport measures, active pruning, and a permutation-invariant metric, resulting in a significant reduction of computational time....

Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

Published at 2025-10-07

#ML

The authors created a large, annotated dataset of scatterplots to test AI models' performance on tasks related to data visualization. They found that OpenAI models and Google's Gemini 2.5 Flash were good at counting clusters and, in some cases, outliers, but struggled with precise location identification, and wide aspect ratio or randomly colored scatterplots negatively affected performance....

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Published at 2025-10-07

#ML

The study presents a new method called FlowRVS for referring video object segmentation, which improves upon existing techniques by treating the task as a conditional continuous flow problem. This approach allows for better alignment between text and video, finer control over pixel details, and more consistent segmentation over time, resulting in state-of-the-art performance on major benchmarks....

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Published at 2025-10-07

#ML

The authors present MeDiM, a new model that integrates information from different medical sources like images, text, and clinical notes. By using a shared probabilistic space and a multimodal large language model, MeDiM can generate high-quality medical data and improve downstream tasks, leading to more coherent and clinically accurate outputs....

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Published at 2025-10-07

#ML

The study present a framework called Distributional Semantics Tracing to investigate why large language models sometimes generate incorrect information. They identify a specific layer in the model where this happens and explain it using a theory of two different thinking processes, leading to predictable errors....

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Published at 2025-10-07

#ML

The authors present EgoNight, a new benchmark for egocentric vision understanding at night, which includes day-night aligned videos to improve night annotation quality and reveal performance gaps between lighting conditions. The benchmark contains over 3000 QA pairs and supports tasks like visual question answering, day-night correspondence retrieval, and depth estimation. Tests show that current models struggle in low-light conditions, highlighting the need for further research in this area....

HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video

Published at 2025-10-07

#ML

The authors present a new method called HoloScene that creates detailed, interactive 3D environments from a single video, which is useful for various applications like gaming and robotics. This technique improves upon existing methods by accurately capturing object geometry, appearance, and physical properties, resulting in realistic and stable digital replicas....

Human3R: Everyone Everywhere All at Once

Published at 2025-10-07

#ML

Human3R is a new framework that can reconstruct multiple humans and their surroundings in real-time from a single video, using a simple and efficient method that doesn't require multiple stages or complex computations. It achieves high performance with a low memory footprint, making it a strong baseline for future applications....

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Published at 2025-10-07

#ML

This study presents AgentFlow, a new framework that improves performance in tasks requiring planning, tool use, and reasoning by optimizing a planner within a multi-turn loop and coordinating specialized modules through an evolving memory. The proposed Flow-based Group Refined Policy Optimization method allows for efficient training in live environments, leading to significant improvements across various benchmarks compared to existing models....

MixReasoning: Switching Modes to Think

Published at 2025-10-07

#ML

The authors propose MixReasoning, a framework that adjusts the depth of reasoning based on the complexity of each step, leading to more efficient and accurate problem-solving without compromising on accuracy, as demonstrated by experiments on GSM8K, MATH-500, and AIME....

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

Published at 2025-10-07

#ML

The study explores how language models retrieve bound entities in-context and finds that while positional mechanism works for short lists, it fails for longer ones. The models then use lexical and reflexive mechanisms to compensate, which the study uses to create a causal model that estimates next token distributions accurately. This model also works for longer and more complex inputs, providing a better understanding of how language models handle entity retrieval....

Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

Published at 2025-10-07

#ML

The study presents EvoPresent, a framework that uses self-improvement agents to create engaging and visually appealing academic presentations. It includes PresAesth, a multi-task reinforcement learning model that provides feedback for improvement, and the EvoPresent Benchmark for evaluating presentation quality and aesthetic awareness. The research findings emphasize the importance of high-quality feedback, the trade-off between visual design and content, and the benefits of multi-task RL traini...

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Published at 2025-10-07

#ML

The study explores why reasoning models sometimes fail to follow safety guidelines. They discovered that these models can identify harmful prompts but suppress their refusal intentions before output. By identifying and adjusting a small number of attention heads, they improved the models' safety, using only a small portion of the training data....

Revisiting Modeling and Evaluation Approaches in Speech Emotion Recognition: Considering Subjectivity of Annotators and Ambiguity of Emotions

Published at 2025-10-07

#ML

This research questions traditional methods in speech emotion recognition, which disregard the subjectivity and ambiguity of human emotions. The study proposes new modeling and evaluation approaches, such as retaining all emotional ratings, redefining SER evaluation to include co-occurring emotions, and constructing a penalization matrix to improve performance. Experiments on various databases show that these methods create more robust and human-aligned SER systems....

Scientific Algorithm Discovery by Augmenting AlphaEvolve with Deep Research

Published at 2025-10-07

#ML

The study presents DeepEvolve, an agent that combines algorithm evolution and deep research to discover scientific algorithms. DeepEvolve uses external knowledge, cross-file code editing, and systematic debugging in a feedback loop to propose, refine, implement, and test new hypotheses, leading to executable and improved algorithms in various scientific domains....

ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

Published at 2025-10-07

#ML

This study presents a new method to create high-quality 4D shapes from videos by using a framework that combines large-scale pre-trained 3D models and three key components. These components help in accurately capturing the movement and changes in the video without needing to optimize each frame individually, resulting in improved robustness and visual accuracy....

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

Published at 2025-10-07

#ML

The authors present TaTToo, a new framework for enhancing large reasoning models in tabular reasoning domains. By integrating tool-based verification and table-grounded reasoning, TaTToo significantly improves model performance and generalizability across various tasks and strategies, outperforming larger baseline models....

The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

Published at 2025-10-07

#ML

Researchers studied the impact of distillation data quantity on the performance of two small non-reasoning language models in coding tasks, discovering a 'valley of code reasoning' where performance initially drops then sharply rises with more data. They also found that smaller models benefit more from easier coding questions than harder ones in low and medium-low data regimes, and the correctness of training data outputs doesn't affect distillation outcomes....

Training Dynamics Impact Post-Training Quantization Robustness

Published at 2025-10-07

#ML

This study explores the relationship between training dynamics and quantization performance in large language models, finding that learning rate and other hyperparameters significantly impact quantization errors. The research introduces controlled experiments to identify specific configurations that can enhance quantization robustness, challenging the assumption that larger datasets necessarily reduce quantization effectiveness....

Verifier-free Test-Time Sampling for Vision Language Action Models

Published at 2025-10-07

#ML

The authors present a new method called MG-Select for Vision-Language-Action models that enhances performance in tasks requiring precision without needing additional training or external tools. By using a reference distribution generated with masked states and language conditions, MG-Select selects the best action based on confidence, leading to significant improvements in both in-distribution and out-of-distribution tasks, as well as a substantial gain in RoboCasa pick-and-place tasks....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages