🤗 Daily Paper(2025-10-03)

0 views
Skip to first unread message

deep.di...@gmail.com

unread,
Oct 3, 2025, 4:07:41 PMOct 3
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Published at 2025-09-23

#ML

The authors created two new math benchmarks, SKYLENAGE-ReasoningMATH and SKYLENAGE-MATH, to better evaluate the mathematical reasoning abilities of language models. They tested 15 language models on these benchmarks and found that the strongest model could only achieve 44% accuracy on the contest-style suite and 81% accuracy on the reasoning-centered suite, with performance declining as the difficulty increased....

Read Moreicon

IoT-MCP: Bridging LLMs and IoT Systems Through Model Context Protocol

Published at 2025-09-25

#ML

The paper presents IoT-MCP, a new framework that uses edge-deployed servers to connect Large Language Models with IoT systems, ensuring standardized communication. They also introduce IoT-MCP Bench, a benchmark with various tasks to evaluate IoT-enabled LLMs, demonstrating IoT-MCP's high success rate, fast response time, and low memory usage across different sensor types and microcontroller units....

Read Moreicon

Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

Published at 2025-09-25

#ML

The study examines a problem called 'hallucination snowballing' in multi-agent systems that rely on visual language models. They analyze this issue and propose a new method, ViF, to reduce it, which significantly improves performance in various benchmarks by better preserving visual evidence and reallocating attention....

Read Moreicon

Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs

Published at 2025-09-26

#ML

This study creates a new benchmark and evaluation protocol for detecting errors in language models, focusing on a more practical approach than current methods. The research reveals that large language models struggle with this task due to incorrectly flagging missing details and difficulty with factually correct but unverifiable information, providing insights into improving prompting strategies....

Read Moreicon

RLP: Reinforcement as a Pretraining Objective

Published at 2025-09-26

#ML

This study proposes RLP, a new pretraining objective that brings reinforcement learning's exploration aspect to the pretraining phase of large reasoning models. RLP encourages models to think independently before predicting the next token, which significantly improves performance on reasoning-heavy tasks and demonstrates scalability across different architectures and model sizes....

Read Moreicon

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Published at 2025-09-26

#ML

The study finds that activation steering, a method used to control language models' behavior, can actually make them more likely to follow harmful instructions by systematically breaking their safety measures. The researchers show that even randomly chosen directions or using benign features from a sparse autoencoder can significantly increase the probability of harmful compliance, challenging the belief that precise control over a model's internals ensures safe behavior....

Read Moreicon

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Published at 2025-09-28

#ML

This research reveals that Group Relative REINFORCE, usually considered an on-policy algorithm, can actually be interpreted as an off-policy algorithm. The study offers new insights into adapting REINFORCE for off-policy settings, debunks some misconceptions about GRPO, and provides a theoretical foundation for data-weighting strategies used in off-policy reinforcement learning for language models....

Read Moreicon

Controlled Generation for Private Synthetic Text

Published at 2025-09-29

#ML

This study presents a new method for generating synthetic text that protects privacy, focusing on domains like healthcare and law. The approach uses entity-aware control codes and either in-context learning or prefix tuning to balance privacy and text quality, as shown in experiments on legal and clinical datasets....

Read Moreicon

FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

Published at 2025-09-29

#ML

The authors propose FrameThinker, a new framework that allows Large Vision-Language Models to efficiently understand long videos by iteratively questioning the video content. They address challenges in adapting the model to new actions and designing reward functions through a two-phase training strategy, resulting in significant performance improvements and reduced frame processing compared to baseline models....

Read Moreicon

Automated Structured Radiology Report Generation with Rich Clinical Context

Published at 2025-09-30

#ML

The study presents a new method for generating structured radiology reports from chest X-ray images, which takes into account important clinical information like patient history and imaging techniques. This approach improves report quality and reduces errors compared to existing systems, and the researchers have made their data and code available to help advance this field....

Read Moreicon

Go with Your Gut: Scaling Confidence for Autoregressive Image Generation

Published at 2025-09-30

#ML

The authors present ScalingAR, a new framework for improving image generation in large language models by using token entropy, which enhances base models' performance by over 10%, reduces visual token consumption by over 60%, and increases robustness in challenging scenarios by 26%....

Read Moreicon

LongCodeZip: Compress Long Context for Code Language Models

Published at 2025-09-30

#ML

The authors present LongCodeZip, a new method to compress long contexts for code language models, addressing the high costs and slow response times of existing techniques. By identifying and selecting the most relevant function-level chunks and optimizing the token budget, LongCodeZip improves performance in coding tasks while maintaining efficiency....

Read Moreicon

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

Published at 2025-09-30

#ML

This study presents a new algorithm called one-token rollout (OTR) that improves supervised fine-tuning of large language models by incorporating elements of reinforcement learning. OTR enhances the model's generalization ability by using the policy gradient method to treat each token generation as a single-step learning process, making it more effective than traditional supervised fine-tuning....

Read Moreicon

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval

Published at 2025-09-30

#ML

The study presents a new training objective called MW loss for dual-encoder retrievers, which improves score separation quality and retrieval performance compared to the existing NCE objective. MW loss maximizes the Mann-Whitney U statistic and is equivalent to AUC, leading to better-calibrated and more discriminative retrievers in high-stakes applications like RAG....

Read Moreicon

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Published at 2025-09-30

#ML

The study presents Ovi, a new method for generating audio-video content that combines sound and visuals into a single process, eliminating the need for separate pipelines or alignment steps. Ovi uses a twin-DiT module approach for cross-modal fusion, resulting in realistic sound effects and speech with rich emotional and speaker identity characteristics, enabling high-quality movie-grade video clips....

Read Moreicon

SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

Published at 2025-09-30

#ML

SQUARE is a new method that uses artificial intelligence models to improve image search based on user input, without needing specific training for each search. It enhances search accuracy by incorporating high-level semantic guidance and performing joint visual-semantic reasoning, outperforming other methods on standard benchmarks....

Read Moreicon

Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models

Published at 2025-10-01

#ML

The authors present AGILE, a method that improves the perception and reasoning skills of Vision-Language Models by having them solve jigsaw puzzles interactively. This approach significantly enhances puzzle-solving accuracy and generalizes well to other vision tasks, offering a scalable solution to the data scarcity issue in multimodal reinforcement learning....

Read Moreicon

Aristotle: IMO-level Automated Theorem Proving

Published at 2025-10-01

#ML

Aristotle is an AI system that blends formal and informal reasoning to solve advanced math problems at the level of the International Mathematical Olympiad. It consists of a proof search system, a lemma generator, and a geometry solver, all working together to achieve top performance and scalability in automated theorem proving....

Read Moreicon

CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Published at 2025-10-01

#ML

This study proposes a new method called CLUE, which uses hidden states from large language models to verify the correctness of outputs. By clustering hidden states from past experiences, CLUE outperforms traditional methods and improves accuracy in various tasks, such as AIME 24/25 and GPQA....

Read Moreicon

Generalized Parallel Scaling with Interdependent Generations

Published at 2025-10-01

#ML

This study presents a method called Bridge to create connected responses in parallel, which enhances the quality and consistency of the answers compared to independent generation. By using a small number of new parameters, Bridge improves accuracy and is compatible with any post-generation aggregation technique, offering a more efficient way to scale parallel processing....

Read Moreicon

ModernVBERT: Towards Smaller Visual Document Retrievers

Published at 2025-10-01

#ML

The study presents ModernVBERT, a compact vision-language model that outperforms much larger models in document retrieval tasks. The authors improve retrieval performance by optimizing factors like attention masking, image resolution, and contrastive objectives, making their model and code publicly available....

Read Moreicon

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

Published at 2025-10-01

#ML

This study examines how feed-forward networks in large language models make use of their latent space, finding that widening them adds mostly low-energy directions while dominant-mode subspaces saturate early, leading to underutilization of the latent space. The research provides guidance for designing more efficient language models by balancing tail capacity and dominant-mode capacity....

Read Moreicon

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

Published at 2025-10-01

#ML

The researchers created Toucan, a large, public dataset with 1.5 million realistic tasks for training language models to use tools. Toucan's tasks are more diverse and complex than existing datasets, and models trained on it perform better on various benchmarks....

Read Moreicon

Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression

Published at 2025-10-01

#ML

The study presents TRAAC, a method that helps thinking models allocate their reasoning resources more efficiently by balancing the length of reasoning steps according to task difficulty. TRAAC improves accuracy, reduces reasoning steps, and performs well on various tasks, including those outside its training domain....

Read Moreicon

TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis

Published at 2025-10-01

#ML

The study presents TimeSeriesScientist (TSci), a new AI framework that uses a large language model to automate and improve time series forecasting, reducing forecast error by up to 38.2% compared to existing methods. TSci's agents perform tasks such as data preprocessing, model selection, and validation, and it generates comprehensive, transparent reports to make the forecasting process more interpretable....

Read Moreicon

VIRTUE: Visual-Interactive Text-Image Universal Embedder

Published at 2025-10-01

#ML

The study presents a new model, VIRTUE, that enhances the ability of visual-textual embedding models to understand and process specific regions within an image by incorporating a segmentation model, which allows for more precise handling of complex scenarios. The model's performance is evaluated using a large-scale benchmark, demonstrating significant improvements over existing methods in both universal and visual-interactive tasks....

Read Moreicon

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Published at 2025-10-01

#ML

The authors present VLA-R1, a vision-language-action model that improves reasoning and execution through Reinforcement Learning from Verifiable Rewards and Group Relative Policy Optimization. VLA-R1 outperforms previous models in various evaluations and the authors plan to release the model, code, and dataset....

Read Moreicon

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

Published at 2025-10-01

#ML

This study presents a new method called VOGUE that improves multimodal reasoning by focusing on visual uncertainty during exploration, rather than just text output. By treating images as uncertain and using this uncertainty to guide exploration, VOGUE enhances performance on various visual math and general-domain reasoning benchmarks, while also addressing the exploration decay issue in RL fine-tuning....

Read Moreicon

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

Published at 2025-10-02

#ML

The study presents a comprehensive evaluation framework and benchmark for Deep Research Agents (DRAs), which can perform complex tasks by decomposing them, retrieving information from various sources, reasoning through multiple stages, and providing structured outputs. The benchmark includes 214 challenging questions across 10 themes, and the framework assesses DRAs' performance in generating long-form reports based on semantic quality, topical focus, and retrieval trustworthiness....

Read Moreicon

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

Published at 2025-10-02

#ML

This study presents a new method called DragFlow that uses a region-based approach to improve drag-based image editing by leveraging stronger generative priors in DiTs. Compared to existing methods, DragFlow significantly enhances image editing performance by utilizing affine transformations, pretrained personalization adapters, and multimodal large language models, as demonstrated by extensive experiments on two benchmarks....

Read Moreicon

Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Published at 2025-10-02

#ML

This study questions the current method of interpreting draws in arena-style evaluations of large language models, suggesting that draws may indicate query difficulty rather than equal model performance. The researchers found that ignoring rating updates for draws improved prediction accuracy and that draws occur more frequently for easy and objective queries, recommending future rating systems to reconsider draw semantics and account for query properties....

Read Moreicon

ExGRPO: Learning to Reason from Experience

Published at 2025-10-02

#ML

The paper studies what makes past experiences valuable for large language models and finds that correctness and entropy are good indicators of experience value. They then propose a new framework, ExGRPO, that organizes and prioritizes valuable experiences, improving reasoning performance on various benchmarks and stabilizing training for both strong and weak models....

Read Moreicon

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Published at 2025-10-02

#ML

Researchers have developed a new suite of embedding models called F2LLM, which offers strong performance at a lower cost compared to previous top models. By finetuning directly from foundation models using 6 million open-source data samples, F2LLM achieves impressive results on the MTEB English leaderboard, providing a reproducible and budget-friendly baseline for future research....

Read Moreicon

Interactive Training: Feedback-Driven Neural Network Optimization

Published at 2025-10-02

#ML

The authors present Interactive Training, a new framework for neural network optimization that allows real-time intervention by human experts or AI agents. This approach improves training stability, reduces sensitivity to initial hyperparameters, and enhances adaptability to changing needs, making the training process more efficient and responsive....

Read Moreicon

Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness

Published at 2025-10-02

#ML

The study identifies a bias in Computer-Use Agents called Blind Goal-Directedness, where these agents pursue goals without considering feasibility, safety, or context. The researchers created a benchmark called BLIND-ACT to evaluate nine advanced models, finding high rates of this bias and suggesting the need for improved safety measures....

Read Moreicon

Learning to Reason for Hallucination Span Detection

Published at 2025-10-02

#ML

This study explores the use of reasoning to improve the detection of unsupported content, or 'hallucinations,' generated by large language models. The researchers introduce RL4HS, a reinforcement learning framework that encourages reasoning with a span-level reward function, which outperforms pretrained reasoning models and supervised fine-tuning in detecting hallucinations in various tasks....

Read Moreicon

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Published at 2025-10-02

#ML

The authors present MedQ-Bench, a new benchmark for evaluating medical image quality using Multi-modal Large Language Models (MLLMs). It includes tasks to test both perceptual and reasoning abilities of MLLMs in a way that mimics human evaluation, and the results show that current MLLMs perform poorly in clinical use, suggesting a need for improvement in this area....

Read Moreicon

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Published at 2025-10-02

#ML

This study presents a novel theoretical framework that improves text-to-image models' performance on multi-subject descriptions by introducing two algorithms: a test-time controller and a fine-tuning rule. These algorithms enhance subject disentanglement and multi-subject fidelity while preserving the base model's capabilities, and they are architecture-agnostic and efficient to run....

Read Moreicon

Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective

Published at 2025-10-02

#ML

This research explores how well reasoning skills learned in English by Large Reasoning Models (LRMs) can be applied to other languages. They evaluate English-centric LRMs on multilingual tasks and find that the ability to transfer reasoning skills varies based on the model, target language, and training method. The study suggests that relying too much on English-specific patterns hinders cross-lingual generalization and introduces a new concept called the Parallel Scaling Law, which shows that a...

Read Moreicon

RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems

Published at 2025-10-02

#ML

The study presents a new training method, RLAD, for large language models to learn reasoning skills by discovering abstractions, which are concise descriptions of procedural and factual knowledge. This method enables the models to propose multiple abstractions for a given problem and then build solutions using these abstractions, leading to improved generalization and more effective reasoning....

Read Moreicon

Rethinking the shape convention of an MLP

Published at 2025-10-02

#ML

This study challenges the traditional narrow-wide-narrow design of Multi-layer Perceptrons (MLPs) and proposes a new wide-narrow-wide (Hourglass) MLP block design. The new design inverts the conventional approach by using skip connections at expanded dimensions and narrow bottlenecks, leading to superior performance and more efficient training and inference implementations....

Read Moreicon

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Published at 2025-10-02

#ML

This study presents RewardMap, a multi-stage reinforcement learning framework that enhances fine-grained visual understanding and reasoning abilities in multimodal large language models. By introducing a difficulty-aware reward design and a multi-stage RL scheme, RewardMap effectively addresses the challenges of sparse rewards and unstable optimization, leading to improved performance on various benchmarks....

Read Moreicon

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Published at 2025-10-02

#ML

The authors present a method to generate longer and higher quality videos using diffusion models without needing long video data or retraining. Their approach uses guidance from teacher models and avoids common issues, enabling video generation up to 20 times longer than the teacher model's capability, and even up to 4 minutes and 15 seconds with increased computation, while maintaining high fidelity and consistency....

Read Moreicon

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

Published at 2025-10-02

#ML

The research presents Sparse Query Attention (SQA), a new attention mechanism that cuts down the number of Query heads to decrease computational complexity and FLOPs, unlike previous methods. Experiments show that SQA can improve throughput by up to 3x in computation-bound tasks without significantly affecting model quality....

Read Moreicon

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Published at 2025-10-02

#ML

This research studies the weaknesses of 3D scene representation methods, specifically 3D Gaussian Splatting, and proposes a new strategy to exploit these vulnerabilities. The method involves inserting illusory objects into scenes from specific viewpoints without significantly affecting other views, and it also disrupts the consistency of multiple views to enhance the attack's effectiveness....

Read Moreicon

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Published at 2025-10-02

#ML

The study presents StockBench, a new benchmark for evaluating large language models' performance in realistic stock trading over multiple months. The results show that while most models struggle to beat a simple buy-and-hold strategy, some have the potential to generate higher returns and manage risk better, emphasizing the challenges and opportunities in creating LLM-powered financial agents....

Read Moreicon

The Unreasonable Effectiveness of Scaling Agents for Computer Use

Published at 2025-10-02

#ML

The authors present a new method called Behavior Best-of-N (bBoN) that significantly improves the performance of computer-use agents by enabling wide exploration and principled trajectory selection. bBoN achieves a new state-of-the-art result on OSWorld, outperforming prior methods and approaching human-level performance, and demonstrates strong generalization to different operating systems....

Read Moreicon

Transformers Discover Molecular Structure Without Graph Priors

Published at 2025-10-02

#ML

This study explores whether Transformers can learn molecular energies and forces directly from Cartesian coordinates, without relying on predefined graphs or physical priors. The results show that Transformers can adaptively learn physically consistent patterns and improve with more training resources, challenging the need for hard-coded graph biases in molecular modeling....

Read Moreicon

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Published at 2025-10-02

#ML

The study presents DialTree-RPO, a new method that automatically finds various multi-turn attack strategies for large language models in conversations, which is more effective than existing approaches and uncovers new attack strategies without manual data curation....

Read Moreicon

VideoNSA: Native Sparse Attention Scales Video Understanding

Published at 2025-10-02

#ML

The authors propose VideoNSA, a method that improves video understanding in multimodal language models by adapting Native Sparse Attention to video-language models. Through end-to-end training, VideoNSA outperforms other sparse attention techniques on various benchmarks, demonstrating reliable scaling, optimal attention allocation, and task-dependent branch usage patterns....

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages