🤗 Daily Paper(2025-10-15)

1 view

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 15, 2025, 4:07:48 PMOct 15

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Published at 2025-10-01

#ML

This study identifies typicality bias in preference data as the main cause of mode collapse in LLMs after training, which reduces their diversity. The researchers propose a new method called Verbalized Sampling, a simple prompting strategy that significantly improves performance in various tasks without sacrificing accuracy or safety, and it's particularly effective in creative writing....

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

Published at 2025-10-09

#ML

This study explores using advanced AI models, known as MLLMs, to predict human preferences for different user interface designs, comparing them to real human evaluations. The results show that while MLLMs can mimic human preferences for some design aspects, they struggle with others, highlighting their potential and limitations in aiding early user experience research....

dInfer: An Efficient Inference Framework for Diffusion Language Models

Published at 2025-10-09

#ML

The authors have developed dInfer, a fast and flexible framework for a type of language model called diffusion language models. By breaking down the inference process into four parts and optimizing each, dInfer is able to generate text much faster than other systems, up to 10 times faster than the previous best, without sacrificing quality....

DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Published at 2025-10-10

#ML

The authors present DITING, a new framework for evaluating web novel translations, focusing on narrative and cultural accuracy. They also introduce AgentEval, a multi-agent assessment system that simulates expert judgment to improve translation quality evaluation, and MetricAlign, a dataset for comparing translation metrics. The study finds that Chinese-trained language models outperform larger foreign models for this task, with DeepSeek-V3 providing the best results....

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

Published at 2025-10-10

#ML

The paper addresses the issue of data contamination in the reinforcement learning post-training phase of large language models, which can compromise their performance evaluation. The authors introduce Self-Critique, a method that detects data contamination by identifying policy collapse, and RL-MIA, a benchmark for simulating this scenario. Self-Critique significantly outperforms baseline methods in detecting contamination, achieving up to a 30% improvement in AUC....

ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Published at 2025-10-10

#ML

The authors propose a new training framework called ReFIne that enhances the trustworthiness of large reasoning models by improving their interpretability, faithfulness, and reliability. Experimental results show significant improvements in these areas, emphasizing the importance of optimizing reasoning models for trustworthiness beyond just accuracy....

SynthID-Image: Image watermarking at internet scale

Published at 2025-10-10

#ML

The authors present SynthID-Image, a deep learning system for invisibly watermarking AI-generated images, which has been used to watermark over ten billion images across Google's services. They evaluate its performance and demonstrate state-of-the-art results, and also discuss the generalization of their findings to other media types like audio....

The Geometry of Reasoning: Flowing Logics in Representation Space

Published at 2025-10-10

#ML

This study proposes a new way to understand how large language models reason by analyzing their internal representation space as flows, or embedding trajectories. The researchers disentangle logic from meaning and connect reasoning to geometric quantities like position and velocity, providing a new perspective for interpreting and analyzing the behavior of large language models....

Why Do Transformers Fail to Forecast Time Series In-Context?

Published at 2025-10-10

#ML

The paper analyzes why Transformers struggle with time series forecasting compared to simpler models, focusing on In-Context Learning theory. It finds that linear models perform better, Transformers can only match them with infinite context, and predictions revert to the mean with Chain-of-Thought style inference. The study offers insights for improving forecasting models and encourages deeper theoretical exploration....

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Published at 2025-10-13

#ML

The study reveals that reasoning-based safety measures for large language models can be easily bypassed through subtle input manipulations, leading to harmful responses. The researchers present various jailbreak methods that exploit these vulnerabilities, achieving high attack success rates across different models, highlighting the need for improved safety measures in open-source language models....

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Published at 2025-10-13

#ML

The researchers present a new method called Boundary-Guided Policy Optimization (BGPO) that addresses the memory issue in training large language models using reinforcement learning. BGPO improves the accuracy of training by using a larger sample size, leading to better performance in tasks like math problem-solving, code generation, and planning....

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Published at 2025-10-13

#ML

The authors present a new model called ContextGen that uses layout and reference images to generate multiple images with precise object placement and consistent identity. They also introduce a new dataset, IMIG-100K, for this task, and their model outperforms existing methods in controlling object placement, maintaining identity, and overall image quality....

Deconstructing Attention: Investigating Design Principles for Effective Language Modeling

Published at 2025-10-13

#ML

This study examines the key components of the dot-product attention mechanism in Transformer language models and tests their necessity. The researchers created modified versions of attention, some with relaxed principles, and found that while token mixing is crucial, other components can be simplified or relaxed, especially when combined with standard attention, leading to more efficient language models....

Deep Research Brings Deeper Harm

Published at 2025-10-13

#ML

The study reveals that while LLMs used in Deep Research agents can perform complex tasks, they also pose significant risks if misused, especially in sensitive areas like biosecurity. The researchers propose new methods to expose these risks and call for improved alignment techniques to ensure the safe use of Deep Research agents....

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

Published at 2025-10-13

#ML

The authors propose a new method called Diffusion-Link to bridge the gap between audio and text data representations. They show that this method significantly reduces the modality gap and improves the performance of automatic audio captioning tasks compared to previous methods....

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Published at 2025-10-13

#ML

ExpVid is a new benchmark for testing multimodal large language models on scientific experiment videos. It has three levels of tasks that reflect the scientific process, and it reveals that these models struggle with fine details, tracking changes over time, and connecting procedures to scientific outcomes....

Information-Preserving Reformulation of Reasoning Traces for Antidistillation

Published at 2025-10-13

#ML

The authors present PART, a method to protect detailed reasoning traces of Large Language Models from unauthorized distillation without losing important information. PART disrupts distillation by removing self-talk and reordering sub-conclusions, causing a significant decrease in performance for student models trained on reformulated traces....

LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Published at 2025-10-13

#ML

This study explores the use of 'thinking tokens' in machine translation with large reasoning models but finds they don't improve performance. However, combining outputs of modular translation strategies for intermediate tokens does lead to improvements, highlighting the importance of translation attempts in these tokens....

Locket: Robust Feature-Locking Technique for Language Models

Published at 2025-10-13

#ML

The study presents Locket, a new feature-locking technique for language models that allows for a pay-to-unlock scheme, which is more economically viable for chatbot providers. Locket is effective in refusing locked features, preserves utility for unlocked features, is robust against evasion, and can be scaled to multiple features and users....

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Published at 2025-10-13

#ML

The study presents a framework called OneLife that learns and represents the dynamics of a complex, stochastic environment through conditionally-activated programmatic laws within a probabilistic programming framework. The framework outperforms a strong baseline in 16 out of 23 scenarios tested and demonstrates successful planning ability in the Crafter-OO environment....

R-WoM: Retrieval-augmented World Model For Computer-use Agents

Published at 2025-10-13

#ML

The study evaluates the effectiveness of Large Language Models (LLMs) as world models for computer-use agents and identifies their limitations in long-horizon simulations due to hallucination and static training knowledge. To overcome these challenges, the researchers propose R-WoM, a retrieval-augmented world model that incorporates real-time, factual knowledge from external sources, resulting in significant performance improvements in simulations....

SR-Scientist: Scientific Equation Discovery With Agentic AI

Published at 2025-10-13

#ML

The authors present SR-Scientist, a framework that empowers Large Language Models (LLMs) to function as autonomous AI scientists, capable of analyzing data, implementing and optimizing scientific equations with minimal human intervention. Experimental results indicate that SR-Scientist significantly outperforms existing methods across various scientific disciplines, demonstrating robustness, generalization, and symbolic accuracy....

Scaling Language-Centric Omnimodal Representation Learning

Published at 2025-10-13

#ML

This study explores why multimodal language models perform well and finds that their ability to align different types of data (like text and images) during pretraining is key. They then propose a new framework, LCO-Emb, which improves on this by further refining the model's understanding of various data types, leading to better performance across different tasks....

Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models

Published at 2025-10-13

#ML

The paper presents a solution to improve the generation quality of diffusion models by reducing errors that accumulate during the generation process. The proposed method, Temporal Alignment Guidance, uses a time predictor to identify and correct deviations from the desired data manifold at each step, resulting in better sample fidelity and performance on various tasks....

A Survey of Vibe Coding with Large Language Models

Published at 2025-10-14

#ML

This survey explores the emerging field of Vibe Coding, which involves developers working with AI-powered coding agents to create software. The authors analyze various aspects of Vibe Coding, such as language models and development environments, and propose a new taxonomy for this approach, emphasizing the importance of context engineering and human-AI collaboration....

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

Published at 2025-10-14

#ML

This study presents a new two-stage training method that improves the performance and efficiency of pixel-space diffusion and consistency models, which typically lag behind latent-space models. The proposed framework pre-trains encoders to learn semantics from images and then fine-tunes a decoder, resulting in a diffusion model that outperforms previous pixel-space methods and rivals VAE-based models on the ImageNet dataset....

Cautious Weight Decay

Published at 2025-10-14

#ML

The authors propose a new method called Cautious Weight Decay (CWD) that applies weight decay only to certain parameter coordinates during optimization. This approach improves the final loss and accuracy for large-scale tasks like language model pre-training and ImageNet classification without requiring new hyperparameters or additional tuning....

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Published at 2025-10-14

#ML

The authors present DeepMMSearch-R1, a multimodal LLM that can perform on-demand, multi-turn web searches and adaptively craft queries for both image and text search tools. It uses a two-stage training pipeline and a novel multimodal VQA dataset called DeepMMSearchVQA, which teaches the model when and how to search for information. The approach outperforms existing methods in various knowledge-intensive benchmarks....

Detect Anything via Next Point Prediction

Published at 2025-10-14

#ML

Researchers have developed Rex-Omni, a 3B-scale MLLM that outperforms traditional coordinate regression-based models in object detection tasks. This is achieved through innovative designs such as special tokens for coordinate prediction, multiple data engines for semantically rich supervision, and a two-stage training process that improves box accuracy and reduces duplicate predictions....

Dr.LLM: Dynamic Layer Routing in LLMs

Published at 2025-10-14

#ML

The authors present Dr.LLM, a framework that optimizes computation in language models by dynamically routing tokens through the model's layers based on their complexity. Dr.LLM improves accuracy and efficiency on various tasks compared to existing methods, without altering the base model's weights....

ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Published at 2025-10-14

#ML

The authors propose ERA, a two-stage framework that combines prior knowledge learning and online reinforcement learning to enhance the performance of vision language models in complex environments. ERA improves agent performance through three key designs: self-summarization, dense reward shaping, and turn-level policy optimization, and demonstrates superior results compared to large models and previous baselines in various tasks....

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Published at 2025-10-14

#ML

The authors present FlashVSR, a practical one-step streaming framework for real-time video super-resolution using diffusion models. They introduce three key innovations: a three-stage distillation pipeline, locality-constrained sparse attention, and a tiny conditional decoder, which together enable real-time performance, scalability, and high-quality reconstruction. They also release a large-scale dataset, VSR-120K, and demonstrate state-of-the-art performance with significant speedup over previ...

HoneyBee: Data Recipes for Vision-Language Reasoners

Published at 2025-10-14

#ML

This study investigates how to build effective datasets for training vision-language models (VLMs) in reasoning tasks, identifying key factors like context source, data interventions, and scaling up data. The researchers created HoneyBee, a large-scale reasoning dataset with 2.5M examples, which significantly improves VLM performance, and introduced a test-time scaling strategy to reduce decoding cost....

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

Published at 2025-10-14

#ML

This study presents a new approach called Memory-as-Action, where agents manage their working memory by executing editing operations as part of a unified policy. The proposed method, Dynamic Context Policy Optimization, enables stable end-to-end reinforcement learning by segmenting trajectories at memory action points, leading to improved task performance and reduced computational consumption through adaptive context curation strategies....

Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

Published at 2025-10-14

#ML

The authors of this study found that existing noise-reducing models often have a problem where the expected noise level doesn't match the actual noise level during the sampling process, which they call 'noise shift'. To fix this, they created a new method called Noise Awareness Guidance (NAG) that helps keep the noise levels consistent. They also made a simpler version of NAG that doesn't need extra tools, and their tests showed that NAG significantly improved the quality of the generated images...

RAG-Anything: All-in-One RAG Framework

Published at 2025-10-14

#ML

RAG-Anything is a new framework that allows for comprehensive knowledge retrieval across all modalities, including text, images, tables, and mathematical expressions. It outperforms existing methods in multimodal benchmarks, especially in long documents, by reconceptualizing multimodal content as interconnected knowledge entities and introducing dual-graph construction and cross-modal hybrid retrieval....

Robot Learning: A Tutorial

Published at 2025-10-14

#ML

This tutorial covers the modern field of robot learning, focusing on data-driven methods that have recently surpassed traditional, model-based approaches in autonomous systems. It covers topics from basic reinforcement learning and behavioral cloning to advanced language-conditioned models, providing practical tools and examples for researchers and practitioners....

SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

Published at 2025-10-14

#ML

The SAIL-Embedding model is a new omni-modal embedding foundation model that improves upon existing multimodal embedding models by addressing challenges such as limited modality support, unstable training, and industrial domain gaps. It achieves this through tailored training strategies, architectural design, and a multi-stage training scheme, resulting in state-of-the-art performance in various retrieval tasks and significant improvements in user engagement metrics for recommendation scenarios....

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Published at 2025-10-14

#ML

The study presents SRUM, a self-improvement framework that enables unified multimodal models to enhance their visual generation capabilities using their own understanding module as a feedback loop, without requiring additional data. SRUM's global-local dual reward system guides the generation process, resulting in improved performance on various benchmarks....

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Published at 2025-10-14

#ML

The study presents a new method called Spatial Forcing to improve the spatial awareness of robots following language instructions, without relying on 3D sensors or depth estimators. This approach enhances the precision of robotic actions and accelerates training, demonstrating state-of-the-art results in both simulation and real-world environments....

Tensor Logic: The Language of AI

Published at 2025-10-14

#ML

This paper proposes tensor logic, a unifying language for AI that combines neural and symbolic AI, addressing the limitations of existing AI programming languages and enabling new directions like sound reasoning in embedding space....

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Published at 2025-10-14

#ML

The authors propose UniFusion, a model that uses a single, frozen vision-language model to generate images, improving cross-modal reasoning and knowledge transfer compared to existing methods. UniFusion's Layerwise Attention Pooling mechanism extracts both high and low level details from text and visual tokens, and the model demonstrates improved text-image alignment and generalization capabilities....

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Published at 2025-10-14

#ML

The authors present a new training algorithm, Visual Consistency Learning (ViCO), which allows models to use varying numbers of vision tokens for images with different semantic complexities. By employing multiple MLP connectors and a visual resolution router, the method reduces the number of vision tokens by up to 50% during inference without compromising the model's performance, leading to more efficient Multimodal Large Language Models....

What If : Understanding Motion Through Sparse Interactions

Published at 2025-10-14

#ML

The Flow Poke Transformer (FPT) is a new framework that predicts local motion distribution based on sparse interactions, providing a more interpretable and multi-modal representation of scene motion compared to traditional methods. The FPT model excels in various tasks, including dense face motion generation and articulated object motion estimation, and its source code is publicly available....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages