🤗 Daily Paper(2025-10-17)

1 view

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 17, 2025, 4:11:16 PMOct 17

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Published at 2025-10-06

#ML

The researchers created a large, multilingual dataset called PsiloQA to help detect false information generated by large language models. They built this dataset using a three-step process and tested different methods for detecting false information, finding that one type of model worked best across multiple languages and was more cost-effective than datasets made by humans....

SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis

Published at 2025-10-08

#ML

The study presents SCas4D, a new framework that uses structural patterns in 3D Gaussian Splatting to efficiently model dynamic scenes with accurate deformations. By refining deformations progressively, SCas4D achieves fast convergence and performs well in various tasks like novel view synthesis and dense point tracking, while requiring significantly fewer training iterations compared to existing methods....

Large Language Models Do NOT Really Know What They Don't Know

Published at 2025-10-10

#ML

The study analyzes how large language models process factual queries and finds that they don't distinguish internal states for correct and hallucinated responses unless the hallucination is unrelated to subject knowledge. This reveals that LLMs don't encode truthfulness but only recall patterns, meaning they don't truly know what they don't know....

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

Published at 2025-10-11

#ML

The study finds that advanced language models often fail to refuse answering based on faulty context, especially in multi-document tasks. The researchers developed RefusalBench, a new method to evaluate this capability by creating test cases through linguistic perturbation, revealing that refusal skill can be improved with training and alignment....

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Published at 2025-10-12

#ML

The authors present FML-bench, a diverse and comprehensive benchmark for evaluating automatic machine learning research agents on 8 fundamental ML problems, emphasizing the importance of broad exploration for effective research outcomes. The benchmark reduces coding burden, focuses on fundamental problems, offers high task diversity, and is extensible to real-world ML repositories, with a unified evaluation framework and five complementary metrics....

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Published at 2025-10-12

#ML

The authors present a new method called VR-Thinker that improves the performance of video reward models by incorporating visual reasoning operations and a configurable memory window. This approach enables the model to actively gather and update visual information within context limits, enhancing the accuracy and reliability of its reasoning, especially for longer videos....

AnyUp: Universal Feature Upsampling

Published at 2025-10-14

#ML

The authors propose a new method called AnyUp for upsampling vision features of any resolution, which doesn't require training for specific encoders and works with different feature types. This method improves upsampling quality, maintains feature details, and is efficient for various tasks....

VLA-0: Building State-of-the-Art VLAs with Zero Modification

Published at 2025-10-14

#ML

This research explores a straightforward approach to building Vision-Language-Action models (VLAs) by representing actions as text, which surprisingly outperforms more complex models like pi_0.5-KI, OpenVLA-OFT, and SmolVLA on the LIBERO benchmark. Even without extensive robotics-specific training, this simple VLA design exceeds the performance of methods trained on large-scale robotic data and real-world pre-trained models, highlighting the effectiveness of this uncomplicated strategy....

BitNet Distillation

Published at 2025-10-15

#ML

The authors propose a method called BitNet Distillation that reduces the size of large language models by converting them into 1.58-bit precision, resulting in faster and more memory-efficient models without significantly compromising performance. This is achieved through three techniques: SubLN module, multi-head attention distillation, and continual pre-training, which together enable up to 10x memory savings and 2.65x faster inference on CPUs compared to full-precision models....

LLM-guided Hierarchical Retrieval

Published at 2025-10-15

#ML

The authors present LATTICE, a hierarchical retrieval framework that allows language models to efficiently reason over large corpora using a semantic tree structure. By organizing documents into a hierarchy and navigating it with a search LLM, LATTICE achieves state-of-the-art performance on the BRIGHT benchmark, outperforming other zero-shot methods and matching fine-tuned SOTA methods on static corpora....

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Published at 2025-10-15

#ML

The authors present a framework called LiteStage to improve the efficiency of multi-stage reasoning in language models without sacrificing too much accuracy. They tackle the challenges of varying layer sensitivity and redundant output generation by optimizing layer budgets offline and using an early exit strategy online, resulting in significant speedup and minimal accuracy loss on various benchmarks....

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Published at 2025-10-15

#ML

LiveResearchBench is a new benchmark for evaluating deep research systems that are user-centric, dynamic, unambiguous, and multi-faceted. It consists of 100 expert-curated tasks that require extensive web search and synthesis, and comes with a comprehensive evaluation suite called DeepEval, which assesses both content and report-level quality. The benchmark and evaluation suite were used to test 17 frontier deep research systems, revealing their strengths and weaknesses....

Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Published at 2025-10-15

#ML

The authors propose a new inference algorithm, Mirror Speculative Decoding, which improves the speed and accuracy of language model inference by using parallel and heterogeneous computation, as well as multi-token speculative streaming, resulting in significant speedups and better performance compared to existing methods....

MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Published at 2025-10-15

#ML

This study presents a new framework called MoM that improves the way small language models process and understand documents, making them more proactive and efficient. MoM simulates human cognitive processes during reading, handles documents from multiple domains, and incorporates a reverse reasoning strategy to help small language models achieve human-centric intelligent text processing....

On Pretraining for Project-Level Code Completion

Published at 2025-10-15

#ML

The authors study how different methods of processing code repositories impact the learning of an AI model called OpenCoder, which generates code. They improve the model's performance by increasing its context window and using a new technique for understanding the code's structure, and they show that a simpler approach still works well, making this method accessible even with limited resources....

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Published at 2025-10-15

#ML

The authors present RAGCap-Bench, a new tool for evaluating the intermediate tasks in agentic Retrieval-Augmented Generation systems, which helps improve the performance of large language models by enhancing their ability to retrieve and reason over complex queries....

Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Published at 2025-10-15

#ML

The authors present a method to create complex question-answering datasets for web agents by gradually increasing difficulty until a baseline agent fails. They then use this dataset to train more effective web agents that demonstrate better performance and diversity in tool use compared to existing datasets....

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Published at 2025-10-15

#ML

The authors present the German Commons, a vast collection of openly licensed German text, addressing the scarcity of legally compliant data for German language model development. This resource, consisting of 154.56 billion tokens from 41 sources across seven domains, enables the creation of truly open German language models and is fully reproducible and extensible....

Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning

Published at 2025-10-15

#ML

This study explores how to improve the performance of Transformer networks when faced with unfamiliar data, using a complex math task as a test case. The researchers propose and test four new architectural mechanisms designed to enhance the network's ability to generalize, ultimately creating a more robust and scalable system for handling unknown data....

VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Published at 2025-10-15

#ML

The authors present a new method called VIST3A that combines a text-to-video model and a 3D reconstruction system to create a powerful text-to-3D generator. They address challenges in integrating the two components and aligning them to produce high-quality, consistent 3D scenes, resulting in improved performance compared to previous text-to-3D models....

AI for Service: Proactive Assistance with AI Glasses

Published at 2025-10-16

#ML

The authors propose a new AI paradigm called AI for Service that enables proactive and real-time assistance in daily life using AI glasses. They introduce a unified framework called Alpha-Service, which includes five key components to detect service opportunities, infer user intent, and provide timely assistance without explicit prompts, as demonstrated in case studies like a Blackjack advisor and museum tour guide....

Agentic Design of Compositional Machines

Published at 2025-10-16

#ML

This study explores if large language models can learn to design complex machines by creating machines from standard parts in a simulated environment. The researchers introduce a new testbed called BesiegeField and find that current models lack necessary skills, suggesting reinforcement learning as a potential solution....

Agentic Entropy-Balanced Policy Optimization

Published at 2025-10-16

#ML

The paper presents a new algorithm, AEPO, that addresses the challenges of relying too much on entropy signals in agentic RL. AEPO balances entropy in both the rollout and policy update phases, improving rollout sampling diversity and maintaining stable policy entropy, which helps in training scalable web agents....

Attention Is All You Need for KV Cache in Diffusion LLMs

Published at 2025-10-16

#ML

This study presents a new method called Elastic-Cache that reduces redundant computation and speeds up decoding in diffusion large language models by adaptively recomputing key-value caches, resulting in significant speedups and higher accuracy compared to existing methods....

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Published at 2025-10-16

#ML

The study presents WritingPreferenceBench, a new dataset of 1,800 human-rated writing pairs in English and Chinese, to evaluate preference learning models. The results show that current methods struggle with subjective writing preferences, while generative models with explicit reasoning chains perform best, suggesting a need for better models that understand creativity, style, and emotion in writing....

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Published at 2025-10-16

#ML

The study presents Beyond One World, a benchmark evaluating the ability of language models to portray different versions of 30 iconic superheroes from Marvel and DC accurately. It examines factual recall and ethical reasoning, revealing that while some models can generate coherent narratives, they struggle with consistency across versions and aligning reasoning with actions....

COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes

Published at 2025-10-16

#ML

The authors create a new dataset called COIG-Writer for Chinese creative writing, which not only includes the final text but also the thought process behind it. They find that creative writing involves two components: narrative logic and linguistic expression, and that process supervision improves performance but needs to be balanced with general data for optimal results....

DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Published at 2025-10-16

#ML

The study creates a large-scale benchmark for six English dialects and tests 17 multimodal generative models, finding that they perform worse with dialectal input. The researchers then develop a new method to help these models better understand dialects without losing their ability to understand standard English, improving dialect performance by up to 34.4%....

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Published at 2025-10-16

#ML

The study looks at the connection between recurrent-depth models and diffusion language models, and creates a new diffusion forcing sampler to speed up the generation process. This sampler, which can be applied to existing 3.5B recurrent-depth transformers without any adjustments, results in a significant 5x speedup, offering an efficient way to parallelize computation in recurrent-depth models....

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Published at 2025-10-16

#ML

The authors present a new architecture called AdaMoE, which improves the efficiency and performance of vision-language-action models used in robotic manipulation tasks. AdaMoE addresses the challenges of scaling up these models by leveraging pretrained weights, balancing model capacity with computational efficiency, and enabling collaborative expert utilization, resulting in superior performance and significant improvements in key benchmarks and real-world experiments....

Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report

Published at 2025-10-16

#ML

The authors present mxbai-edge-colbert-v0 models, which are smaller versions of retrieval and late-interaction models designed for efficient, scalable use. These models can run on any device, from cloud-based large-scale retrieval to local, device-based retrieval, and have been optimized for both short-text benchmarks and long-context tasks....

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Published at 2025-10-16

#ML

The study proposes a new family of native Vision-Language Models (VLMs) called NEO, which effectively aligns visual and textual data in a shared space, integrates vision and language modules, and supports cross-modal reasoning. NEO is designed to be scalable, powerful, and cost-effective, with the goal of advancing research in native VLMs and making it more accessible....

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Published at 2025-10-16

#ML

The authors present GroundedPRM, a new framework for improving multi-step reasoning in language models. It uses tree-guided search and external tools to provide accurate, grounded feedback, outperforming other methods with less data....

ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Published at 2025-10-16

#ML

The authors present ImagerySearch, a strategy for generating more coherent and realistic videos in imaginative scenarios by adjusting the search space and reward function based on the prompt's semantic relationships. They also introduce LDT-Bench, the first benchmark for evaluating long-distance semantic prompts, and demonstrate ImagerySearch's superior performance compared to other methods....

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Published at 2025-10-16

#ML

The study presents a new reinforcement learning framework called Information Gain-based Policy Optimization (IGPO) that helps multi-turn LLM agents learn more effectively by providing dense and intrinsic supervision. IGPO improves upon existing methods by offering turn-level rewards based on the model's belief updates, which leads to better performance and sample efficiency in various multi-turn scenarios....

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Published at 2025-10-16

#ML

The authors present UI-Simulator, a cost-effective method to generate large-scale training data for digital agents by simulating diverse user interfaces, and UI-Simulator-Grow, which improves efficiency by focusing on high-impact tasks. Experiments show that agents trained with UI-Simulator perform as well as or better than those trained with real UI data, and UI-Simulator-Grow can match the performance of larger models using a smaller base model....

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Published at 2025-10-16

#ML

This study presents LaSeR, an algorithm that enhances Large Language Models by integrating reasoning and self-verification capabilities, improving model performance and efficiency by using a simple self-rewarding score based on the last token generated....

Learning an Image Editing Model without Image Editing Pairs

Published at 2025-10-16

#ML

This study proposes a new training method for image editing models that doesn't require paired data. By using vision-language models and a distribution matching loss, the approach achieves results comparable to other models trained with paired data, and even outperforms RL-based techniques in the few-step setting....

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Published at 2025-10-16

#ML

The study presents a new framework called MathCanvas that enhances the ability of multimodal models to use visual aids for mathematical reasoning, specifically in geometry. By pre-training models on a large dataset of diagrams and their corresponding captions, and fine-tuning them on a dataset of interleaved visual-textual reasoning paths, the framework enables models to generate high-quality diagrams and use them strategically for complex problem-solving. The resulting model, BAGEL-Canvas, sign...

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Published at 2025-10-16

#ML

The study presents PaddleOCR-VL, a state-of-the-art, resource-efficient model for multilingual document parsing, which outperforms existing solutions and delivers fast inference speeds, making it ideal for real-world applications....

Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation

Published at 2025-10-16

#ML

The researchers created a framework called Ponimator, which uses close-proximity human poses to generate various types of human-human interaction animations, such as image-based and text-to-interaction animations. They trained it using motion-capture interaction datasets and employed two conditional diffusion models to create realistic and dynamic animations, demonstrating its effectiveness and versatility in various applications....

Qwen3Guard Technical Report

Published at 2025-10-16

#ML

The authors present Qwen3Guard, a multilingual safety guardrail model that offers fine-grained safety judgments and real-time monitoring for LLMs. It comes in two variants, Generative and Stream, with different sizes and supports 119 languages, providing efficient safety moderation for global deployments....

RealDPO: Real or Not Real, that is the Preference

Published at 2025-10-16

#ML

The authors present RealDPO, a new method that improves the realism of generated motion by comparing it to real-world videos and making adjustments, and introduce RealAction-5K, a high-quality video dataset for training and testing complex motion synthesis. This approach outperforms existing models and techniques in generating realistic and contextually consistent movements....

SimKO: Simple Pass@K Policy Optimization

Published at 2025-10-16

#ML

This study identifies a bias in reinforcement learning with verifiable rewards (RLVR) methods towards exploitation over exploration, resulting in better single-choice accuracy but poorer multi-choice performance. The researchers propose SimKO, an asymmetrical optimization method that boosts top-K probabilities for correct responses and penalizes top-1 probabilities for incorrect ones, effectively mitigating the over-concentration issue and improving exploration in various math and logical reason...

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Published at 2025-10-16

#ML

The study presents TokDrift, a framework that highlights the issue of subword tokenization in large language models for code, which can lead to different tokenization for similar code snippets. This problem, originating in early embeddings, affects even large models and suggests the need for grammar-aware tokenization for better code understanding and generation....

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Published at 2025-10-16

#ML

The authors present a new framework, VLA^2, which enhances vision-language-action models' performance when dealing with unseen objects by using external modules like web retrieval and object detection. This framework outperforms current models, especially in handling out-of-distribution objects, with a 44.2% improvement in success rate on a challenging benchmark compared to the standalone OpenVLA baseline....

WithAnyone: Towards Controllable and ID Consistent Image Generation

Published at 2025-10-16

#ML

The authors present a new method for generating images that are consistent with a given identity, while also being diverse in terms of pose, expression, and lighting. They achieve this by creating a large dataset of paired images, developing a benchmark to measure the quality of the generated images, and proposing a new training approach that balances identity fidelity with diversity. The resulting model, called WithAnyone, significantly reduces copy-paste artifacts and improves controllability ...

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Published at 2025-10-16

#ML

The study presents pi-Flow, a new model that improves few-step diffusion by having a student model mimic a teacher model's behavior, resulting in faster and more accurate data generation while avoiding trade-offs between quality and diversity. Tests on ImageNet and other datasets show that pi-Flow outperforms other methods in terms of quality and diversity....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages