🤗 Daily Paper(2025-12-18)

3 views

Skip to first unread message

deep.di...@gmail.com

unread,

Dec 18, 2025, 3:07:24 PM12/18/25

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Hybrid Attribution Priors for Explainable and Robust Model Training

Published at 2025-12-09

#ML

The study proposes a new framework called Class-Aware Attribution Prior (CAP) to improve the interpretability and robustness of small language models used in classification tasks. CAP helps models focus on fine-grained class distinctions, and the hybrid version combines it with existing attribution techniques to create a more comprehensive supervisory signal, enhancing the learning of diverse, decision-relevant features....

VABench: A Comprehensive Benchmark for Audio-Video Generation

Published at 2025-12-09

#ML

VABench is a new benchmark framework for evaluating audio-video generation models, covering three task types and 15 evaluation dimensions, including audio-video synchronization, lip-speech consistency, and question-answering pairs, across seven content categories....

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

Published at 2025-12-10

#ML

This study presents TacThru, a sensor that offers simultaneous visual and tactile perception for robots, and TacThru-UMI, a learning framework that uses these perceptions for manipulation tasks. The new sensor and framework enable robots to perform complex tasks with high success rates, especially in scenarios requiring delicate touch or multimodal coordination....

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Published at 2025-12-11

#ML

The authors present MMSI-Video-Bench, a comprehensive benchmark for evaluating video-based spatial intelligence in large language models, which assesses perception, planning, prediction, and cross-video reasoning through human-annotated questions and videos. The benchmark reveals a significant gap between human and AI performance and highlights areas for improvement, such as geometric reasoning and long-term prediction....

VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Published at 2025-12-12

#ML

This study presents Voyager, a novel approach to generate diverse datasets using large language models (LLMs) without the need for training. Voyager directly optimizes a mathematical quantity that enhances dataset diversity and has been shown to significantly outperform existing methods in comprehensive experiments....

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Published at 2025-12-15

#ML

The authors present FiNERweb, a tool that creates large-scale, multilingual datasets for named entity recognition using a teacher-student approach. Experiments show that models trained on FiNERweb perform well in zero-shot transfer settings across multiple languages, and the annotations are reliable and informative, with performance slightly better when using English labels instead of translated ones....

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Published at 2025-12-15

#ML

The authors present HyperVL, a multimodal large language model designed for on-device use, which overcomes the high computational and memory requirements of current models through an image-tiling strategy, a Visual Resolution Compressor, and Dual Consistency Learning. HyperVL outperforms similar-sized models in various benchmarks and reduces latency and power consumption on mobile devices, making it practical for on-device multimodal inference....

LikeBench: Evaluating Subjective Likability in LLMs for Personalization

Published at 2025-12-15

#ML

The researchers present LikeBench, a new framework for evaluating the likability of language models in personalization. They argue that likability, which is subjective and important for user experience, is not well-measured in current benchmarks. LikeBench assesses likability across seven dimensions, including emotional adaptation and humor fit, using realistic simulated users. The study reveals that strong memory performance does not guarantee high likability, as a model with lower memory accur...

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Published at 2025-12-15

#ML

The authors propose SAGE, a system that reasons flexibly over different video lengths, similar to human behavior. They introduce a data generation pipeline, an effective training method, and a benchmark for evaluating video reasoning ability, resulting in notable improvements in performance on long video tasks....

WAY: Estimation of Vessel Destination in Worldwide AIS Trajectory

Published at 2025-12-15

#ML

This study presents a novel deep learning model, WAY, which improves the estimation of ship destinations by using global AIS data and a new method that handles long port-to-port trajectories as a nested sequence structure. The model's architecture includes a trajectory representation layer and Channel-Aggregative Sequential Processing (CASP) blocks, which help process the data and provide long-term destination estimates. Additionally, the researchers introduce a new Gradient Dropout technique to...

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Published at 2025-12-16

#ML

This research presents Jacobi Forcing, a new method that trains AR models to become efficient parallel decoders without losing their original causal inference abilities, resulting in significant speedup for coding and math tasks with minimal performance loss. Additionally, the study introduces multi-block decoding with rejection recycling, which further enhances speed and reduces inference latency....

Puzzle Curriculum GRPO for Vision-Centric Reasoning

Published at 2025-12-16

#ML

The authors propose a new method called Puzzle Curriculum GRPO that enhances visual reasoning in Vision Language Models without requiring hand-curated annotations or external verifiers. This method uses three self-supervised puzzle environments to replace labels, introduces a difficulty-aware curriculum to address flat rewards, and monitors Reasoning-Answer Consistency to improve training stability and end-task accuracy....

Understanding and Improving Hyperbolic Deep Reinforcement Learning

Published at 2025-12-16

#ML

This research investigates the challenges of using hyperbolic feature spaces in reinforcement learning and proposes Hyper++, a new agent that improves stability and performance through stable critic training, feature regularization, and an optimization-friendly formulation of hyperbolic network layers, outperforming prior agents in various experiments....

Universal Reasoning Model

Published at 2025-12-16

#ML

The study examines the factors contributing to the success of Universal Transformers in complex reasoning tasks and proposes a new model, Universal Reasoning Model (URM), which incorporates short convolution and truncated backpropagation to significantly improve reasoning performance on ARC-AGI tasks....

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Published at 2025-12-17

#ML

The authors present G2RL, a new framework for reinforcement learning in large language models that uses the model's own gradient information to guide exploration, improving performance on various reasoning benchmarks compared to traditional methods....

DEER: Draft with Diffusion, Verify with Autoregressive Models

Published at 2025-12-17

#ML

This research proposes a new framework called DEER that uses a diffusion large language model to draft text quickly and efficiently, then verifies it with an autoregressive model. The result is a significant speedup in text generation compared to existing methods, with the potential for much longer drafts....

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Published at 2025-12-17

#ML

The study presents DiffusionVL, a family of diffusion vision language models that can be created from any powerful autoregressive model. By fine-tuning, these models are adapted to the diffusion paradigm, leading to improved performance and faster inference speeds, even with less training data....

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Published at 2025-12-17

#ML

This research presents a new method called Resampling Forcing, which allows for end-to-end training of autoregressive video models without needing a teacher model or online discriminator. The approach uses a self-resampling scheme to simulate model errors during training, ensuring better temporal consistency in long video generation....

IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

Published at 2025-12-17

#ML

The authors present a new framework called IC-Effect that uses artificial intelligence to add complex visual effects to videos while keeping the background unchanged and ensuring smooth blending. This method is efficient, customizable, and uses a special training process to create realistic effects with less computer power, and the authors also provide a new dataset for this purpose....

In Pursuit of Pixel Supervision for Visual Pre-training

Published at 2025-12-17

#ML

The study presents 'Pixio', an improved autoencoder model that learns from pixels to create strong representations for various visual tasks. Trained on 2B web images, Pixio competes with or outperforms other large-scale models in tasks like depth estimation, 3D reconstruction, and semantic segmentation, showing that pixel-based self-supervised learning can be a promising alternative to other methods....

Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Published at 2025-12-17

#ML

This study evaluates Nano Banana Pro's ability to perform various low-level vision tasks compared to specialist models. The results show that while Nano Banana Pro creates visually appealing images, it falls short in meeting the strict pixel-level accuracy required by traditional metrics, indicating that it's not yet as precise as domain-specific models....

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Published at 2025-12-17

#ML

The study presents a new method called Qwen-Image-Layered that breaks down images into separate layers, similar to professional design tools, enabling edits to specific parts without affecting the rest. This is achieved through three main components: a unifying model for RGB and RGBA images, a variable layer decomposition architecture, and a multi-stage training strategy. The researchers also created a pipeline to generate high-quality training images and found that their method outperforms exis...

Robust and Calibrated Detection of Authentic Multimedia Content

Published at 2025-12-17

#ML

This study presents a method to reliably detect authentic multimedia content by resynthesizing samples, which is effective against efficient adversaries and maintains low false positive rates. The approach is robust, controllable, and applicable to various modalities using advanced inversion techniques....

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Published at 2025-12-17

#ML

The study presents SCOPE, a method that enhances the performance of large language model agents in dynamic environments by evolving their prompts based on execution traces. This approach significantly improves task success rates and ensures the agents have the right strategies for various tasks, without requiring human intervention....

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Published at 2025-12-17

#ML

The study presents Skyra, a model that uses a large language model to detect and explain AI-generated videos by identifying visible artifacts. The authors created a new dataset and a two-stage training strategy to improve the model's performance, and extensive experiments show that Skyra outperforms existing methods in detecting AI-generated videos....

Step-GUI Technical Report

Published at 2025-12-17

#ML

This research presents a new method to create high-quality training data for GUI automation with reduced cost and increased accuracy. They also introduce a family of models, Step-GUI, which outperform existing models in GUI performance while maintaining general capabilities. Additionally, they propose a new protocol, GUI-MCP, for standardized and private GUI automation across different devices, and introduce a new benchmark, AndroidDaily, to assess the practical usage of GUI agents in real-world...

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Published at 2025-12-17

#ML

The study presents TIMAR, a framework for creating expressive 3D conversational heads that model dialogue as interleaved audio-visual contexts, improving the coordination and variability of head movements compared to existing methods....

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Published at 2025-12-17

#ML

This study introduces a benchmark to evaluate the performance of vision-language models in understanding long contexts with vision-text compression. The results show that most models struggle with capturing long associations or dependencies in compressed information, despite being able to decode textual information....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages