🤗 Daily Paper(2025-09-09)

0 views

Skip to first unread message

deep.di...@gmail.com

unread,

Sep 9, 2025, 4:06:57 PMSep 9

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Mechanistic interpretability for steering vision-language-action models

Published at 2025-08-29

#ML

The study presents a new framework for understanding and controlling Vision-Language-Action models, which are crucial for creating adaptable robots. By analyzing the model's internal workings, they identify key elements that influence robot actions, allowing for real-time control without additional training or trial-and-error, enhancing robots' transparency and usability in the real world....

Reinforced Visual Perception with Tools

Published at 2025-09-01

#ML

The study presents ReVPT, a new method that uses reinforcement learning to improve the visual reasoning abilities of multi-modal LLMs by training them to use visual tools. Experiments show that ReVPT outperforms existing methods on various visual reasoning benchmarks, providing new insights into RL-based visual tool-usage....

DivMerge: A divergence-based model merging method for multi-tasking

Published at 2025-09-02

#ML

The study presents a new method that combines multiple fine-tuned models into one, ensuring good performance on all tasks. This approach uses Jensen-Shannon divergence to merge models without extra labeled data and handles more tasks better than existing methods....

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Published at 2025-09-03

#ML

The authors present T2I-CoReBench, a new benchmark for evaluating text-to-image models, which focuses on both composition and reasoning capabilities. This benchmark includes 1,080 complex prompts and around 13,500 questions to assess the models' performance in understanding and generating images based on detailed descriptions, revealing that current models struggle with complex, high-density scenarios and inferring implicit elements....

Singular Value Few-shot Adaptation of Vision-Language Models

Published at 2025-09-03

#ML

The authors propose CLIP-SVD, a new method for adapting vision-language models like CLIP to new domains using Singular Value Decomposition. This technique fine-tunes only a small portion of the model's parameters, improving adaptation performance and preserving generalization ability, resulting in state-of-the-art classification results on various datasets....

Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping

Published at 2025-09-04

#ML

The study presents a new method called Inpaint4Drag that enhances drag-based image editing by breaking it down into pixel-level adjustments and image completion, resulting in faster, more precise, and model-agnostic edits with real-time previews....

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Published at 2025-09-06

#ML

The researchers created a trilingual language model called Llama-GENBA-10B, which can handle English, German, and Bavarian. This model aims to reduce the bias towards English in language models and promotes the use of Bavarian, a less common language, by using a balanced multilingual dataset and a unified tokenizer....

DCReg: Decoupled Characterization for Efficient Degenerate LiDAR Registration

Published at 2025-09-07

#ML

The study presents DCReg, a framework that effectively tackles the issue of inaccurate detection and resolution of ill-conditioned registration problems in LiDAR point cloud registration, particularly in degenerate or narrow environments. By employing a Schur complement decomposition to the hessian matrix, DCReg decouples the registration problem into clean rotational and translational subspaces, enabling reliable ill-conditioning detection and targeted mitigation, resulting in improved localiza...

Reverse-Engineered Reasoning for Open-Ended Generation

Published at 2025-09-07

#ML

The authors present a novel approach called REER that reverses the reasoning process to enable open-ended generation, overcoming the limitations of reinforcement learning and instruction distillation. They introduce a large-scale dataset, DeepWriting-20K, and a model, DeepWriter-8B, that outperforms open-source baselines and competes with leading proprietary models....

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Published at 2025-09-07

#ML

The authors have created a unified model for generating coordinated audio and video called UniVerse-1. They used a technique called stitching of experts to train the model efficiently and developed an online annotation pipeline to ensure accurate alignment between audio and video. The model, after being trained on 7,600 hours of data, produces well-coordinated audio-visuals for ambient sounds and strong alignment for speech. The authors also introduced a new benchmark dataset, Verse-Bench, and m...

D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

Published at 2025-09-08

#ML

The researchers created a new dataset of 4,379 Reddit memes annotated for dark humor, target category, and intensity rating, and proposed a reasoning-augmented framework that uses a Large Vision-Language Model to generate explanations for each meme. The framework then fuses text, image, and reasoning features to classify the memes, outperforming strong baselines in dark humor detection, target identification, and intensity prediction....

Does DINOv3 Set a New Medical Vision Standard?

Published at 2025-09-08

#ML

This research explores whether DINOv3, a powerful vision transformer trained on natural images, can be used for medical vision tasks without special training. The results show that while DINOv3 performs well and outperforms medical-specific models in many cases, it struggles with highly specialized tasks and doesn't always improve with larger models or higher resolutions....

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Published at 2025-09-08

#ML

This study examines how Vision-Language Models (VLMs) process complex visual environments and finds that attention patterns within VLMs can be improved to enhance visual reasoning. The researchers propose a new, training-free method called CARVE, which uses contrasting attention to extract task-relevant visual signals, resulting in significant performance improvements on open-source models....

Guided Decoding and Its Critical Role in Retrieval-Augmented Generation

Published at 2025-09-08

#ML

This research compares three guided decoding methods in Retrieval-Augmented Generation systems, analyzing their performance in different prompting setups to ensure accurate and structured responses from Large Language Models, providing insights for selecting the best method for specific applications....

Interleaving Reasoning for Better Text-to-Image Generation

Published at 2025-09-08

#ML

This study presents a new framework called Interleaving Reasoning Generation (IRG) that improves text-to-image generation by alternating between text-based thinking and image synthesis. The framework, trained using the Interleaving Reasoning Generation Learning (IRGL) method, outperforms existing models, resulting in better visual quality and fine-grained fidelity....

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Published at 2025-09-08

#ML

The authors present MAS-Bench, a benchmark for evaluating hybrid GUI agents that use both graphical user interfaces and shortcuts like API and deep links, specifically focusing on mobile devices. The benchmark includes various tasks, predefined shortcuts, and evaluation metrics, and experiments show that hybrid agents outperform GUI-only agents in efficiency and success rates....

Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

Published at 2025-09-08

#ML

Paper2Agent is a framework that turns research papers into interactive AI agents, enabling users to carry out complex scientific queries through natural language. It analyzes papers and associated codebases to create a Model Context Protocol server, which can be connected to a chat agent for practical applications in fields like genomics and transcriptomics....

R^textbf{2AI}: Towards Resistant and Resilient AI in an Evolving World

Published at 2025-09-08

#ML

The authors propose a new approach called safe-by-coevolution for creating safe AI, inspired by biological immunity, which treats safety as a dynamic and ongoing process. They introduce R^2AI as a practical framework that combines resistance to known threats and resilience to unforeseen risks, aiming to maintain safety in dynamic environments as AI advances towards AGI and ASI....

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Published at 2025-09-08

#ML

The survey explores the application of reinforcement learning (RL) in training deep research systems, which are agentic AIs that tackle complex, multi-step tasks. It focuses on three main areas: data synthesis, RL methods for agentic research, and RL training systems, while also discussing agent architecture, evaluation, and benchmarks. The goal is to provide practical guidance for creating robust and transparent deep research agents using RL....

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Published at 2025-09-08

#ML

The study presents TraDo, a series of advanced diffusion language models that use a new reinforcement learning framework called TraceRL. This framework improves reasoning ability in complex math and coding tasks, even outperforming larger models, and offers better sampling flexibility with its adaptability to larger blocks. Additionally, the research introduces a comprehensive open-source framework for developing, training, and deploying diffusion language models, which includes various fine-tun...

Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

Published at 2025-09-08

#ML

The authors address the lack of high-quality data for improving mathematical reasoning in Large Language Models by creating a scalable data engine using E-prover's saturation capabilities on the TPTP axiom library. This results in a large, error-free corpus of theorems, which are then transformed into three difficulty-controlled challenges to reveal and address a weakness in current models' deep, structural reasoning abilities....

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Published at 2025-09-08

#ML

This study presents a system called BFS-Prover-V2 that improves the training and inference processes of Large Language Models used in automated theorem proving. The system uses a new multi-stage reinforcement learning framework to enhance model performance during training and a multi-agent search architecture to improve reasoning capabilities during inference, leading to state-of-the-art results in formal mathematics benchmarks....

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

Published at 2025-09-08

#ML

The study finds that test-time scaling, which enhances reasoning chains, does not consistently improve accuracy and often increases hallucinations in knowledge-intensive tasks. Despite this, allowing models to think is generally still advantageous....

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Published at 2025-09-08

#ML

This research presents a new method called WebExplorer to generate challenging data for training advanced web agents, which can perform complex web navigation and multi-step reasoning. The resulting model, WebExplorer-8B, outperforms larger models on various information-seeking tasks and demonstrates strong generalization on the HLE benchmark....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages