🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
Mechanistic interpretability for steering vision-language-action models |
Published at 2025-08-29 |
#ML
|
The study presents a new framework for understanding and controlling Vision-Language-Action models, which are crucial for creating adaptable robots. By analyzing the model's internal workings, they identify key elements that influence robot actions, allowing for real-time control without additional training or trial-and-error, enhancing robots' transparency and usability in the real world.... |
Read More |
|
|
![]() |
Reinforced Visual Perception with Tools |
Published at 2025-09-01 |
#ML
|
The study presents ReVPT, a new method that uses reinforcement learning to improve the visual reasoning abilities of multi-modal LLMs by training them to use visual tools. Experiments show that ReVPT outperforms existing methods on various visual reasoning benchmarks, providing new insights into RL-based visual tool-usage.... |
Read More |
|
|
|
![]() |
DivMerge: A divergence-based model merging method for multi-tasking |
Published at 2025-09-02 |
#ML
|
The study presents a new method that combines multiple fine-tuned models into one, ensuring good performance on all tasks. This approach uses Jensen-Shannon divergence to merge models without extra labeled data and handles more tasks better than existing methods.... |
Read More |
|
|
![]() |
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? |
Published at 2025-09-03 |
#ML
|
The authors present T2I-CoReBench, a new benchmark for evaluating text-to-image models, which focuses on both composition and reasoning capabilities. This benchmark includes 1,080 complex prompts and around 13,500 questions to assess the models' performance in understanding and generating images based on detailed descriptions, revealing that current models struggle with complex, high-density scenarios and inferring implicit elements.... |
Read More |
|
|
|
![]() |
Singular Value Few-shot Adaptation of Vision-Language Models |
Published at 2025-09-03 |
#ML
|
The authors propose CLIP-SVD, a new method for adapting vision-language models like CLIP to new domains using Singular Value Decomposition. This technique fine-tunes only a small portion of the model's parameters, improving adaptation performance and preserving generalization ability, resulting in state-of-the-art classification results on various datasets.... |
Read More |
|
|
![]() |
Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping |
Published at 2025-09-04 |
#ML
|
The study presents a new method called Inpaint4Drag that enhances drag-based image editing by breaking it down into pixel-level adjustments and image completion, resulting in faster, more precise, and model-agnostic edits with real-time previews.... |
Read More |
|
|
|
![]() |
Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian |
Published at 2025-09-06 |
#ML
|
The researchers created a trilingual language model called Llama-GENBA-10B, which can handle English, German, and Bavarian. This model aims to reduce the bias towards English in language models and promotes the use of Bavarian, a less common language, by using a balanced multilingual dataset and a unified tokenizer.... |
Read More |
|
|
![]() |
DCReg: Decoupled Characterization for Efficient Degenerate LiDAR Registration |
Published at 2025-09-07 |
#ML
|
The study presents DCReg, a framework that effectively tackles the issue of inaccurate detection and resolution of ill-conditioned registration problems in LiDAR point cloud registration, particularly in degenerate or narrow environments. By employing a Schur complement decomposition to the hessian matrix, DCReg decouples the registration problem into clean rotational and translational subspaces, enabling reliable ill-conditioning detection and targeted mitigation, resulting in improved localiza... |
Read More |
|
|
|
![]() |
Reverse-Engineered Reasoning for Open-Ended Generation |
Published at 2025-09-07 |
#ML
|
The authors present a novel approach called REER that reverses the reasoning process to enable open-ended generation, overcoming the limitations of reinforcement learning and instruction distillation. They introduce a large-scale dataset, DeepWriting-20K, and a model, DeepWriter-8B, that outperforms open-source baselines and competes with leading proprietary models.... |
Read More |
|
|
![]() |
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts |
Published at 2025-09-07 |
#ML
|
The authors have created a unified model for generating coordinated audio and video called UniVerse-1. They used a technique called stitching of experts to train the model efficiently and developed an online annotation pipeline to ensure accurate alignment between audio and video. The model, after being trained on 7,600 hours of data, produces well-coordinated audio-visuals for ambient sounds and strong alignment for speech. The authors also introduced a new benchmark dataset, Verse-Bench, and m... |
Read More |
|
|
|
![]() |
D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning |
Published at 2025-09-08 |
#ML
|
The researchers created a new dataset of 4,379 Reddit memes annotated for dark humor, target category, and intensity rating, and proposed a reasoning-augmented framework that uses a Large Vision-Language Model to generate explanations for each meme. The framework then fuses text, image, and reasoning features to classify the memes, outperforming strong baselines in dark humor detection, target identification, and intensity prediction.... |
Read More |
|
|
![]() |
Does DINOv3 Set a New Medical Vision Standard? |
Published at 2025-09-08 |
#ML
|
This research explores whether DINOv3, a powerful vision transformer trained on natural images, can be used for medical vision tasks without special training. The results show that while DINOv3 performs well and outperforms medical-specific models in many cases, it struggles with highly specialized tasks and doesn't always improve with larger models or higher resolutions.... |
Read More |
|
|
|
![]() |
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning |
Published at 2025-09-08 |
#ML
|
This study examines how Vision-Language Models (VLMs) process complex visual environments and finds that attention patterns within VLMs can be improved to enhance visual reasoning. The researchers propose a new, training-free method called CARVE, which uses contrasting attention to extract task-relevant visual signals, resulting in significant performance improvements on open-source models.... |
Read More |
|
|
![]() |
Guided Decoding and Its Critical Role in Retrieval-Augmented Generation |
Published at 2025-09-08 |
#ML
|
This research compares three guided decoding methods in Retrieval-Augmented Generation systems, analyzing their performance in different prompting setups to ensure accurate and structured responses from Large Language Models, providing insights for selecting the best method for specific applications.... |
Read More |
|
|
|
![]() |
Interleaving Reasoning for Better Text-to-Image Generation |
Published at 2025-09-08 |
#ML
|
This study presents a new framework called Interleaving Reasoning Generation (IRG) that improves text-to-image generation by alternating between text-based thinking and image synthesis. The framework, trained using the Interleaving Reasoning Generation Learning (IRGL) method, outperforms existing models, resulting in better visual quality and fine-grained fidelity.... |
Read More |
|
|
![]() |
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents |
Published at 2025-09-08 |
#ML
|
The authors present MAS-Bench, a benchmark for evaluating hybrid GUI agents that use both graphical user interfaces and shortcuts like API and deep links, specifically focusing on mobile devices. The benchmark includes various tasks, predefined shortcuts, and evaluation metrics, and experiments show that hybrid agents outperform GUI-only agents in efficiency and success rates.... |
Read More |
|
|
|
![]() |
Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents |
Published at 2025-09-08 |
#ML
|
Paper2Agent is a framework that turns research papers into interactive AI agents, enabling users to carry out complex scientific queries through natural language. It analyzes papers and associated codebases to create a Model Context Protocol server, which can be connected to a chat agent for practical applications in fields like genomics and transcriptomics.... |
Read More |
|
|
![]() |
R^textbf{2AI}: Towards Resistant and Resilient AI in an Evolving World |
Published at 2025-09-08 |
#ML
|
The authors propose a new approach called safe-by-coevolution for creating safe AI, inspired by biological immunity, which treats safety as a dynamic and ongoing process. They introduce R^2AI as a practical framework that combines resistance to known threats and resilience to unforeseen risks, aiming to maintain safety in dynamic environments as AI advances towards AGI and ASI.... |
Read More |
|
|
|
![]() |
Reinforcement Learning Foundations for Deep Research Systems: A Survey |
Published at 2025-09-08 |
#ML
|
The survey explores the application of reinforcement learning (RL) in training deep research systems, which are agentic AIs that tackle complex, multi-step tasks. It focuses on three main areas: data synthesis, RL methods for agentic research, and RL training systems, while also discussing agent architecture, evaluation, and benchmarks. The goal is to provide practical guidance for creating robust and transparent deep research agents using RL.... |
Read More |
|
|
![]() |
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models |
Published at 2025-09-08 |
#ML
|
The study presents TraDo, a series of advanced diffusion language models that use a new reinforcement learning framework called TraceRL. This framework improves reasoning ability in complex math and coding tasks, even outperforming larger models, and offers better sampling flexibility with its adaptability to larger blocks. Additionally, the research introduces a comprehensive open-source framework for developing, training, and deploying diffusion language models, which includes various fine-tun... |
Read More |
|
|
|
![]() |
Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem |
Published at 2025-09-08 |
#ML
|
The authors address the lack of high-quality data for improving mathematical reasoning in Large Language Models by creating a scalable data engine using E-prover's saturation capabilities on the TPTP axiom library. This results in a large, error-free corpus of theorems, which are then transformed into three difficulty-controlled challenges to reveal and address a weakness in current models' deep, structural reasoning abilities.... |
Read More |
|
|
![]() |
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers |
Published at 2025-09-08 |
#ML
|
This study presents a system called BFS-Prover-V2 that improves the training and inference processes of Large Language Models used in automated theorem proving. The system uses a new multi-stage reinforcement learning framework to enhance model performance during training and a multi-agent search architecture to improve reasoning capabilities during inference, leading to state-of-the-art results in formal mathematics benchmarks.... |
Read More |
|
|
|
![]() |
Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet |
Published at 2025-09-08 |
#ML
|
The study finds that test-time scaling, which enhances reasoning chains, does not consistently improve accuracy and often increases hallucinations in knowledge-intensive tasks. Despite this, allowing models to think is generally still advantageous.... |
Read More |
|
|
![]() |
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents |
Published at 2025-09-08 |
#ML
|
This research presents a new method called WebExplorer to generate challenging data for training advanced web agents, which can perform complex web navigation and multi-step reasoning. The resulting model, WebExplorer-8B, outperforms larger models on various information-seeking tasks and demonstrates strong generalization on the HLE benchmark.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|