🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries |
Published at 2025-02-27 |
|
#ML
|
The study examines how language models handle one-to-many factual queries and identifies a promote-then-suppress mechanism, where the model recalls all answers first and then suppresses previous ones. The mechanism is validated through various experiments, including using early decoding, causal tracing, and analyzing component interactions with input tokens using Token Lens and a knockout method.... |
Read More |
|
|
|
![]() |
Words or Vision: Do Vision-Language Models Have Blind Faith in Text? |
Published at 2025-03-03 |
|
#ML
|
This research explores how Vision-Language Models (VLMs) handle inconsistent visual and textual data, finding that VLMs tend to trust text over visuals leading to safety concerns. They analyze factors like instruction prompts, language model size, and token order that influence this text bias, and suggest fine-tuning with text augmentation as a solution.... |
Read More |
|
|
|
|
![]() |
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning |
Published at 2025-03-04 |
|
#ML
|
This study addresses the challenge of distinguishing hard negative pairs in existing LMM-based embedding models by proposing a hardness-weighted contrastive learning framework, resulting in improved performance and scalability, and achieving state-of-the-art results on the MMEB benchmark.... |
Read More |
|
|
|
![]() |
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders |
Published at 2025-03-05 |
|
#ML
|
This study improves the interpretability of Artificial Text Detection by using Sparse Autoencoders to extract features from Gemma-2-2b residual stream, and identifies both interpretable and efficient features, analyzing their semantics and relevance to gain insights into how texts from various models differ from human-written content, showing that modern LLMs have a distinct writing style, especially in information-dense domains.... |
Read More |
|
|
|
|
![]() |
NeuGrasp: Generalizable Neural Surface Reconstruction with Background Priors for Material-Agnostic Object Grasp Detection |
Published at 2025-03-05 |
|
#ML
|
NeuGrasp is a neural surface reconstruction method that uses background priors for material-agnostic object grasp detection, especially in challenging scenes with transparent and specular objects, by integrating transformers and global prior volumes for multi-view feature aggregation and employing residual feature enhancement and an occupancy-prior volume for better spatial perception.... |
Read More |
|
|
|
![]() |
State-offset Tuning: State-based Parameter-Efficient Fine-Tuning for State Space Models |
Published at 2025-03-05 |
|
#ML
|
The paper discusses the use of state-based methods as a more effective alternative to prompt-based methods for Parameter-Efficient Fine-Tuning (PEFT) in State Space Models (SSMs). They propose a new state-based PEFT method, State-offset Tuning, which directly affects the state at the current timestep, leading to more effective adaptation.... |
Read More |
|
|
|
|
![]() |
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning |
Published at 2025-03-06 |
|
#ML
|
This research proposes a method to compress external knowledge for LLMs, called task-aware key-value (KV) cache compression, which improves reasoning efficiency and performance over existing methods such as RAG and task-agnostic compression.... |
Read More |
|
|
|
![]() |
SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing |
Published at 2025-03-06 |
|
#ML
|
SurveyForge is a new tool that uses LLMs to improve the quality of survey papers. It generates the outline by analyzing human-written outlines and domain-related articles, then uses a scholar navigation agent to retrieve high-quality papers for generating and refining the content. SurveyBench is used to evaluate the AI-generated survey papers in three dimensions: reference, outline, and content quality, and experiments show that SurveyForge outperforms previous works.... |
Read More |
|
|
|
|
![]() |
Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces |
Published at 2025-03-07 |
|
#ML
|
This study investigates the alignment of 3D and text latent spaces, discovering that projecting learned representations onto well-chosen lower-dimensional subspaces significantly improves the quality of alignment, leading to better performance on matching and retrieval tasks. The analysis of these shared subspaces reveals they roughly separate between semantic and geometric data representations.... |
Read More |
|
|
|
![]() |
Novel Object 6D Pose Estimation with a Single Reference View |
Published at 2025-03-07 |
|
#ML
|
The paper presents a novel method named SinRef-6D for estimating the 6D pose of an unfamiliar object using only a single reference view. This method uses iterative camera-space point-wise alignment and state space models to effectively handle large pose discrepancies and capture long-range dependencies and spatial information, achieving performance comparable to CAD-based and dense reference view-based methods.... |
Read More |
|
|
|
|
![]() |
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning |
Published at 2025-03-07 |
|
#ML
|
The proposed Symbolic-MoE framework allows for adaptive instance-level mixing of pre-trained LLM experts by emphasizing skills. It improves performance by dynamically selecting the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths, and then synthesizing the outputs into a final high-quality response. The system performs better than other strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an average improvement of 8.15% over the best multi-age... |
Read More |
|
|
|
![]() |
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs |
Published at 2025-03-07 |
|
#ML
|
The study is the first comprehensive examination of the robustness of Mixture of LLMs (MoA) architectures against deceptive agents that intentionally provide misleading responses. It uncovers critical vulnerabilities, demonstrating that even a single deceptive agent can significantly reduce performance, and proposes unsupervised defense mechanisms inspired by historical voting processes.... |
Read More |
|
|
|
|
![]() |
WritingBench: A Comprehensive Benchmark for Generative Writing |
Published at 2025-03-07 |
|
#ML
|
The authors have developed WritingBench, a comprehensive benchmark evaluating large language models in 6 core writing domains and 100 subdomains, including creative, persuasive, informative, and technical writing. They additionally propose a query-dependent evaluation framework with a fine-tuned critic model for criteria-aware scoring, demonstrating its validity by allowing 7B-parameter models to approach state-of-the-art performance. The benchmark and evaluation tools are being open-sourced to ... |
Read More |
|
|
|
![]() |
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs |
Published at 2025-03-08 |
|
#ML
|
The paper presents Llama-MTSK, a Matryoshka-based Multimodal LLM for Adaptive Audio-Visual Speech Recognition, which allows for flexible adaptation of audio-visual token allocation based on computational constraints without sacrificing performance. This approach, inspired by Matryoshka Representation Learning, encodes audio-visual representations at multiple granularities within a single model and introduces three LoRA-based Matryoshka strategies for efficient fine-tuning of the LLM, resulting i... |
Read More |
|
|
|
|
![]() |
BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling |
Published at 2025-03-08 |
|
#ML
|
The research focuses on using RWKV-7 meta-learning for time series modeling, implementing it in a Timer model, which achieves significant performance improvements over traditional Transformers, LSTMs, and GRUs, and also reduces training time and parameter requirements.... |
Read More |
|
|
|
![]() |
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations |
Published at 2025-03-08 |
|
#ML
|
The proposed Zero-AVSR framework utilizes a novel Audio-Visual Speech Romanizer (AV-Romanizer) to learn language-agnostic speech representations, enabling zero-shot AVSR. It then leverages Large Language Models (LLMs) to convert predicted Roman text into language-specific graphemes, forming the Cascaded Zero-AVSR. Alternatively, it integrates audio-visual speech representations into the LLM through a unified approach, fine-tuning the adapter and LLM using a multi-task learning scheme. The Multil... |
Read More |
|
|
|
|
![]() |
Agent models: Internalizing Chain-of-Action Generation into Reasoning models |
Published at 2025-03-09 |
|
#ML
|
The proposed framework, AutoCoA, enables Large Agent Models to autonomously generate and decide on actions using a combination of supervised fine-tuning and reinforcement learning, outperforming traditional ReAct-based workflows in tasks requiring long-term reasoning and multi-step actions.... |
Read More |
|
|
|
![]() |
DiffCLIP: Differential Attention Meets CLIP |
Published at 2025-03-09 |
|
#ML
|
DiffCLIP is a new model that adds a specialized attention mechanism to CLIP, boosting its performance on various image-text tasks without much increase in computation. It improves upon the original CLIP in zero-shot classification, retrieval, and robustness, making it a more efficient and accurate multi-modal representation model.... |
Read More |
|
|
|
|
![]() |
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation |
Published at 2025-03-09 |
|
#ML
|
The authors present FEA-Bench, a benchmark for evaluating large language models' ability to generate code for new features within existing repositories. They gather data from GitHub pull requests and use filters to focus on tasks related to feature development, pairing each task with unit tests to verify the solution, and find that LLMs face challenges in this repository-level incremental development.... |
Read More |
|
|
|
![]() |
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks |
Published at 2025-03-09 |
|
#ML
|
The paper presents ProBench, a benchmark for evaluating advanced multimodal intelligence in large language models. Experiments with 24 models reveal challenges in visual perception, text understanding, domain knowledge, and advanced reasoning, providing directions for future multimodal AI research.... |
Read More |
|
|
|
|
![]() |
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement |
Published at 2025-03-09 |
|
#ML
|
The paper presents Seg-Zero, a new framework that uses cognitive reinforcement to interpret user intentions, generate explicit reasoning chains, and produce precise pixel-level masks. Unlike traditional methods, Seg-Zero doesn't rely on supervised fine-tuning and achieves robust zero-shot generalization, outperforming previous models by 18% on the ReasonSeg benchmark.... |
Read More |
|
|
|
![]() |
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models |
Published at 2025-03-09 |
|
#ML
|
The proposed model Vision-R1 improves reasoning capability in multimodal language models by constructing a high-quality multimodal CoT dataset and employing a Progressive Thinking Suppression Training strategy with Group Relative Policy Optimization, resulting in an average improvement of 6% across various multimodal math reasoning benchmarks.... |
Read More |
|
|
|
|
![]() |
What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization |
Published at 2025-03-09 |
|
#ML
|
The study explores how model architectures and pre-training objectives influence feature richness for domain generalization. The proposed method discovers latent domain structures, known as pseudo-domains, to capture domain-specific variations and enhances classifiers with these representations, achieving better generalization to unseen domains, particularly with features from diffusion models, as demonstrated on 5 datasets.... |
Read More |
|
|
|
![]() |
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning |
Published at 2025-03-10 |
|
#ML
|
The work presents SlotMIM, a method to improve object-centric representation learning in robots using pre-trained vision models. It introduces a semantic bottleneck and cross-view consistency regularization to enhance the models' performance, especially on non-(single-)object-centric datasets, and it achieves better data efficiency and scalability.... |
Read More |
|
|
|
|
![]() |
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning |
Published at 2025-03-10 |
|
#ML
|
The AlphaDrive framework uses a reinforcement learning and reasoning approach to improve performance and efficiency in autonomous driving by integrating vision-language models, and it also discovers emergent multimodal planning capabilities.... |
Read More |
|
|
|
![]() |
Automated Movie Generation via Multi-Agent CoT Planning |
Published at 2025-03-10 |
|
#ML
|
MovieAgent is introduced to automate movie/long-video generation, offering two main advantages: 1) It generates multi-scene, multi-shot long-form videos with a coherent narrative, character consistency, synchronized subtitles, and stable audio. 2) It uses a hierarchical CoT-based reasoning process with multiple LLM agents to structure scenes, camera settings, and cinematography, significantly reducing human effort and achieving new state-of-the-art results in script faithfulness, character consi... |
Read More |
|
|
|
|
![]() |
Detection Avoidance Techniques for Large Language Models |
Published at 2025-03-10 |
|
#ML
|
This study focuses on the risks associated with large language models, like spreading fake news, and introduces evasion techniques that can bypass detectors such as DetectGPT, including changing the model's temperature, fine-tuning via reinforcement learning, and rephrasing. The research compares these methods to existing work, highlighting their better performance and discussing potential societal implications and future research directions.... |
Read More |
|
|
|
![]() |
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs |
Published at 2025-03-10 |
|
#ML
|
The paper proposes DistiLLM-2, a contrastive approach for language model distillation that improves performance by aligning teacher and student models across different data types. Experiments show that this method builds high-performing student models for tasks like instruction-following and code generation, and supports applications like preference alignment and vision-language extensions.... |
Read More |
|
|
|
|
![]() |
DreamRelation: Relation-Centric Video Customization |
Published at 2025-03-10 |
|
#ML
|
DreamRelation is a new method for creating personalized videos that shows specific relations between two subjects. It uses two main parts: Relational Decoupling Learning to separate relations from subject appearances and Relational Dynamics Enhancement to focus on relational dynamics while ignoring unnecessary details. This approach is better than existing methods as it handles complex relations better and can generalize more effectively.... |
Read More |
|
|
|
![]() |
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer |
Published at 2025-03-10 |
|
#ML
|
The paper presents EasyControl, a new framework that enhances control for diffusion transformers, using innovations like a lightweight Condition Injection LoRA Module for flexible condition injection, a Position-Aware Training Paradigm for efficient image generation, and a Causal Attention Mechanism to reduce latency, all of which improve the efficiency and flexibility of the framework.... |
Read More |
|
|
|
|
![]() |
Effective and Efficient Masked Image Generation Models |
Published at 2025-03-10 |
|
#ML
|
A new model, eMIGM, is proposed for masked image generation that combines the principles of masked models and diffusion models. eMIGM demonstrates superior performance and efficiency compared to existing models, achieving state-of-the-art performance with fewer function evaluations.... |
Read More |
|
|
|
![]() |
Efficient Distillation of Classifier-Free Guidance using Adapters |
Published at 2025-03-10 |
|
#ML
|
The paper presents adapter guidance distillation (AGD), a method that uses lightweight adapters to approximate classifier-free guidance (CFG) for conditional diffusion models, effectively doubling the sampling speed while maintaining or improving sample quality. Unlike other methods, AGD keeps the base model frozen and only trains minimal additional parameters (sim2%), making it more resource-efficient and accessible for training large models on a single consumer GPU.... |
Read More |
|
|
|
|
![]() |
FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates |
Published at 2025-03-10 |
|
#ML
|
FedRand is a new framework for training vision-language models in a decentralized way that improves data privacy by randomly selecting subparameters of LoRA from the server and keeping the rest as private parameters. This method reduces the risk of exposing client-side VLM parameters and enhances data privacy, while still achieving accuracy comparable to methods that communicate full LoRA parameters.... |
Read More |
|
|
|
![]() |
HumanMM: Global Human Motion Recovery from Multi-shot Videos |
Published at 2025-03-10 |
|
#ML
|
This study introduces a new system to create realistic long-sequence 3D human motion in the world coordinates from multiple shot videos, addressing challenges like abrupt transitions and dynamic backgrounds using an improved camera pose estimation, shot transition detector, and a robust alignment module.... |
Read More |
|
|
|
|
![]() |
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning |
Published at 2025-03-10 |
|
#ML
|
MM-Eureka is a model for multimodal reasoning that applies large-scale rule-based reinforcement learning to multimodal settings, reproducing text-based RL system characteristics and showing that both instruction-tuned and pre-trained models can develop strong multimodal reasoning abilities without supervised fine-tuning.... |
Read More |
|
|
|
![]() |
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning |
Published at 2025-03-10 |
|
#ML
|
MedAgentsBench is a new benchmark for advanced medical reasoning that focuses on complex questions, diagnosis formulation, and treatment planning. It addresses limitations in existing evaluations, such as straightforward questions and inconsistent sampling, and finds that DeepSeek R1 and OpenAI o3 models perform well, with search-based agent methods showing good performance-to-cost ratios.... |
Read More |
|
|
|
|
![]() |
PE3R: Perception-Efficient 3D Reconstruction |
Published at 2025-03-10 |
|
#ML
|
The authors present PE3R, a new system for faster and more accurate 3D scene understanding from 2D images. It outperforms current methods in speed, accuracy, and generalization, and its code is available online.... |
Read More |
|
|
|
![]() |
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding |
Published at 2025-03-10 |
|
#ML
|
The authors propose REF-VLM, a unified framework for training various visual decoding tasks, addressing limitations in current MLLMs for dense prediction tasks and multi-task/multi-granularity scenarios. They introduce the Triplet-Based Referring Paradigm (TRP) for structured representation learning and create the Visual-Task Instruction Following Dataset (VTInstruct) with diverse visual prompts and units, demonstrating superior performance across standard benchmarks.... |
Read More |
|
|
|
|
![]() |
RePO: ReLU-based Preference Optimization |
Published at 2025-03-10 |
|
#ML
|
The proposed RePO algorithm streamlines preference optimization for LLMs by eliminating the need for a hyperparameter through two advances: (1) retaining the reference-free margins of SimPO but removing beta through gradient analysis, and (2) adopting a ReLU-based max-margin loss that filters trivial pairs. Empirically, RePO outperforms DPO and SimPO across multiple base models, needing only one hyperparameter to tune.... |
Read More |
|
|
|
![]() |
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models |
Published at 2025-03-10 |
|
#ML
|
The paper proposes a method called Sparse Expert Activation Pruning (SEAP) to reduce the computational cost of large language models during inference. SEAP identifies and retains task-relevant parameters, improving efficiency without compromising performance.... |
Read More |
|
|
|
|
![]() |
Should VLMs be Pre-trained with Image Data? |
Published at 2025-03-10 |
|
#ML
|
This study examines the impact of integrating image data during pre-training for vision-language models (VLMs), comparing single- and two-step pipelines. Results indicate that pre-training with both image and text data enhances performance on vision-language tasks, while maintaining strong text-only evaluation results. For a 1B model, introducing visual tokens 80% through pre-training yields an average 2% improvement over a fully pre-trained model, across 6 diverse tasks.... |
Read More |
|
|
|
![]() |
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models |
Published at 2025-03-10 |
|
#ML
|
The paper introduces TRCE, a two-stage concept erasure strategy for text-to-image diffusion models to erase malicious content while preserving the model's generation capability. TRCE identifies and erases malicious semantics in textual prompts and steers the denoising prediction towards safe directions using contrastive learning.... |
Read More |
|
|
|
|
![]() |
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning |
Published at 2025-03-10 |
|
#ML
|
The paper introduces MMDiag, a multi-turn multimodal dialogue dataset, to better reflect real-world human conversations. The dataset is used to benchmark and challenge the grounding and reasoning capabilities of Multimodal large language models (MLLMs). Additionally, the paper presents DiagNote, an MLLM with multimodal grounding and reasoning capabilities, which outperforms existing models in both grounding and jointly processing and reasoning with vision and language information.... |
Read More |
|
|
|
![]() |
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment |
Published at 2025-03-10 |
|
#ML
|
The paper introduces Autoregressive Representation Alignment (ARRA), a new training method for large language models (LLMs) that enables them to generate images from text without changing their architecture. ARRA uses a loss function and a special token to align the model's hidden states with visual representations, allowing the model to learn spatial and contextual coherence while retaining its original autoregressive nature. The method works well with popular LLMs and outperforms direct fine-t... |
Read More |
|
|
|
|
![]() |
VACE: All-in-One Video Creation and Editing |
Published at 2025-03-10 |
|
#ML
|
VACE is a unified framework that enables users to perform various video tasks like reference-to-video generation, video-to-video editing, and masked video-to-video editing. It organizes video task inputs into a unified interface called the Video Condition Unit (VCU) and utilizes a Context Adapter structure to handle arbitrary video synthesis tasks flexibly.... |
Read More |
|
|
|
![]() |
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation |
Published at 2025-03-10 |
|
#ML
|
The paper presents WISE, a new evaluation benchmark for assessing world knowledge and semantic understanding in Text-to-Image models. WISE uses 1000 prompts across 25 sub-domains and WiScore, a novel metric, to test the models' ability to generate images based on complex semantic understanding and world knowledge.... |
Read More |
|
|
|
|
![]() |
YOLOE: Real-Time Seeing Anything |
Published at 2025-03-10 |
|
#ML
|
YOLOE is a real-time object detection and segmentation model that adapts to various open prompt mechanisms with high efficiency and minimal complexity. It offers improved performance and transferability, outperforming other models like YOLO-Worldv2-S and YOLOv8-L in both closed-set and zero-shot scenarios.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|