🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Adversarial Confusion Attack: Disrupting Multimodal Large Language Models |
Published at 2025-11-25 |
|
#ML
|
The study presents a new method to disrupt multimodal large language models by making them generate incorrect or incoherent responses. This is achieved by embedding specific images that increase uncertainty in the model's outputs, affecting both open-source and proprietary models.... |
Read More |
|
|
|
![]() |
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs |
Published at 2025-11-25 |
|
#ML
|
The study presents AlignBench, a new benchmark for evaluating how well image-text alignment models understand detailed image-caption pairs. By assessing various models, they discover that many struggle with fine-grained alignment, detectors overrate early sentences, and models favor their own outputs, which impacts performance.... |
Read More |
|
|
|
|
![]() |
Qwen3-VL Technical Report |
Published at 2025-11-26 |
|
#ML
|
The Qwen3-VL is a powerful vision-language model that excels in understanding text, images, and videos. It has various versions to balance speed and quality, and it's particularly good at remembering and connecting information over long texts and videos, making it useful for tasks like reasoning and decision-making.... |
Read More |
|
|
|
![]() |
Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem |
Published at 2025-11-27 |
|
#ML
|
This study examines the open AI model economy using data from the Hugging Face Model Hub, revealing a shift in power from US-based companies to unaffiliated developers, community organizations, and Chinese industry. The research also highlights changes in model properties, such as increased size and multimodal generation, and the emergence of developer intermediaries focused on quantizing and adapting base models.... |
Read More |
|
|
|
|
![]() |
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment |
Published at 2025-11-27 |
|
#ML
|
The authors propose a new alignment strategy for Normalizing Flows that improves generative quality by aligning intermediate features of the reverse pass with a powerful vision model. They also introduce a new test-time optimization algorithm for classification, resulting in faster training, better accuracy, and new state-of-the-art results on ImageNet.... |
Read More |
|
|
|
![]() |
AutoNeural: Co-Designing Vision-Language Models for NPU Inference |
Published at 2025-12-02 |
|
#ML
|
The authors present AutoNeural, a new Vision-Language Model architecture designed for efficient inference on Neural Processing Units (NPUs). By replacing traditional components with more efficient alternatives, such as MobileNetV5-style backbones and State-Space Model principles, AutoNeural significantly reduces quantization error and latency, while also improving decoding speed and context window, making it ideal for real-time applications on NPUs.... |
Read More |
|
|
|
|
![]() |
OneThinker: All-in-one Reasoning Model for Image and Video |
Published at 2025-12-02 |
|
#ML
|
The study presents OneThinker, a unified model for visual reasoning that combines image and video understanding across various tasks such as question answering, captioning, and tracking. It introduces a new training corpus and a method for multi-task reinforcement learning to improve performance on 31 visual benchmarks, demonstrating effective knowledge transfer and preliminary zero-shot generalization ability.... |
Read More |
|
|
|
![]() |
PretrainZero: Reinforcement Active Pretraining |
Published at 2025-12-02 |
|
#ML
|
PretrainZero is a new framework that enables general reasoning abilities in large-thinking models through active learning from a general corpus, without relying on specific domain rewards or supervised training. It significantly improves the performance of pretrained models in various benchmarks and serves as a foundation for downstream tasks.... |
Read More |
|
|
|
|
![]() |
SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment |
Published at 2025-12-02 |
|
#ML
|
The abstract presents a new method called stable rank, which is an internal, annotation-free quality signal derived from Large Language Models' representations. This signal measures the effective dimensionality of hidden states and is used in a reinforcement learning technique called SR-GRPO, which improves model performance on various tasks without relying on external supervision.... |
Read More |
|
|
|
![]() |
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach |
Published at 2025-12-02 |
|
#ML
|
This study presents a new method called TACO, which helps Vision-Language-Action models perform better in real-world tasks by reducing irrelevant actions and improving inference stability, similar to the anti-exploration principle in offline reinforcement learning, without requiring complex gradient computations.... |
Read More |
|
|
|
|
![]() |
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs |
Published at 2025-12-02 |
|
#ML
|
The authors present a framework called UniQL that combines quantization and low-rank compression to make large language models work better on mobile devices with limited resources. This framework can be used for different types of models and improves their performance by reducing memory usage and increasing speed, all while maintaining accuracy.... |
Read More |
|
|
|
![]() |
ViDiC: Video Difference Captioning |
Published at 2025-12-02 |
|
#ML
|
The ViDiC-1K dataset is presented to assess video understanding and comparative reasoning in multimodal models. This dataset contains 1,000 video pairs with detailed annotations, focusing on differences and similarities in various aspects like motion, style, and location, to evaluate model performance in describing visual changes over time.... |
Read More |
|
|
|
|
![]() |
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition |
Published at 2025-12-03 |
|
#ML
|
AdaptVision is a new approach for Vision-Language Models that reduces computation by adaptively acquiring only necessary visual tokens. It uses a coarse-to-fine method, starting with low-resolution images and cropping key regions when needed, and is trained using a reinforcement learning framework that balances accuracy and efficiency, resulting in better performance and fewer visual tokens compared to existing methods.... |
Read More |
|
|
|
![]() |
BlurDM: A Blur Diffusion Model for Image Deblurring |
Published at 2025-12-03 |
|
#ML
|
The study presents a Blur Diffusion Model (BlurDM) that improves image deblurring by incorporating the blur formation process into diffusion models. BlurDM simultaneously denoises and deblurs images, outperforming existing methods on benchmark datasets by efficiently integrating into deblurring networks in the latent space.... |
Read More |
|
|
|
|
![]() |
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation |
Published at 2025-12-03 |
|
#ML
|
The authors present a new diffusion-based framework called CookAnything that generates coherent and distinct image sequences from cooking instructions of any length. The framework includes three components to align textual steps with images, maintain temporal coherence and spatial diversity, and ensure ingredient consistency across steps, resulting in better performance than existing methods in generating recipe illustrations.... |
Read More |
|
|
|
![]() |
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding |
Published at 2025-12-03 |
|
#ML
|
This research proposes a new method called DIG for selecting frames in long-form video understanding that adapts to different types of queries. DIG uses a simple sampling method for broad queries and a specialized method for specific queries, improving performance and efficiency over existing methods.... |
Read More |
|
|
|
|
![]() |
In-Context Representation Hijacking |
Published at 2025-12-03 |
|
#ML
|
The study presents Doublespeak, an attack method that replaces harmful keywords with benign tokens in prompts, causing large language models to interpret innocent-looking prompts as harmful instructions. This attack, which works without optimization and is effective on various models, reveals a significant security vulnerability in the latent space of language models.... |
Read More |
|
|
|
![]() |
Jina-VLM: Small Multilingual Vision Language Model |
Published at 2025-12-03 |
|
#ML
|
The researchers have developed a powerful 2.4B parameter vision-language model called Jina-VLM that excels in answering visual questions across multiple languages, outperforming similar models by efficiently processing images and text using a unique attention-pooling connector.... |
Read More |
|
|
|
|
![]() |
PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design |
Published at 2025-12-03 |
|
#ML
|
The authors propose a new framework called PosterCopilot that improves the accuracy and control of layouts in professional graphic design. This framework uses a three-stage training process to enhance the geometric understanding and aesthetic reasoning of large multimodal models, allowing for more precise and visually appealing designs, as well as easier editing and refinement.... |
Read More |
|
|
|
![]() |
RELIC: Interactive Video World Model with Long-Horizon Memory |
Published at 2025-12-03 |
|
#ML
|
The RELIC framework is introduced to create an interactive world model that combines real-time long-horizon streaming, consistent spatial memory, and precise user control. This model uses compressed historical latent tokens and a memory-efficient self-forcing paradigm to enable long-duration exploration of arbitrary scenes in real time, outperforming previous work in terms of accuracy, stability, and robustness.... |
Read More |
|
|
|
|
![]() |
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation |
Published at 2025-12-03 |
|
#ML
|
The authors present a new framework called PRIS that improves the alignment between user intent and generated visuals in text-to-visual generation by adaptively modifying the prompt during the inference process. They introduce a new verifier, element-level factual correction, to provide precise feedback for prompt revision and demonstrate the effectiveness of their approach through extensive experiments on text-to-image and text-to-video benchmarks.... |
Read More |
|
|
|
![]() |
SkillFactory: Self-Distillation For Learning Cognitive Behaviors |
Published at 2025-12-03 |
|
#ML
|
The study presents SkillFactory, a method that enables models to learn cognitive skills through self-distillation during a supervised fine-tuning stage before reinforcement learning. SkillFactory uses samples from the model itself, rearranged to provide training data, and helps models generalize to harder tasks and use cognitive skills effectively, making them more robust to out-of-domain tasks compared to reinforcement learned base models.... |
Read More |
|
|
|
|
![]() |
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL |
Published at 2025-12-03 |
|
#ML
|
This study presents a new method called Double Interactive Reinforcement Learning (DIRL) that helps Vision Language Models (VLMs) use multiple tools together for better spatial understanding. DIRL allows VLMs to discover optimal tool-use patterns without relying on predefined strategies, and the resulting model, SpaceTools, outperforms existing methods in spatial reasoning benchmarks and real-world manipulation tasks using a robotic arm.... |
Read More |
|
|
|
![]() |
Thinking with Programming Vision: Towards a Unified View for Thinking with Images |
Published at 2025-12-03 |
|
#ML
|
The study presents CodeVision, a flexible framework that allows models to generate code for any image operation, improving tool-based reasoning for multimodal large language models. This approach addresses the brittleness of current models when dealing with simple image changes or corruptions, enhancing performance through a two-stage training methodology and introducing new datasets and benchmarks for evaluation.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|