🤗 Daily Paper(2025-09-03)

0 views
Skip to first unread message

deep.di...@gmail.com

unread,
Sep 3, 2025, 4:07:07 PM (3 days ago) Sep 3
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models

Published at 2025-08-28

#ML

The authors present a new framework called FastFit that significantly speeds up virtual try-on technology by using a cacheable diffusion model and a Semi-Attention mechanism, resulting in a 3.5x speedup over existing methods. They also introduce a new large-scale dataset called DressCode-MR to support research in this area....

Read Moreicon

On the Theoretical Limitations of Embedding-Based Retrieval

Published at 2025-08-28

#ML

This study shows that even simple queries can hit theoretical limits in vector embeddings, which are used for tasks like reasoning and coding. The research introduces a new dataset, LIMIT, to test this theory and finds that even the best models fail on it, highlighting the need for new methods to overcome this fundamental limitation....

Read Moreicon

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Published at 2025-08-29

#ML

This study presents ELV-Halluc, the first benchmark for long-video hallucinations, to investigate a new type of hallucination called Semantic Aggregation Hallucination (SAH). SAH occurs when models generate incorrect outputs with correct frame-level semantics, and it becomes more critical in long videos due to increased complexity. The researchers confirm the existence of SAH, show that it increases with semantic complexity, and suggest potential approaches to mitigate it, such as positional enc...

Read Moreicon

Stairway to Fairness: Connecting Group and Individual Fairness

Published at 2025-08-29

#ML

This study explores the relationship between group and individual fairness in recommender systems, finding that recommendations can be fair for groups but unfair for individuals. The research provides a comprehensive comparison of evaluation measures for both fairness types and offers useful insights for RS practitioners, with the code available for further use....

Read Moreicon

Universal Deep Research: Bring Your Own Model and Strategy

Published at 2025-08-29

#ML

The study presents a new system called Universal Deep Research that allows users to customize their own deep research strategies using any language model without additional training. The system is demonstrated with example strategies and a user interface for experimentation....

Read Moreicon

C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection

Published at 2025-08-30

#ML

The study presents a new method called Context-Aware Fusion (CAF) that improves fine-grained object detection, like identifying vehicle damage, by combining global scene context with local features. CAF uses cross-attention mechanisms and a dedicated encoder to capture and integrate comprehensive environmental information, enhancing the generative detection paradigm and outperforming state-of-the-art models on the CarDD benchmark....

Read Moreicon

Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

Published at 2025-08-30

#ML

The authors present a new architecture, Gated Associative Memory (GAM), for sequence modeling that is faster than the common Transformer approach, especially for long contexts, by using two parallel pathways to efficiently capture both local and global information, and a gating mechanism to combine them....

Read Moreicon

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Published at 2025-08-30

#ML

The study presents LLaVA-Critic-R1, a multimodal model that challenges the conventional separation between critic and policy models in vision-language tasks. By reorganizing preference-labeled critic datasets and performing reinforcement learning on a base generative model, LLaVA-Critic-R1 excels as both a critic and a competitive policy model, outperforming specialized reasoning VLMs across various benchmarks. Furthermore, the model's enhanced critic ability improves inference, leading to a sim...

Read Moreicon

Metis: Training Large Language Models with Advanced Low-Bit Quantization

Published at 2025-08-30

#ML

The abstract presents Metis, a training framework that addresses the issue of training large language models with low-bit quantization by disentangling dominant from long-tail components, using adaptive learning rates, and applying a dual-range regularizer. This results in improved model performance and stability, even with very low bit-rates, surpassing the accuracy of traditional floating-point training....

Read Moreicon

MobiAgent: A Systematic Framework for Customizable Mobile Agents

Published at 2025-08-30

#ML

The authors present MobiAgent, a new system for mobile agents that improves accuracy and efficiency through three main parts: advanced agent models, a speed-up framework, and a testing tool. They also created a data collection method using AI to lower the cost of annotations, resulting in better performance than other general and specialized mobile agent models....

Read Moreicon

SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction

Published at 2025-08-30

#ML

The paper presents a new framework called SQL-of-Thought that uses in-context learning and chain-of-thought to convert natural language queries into SQL queries more effectively. This framework improves upon previous systems by adding a guided error correction loop, which helps create more accurate SQL queries, especially in complex scenarios....

Read Moreicon

The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

Published at 2025-08-30

#ML

Researchers created Camlang, a novel language, to test if large language models can learn it through explicit metalinguistic reasoning. The results show that while GPT-5 performed well in English, it struggled with Camlang, revealing a gap between current models and human metalinguistic competence....

Read Moreicon

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Published at 2025-08-31

#ML

The authors present FlashAdventure, a new benchmark consisting of 34 Flash-based adventure games that evaluate GUI agents' ability to complete entire storylines. They also introduce CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agent framework that uses long-term memory to improve performance on sequential tasks. Results show that current GUI agents struggle with full story arcs, but COAST makes significant improvements by addressing the observation-behavior gap, highlighting th...

Read Moreicon

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

Published at 2025-08-31

#ML

This study presents VerlTool, a new framework that improves upon existing methods for multi-turn tool interactions in reinforcement learning. VerlTool offers a unified and modular approach, enhancing efficiency and enabling easier integration of various tools, resulting in competitive performance across multiple domains....

Read Moreicon

Benchmarking Optimizers for Large Language Model Pretraining

Published at 2025-09-01

#ML

This study evaluates various optimization techniques for Large Language Models (LLMs) by conducting standardized pretraining scenarios with varying model size, batch size, and training duration. The results offer guidance for practitioners on choosing the best optimizer for specific scenarios and point out potential directions for future optimization research....

Read Moreicon

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Published at 2025-09-01

#ML

The study questions whether the high sensitivity of large language models (LLMs) to different prompts is a real flaw or just a result of the way these models are evaluated. By testing 7 LLMs using various benchmarks and prompt templates, the researchers found that the perceived prompt sensitivity often comes from the evaluation methods, which can overlook correct responses that are phrased differently. When using a more advanced evaluation method, the performance of LLMs was more consistent, sug...

Read Moreicon

Improving Large Vision and Language Models by Learning from a Panel of Peers

Published at 2025-09-01

#ML

The authors propose a new method for training Large Vision and Language Models by having them learn from each other, similar to how students learn from peers in a classroom. This approach improves model performance on various benchmarks without needing expensive human-curated data, increasing the average score from 48% to 57%....

Read Moreicon

Kwai Keye-VL 1.5 Technical Report

Published at 2025-09-01

#ML

The authors present Keye-VL-1.5, a multimodal large language model that improves video understanding by using a novel Slow-Fast video encoding strategy, a progressive four-stage pre-training methodology, and a comprehensive post-training pipeline. This results in better performance on video tasks and general multimodal benchmarks compared to existing models....

Read Moreicon

M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision

Published at 2025-09-01

#ML

Researchers created a unified visual encoder called M3Ret for medical image retrieval, which can handle various modalities like X-rays, ultrasounds, videos, and CT scans. By using a large-scale hybrid-modality dataset and self-supervised learning techniques, M3Ret outperforms existing methods and demonstrates strong cross-modal alignment and generalizability to unseen modalities like MRI....

Read Moreicon

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

Published at 2025-09-01

#ML

The authors have simplified the architecture and loss design of OpenVision to improve its training efficiency, resulting in OpenVision 2. This new version matches the original model's performance on multimodal benchmarks while reducing training time and memory consumption by up to 1.5x and 1.8x, respectively, and allows for scaling beyond 1 billion parameters....

Read Moreicon

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Published at 2025-09-01

#ML

The authors propose a two-stage framework for creating high-quality document conversion datasets and models without relying on manual annotation or distillation. In the first stage, they generate synthetic data to train a model, and in the second stage, they improve the model using real-world documents, resulting in POINTS-Reader, a model that outperforms many existing models....

Read Moreicon

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

Published at 2025-09-01

#ML

This study shows that reasoning skills learned by a large language model can be extracted and transferred to other models as a compact task vector, improving their performance on various reasoning tasks without requiring expensive training....

Read Moreicon

Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views

Published at 2025-09-01

#ML

This study presents a new approach called Point-PQAE for self-supervised point cloud learning using two-view cross-reconstruction, which outperforms existing single-modal self-reconstruction methods by 6.5%-7.0% in various tests. The method introduces a novel crop mechanism for point cloud view generation and a unique positional encoding to represent 3D relative positions between two decoupled views, making pre-training more challenging and informative....

Read Moreicon

ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association

Published at 2025-09-01

#ML

ViSTA-SLAM is a real-time visual SLAM system that works with any camera and quickly estimates camera positions and local maps using just two images. It has a unique design that reduces model complexity and improves accuracy, resulting in better camera tracking and 3D reconstructions compared to other methods....

Read Moreicon

AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models

Published at 2025-09-02

#ML

This study presents AMBEDKAR, a framework designed to minimize caste and religion biases in language models, particularly for the Indian context. The framework uses a Constitution-Aware Decoding Layer guided by the AI Constitution of India, which operates during inference without altering the base model. It combines a small language model with a larger, constitutionally guided model to reduce bias by up to 26.41% compared to the baseline....

Read Moreicon

Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

Published at 2025-09-02

#ML

The authors present a new method called Genetic Prompt that improves synthetic data generation by combining genetic algorithms with large language models. This approach treats text attributes as genes and uses the language model to create new combinations, enhancing data quality and diversity, and improving downstream model performance, especially in class-imbalanced scenarios....

Read Moreicon

Baichuan-M2: Scaling Medical Capability with Large Verifier System

Published at 2025-09-02

#ML

The study presents Baichuan-M2, a medical AI model trained using a new interactive reinforcement learning system that creates realistic clinical environments and dynamic evaluation metrics. Baichuan-M2 outperforms other open-source models and most closed-source counterparts on HealthBench, showing the importance of dynamic verification for practical healthcare applications....

Read Moreicon

DCPO: Dynamic Clipping Policy Optimization

Published at 2025-09-02

#ML

The study presents a new method called Dynamic Clipping Policy Optimization (DCPO) that improves the learning process of large language models by adjusting clipping bounds and standardizing rewards. This approach enhances token-level exploration and response-level utilization, resulting in significant performance improvements on various benchmarks compared to existing methods....

Read Moreicon

Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

Published at 2025-09-02

#ML

This research presents a new method called VARIN that allows for precise and controllable image editing using visual autoregressive models, which are a type of AI model for generating images from text. The method, which uses a special technique to reverse the noise in the images, enables users to make targeted changes to images based on text descriptions while preserving the original background and details....

Read Moreicon

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

Published at 2025-09-02

#ML

The authors present dynamic guardian models that allow users to define their own policies for monitoring chatbot outputs, which can detect violations quickly or provide detailed reasoning. These models are as accurate as static models for common harms and can identify custom policy violations faster than other advanced models....

Read Moreicon

Fantastic Pretraining Optimizers and Where to Find Them

Published at 2025-09-02

#ML

The study investigates ten deep learning optimizers across various model scales and data-to-model ratios, revealing that fair comparisons require rigorous hyperparameter tuning and evaluations. It finds that while some optimizers claim faster training, their actual speedup is lower than expected, especially for larger models, and that the fastest optimizers use matrices as preconditioners, but this speedup decreases with model size....

Read Moreicon

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

Published at 2025-09-02

#ML

Researchers developed small ASR models for underrepresented languages, challenging the idea that multilingual models are always better. These new models, named Flavors of Moonshine, significantly outperform larger multilingual models, achieving error rates 48% lower than a similar-sized model and even matching or surpassing a 28x larger model, enabling accurate on-device ASR for previously limited languages....

Read Moreicon

GenCompositor: Generative Video Compositing with Diffusion Transformer

Published at 2025-09-02

#ML

The authors propose a new method for automating video compositing using generative models, which allows for interactive customization of foreground elements in videos. They introduce a novel Diffusion Transformer pipeline, a lightweight background preservation branch, and a DiT fusion block to maintain consistency and fuse background and foreground videos with different layouts. The proposed method was trained and tested on a curated dataset of 61K video sets, demonstrating improved fidelity and...

Read Moreicon

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Published at 2025-09-02

#ML

The authors present a new framework called PACS that improves the training of large language models on reasoning tasks by using a supervised learning approach to address challenges in the RLVR paradigm, resulting in better performance compared to existing methods....

Read Moreicon

Jointly Reinforcing Diversity and Quality in Language Model Generations

Published at 2025-09-02

#ML

The study presents DARLING, a framework that balances quality and diversity in language model generations. DARLING measures diversity beyond surface-level variations and combines it with quality rewards during online reinforcement learning, resulting in high-quality and distinct outputs for both creative and problem-solving tasks....

Read Moreicon

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

Published at 2025-09-02

#ML

The study presents MedDINOv3, a framework that improves the performance of vision foundation models for medical image segmentation. It addresses challenges such as ViT underperformance and domain gap between natural and medical images, resulting in state-of-the-art performance across various segmentation benchmarks....

Read Moreicon

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Published at 2025-09-02

#ML

The study presents SimpleTIR, a new method that improves the training of large language models interacting with tools for multi-turn reasoning. SimpleTIR addresses training instability by filtering out problematic turns, leading to better performance and more diverse reasoning patterns in math reasoning benchmarks....

Read Moreicon

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Published at 2025-09-02

#ML

This survey explores the shift from traditional reinforcement learning in large language models to agentic reinforcement learning, where models become autonomous decision-makers in complex environments. It presents a two-part taxonomy of agentic capabilities and applications, emphasizing reinforcement learning's role in creating adaptive behavior, and provides a comprehensive resource for future research in this field....

Read Moreicon

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Published at 2025-09-02

#ML

This technical report presents UI-TARS-2, an improved native GUI-centered agent model that overcomes challenges in data scalability, multi-turn reinforcement learning, and environment stability. UI-TARS-2 outperforms previous models and strong baselines on various benchmarks, demonstrating its potential to advance the field of GUI agents and generalize to real-world interactive scenarios....

Read Moreicon

Published at

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages