🤗 Daily Paper(2025-08-12)

5 views

Skip to first unread message

deep.di...@gmail.com

unread,

Aug 12, 2025, 4:07:21 PMAug 12

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

UserBench: An Interactive Gym Environment for User-Centric Agents

Published at 2025-07-29

#ML

The authors present UserBench, a platform that tests AI agents' ability to work closely with users by understanding vague goals, adapting to changing preferences, and making informed decisions with tools. The study shows that current AI models struggle with fully aligning with user intents and uncovering user preferences, emphasizing the need for more advanced AI collaboration capabilities....

TextQuests: How Good are LLMs at Text-Based Video Games?

Published at 2025-07-31

#ML

The authors present TextQuests, a benchmark for evaluating AI agents' long-term reasoning skills in text-based games, which are complex and require sustained, self-directed problem-solving without external tools. This benchmark aims to measure an AI's ability to learn and make decisions independently in a challenging, exploratory environment....

Compressing Chain-of-Thought in LLMs via Step Entropy

Published at 2025-08-05

#ML

This study presents a new method to make complex reasoning by Large Language Models (LLMs) more efficient by identifying and removing redundant steps in the reasoning process, which significantly reduces inference costs without sacrificing accuracy....

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

Published at 2025-08-05

#ML

This study presents a large-scale open-source dataset with over 66,000 human-annotated audio samples of mathematical equations and sentences in English and Russian, aimed at improving the conversion of spoken mathematics into LaTeX. The authors introduce new models and methods, achieving better results than existing ones in converting equations and establishing a new benchmark for mathematical sentence recognition....

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Published at 2025-08-05

#ML

The researchers created WhisperInject, a method that can trick advanced audio language models into generating harmful content. They do this by adding tiny, invisible changes to harmless audio inputs, like weather updates or greetings, which can bypass safety measures with over 86% success rate....

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Published at 2025-08-07

#ML

The study presents Bifrost-1, a framework that combines multimodal large language models and diffusion models using CLIP image embeddings. This approach allows for high-quality, controllable image generation while maintaining the models' reasoning abilities and reducing training costs....

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Published at 2025-08-07

#ML

The paper presents a new method called MoBE that reduces the size of large language models with the Mixture-of-Experts architecture, using a special decomposition to minimize accuracy loss. Experiments show that MoBE significantly outperforms previous methods, reducing parameter counts by 24%-30% with only 1%-2% accuracy drop....

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Published at 2025-08-07

#ML

The OmniEAR framework evaluates language models' ability to reason about physical interactions, tools, and multi-agent coordination in embodied tasks, revealing significant performance drops when models must reason from constraints and exposing limitations in current AI architectures for embodied reasoning....

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

Published at 2025-08-07

#ML

SONAR-LLM is a new transformer model that uses a hybrid approach to generate text by thinking in sentence embeddings and speaking in tokens, offering high-quality generation while maintaining a likelihood-based training signal and eliminating the need for a diffusion sampler....

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Published at 2025-08-08

#ML

The study presents BrowseComp-Plus, a new benchmark for evaluating deep-research agents, which addresses limitations in fairness and transparency of current benchmarks. By using a fixed, curated corpus with human-verified documents and challenging negatives, BrowseComp-Plus enables controlled experimentation, allowing for a better understanding of the capabilities of deep-research language models and retrieval methods....

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Published at 2025-08-08

#ML

This study explores how removing certain topics from the training data of open-weight AI systems can make them less susceptible to tampering attacks. The researchers developed a pipeline for filtering text and trained several models, finding that they were significantly more resistant to attacks compared to existing methods, without losing other capabilities. However, they also discovered that these models could still access dangerous information if provided in context, suggesting that multiple ...

Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System

Published at 2025-08-08

#ML

The research presents Fact2Fiction, a new attack framework that targets advanced fact-checking systems. This framework exploits the system's decomposition strategy and justifications to create tailored malicious evidence, which compromises sub-claim verification, leading to higher attack success rates compared to existing attacks....

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

Published at 2025-08-08

#ML

This study explores why generalist robot policies struggle to generalize beyond their training data and finds that relying on irrelevant features (shortcut learning) is the main issue. The research identifies two main causes: limited diversity within sub-datasets and significant differences between sub-datasets (fragmentation), which are common in large-scale datasets. The findings suggest strategies to improve dataset collection and use data augmentation to reduce shortcut learning and enhance ...

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

Published at 2025-08-08

#ML

The study presents a new method for Self-Rewarding Language Models that addresses the issue of narrowing differences between chosen and rejected responses, by using past and future model generations to maintain learning signals. This approach significantly improves performance on various tasks and outperforms previous Self-Rewarding methods, even without specific training data for those tasks....

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Published at 2025-08-09

#ML

This study presents a new method called LessIsMore, which reduces the computational overhead of large reasoning models without compromising accuracy. LessIsMore is a training-free sparse attention mechanism that improves generalization and efficiency by aggregating token selections from local attention heads and enabling unified cross-head token ranking, resulting in faster processing and attending to fewer tokens compared to existing methods....

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

Published at 2025-08-09

#ML

The study presents ReasonRank, a passage ranking model with enhanced reasoning abilities. It uses an automated data synthesis framework and a two-stage post-training approach, resulting in superior performance and lower latency compared to existing baselines, achieving state-of-the-art results on the BRIGHT leaderboard....

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Published at 2025-08-10

#ML

The study offers a detailed review of self-evolving AI agents, which can adapt to changing environments unlike static agent systems. It presents a unified framework to understand and compare various self-evolving techniques, and discusses domain-specific strategies and the importance of evaluating, ensuring safety, and considering ethics in developing these advanced, autonomous systems....

VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Published at 2025-08-10

#ML

The authors present VisR-Bench, a multilingual benchmark for question-driven multimodal retrieval in lengthy documents, which includes over 35K QA pairs in 16 languages and three question types, enabling a thorough evaluation of multimodal retrieval, and they test various retrieval models, finding that MLLMs perform best but still face challenges with structured tables and low-resource languages....

Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Published at 2025-08-11

#ML

The study presents a new method called Follow-Your-Shape that allows for precise and controllable editing of object shapes in images without affecting other areas. This technique uses a Trajectory Divergence Map to identify editable regions and ensures stable editing through a Scheduled KV Injection mechanism. The researchers also introduce a new benchmark, ReShapeBench, to evaluate the effectiveness of their method, which outperforms existing models in large-scale shape replacement tasks....

GLiClass: Generalist Lightweight Model for Sequence Classification Tasks

Published at 2025-08-11

#ML

The authors present GLiClass, a new method based on GLiNER architecture for sequence classification tasks, which offers high accuracy and efficiency similar to embedding-based methods while being adaptable for zero-shot and few-shot learning. They also improved PPO for multi-label text classification, allowing training classifiers in data-sparse conditions or with human feedback....

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Published at 2025-08-11

#ML

The authors present Grove MoE, a new architecture for large language models that improves computational efficiency by using experts of varying sizes and a dynamic activation mechanism, resulting in models that perform as well as state-of-the-art models with fewer parameters....

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Published at 2025-08-11

#ML

The Klear-Reasoner model is introduced, which has strong reasoning abilities and carefully considers problems, outperforming many existing models. The paper provides a detailed analysis of the model's training process, including data preparation, fine-tuning, and reinforcement learning, and proposes a new Gradient-Preserving clipping Policy Optimization method to improve the model's exploration and learning efficiency....

MolmoAct: Action Reasoning Models that can Reason in Space

Published at 2025-08-11

#ML

The MolmoAct model is a new type of vision-language-action model that can reason about space to perform tasks more adaptively and generally than existing robotic models. It outperforms other models in various simulations and real-world settings, and its performance improves further with a newly released dataset. The model and its training code, along with the dataset, are made publicly available....

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Published at 2025-08-11

#ML

The study presents a new framework called Omni-Effects that allows for the generation of multiple visual effects at specified locations, overcoming the limitation of current methods that can only generate single effects. This is achieved through two innovations: a method to integrate diverse effects without interference and a technique to control the spatial location of each effect, along with a new dataset and evaluation framework for visual effects....

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

Published at 2025-08-11

#ML

The study reviews and reproduces popular reinforcement learning techniques for large language models, analyzing their mechanisms and providing guidelines for choosing the right technique. They find that a simple combination of two methods can significantly improve performance, outperforming other strategies....

Reinforcement Learning in Vision: A Survey

Published at 2025-08-11

#ML

This survey explores the field of visual RL, discussing its evolution, key strategies, and major themes like multi-modal language models and unified frameworks. It also covers evaluation methods and highlights challenges such as sample efficiency and safe deployment, aiming to guide future research in this rapidly growing area....

WideSearch: Benchmarking Agentic Broad Info-Seeking

Published at 2025-08-11

#ML

The authors present WideSearch, a benchmark for evaluating the performance of automated search agents powered by Large Language Models in conducting wide-scale information gathering tasks. The benchmark consists of 200 manually curated questions across 15 diverse domains, and most state-of-the-art search systems tested on it failed to perform well, highlighting the need for further research and development in this area....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages