🤗 Daily Paper(2025-10-29)

3 views

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 29, 2025, 4:07:28 PMOct 29

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Published at 2025-10-20

#ML

The study presents FALCON, a new approach that improves vision-language-action models by incorporating detailed 3D spatial information, which helps in better spatial understanding and adaptability. FALCON uses spatial foundation models to provide strong geometric priors and can optionally integrate depth or pose data without requiring any changes to its architecture, resulting in state-of-the-art performance in various simulations and real-world tasks....

PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding

Published at 2025-10-22

#ML

The authors present a new dataset called PartNeXt, which includes over 23,000 high-quality, textured 3D models annotated with detailed part labels. This dataset is designed to improve the understanding of objects at a granular level, enabling better performance in tasks like part segmentation and question answering, and it outperforms existing datasets in these areas....

UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset

Published at 2025-10-23

#ML

The authors present UltraHR-100K, a large, high-quality dataset of 100,000 ultra-high-resolution images designed to improve text-to-image generation. They also introduce a new training method that enhances the generation of fine details in these images, resulting in more realistic and higher-quality images....

ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

Published at 2025-10-24

#ML

The researchers conducted a large-scale study on multilingual scaling laws, introducing the Adaptive Transfer Scaling Law (ATLAS) to improve performance across various languages. Their findings help understand multilingual learning dynamics, transfer properties, and optimize model scaling for better efficiency in AI applications beyond English....

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

Published at 2025-10-24

#ML

This study presents RECAP, a new strategy for preserving general knowledge in large reasoning models during training, addressing the issue of capability regression in reinforcement learning with verifiable rewards. Experiments show that RECAP effectively maintains general capabilities and improves reasoning by allowing better trade-offs among rewards....

Generalization or Memorization: Dynamic Decoding for Mode Steering

Published at 2025-10-24

#ML

This study presents a new method to control the reasoning modes of large language models, which can either generalize well or memorize data verbatim. The proposed framework, Dynamic Mode Steering, uses a lightweight algorithm to identify reliance on memorization and steers the model towards generalization, improving logical consistency and factual accuracy....

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Published at 2025-10-24

#ML

The study presents VL-SAE, a method that makes it easier to understand and improve the connection between visual and language data in AI models. It does this by linking each concept to specific neural activities, allowing for better interpretation and enhancement of the model's performance in tasks like image classification and reducing hallucinations....

VisCoder2: Building Multi-Language Visualization Coding Agents

Published at 2025-10-24

#ML

The study presents three resources to improve visualization coding agents: a large dataset of validated visualization samples in 12 programming languages, a benchmark for evaluating these agents, and a new family of multi-language visualization models called VisCoder2. VisCoder2 outperforms existing models and approaches the performance of advanced models like GPT-4.1, with improvements in iterative self-debugging....

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Published at 2025-10-25

#ML

The study presents GRPO-Guard, a method that improves GRPO-based reinforcement learning for flow-matching models. It addresses the issue of implicit over-optimization by restoring a balanced importance ratio and equalizing policy gradients, leading to better model performance without relying on heavy KL regularization....

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Published at 2025-10-25

#ML

The authors present PatenTEB, a detailed benchmark for patent text embeddings with 15 tasks and over 2 million examples, addressing patent-specific challenges not covered in general embedding benchmarks. They also propose the patembed model family, trained for various tasks, which demonstrates superior generalization in external validation, with the base version achieving state-of-the-art results in a clustering task....

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Published at 2025-10-25

#ML

The researchers created VisJudge-Bench, a comprehensive benchmark for evaluating AI models' ability to assess the aesthetics and quality of visualizations, which consists of 3,090 expert-annotated samples. They found that advanced AI models like GPT-5 still struggle to match human experts in this task and introduced VisJudge, a new model that significantly improves AI's performance in evaluating visualizations....

MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion

Published at 2025-10-26

#ML

This study presents a framework called MMPersuade to analyze how well large vision-language models can be influenced by persuasive multimedia content. The researchers found that multimedia significantly increases persuasion effectiveness, especially in misinformation scenarios, and that different persuasion strategies work better in different contexts....

Optimize Any Topology: A Foundation Model for Shape- and Resolution-Free Structural Topology Optimization

Published at 2025-10-26

#ML

The authors present a new framework called Optimize Any Topology (OAT) that can quickly and accurately design structures for various engineering applications, even in complex shapes and sizes. OAT uses advanced machine learning techniques and a large dataset of optimized structures to predict the best design for specific requirements, significantly outperforming existing methods and enabling fast, resolution-free design optimization....

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Published at 2025-10-26

#ML

The study presents SAO-Instruct, a model that can edit any audio clip using natural language instructions, which is a first as current methods either need detailed descriptions or can only follow predefined instructions. The model is trained using a new dataset and performs well on both synthetic and real-world audio clips....

Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

Published at 2025-10-27

#ML

The study introduces Game-TARS, a versatile game agent that uses a unified action space for training, allowing it to learn from various games and platforms. By pre-training on over 500 billion tokens, Game-TARS outperforms previous models and large language models in different game tasks, showing that using a simple and scalable action representation with extensive pre-training can lead to agents with diverse computer skills....

Latent Chain-of-Thought for Visual Reasoning

Published at 2025-10-27

#ML

The authors propose a new method to improve the interpretability and reliability of Large Vision-Language Models by reformulating reasoning as posterior inference and introducing a sparse reward function for token-level learning signals. This method encourages diverse, high-likelihood latent reasoning and avoids reward hacking, resulting in enhanced performance on seven reasoning benchmarks....

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

Published at 2025-10-27

#ML

This study presents RoboOmni, a new framework that allows robots to understand and act on human intentions through spoken language, environmental sounds, and visual cues, rather than explicit instructions. The framework, which is based on end-to-end omni-modal Large Language Models, can recognize intentions, confirm interactions, and execute actions, and it has been trained using a large dataset called OmniAction, which includes various types of real-world interactions....

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Published at 2025-10-28

#ML

The authors present the Agent Data Protocol (ADP), a simple and expressive language that unifies diverse agent training datasets. By converting 13 existing datasets into ADP format, they achieved an average performance gain of 20% over base models on various benchmarks, demonstrating the potential of ADP to lower barriers in agent training....

AgentFold: Long-Horizon Web Agents with Proactive Context Management

Published at 2025-10-28

#ML

The study presents AgentFold, a new approach for managing context in long-term web agent tasks. It actively manages the historical trajectory of the agent, allowing for preservation of important details or abstraction of sub-tasks, leading to improved performance on benchmark tasks compared to larger models....

AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

Published at 2025-10-28

#ML

The study presents a new method called AgentFrontier Engine, which creates educational data for large language models (LLMs) that's tailored to their abilities, inspired by the Zone of Proximal Development theory. This approach helps LLMs learn complex tasks and reasoning skills more effectively, leading to improved performance on challenging benchmarks....

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Published at 2025-10-28

#ML

The study presents Critique-RL, a new method for training language models to critique and improve other models' outputs using two-stage reinforcement learning, without relying on strong supervision. This approach significantly enhances the performance of language models on various tasks, with gains of up to 9.02% on in-domain tasks for Qwen2.5-7B....

FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling

Published at 2025-10-28

#ML

The authors present FunReason-MT, a new method to generate high-quality, multi-turn training data for language models and autonomous agents to interact with external tools. This framework addresses practical challenges in existing data synthesis methods and demonstrates superior performance on a popular benchmark for function calling tasks....

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Published at 2025-10-28

#ML

Researchers from 65 countries created Global PIQA, a commonsense reasoning test for over 100 languages, covering five continents, 14 language families, and 23 writing systems. The test reveals that while advanced language models perform well overall, they struggle with lower-resource languages and everyday knowledge in many cultures....

Group Relative Attention Guidance for Image Editing

Published at 2025-10-28

#ML

The study explores a mechanism in an image editing model and proposes a new method called Group Relative Attention Guidance. This method allows for more precise control over the intensity of image editing, outperforming existing methods, and can be easily integrated into existing frameworks....

InteractComp: Evaluating Search Agents With Ambiguous Queries

Published at 2025-10-28

#ML

The study presents InteractComp, a benchmark that evaluates search agents' ability to identify and resolve ambiguous queries through interaction, as most existing agents and benchmarks fail to do so. The results show that forced interaction significantly improves performance, highlighting a critical area for improvement in search agents....

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Published at 2025-10-28

#ML

The Latent Sketchpad framework enhances Multimodal Large Language Models by adding an internal visual scratchpad, enabling them to think and reason visually while maintaining textual reasoning ability. This innovation allows for richer human-computer interaction and broader applications, as demonstrated by its comparable or superior reasoning performance across various MLLMs....

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Published at 2025-10-28

#ML

The researchers created OSWorld-MCP, a new and fair way to test computer-use agents by evaluating their ability to use tools and interact with a computer's graphical user interface. They found that agents that can use tools perform better in tasks, but even the best agents still struggle with using tools, which shows that there's room for improvement in this area....

ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking

Published at 2025-10-28

#ML

ParallelMuse is a new method that improves problem-solving for information-seeking agents by combining parallel thinking with deep exploration. It does this in two stages: improving exploration efficiency and compressing reasoning to create coherent answers, resulting in better performance and less resource consumption....

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Published at 2025-10-28

#ML

The authors present ReplicationBench, a framework to test AI agents' ability to replicate astrophysics research papers, revealing that current language models struggle with this task and identifying various failure modes, providing insights into AI agents' reliability in scientific research....

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Published at 2025-10-28

#ML

This study presents a new method called E-GRPO that improves the training of LLM-based search agents by using entity information, which previous methods ignored. The results show that E-GRPO outperforms the existing method in accuracy and efficiency, making it a better approach for training search agents....

Rethinking Visual Intelligence: Insights from Video Pretraining

Published at 2025-10-28

#ML

This study explores the potential of Video Diffusion Models (VDMs) in improving visual intelligence by pretraining on spatiotemporal data, which helps them understand structure and dynamics. The research shows that VDMs adapt more efficiently to various visual tasks compared to language models, suggesting that video pretraining could lead to more advanced visual foundation models....

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Published at 2025-10-28

#ML

The study presents ProMoE, a new framework for Mixture-of-Experts that improves Diffusion Transformers' performance in vision tasks. ProMoE uses a two-step router with explicit routing guidance to better specialize experts, resulting in superior image classification results on the ImageNet benchmark compared to existing methods....

SPICE: Self-Play In Corpus Environments Improves Reasoning

Published at 2025-10-28

#ML

The study presents SPICE, a self-improving system that uses a single model to create challenging reasoning tasks and solve them, leading to consistent gains in mathematical and general reasoning benchmarks. Document grounding is crucial for SPICE to generate and achieve increasingly difficult goals, enabling sustained improvement....

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

Published at 2025-10-28

#ML

The authors present STAR-Bench, a new benchmark for testing audio 4D intelligence, which focuses on reasoning about sound dynamics in time and 3D space. They demonstrate that STAR-Bench effectively identifies gaps in current models' ability to understand fine-grained perceptual reasoning, compared to human performance, and can help guide future improvements in this area....

Tongyi DeepResearch Technical Report

Published at 2025-10-28

#ML

The researchers created Tongyi DeepResearch, a large language model designed for complex, long-term research tasks. It uses a unique training method and a scalable data pipeline to achieve top performance on various benchmarks, and the model is now open-source for the community to use....

Uniform Discrete Diffusion with Metric Path for Video Generation

Published at 2025-10-28

#ML

The researchers developed a new framework called URSA that significantly improves discrete video generation methods, making them more efficient and comparable to continuous approaches. URSA uses two key designs, Linearized Metric Path and Resolution-dependent Timestep Shifting, to enable high-resolution image synthesis and long-duration video generation with fewer inference steps. It also includes an asynchronous temporal fine-tuning strategy for versatile tasks, outperforming existing discrete ...

WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

Published at 2025-10-28

#ML

The proposed WebLeaper framework addresses the inefficiency of Large Language Model-based agents in information seeking by constructing high-coverage tasks and generating efficient solution trajectories, resulting in improved effectiveness and efficiency on various benchmarks....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages