🤗 Daily Paper(2025-11-04)

8 views

Skip to first unread message

deep.di...@gmail.com

unread,

Nov 4, 2025, 3:07:49 PMNov 4

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

Published at 2025-10-24

#ML

Ling 2.0 is a new series of reasoning-focused language models that can scale from 16 billion to 1 trillion parameters using a unified Mixture-of-Experts approach. It improves efficiency and reasoning ability through sparsity, cross-scale consistency, and innovative techniques like reinforcement-based fine-tuning, making it a promising foundation for future AI advancements....

MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Published at 2025-10-27

#ML

This study presents a new framework called MR-ALIGN that improves the factual accuracy of large reasoning models by focusing on their reasoning process. MR-ALIGN enhances the model's ability to incorporate correct facts into its final response by analyzing and reshaping its thinking process, leading to more coherent reasoning and fewer errors in factual questions....

The Underappreciated Power of Vision Models for Graph Structural Understanding

Published at 2025-10-27

#ML

This study explores how vision models can understand graphs as well as Graph Neural Networks (GNNs) but in different ways. The researchers created a new benchmark to test models' ability to understand global graph properties, finding that vision models excel at tasks requiring holistic understanding and are better at scale-invariant reasoning, while GNNs struggle with global pattern abstraction....

World Simulation with Video Foundation Models for Physical AI

Published at 2025-10-28

#ML

The Cosmos-Predict2.5 model is introduced, which can generate text, images, and videos in a single model, using video clips for training and improving video quality and instruction alignment. It also enables reliable synthetic data generation and policy evaluation for robotics and autonomous systems, and is released as open-source software for Physical AI research and deployment....

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Published at 2025-10-29

#ML

The research presents a new method for optimizing the use of computation in large language models during inference, focusing on finding the best combination of models and their arrangements for specific tasks. The proposed solution, Agent-REINFORCE, is an LLM-agent-augmented framework that improves upon traditional methods by being more efficient and effective in identifying optimal model configurations, as demonstrated through experiments....

Data-Efficient RLVR via Off-Policy Influence Guidance

Published at 2025-10-30

#ML

This study presents a new method for selecting data in Reinforcement Learning with Verifiable Rewards, which uses influence functions to estimate the importance of each data point. The proposed approach, called CROPI, uses off-policy influence estimation and sparse random projection to efficiently select the most influential data for training large language models, resulting in significant acceleration of the training process....

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Published at 2025-10-30

#ML

The authors present MeasureBench, a benchmark for testing visual measurement reading in vision-language models using real and synthesized images. They find that even advanced models struggle with this task, often misidentifying important positions, and suggest that this highlights a limitation in spatial understanding for these models....

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Published at 2025-10-30

#ML

This study presents NaviTrace, a new benchmark for evaluating the navigation abilities of vision-language models in robotic systems. NaviTrace assesses models' performance in real-world scenarios using a semantic-aware trace score, revealing gaps in spatial grounding and goal localization compared to human performance....

PHUMA: Physically-Grounded Humanoid Locomotion Dataset

Published at 2025-10-30

#ML

The researchers created a new dataset called PHUMA that contains large-scale, physically reliable humanoid motion data. This dataset improves upon existing ones by addressing physical artifacts and enabling better motion imitation, especially for diverse movements....

EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

Published at 2025-10-31

#ML

The authors present EBT-Policy, a new energy-based architecture that effectively tackles key problems in robotics and real-world applications. EBT-Policy outperforms existing diffusion-based policies in both simulated and real-world tasks, requiring less computation and converging faster, while also demonstrating new capabilities like zero-shot recovery from failed actions....

LongCat-Flash-Omni Technical Report

Published at 2025-10-31

#ML

The researchers created an advanced, open-source model called LongCat-Flash-Omni, which can handle audio and visual information in real-time using a unique training strategy and efficient modules, even though it has a large number of parameters. They also developed a new training system to manage the model's complexity, enabling it to perform well on various tasks and set new standards for open-source models....

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Published at 2025-10-31

#ML

The authors present ToolScope, a framework that enhances multimodal large language models' ability to use external tools for reasoning by combining global planning with local perception. ToolScope, which includes a global navigator, an agentic executor, and a response synthesizer, improves performance on various VQA benchmarks by up to 6.69%....

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Published at 2025-10-31

#ML

The authors propose a new framework for universal video retrieval by introducing the Universal Video Retrieval Benchmark, a suite of 16 datasets to evaluate and diagnose capability gaps. They then use this benchmark to create a large dataset of high-quality video pairs and train a General Video Embedder model, which outperforms existing methods in zero-shot generalization....

OpenSIR: Open-Ended Self-Improving Reasoner

Published at 2025-11-01

#ML

The study presents a new framework called OpenSIR that allows language models to improve their reasoning skills by creating and solving their own problems, without relying on external supervision. This method leads to significant improvements in mathematical problem-solving abilities and enables the models to progress from basic to advanced mathematics on their own....

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Published at 2025-11-01

#ML

This study presents UME-R1, a new framework for multimodal embeddings that uses a two-step training process to improve reasoning and generation capabilities. The results show that UME-R1 significantly outperforms traditional methods and has the potential for more interpretable and scalable generative embeddings....

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Published at 2025-11-02

#ML

This study presents ROVER, a benchmark for evaluating the ability of unified multimodal models to use one type of data to improve the understanding or generation of another. Experiments show that models which can integrate information from different sources perform better at generating images or text, and that they struggle with tasks requiring abstract thinking....

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Published at 2025-11-03

#ML

The authors propose a new task called Viewpoint Learning and a large dataset called Viewpoint-100K to test and enhance the 3D reasoning abilities of multimodal large language models. They use a two-step training process and a hybrid initialization method to improve the models' understanding of spatial relationships, leading to better performance in various reasoning tasks....

How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Published at 2025-11-03

#ML

This study introduces SurgVeo, a benchmark for evaluating video generation models in surgery, and the Surgical Plausibility Pyramid (SPP), a framework to assess model outputs. The study finds that while a model can create visually convincing surgical videos, it lacks the understanding of the complex causal relationships required for accurate surgical procedures....

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Published at 2025-11-03

#ML

The researchers developed MotionStream, a method that allows for real-time video generation with motion controls, overcoming the limitations of existing methods that have long latency and non-causal processing. By using a text-to-video model with motion control and a causal student, MotionStream achieves sub-second latency and up to 29 frames per second on a single GPU, enabling users to interactively control video generation in real-time....

Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement

Published at 2025-11-03

#ML

This study introduces a new method to examine how Large Language Models use external information and their own knowledge when making decisions. By analyzing multiple steps of the decision-making process, the researchers found that models often rely on their own knowledge for incorrect answers, while correct answers use a balance of both external and internal knowledge. The study offers a new framework for understanding how models use information, which could help improve their accuracy and relia...

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Published at 2025-11-03

#ML

The study presents TIR-Bench, a new and extensive testing framework designed to assess advanced image reasoning abilities in artificial intelligence models. This benchmark evaluates models' capacity to use tools and think through complex image manipulation tasks, revealing that strong performance requires sophisticated reasoning skills....

Towards Robust Mathematical Reasoning

Published at 2025-11-03

#ML

The authors created IMO-Bench, a suite of advanced reasoning benchmarks inspired by the International Mathematical Olympiad, to evaluate the mathematical reasoning capabilities of foundation models. IMO-Bench consists of IMO-AnswerBench for short answer problems and IMO-Proof Bench for proof-writing capabilities, and the authors' model achieved impressive results on these benchmarks, outperforming other models and demonstrating strong correlation with human evaluations....

Trove: A Flexible Toolkit for Dense Retrieval

Published at 2025-11-03

#ML

Trove is an open-source toolkit for dense retrieval that offers easy-to-use data management features, enabling users to experiment with different dataset configurations without storing multiple copies. It is highly customizable, supports multi-node execution, and reduces memory consumption while maintaining fast inference times....

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

Published at 2025-11-03

#ML

The authors present UniLumos, a fast and unified framework for relighting images and videos, which ensures realistic lighting effects by incorporating RGB-space geometry feedback and supervising the model with depth and normal maps. UniLumos also introduces a structured annotation protocol and a benchmark for evaluating lighting controllability, resulting in improved physical consistency and a 20x speedup for relighting tasks....

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Published at 2025-11-03

#ML

The study presents UniREditBench, a comprehensive benchmark for evaluating reasoning-based image editing models. This benchmark includes 2,700 samples covering real and game-world scenarios and introduces a multimodal dual-reference evaluation method to improve assessment reliability. The researchers also developed a large-scale synthetic dataset, UniREdit-Data-100K, to fine-tune the Bagel model, resulting in a new model called UniREdit-Bagel that shows significant improvements in performance....

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Published at 2025-11-03

#ML

This study presents a new approach to unify vision, language, and action models by integrating multiple modalities into a single denoising process, allowing for synergy between understanding, generating, and acting. The proposed method, Unified Diffusion VLA, achieves superior performance on various benchmarks and is faster than previous methods, as demonstrated through extensive analysis and real-world applications....

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Published at 2025-11-03

#ML

This study presents a new method, Vote-in-Context, that uses artificial intelligence to improve the ranking and fusion of complex, multi-modal data like videos. The method incorporates visual and textual information into a model's prompt, allowing it to adaptively weigh retriever consensus against visual-linguistic content, and demonstrates improved performance in video retrieval benchmarks compared to previous state-of-the-art baselines....

left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Published at 2025-11-03

#ML

The study presents a new benchmark called BUS, which is a large collection of 1,333 English Rebus Puzzles designed to test the understanding of Vision-Language Models. The researchers also introduce a framework called RebusDescProgICE that improves the performance of these models on the BUS benchmark by combining unstructured descriptions, code-based reasoning, and better example selection....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages