🤗 Daily Paper(2025-11-03)

1 view

Skip to first unread message

deep.di...@gmail.com

unread,

Nov 3, 2025, 3:07:34 PMNov 3

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

A Survey on Efficient Vision-Language-Action Models

Published at 2025-10-27

#ML

This study offers a complete review of methods to build efficient Vision-Language-Action models, which help machines understand and interact with the physical world. The review covers efficient design, training, and data collection for these models, providing a roadmap for future research in this area....

Revisiting Multimodal Positional Encoding in Vision-Language Models

Published at 2025-10-27

#ML

This study analyzes and improves multimodal position encoding in vision-language models, specifically focusing on Rotary Positional Embedding. The researchers propose two new methods, Multi-Head RoPE and MRoPE-Interleave, which outperform existing techniques in various multimodal understanding tasks, without requiring any changes to the model architecture....

OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

Published at 2025-10-28

#ML

This study presents MobileRisk-Live, a dynamic testing environment for mobile agent safety research, and OS-Sentinel, a new safety detection framework that uses a combination of formal verification and AI to improve safety in mobile GUI agents, demonstrating better performance than existing methods....

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Published at 2025-10-29

#ML

This study compares integer and floating-point quantization formats for AI hardware, revealing that while floating-point is better for coarse-grained quantization, integer formats, specifically MXINT8, outperform them in fine-grained quantization for 8-bit formats. The research also presents a method to minimize gradient bias in fine-grained low-bit integer training, enhancing the performance of MXINT8....

π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Published at 2025-10-29

#ML

The paper presents pi_RL, an open-source framework for training flow-based Vision-Language-Action models using reinforcement learning in parallel simulation. The framework improves performance and generalization over supervised fine-tuning models, achieving significant gains on LIBERO and ManiSkill benchmarks....

Defeating the Training-Inference Mismatch via FP16

Published at 2025-10-30

#ML

The abstract discusses how using FP16 instead of BF16 in reinforcement learning fine-tuning of large language models can solve numerical instability issues, leading to faster convergence and improved performance....

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

Published at 2025-10-30

#ML

This study explores a method called RLVR to improve mathematical reasoning in AI models, but finds that it often relies on simple tricks rather than true reasoning. The research highlights the need for better tests to measure genuine progress in AI reasoning skills....

The Denario project: Deep knowledge AI agents for scientific discovery

Published at 2025-10-30

#ML

Denario is an AI multi-agent system that assists in scientific research by performing various tasks such as literature checks, coding, and paper drafting. It has demonstrated its capabilities in generating papers across multiple scientific disciplines, combining ideas from different fields, and has been evaluated by domain experts....

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Published at 2025-10-30

#ML

The study presents ThinkMorph, a model that improves multimodal reasoning by generating progressive text-image steps that manipulate visual content and maintain verbal logic. ThinkMorph significantly outperforms base models on vision-centric benchmarks and demonstrates emergent multimodal intelligence, such as unseen visual manipulation skills and adaptive switching between reasoning modes....

Value Drifts: Tracing Value Alignment During LLM Post-Training

Published at 2025-10-30

#ML

This study examines how large language models (LLMs) learn and align with human values during their post-training phase, specifically focusing on the effects of post-training algorithms and datasets. The researchers found that supervised fine-tuning primarily shapes a model's values, while preference optimization has limited impact on re-aligning values. They also discovered that different preference optimization algorithms can lead to varying value alignment outcomes, even with the same prefere...

Continuous Autoregressive Language Models

Published at 2025-10-31

#ML

This study proposes a new approach for language models, Continuous Autoregressive Language Models (CALM), which shifts from predicting individual tokens to predicting continuous vector representations of multiple tokens. This method reduces the number of generative steps, improving efficiency and performance while lowering computational costs, as demonstrated by experiments....

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

Published at 2025-10-31

#ML

The authors present a new framework called DUST for improving Vision-Language-Action models in robotic policy learning by addressing the challenge of predicting next-state observations and action sequences. DUST uses a multimodal diffusion transformer architecture with separate modality streams, independent noise perturbations, and a decoupled flow-matching loss to learn the joint distribution in a bidirectional manner, resulting in improved performance on simulated and real-world tasks compared...

Higher-order Linear Attention

Published at 2025-10-31

#ML

The authors propose a new method called Higher-order Linear Attention (HLA) that can handle longer contexts in language models more efficiently than previous methods. HLA can compute per-token outputs in linear time without using large matrices, and it can be extended to higher orders, making it a promising building block for future language models....

HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

Published at 2025-10-31

#ML

The study investigates the overconfidence issue in GUI agents and proposes HyperClick, a framework that improves reliable GUI grounding through uncertainty calibration, resulting in better performance and more accurate confidence measures....

Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

Published at 2025-10-31

#ML

The study evaluates YOLOv11, a deep learning model for extracting and classifying building heights from satellite images, finding it outperforms previous models in accuracy and speed, making it ideal for real-time, large-scale urban mapping....

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Published at 2025-10-31

#ML

The authors propose Phased DMD, a new framework for distilling score-based generative models into efficient generators, addressing the limitations of previous methods. Phased DMD improves upon traditional distribution matching distillation by using progressive distribution matching and score matching within subintervals, resulting in better output diversity and retaining key generative capabilities....

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Published at 2025-10-31

#ML

The study presents a new method called Spatial-SSRL, which improves large vision-language models' spatial understanding without relying on expensive or limited supervision. By using RGB or RGB-D images, Spatial-SSRL creates five pretext tasks that help models understand 2D and 3D structures, leading to better spatial reasoning and higher performance on various benchmarks....

Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

Published at 2025-10-31

#ML

This study presents BEAT, a framework that inserts visual backdoors into embodied agents powered by multimodal large language models. BEAT uses objects in the environment as triggers and employs a two-stage training process for reliable activation, achieving high attack success rates while maintaining task performance....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages