🤗 Daily Paper(2025-09-29)

3 views

Skip to first unread message

deep.di...@gmail.com

unread,

Sep 29, 2025, 4:09:38 PMSep 29

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Instruction-Following Evaluation in Function Calling for Large Language Models

Published at 2025-09-22

#ML

The study presents IFEval-FC, a benchmark that tests large language models' ability to follow formatting instructions in function calling, which existing benchmarks do not evaluate. The new benchmark reveals that advanced models often fail to meet basic formatting requirements, pointing out a significant challenge for practical AI applications....

TUN3D: Towards Real-World Scene Understanding from Unposed Images

Published at 2025-09-23

#ML

The study presents TUN3D, a new method for understanding real-world scenes by estimating layout and detecting 3D objects using multiple images, without needing depth sensors or accurate camera positions. TUN3D outperforms existing methods in various scene understanding tasks and is available for use at a provided GitHub link....

CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

Published at 2025-09-24

#ML

The authors have developed CHURRO, a specialized vision-language model for recognizing text in historical documents, which outperforms other models in accuracy and cost-effectiveness, making it easier to study and preserve cultural heritage....

PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

Published at 2025-09-24

#ML

The study presents PromptCoT 2.0, a framework that improves the process of creating challenging and diverse problems for large language models, which helps them to reason better. The new framework uses an iterative method to refine problem-solving steps, resulting in more difficult tasks compared to previous methods. This leads to improved performance in self-play and supervised fine-tuning for language models, setting new records for tasks like Olympiad mathematics and competitive programming....

CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization

Published at 2025-09-25

#ML

The study presents a new method, CAD-Tokenizer, for text-based CAD prototyping. This method uses a special tokenization strategy that understands the geometric structure of CAD designs better than previous methods, resulting in improved design quality and accuracy....

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Published at 2025-09-25

#ML

This study addresses the issue of reward over-optimization in policy models by focusing on reward misspecification in the high-reward tail. The researchers propose a rubric-based reward modeling approach, which effectively mitigates reward over-optimization and improves large language model post-training, by leveraging off-policy examples while remaining insensitive to their artifacts....

D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

Published at 2025-09-25

#ML

The authors propose a new framework called D-Artemis for GUI agents that mimics human cognitive processes to improve task automation. D-Artemis uses a tip retrieval mechanism, proactive alignment stage, and post-execution reflection agent to enhance the capabilities of general-purpose language models for GUI tasks, achieving state-of-the-art results on major benchmarks....

Finding 3D Positions of Distant Objects from Noisy Camera Movement and Semantic Segmentation Sequences

Published at 2025-09-25

#ML

The study presents a flexible and practical solution for locating distant objects using particle filters, which is particularly useful in safety-critical surveillance tasks like drone-based wildfire monitoring, where other methods like dense depth estimation or 3D scene reconstruction are not feasible due to computational limitations or distant objects....

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Published at 2025-09-25

#ML

The authors present RLBFF, a new approach that combines human feedback and rule-based verification in reinforcement learning, allowing for more nuanced reward models. RLBFF outperforms existing methods and can be customized for specific principles of interest, with a fully open-source recipe provided for alignment....

Real-Time Object Detection Meets DINOv3

Published at 2025-09-25

#ML

The study presents DEIMv2, an improved object detection system that uses DINOv3 features and offers eight models for various devices. DEIMv2 achieves better performance and uses fewer parameters compared to previous models, with the largest model having 57.8 AP and only 50.3 million parameters, and the smallest model being the first sub-10 million model to exceed 50 AP on COCO....

ReviewScore: Misinformed Peer Review Detection with Large Language Models

Published at 2025-09-25

#ML

This study addresses the issue of declining review quality in AI conferences due to the increasing number of submissions. The researchers propose a method to detect misinformed review points using large language models and build a human-annotated dataset to evaluate the model's performance, demonstrating moderate agreement between humans and models....

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

Published at 2025-09-25

#ML

This study creates and evaluates synthetic datasets for Indian languages using a bottom-up approach, resulting in a high-quality, large-scale dataset called Updesh. The dataset significantly improves generative task performance and narrows the gap between high and low-resource languages, highlighting the importance of context-aware, culturally grounded data curation in multilingual AI....

Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval

Published at 2025-09-25

#ML

The Think-on-Graph 3.0 framework uses a Multi-Agent Context Evolution and Retrieval mechanism to build and refine a graph index dynamically during reasoning, improving the performance of Large Language Models with external knowledge and overcoming the limitations of static graph construction....

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

Published at 2025-09-25

#ML

The study presents UltraHorizon, a new benchmark for evaluating artificial agents in long-term, complex tasks like large-scale software development or scientific discovery. The benchmark focuses on testing agents' reasoning, planning, memory, and tool use skills in hidden rule discovery tasks, revealing that current language models struggle in these settings compared to humans....

UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

Published at 2025-09-25

#ML

The researchers propose UniVid, a framework that uses a pre-trained video generation model to handle various image and video tasks without needing modifications for each task. They show that UniVid can work with different types of data and switch between understanding and generating tasks, demonstrating the potential of pre-trained video models for unified vision modeling....

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning

Published at 2025-09-25

#ML

This research presents a new method called X-CoT for text-to-video retrieval that uses LLM-based reasoning instead of embedding models. X-CoT improves retrieval performance, provides detailed explanations for ranking results, and helps analyze model behavior and data quality....

X-Streamer: Unified Human World Modeling with Audiovisual Interaction

Published at 2025-09-25

#ML

The authors present a framework called X-Streamer, which creates digital human agents that can interact through text, speech, and video in real-time using a single portrait. This is achieved by a dual-transformer architecture that understands and generates multimodal inputs, allowing for long-term, stable, and coherent conversations....

COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Published at 2025-09-26

#ML

The study presents a new method called CoSpaDi for compressing large language models without reducing their accuracy. It replaces the traditional low-rank weight approximation with a more flexible structured sparse factorization, enabling better model fidelity and efficient sparse-dense matrix multiplication. The method is evaluated across multiple models and demonstrates superior performance compared to state-of-the-art data-aware low-rank methods....

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Published at 2025-09-26

#ML

The study presents a new training method, CapRL, for improving image captioning by using reinforcement learning and a reward system based on the usefulness of captions. This approach leads to more diverse and accurate descriptions compared to traditional human-annotated data methods....

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Published at 2025-09-26

#ML

The study addresses the challenge of training language model (LLM) agents in environments with sparse rewards and multiple turns, identifying a failure mode called the 'exploration-exploitation cascade failure'. The proposed solution, Entropy-regularized Policy Optimization (EPO), improves performance by up to 152% in ScienceWorld and 19.8% in ALFWorld, through enhanced exploration, entropy smoothing, and adaptive phase-based weighting....

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Published at 2025-09-26

#ML

The ERGO model presents a two-stage 'coarse-to-fine' reasoning pipeline for vision-language tasks that reduces computational cost by first analyzing a downsampled image to identify relevant regions, then processing only these regions at full resolution. It uses a reinforcement learning framework to determine where to focus, accounting for perceptual uncertainty and delivering higher accuracy and efficiency than existing methods....

Fine-tuning Done Right in Model Editing

Published at 2025-09-26

#ML

This study challenges the belief that fine-tuning is ineffective for model editing, attributing previous failures to adapting it to a sequential depth-first pipeline. The researchers propose LocFT-BF, a localized editing method that restores fine-tuning to a standard breadth-first pipeline, significantly improving its effectiveness and outperforming state-of-the-art methods across various LLMs and datasets....

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

Published at 2025-09-26

#ML

The study presents FlashEdit, a new framework for high-quality, real-time image editing. It uses three innovations: a fast editing pipeline, a technique to protect the background, and a mechanism for precise, localized edits, resulting in a 150x speedup compared to previous methods....

HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models

Published at 2025-09-26

#ML

The authors propose a new technique called history-guided sampling (HiGS) to improve the quality and efficiency of image generation using diffusion models. HiGS integrates recent model predictions into each inference step, resulting in more realistic images with better details and structure, without requiring additional computation or training....

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Published at 2025-09-26

#ML

This study presents a new approach to train language models (LLMs) using verbal feedback instead of scalar rewards. The feedback-conditional policy (FCP) method learns directly from response-feedback pairs, allowing LLMs to more effectively learn from and adapt to nuanced feedback, improving their overall performance and expressiveness....

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Published at 2025-09-26

#ML

The study presents a new method called SPEAR to improve the training of agentic LLMs in reinforcement learning tasks. SPEAR is a curriculum-based self-imitation learning recipe that helps balance exploration and exploitation by gradually adjusting the policy evolution and managing the entropy trend, leading to more stable and efficient training....

LongLive: Real-time Interactive Long Video Generation

Published at 2025-09-26

#ML

LongLive is a real-time interactive long video generation framework that improves efficiency and quality by using a causal, frame-level autoregressive design, KV-recache mechanism, streaming long tuning, and short window attention with a frame-level attention sink. It can fine-tune a 1.3B-parameter short-clip model for minute-long generation in just 32 GPU-days, achieving 20.7 FPS on a single NVIDIA H100 and supporting up to 240-second videos....

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Published at 2025-09-26

#ML

This study presents LucidFlux, a new method for restoring degraded images without using image captions. It uses a large diffusion transformer and a lightweight dual-branch conditioner to protect the global structure while recovering texture, and it avoids the need for text prompts or MLLM captions by using SigLIP features....

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Published at 2025-09-26

#ML

This study presents MesaTask, a framework that uses spatial reasoning to create realistic tabletop scenes based on task descriptions, addressing the challenge of generating task-relevant scenes for robot training. The framework, enhanced with DPO algorithms, outperforms baselines in generating plausible scenes that align with given tasks, using a large-scale dataset of over 10,000 synthetic tabletop scenes....

Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

Published at 2025-09-26

#ML

The authors present a new method to separate visual and semantic features from pre-trained image generation models, which helps identify inconsistencies in generated images more accurately than existing methods. This approach not only measures but also pinpoints the inconsistent areas, making it a useful tool for improving subject-driven image generation....

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Published at 2025-09-26

#ML

MinerU2.5 is a new document parsing model that is both accurate and efficient. It uses a two-step process to analyze document layout and recognize content, allowing it to handle high-resolution documents without using too many computational resources....

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

Published at 2025-09-26

#ML

This study proposes a new method called RL-ZVP that makes use of 'zero-variance prompts' in training large language models, which were previously ignored. The new method outperforms existing techniques by rewarding correctness and penalizing errors in these prompts, leading to significant improvements in accuracy and pass rate on various math reasoning benchmarks....

Quantile Advantage Estimation for Entropy-Safe Reasoning

Published at 2025-09-26

#ML

The study addresses the instability in training reinforcement learning models by proposing a new method called Quantile Advantage Estimation (QAE). QAE replaces the mean baseline with a group-wise K-quantile baseline, which helps in stabilizing entropy, improving credit assignment, and achieving consistent performance gains in large language models....

RefAM: Attention Magnets for Zero-Shot Referral Segmentation

Published at 2025-09-26

#ML

This study presents a new method that uses features and attention scores from diffusion transformers for downstream tasks, without requiring architectural modifications or additional training. The method, called RefAM, improves zero-shot referring image and video segmentation by using stop words as attention magnets, identifying global attention sinks, and redistributing attention, resulting in sharper and more accurate grounding maps....

SPARK: Synergistic Policy And Reward Co-Evolving Framework

Published at 2025-09-26

#ML

The paper presents SPARK, a framework that improves upon RLVR by recycling rollouts and correctness data to train a generative reward model, eliminating the need for separate reward models and costly human preference data, resulting in significant performance gains on various models and benchmarks....

Scale-Wise VAR is Secretly Discrete Diffusion

Published at 2025-09-26

#ML

The study reveals that Visual Autoregressive Generation (VAR), a powerful method for visual generation, is mathematically similar to a discrete diffusion process when using a specific attention mask. By applying this understanding, the researchers improve VAR's efficiency and performance, resulting in faster convergence, lower inference costs, and better reconstruction, all while maintaining its scalability and computational benefits....

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

Published at 2025-09-26

#ML

The authors propose a new method called See, Point, Fly for drone navigation without the need for training. It uses language instructions to guide drones in any environment, outperforming previous methods in both simulation and real-world tests....

StateX: Enhancing RNN Recall via Post-training State Expansion

Published at 2025-09-26

#ML

The study presents a method to improve the recall ability of Recurrent Neural Networks (RNNs) without significantly increasing the training costs or model parameters. This is achieved by expanding the state size of pre-trained RNNs through a post-training process, which enhances their performance in tasks requiring accurate recall of long contexts, such as linear attention and state space models, without compromising other capabilities....

Variational Reasoning for Language Models

Published at 2025-09-26

#ML

The study presents a new approach for language models that uses latent variables to represent thinking processes and improves them through variational inference. This method unifies variational inference with RL-style techniques, resulting in stable objectives that enhance the reasoning skills of language models. (In simpler terms: The researchers developed a method that improves language models' ability to think and reason by using hidden variables and a mathematical approach called variational...

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Published at 2025-09-26

#ML

The authors present VoiceAssistant-Eval, a thorough benchmark for testing voice-first AI assistants in listening, speaking, and viewing tasks. By evaluating various models, they found that open-source models can outperform proprietary ones, and smaller, well-designed models can be as effective as larger ones, though challenges remain in multimodal input and role-play voice imitation tasks....

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

Published at 2025-09-26

#ML

WebGen-Agent is a new website-generating tool that uses visual feedback to improve its performance. It uses a visual language model to generate text descriptions and scores for website screenshots and GUI testing, which are then used to enhance the agent's performance and improve the accuracy of language models in generating websites....

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Published at 2025-09-26

#ML

The authors present EAGLE, a new method to understand how multimodal large language models generate text by attributing tokens to specific visual regions and measuring the influence of language and visual inputs, outperforming existing methods in interpretability while being efficient and practical....

WoW: Towards a World omniscient World model Through Embodied Interaction

Published at 2025-09-26

#ML

This study explores how AI can develop a better understanding of physics by actively interacting with the world, as opposed to passively observing like current video models. They introduce WoW, a model trained on millions of robot interactions, which demonstrates improved physical causality and object permanence, setting a new standard in AI physical intuition....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages