🤗 Daily Paper(2025-10-14)

2 views

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 14, 2025, 4:08:17 PMOct 14

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

A Tale of LLMs and Induced Small Proxies: Scalable Agents for Knowledge Mining

Published at 2025-10-01

#ML

The research presents Falconer, a framework that combines large language models with lightweight proxy models for efficient knowledge mining. Falconer reduces inference cost by up to 90% and speeds up knowledge mining by over 20 times compared to state-of-the-art language models, making it a scalable solution for extracting information from large text datasets....

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

Published at 2025-10-05

#ML

The authors present a new framework called World-To-Image that improves text-to-image models by incorporating web-retrieved information for unknown concepts, enhancing both semantic accuracy and visual quality, as demonstrated by its superior performance on a custom benchmark....

Making Mathematical Reasoning Adaptive

Published at 2025-10-06

#ML

The study presents a new framework called AdaR to improve mathematical reasoning in large language models by reducing reliance on superficial features and promoting problem-solving logic. Experimental results show that AdaR significantly enhances robustness, generalization, and data efficiency in mathematical reasoning tasks....

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Published at 2025-10-06

#ML

The study presents VER, a Vision Expert Transformer designed to enhance robotic learning by combining multiple vision foundation models into a single, adaptive system. VER offers flexible and efficient feature selection, requiring minimal training to integrate robot-domain knowledge, and achieves superior performance across various robotic tasks by focusing on task-critical regions....

From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation

Published at 2025-10-08

#ML

The authors propose a new training method for generative models using a Bilevel Optimization framework, which addresses the issue of aligning models when only high-quality datasets are available, by treating the reward function as the optimization variable of an outer-level problem. They provide a theoretical analysis of this method and show its application to tabular classification and model-based reinforcement learning....

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Published at 2025-10-08

#ML

The study presents oMeBench, a large-scale, expert-curated benchmark for evaluating large language models' ability to understand and reason organic chemistry mechanisms. The authors propose a dynamic evaluation framework, oMeS, to assess model performance more precisely. They find that while current models have some chemical intuition, they struggle with multi-step reasoning, but performance can be improved with specific strategies and fine-tuning....

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Published at 2025-10-09

#ML

The study presents FinAuditing, a new benchmark for evaluating large language models in financial auditing tasks, focusing on structured, taxonomy-driven documents. Experiments show that current models struggle with semantic, relational, and numerical reasoning in financial contexts, highlighting the need for improvement in this area....

Graph Diffusion Transformers are In-Context Molecular Designers

Published at 2025-10-09

#ML

The study presents demonstration-conditioned diffusion models (DemoDiff) for molecular design, which use a small set of molecule-score examples to guide the generation of molecules aligned with target properties. DemoDiff outperforms larger language models and domain-specific approaches in various design tasks, positioning it as a promising foundation model for in-context molecular design....

PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Published at 2025-10-09

#ML

The study finds a link between model complexity during reasoning and lengthy responses in large reasoning models. They propose a new method, PEAR, which adjusts the model's reasoning process to create shorter, more efficient responses without sacrificing accuracy, even in tricky situations....

Self-Improving LLM Agents at Test-Time

Published at 2025-10-09

#ML

This study presents a new method for improving language models on-the-fly by identifying challenging samples, generating similar examples, and using them for test-time fine-tuning. The proposed approach, Test-Time Self-Improvement, significantly enhances model performance with fewer training samples compared to traditional methods....

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Published at 2025-10-10

#ML

The authors address the challenge of preventing harm from multi-step tasks executed by LLM agents by proposing a foundational guardrail for pre-execution safety. They introduce AuraGen for generating large, reliable datasets with synthetic trajectories, Safiron as a compact guardian model to flag risky cases, and Pre-Exec Bench for evaluating guardrail performance, all of which contribute to safer agentic systems....

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Published at 2025-10-10

#ML

The study presents CoBia, a tool for uncovering hidden biases in language models through constructed conversations. By creating scenarios where models make biased statements, the researchers evaluate the models' ability to recover and reject biased follow-up questions, revealing that many models struggle with this and have deeply embedded biases that can be brought to the surface through interaction....

LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

Published at 2025-10-10

#ML

The authors present Qwen3-XPlus, a translation-enhanced model that outperforms others in translation, especially for less-resourced languages, without sacrificing reasoning skills. This model is trained using a novel recipe that starts with instruct models and employs layer-selective tuning on parallel data, offering a more accessible and less complex method for multilingual enhancement....

Multimodal Policy Internalization for Conversational Agents

Published at 2025-10-10

#ML

This study presents a new approach called Multimodal Policy Internalization (MPI) that incorporates complex visual and multimodal policies into model parameters, allowing conversational agents to follow these policies more effectively without needing them during inference. The researchers developed two datasets and a three-stage training framework called TriMPI, which significantly improves accuracy, generalization, and robustness in multimodal tasks, and made their resources available to encour...

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Published at 2025-10-10

#ML

This study identifies uncertain visual tokens in the early layers of a vision encoder as a major cause of object hallucinations in large vision-language models. The researchers propose a new method to reduce object hallucinations by using adversarial perturbations to detect and mask uncertain visual tokens, which significantly improves the accuracy of these models....

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Published at 2025-10-10

#ML

The study presents a new method called Sandwiched Policy Gradient (SPG) for training masked diffusion language models using reinforcement learning, which outperforms existing methods by reducing bias and improving accuracy in various tasks....

Spotlight on Token Perception for Multimodal Reinforcement Learning

Published at 2025-10-10

#ML

The study examines how visual perception impacts multimodal reasoning in complex models, finding that only some generated tokens rely heavily on visual information. They then propose a new method, Visually-Perceptive Policy Optimization (VPPO), which improves learning by focusing on these key visual tokens, outperforming existing models on various benchmarks....

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Published at 2025-10-10

#ML

The researchers present a method called Stable Video Infinity (SVI) that can create endless videos with consistent timing, believable scene changes, and adjustable storylines. SVI tackles the issue of errors building up in video generation by using a new training technique that helps the model recognize and fix its own mistakes, allowing for longer and more diverse video creation without extra computation cost....

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Published at 2025-10-10

#ML

The authors propose a new evaluation method for language model defenses against jailbreaks and prompt injections, arguing that current methods are flawed. They demonstrate stronger adaptive attacks that bypass 12 recent defenses with high success rates, suggesting that future defense work should consider these stronger attacks to make reliable claims of robustness....

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

Published at 2025-10-10

#ML

This study examines how user memory impacts emotional intelligence in AI systems, finding that the same situation can be interpreted differently based on the user's profile. The research reveals that AI models may unintentionally reinforce social inequalities, as advantaged profiles tend to receive more accurate emotional interpretations....

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Published at 2025-10-11

#ML

The authors present AVoCaDO, an audiovisual video captioner that creates detailed descriptions with synchronized audio and visual events. They introduce a two-step training process that improves the model's performance, resulting in AVoCaDO outperforming existing open-source models on various audiovisual video captioning benchmarks....

Don't Just Fine-tune the Agent, Tune the Environment

Published at 2025-10-11

#ML

The authors propose a new method called Environment Tuning to train language model agents for complex tasks, which addresses the issues of overfitting and training instability in existing methods. This approach allows agents to learn from problem instances directly and demonstrates better performance and generalization compared to other methods using a smaller amount of data....

HUME: Measuring the Human-Model Performance Gap in Text Embedding Task

Published at 2025-10-11

#ML

The study presents a framework called HUME to measure human performance in text embedding tasks, which is compared to model performance across various datasets. The results show that while models generally outperform humans, there is significant variation, with models struggling in some areas and humans performing better in others, particularly in low-resource languages....

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

Published at 2025-10-11

#ML

This study proposes RLFR, a new method for improving reasoning abilities in Large Language Models using flow rewards derived from latent space, which is more effective and efficient than previous methods that use auxiliary signals for reward shaping. Experiments on language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, suggesting a promising paradigm for reward shaping with auxiliary signals....

Skill-Targeted Adaptive Training

Published at 2025-10-11

#ML

A new method called STAT is proposed to improve language model training by utilizing a stronger model's metacognition ability. This approach identifies and targets missing skills in the student model, leading to significant performance improvements on various benchmarks and out-of-distribution tasks....

SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning

Published at 2025-10-11

#ML

SwarmSys is a framework that uses three types of agents, Explorers, Workers, and Validators, to improve scalability and adaptability in multi-agent reasoning. It outperforms existing methods in tasks like symbolic reasoning and scientific programming by using techniques like adaptive agent profiles and a pheromone-inspired reinforcement mechanism....

AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes

Published at 2025-10-12

#ML

The study presents a new method that uses text-to-video models to predict the best viewpoints for 4D scenes. By combining the scene information with the video generation model, the proposed approach outperforms existing methods and demonstrates the potential of video generation models for interactive experiences in the real world....

BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Published at 2025-10-12

#ML

The authors present BrowserAgent, an advanced web agent that mimics human browsing behaviors like scrolling, clicking, and typing to solve complex tasks directly on raw web pages using predefined browser actions. Through a two-stage training process and an explicit memory mechanism, BrowserAgent achieves competitive results across various Open-QA tasks, demonstrating around 20% improvement over Search-R1 on multi-hop QA tasks....

FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

Published at 2025-10-12

#ML

This study presents two methods, Error-Constrained Layer Merging and Mask-guided Token Merging, to reduce computational cost in 3D Human Mesh Recovery transformer models by merging layers and tokens with minimal impact on accuracy. Additionally, a diffusion-based decoder is proposed to maintain performance, resulting in up to 2.3x speed-up and slight performance improvements over the baseline....

High-Fidelity Simulated Data Generation for Real-World Zero-Shot Robotic Manipulation Learning with Gaussian Splatting

Published at 2025-10-12

#ML

This study presents a new method called RoboSimGS that creates realistic and interactive simulation environments for robotic manipulation using multi-view real-world images, addressing the challenge of generalizing simulated data to real-world applications. The approach combines 3D Gaussian Splatting and mesh primitives to reconstruct scenes, and uses a Multi-modal Large Language Model to automatically create physically plausible and articulated assets, enabling successful zero-shot sim-to-real ...

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Published at 2025-10-12

#ML

The study presents OmniVideoBench, a new benchmark for evaluating the synergistic reasoning abilities of multimodal large language models in understanding audio-visual content, which includes 1000 QA pairs and 13 question types, and reveals a significant performance gap between open-source and closed-source models....

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Published at 2025-10-12

#ML

The researchers present a new method called RePro that uses a small language model to rephrase pretraining data in a faithful and effective way, improving the performance of large language models by 4.7%-14.0% on various tasks, and making the most of limited high-quality pretraining data....

The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution

Published at 2025-10-12

#ML

The study reveals that JavaScript code generated by Large Language Models has unique stylistic signatures which can be used for reliable authorship attribution and model fingerprinting. A large-scale dataset and a custom architecture are introduced for high-accuracy attribution, even after code transformations....

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

Published at 2025-10-13

#ML

The Acadreason benchmark was created to test the reasoning skills of large language models and agents in academic fields like computer science, economics, and more. When tested, most models struggled, showing there's still a lot of progress needed for them to match human-level reasoning in complex academic tasks....

AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

Published at 2025-10-13

#ML

This study presents AndesVL, a new set of mobile-friendly multimodal large language models that can perform various tasks such as image understanding, reasoning, and GUI-related tasks, all while being compatible with devices like mobile phones....

Are Large Reasoning Models Interruptible?

Published at 2025-10-13

#ML

This study tests how well large reasoning models perform in real-world situations where they face interruptions and changing context, finding that their performance can drop significantly compared to traditional evaluations, and identifying new failure modes such as reasoning leakage, panic, and self-doubt....

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Published at 2025-10-13

#ML

The authors present a new method called CodePlot-CoT that uses visual aids to improve mathematical reasoning in AI models. They created a large dataset of math problems with visual reasoning and developed a code-to-image converter to generate precise visual thought for solving these problems. The resulting model outperforms existing ones by up to 21% and the authors make their datasets and models publicly available for future research....

Demystifying Reinforcement Learning in Agentic Reasoning

Published at 2025-10-13

#ML

The study investigates reinforcement learning in agentic reasoning from data, algorithm, and reasoning mode perspectives, providing insights on improving LLM reasoning abilities. Key findings include using real trajectories for training, employing exploration-friendly techniques, and adopting a deliberative strategy for better tool efficiency and accuracy, enabling smaller models to outperform larger ones....

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Published at 2025-10-13

#ML

The authors present DiT360, a framework that generates high-quality panoramic images by training on both perspective and panoramic data. DiT360 uses several modules for inter-domain transformation and intra-domain augmentation, which improve image quality, diversity, and realism. Experiments show that DiT360 outperforms other methods in terms of boundary consistency and image fidelity....

Diffusion Transformers with Representation Autoencoders

Published at 2025-10-13

#ML

This study proposes using advanced representation encoders, like DINO, SigLIP, and MAE, in place of the traditional VAE for diffusion transformers, leading to better image generation results and faster convergence....

DocReward: A Document Reward Model for Structuring and Stylizing

Published at 2025-10-13

#ML

The study presents DocReward, a model that rates documents based on structure and style, addressing the gap in current agentic workflows that prioritize textual quality over visual aspects. DocReward, trained on a large multi-domain dataset, outperforms GPT-4o and GPT-5 in accuracy and effectively guides document generation agents to produce more appealing documents....

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Published at 2025-10-13

#ML

The authors present GIR-Bench, a new benchmark for testing unified multimodal models in reasoning-centric tasks. This benchmark evaluates models in three areas: understanding-generation consistency, reasoning-centric text-to-image generation, and multi-step reasoning in editing. The results show that while unified models can perform these tasks, they still struggle with aligning understanding and generation....

IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

Published at 2025-10-13

#ML

The researchers present IVEBench, a new benchmark suite for evaluating instruction-guided video editing. IVEBench includes a diverse collection of videos and editing tasks, and it uses a three-dimensional evaluation protocol to assess video quality, instruction compliance, and video fidelity....

InfiniHuman: Infinite 3D Human Creation with Precise Control

Published at 2025-10-13

#ML

The authors present InfiniHuman, a framework that generates large-scale, diverse, and detailed 3D human data at minimal cost, using existing foundation models. They introduce InfiniHumanData, a pipeline that creates a comprehensive dataset with realistic avatars, and InfiniHumanGen, a generative model for fast, high-quality avatar creation with precise control....

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Published at 2025-10-13

#ML

The study presents the InternSVG family, which includes a large and comprehensive dataset for SVG tasks called SAgoge, a benchmark called SArena, and a unified model for SVG understanding, editing, and generation. This unified approach leads to improved performance in SVG tasks compared to existing methods....

Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

Published at 2025-10-13

#ML

The authors propose a new method called Latent Refinement Decoding to improve parallel language modeling, addressing issues like information loss and premature commitment. This method uses a two-stage framework with Latent Refinement and a Predictive Feedback Loop, enhancing accuracy and speeding up the process across various tasks....

LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

Published at 2025-10-13

#ML

The study presents LikePhys, a method to evaluate intuitive physics understanding in video diffusion models by comparing physically valid and impossible videos. The method, Plausibility Preference Error (PPE), aligns well with human preference and outperforms existing evaluators, revealing improvements in physics understanding with larger model capacity and better inference settings....

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

Published at 2025-10-13

#ML

The authors present a new framework called QeRL that improves the efficiency of reinforcement learning for large language models by using quantization and reducing memory usage. This framework not only speeds up the learning process but also helps discover better strategies and enables training a large model on a single high-end GPU....

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Published at 2025-10-13

#ML

The researchers developed a new framework called ReLook that uses a multimodal language model to improve front-end web development by a loop of generation, diagnosis, and refinement. This method outperforms existing methods in creating visually correct and interactive web pages, thanks to its unique training and inference processes....

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Published at 2025-10-13

#ML

This study presents Vlaser, a model that integrates high-level reasoning with low-level control for embodied agents, which outperforms current models in various embodied reasoning tasks. The researchers also explore the impact of different Vision-Language Model initializations on the performance of Vision-Language-Action models, providing new insights and achieving top results on popular benchmarks....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages