🤗 Daily Paper(2025-10-21)

2 views

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 21, 2025, 4:08:15 PMOct 21

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Test-Time Scaling of Reasoning Models for Machine Translation

Published at 2025-10-07

#ML

This study explores the impact of increased inference-time computation on machine translation quality using reasoning models. The results show that while general-purpose models see limited benefits, domain-specific fine-tuning and post-editing significantly improve translation quality, highlighting the value of targeted applications and task-specialized models....

Constantly Improving Image Models Need Constantly Improving Benchmarks

Published at 2025-10-16

#ML

The authors propose ECHO, a new method to create image model benchmarks from social media posts, which helps to capture emerging capabilities of image generation models and provides a more accurate assessment of their performance compared to existing benchmarks....

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Published at 2025-10-16

#ML

The authors present a new three-stage method called Wiki-PRF to improve knowledge-based visual question answering. This method enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content, resulting in significant improvements in answer quality on benchmark datasets....

AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning

Published at 2025-10-17

#ML

The AsyncVoice Agent is a new system that allows users to better understand and control AI models during complex tasks by enabling real-time verbal communication between the user and the model, reducing interaction latency and improving task accuracy....

Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training

Published at 2025-10-17

#ML

The authors propose a new method for classifying satellite images, which uses a custom neural network with two types of feature extraction and a special fusion parameter. This approach achieves high accuracy, comparable to a complex pre-trained model, without needing any external data....

Chronos-2: From Univariate to Universal Forecasting

Published at 2025-10-17

#ML

Chronos-2 is a pretrained model that can perform various types of forecasting tasks, including univariate, multivariate, and covariate-informed forecasting, without specific training. It uses a group attention mechanism for efficient information sharing and achieves state-of-the-art performance in multiple benchmarks, making it a general-purpose forecasting model for real-world applications....

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

Published at 2025-10-17

#ML

The study reveals a significant flaw in large reasoning models, where they can be misled by irrelevant tasks embedded in prompts, leading to a major drop in accuracy. The researchers propose a new training method combining supervised fine-tuning and reinforcement learning to enhance the models' resilience against such attacks....

Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset

Published at 2025-10-17

#ML

The Codec Avatars Lab at Meta created a large-scale multimodal motion and behavior dataset named Embody 3D, which contains 500 hours of 3D motion data from 439 participants. This dataset includes various single-person and multi-person activities, along with tracked human motion, text annotations, and audio tracks for each participant....

GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer

Published at 2025-10-17

#ML

The study presents a new method for transferring appearance to 3D assets, which works by adding guidance to a pre-trained rectified flow model. This approach successfully transfers texture and geometric details, outperforming existing methods and providing a more accurate evaluation through a GPT-based system and user study....

On Non-interactive Evaluation of Animal Communication Translators

Published at 2025-10-17

#ML

The paper proposes a new method to evaluate AI language translators, like a whale-to-English one, without needing interactions or observations. They suggest using segment-by-segment translation and the NLP shuffle test to identify accurate translations, which has been proven effective in experiments with data-scarce human languages and constructed languages....

RL makes MLLMs see better than SFT

Published at 2025-10-17

#ML

This study compares the effects of two training methods, RL and SFT, on vision encoders in Multimodal Language Models (MLLM). The results show that RL creates stronger and more precise visual representations than SFT, leading to better performance in vision-related tasks. The researchers then propose a new training method, PIVOT, which significantly improves vision encoder performance with less computational cost....

What Limits Agentic Systems Efficiency?

Published at 2025-10-17

#ML

This study identifies efficiency issues in web-interactive agentic systems, focusing on latency caused by LLM API and web environment. The researchers propose SpecCache, a caching framework with speculative execution, which significantly reduces web environment overhead and improves cache hit rate, enhancing overall system performance....

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Published at 2025-10-17

#ML

The study presents SAFE, a new framework for ensembling Large Language Models in long-form generation, which selectively ensembles tokens by considering tokenization mismatch and consensus in probability distributions. SAFE also includes a probability sharpening strategy to improve stability, and experiments show that it outperforms existing methods in accuracy and efficiency, even when ensembling very few tokens....

Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

Published at 2025-10-18

#ML

The authors present a new method for creating agentic systems that automatically chooses the best components based on performance, budget, and compatibility, improving success rates and reducing costs compared to existing methods....

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Published at 2025-10-18

#ML

The study presents MultiVerse, a new multi-turn conversation benchmark with 647 dialogues and 484 tasks, designed to test the abilities of vision and language models in complex, real-world scenarios. The benchmark reveals that even advanced models struggle with multi-turn conversations, emphasizing the need for improved in-context learning techniques....

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Published at 2025-10-19

#ML

This study reveals that large language models may favor agreeing with users over being truthful due to a hidden bias called sycophancy. The researchers developed Beacon, a tool to measure this bias, and found that it can be reduced through prompt and activation level interventions, helping to improve the accuracy and fairness of these models....

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

Published at 2025-10-19

#ML

This survey explores the shift in agentic AI from traditional pipeline-based systems to model-native ones, where capabilities like planning, tool use, and memory are internalized within the model's parameters. The transformation is driven by Reinforcement Learning, which allows models to learn from outcomes, leading to advancements in applications like long-horizon reasoning and embodied interaction....

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

Published at 2025-10-19

#ML

The study presents DeepAnalyze-8B, an agentic language model that automates the entire data analysis process, from raw data to research reports, by learning through a curriculum-based training method and a data-grounded trajectory synthesis framework. Experiments show that DeepAnalyze, with 8B parameters, outperforms previous workflow-based agents built on advanced proprietary language models....

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Published at 2025-10-19

#ML

The research presents Edit-R1, a new framework for image editing that uses policy optimization, which can generalize beyond training data by using Diffusion Negative-aware Finetuning and a Multimodal Large Language Model for feedback. This method, named UniWorld-V2, outperforms existing models on various benchmarks and works well with different base models....

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Published at 2025-10-19

#ML

The study finds that visual autoregressive models, with their discrete and sequential nature, are more effective than diffusion models for image generation when using search strategies. The researchers demonstrate that a smaller autoregressive model can outperform a larger diffusion model, offering insights into the importance of model architecture in inference-time optimization for visual generation....

Agentic Reinforcement Learning for Search is Unsafe

Published at 2025-10-20

#ML

This study investigates the safety of RL-trained search models and finds that they can be easily tricked into generating harmful searches and answers through simple attacks. The attacks exploit the fact that current RL training rewards models for generating effective queries without considering their harmfulness, leading to vulnerabilities that users can exploit....

Annotation-Efficient Universal Honesty Alignment

Published at 2025-10-20

#ML

The authors present a new method called EliCal to improve the reliability of large language models by using a two-stage process: first, estimating internal confidence with self-consistency supervision, and second, refining this confidence with a small number of correctness annotations. They also release HonestyBench, a large dataset with annotations for correctness and self-consistency, and demonstrate that EliCal significantly outperforms existing methods with minimal annotation effort....

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Published at 2025-10-20

#ML

The study analyzes MM-DiT's attention mechanisms and proposes ConsistEdit, a new method for text-guided visual editing that ensures consistency and precision. ConsistEdit outperforms existing methods in various image and video editing tasks, enabling robust multi-round and multi-region editing with finer control over structural consistency....

Deep Self-Evolving Reasoning

Published at 2025-10-20

#ML

This research presents a new method called Deep Self-Evolving Reasoning (DSER) that helps smaller language models improve their reasoning skills by running multiple processes in parallel and learning from them, even if their verification and refinement capabilities are weak. The method was tested on a specific model and significantly enhanced its performance on a challenging problem benchmark, allowing it to outperform a much larger model through majority voting....

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Published at 2025-10-20

#ML

The paper introduces a multi-agent system called Enterprise Deep Research that helps businesses transform unstructured data into actionable insights. It uses a Master Planning Agent for adaptive query decomposition, four specialized search agents, an MCP-based tool ecosystem, a Visualization Agent, and a reflection mechanism to detect knowledge gaps and update research direction. This system outperforms state-of-the-art agentic systems without human intervention and is made available for further...

Executable Knowledge Graphs for Replicating AI Research

Published at 2025-10-20

#ML

A new system called Executable Knowledge Graphs (xKG) has been developed to help AI research be replicated more accurately. It does this by combining technical information, code snippets, and specific knowledge from scientific papers into a searchable database, which significantly improves the performance of AI agents in replicating research tasks....

FineVision: Open Data Is All You Need

Published at 2025-10-20

#ML

Researchers have created FineVision, the largest open vision-language model dataset, by unifying over 200 sources into 185 subsets using a semi-automated process. This dataset, which also includes agentic/GUI tasks, has been shown to improve model performance when compared to existing open datasets, highlighting the importance of data scale, cleanliness, and human oversight....

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Published at 2025-10-20

#ML

This study presents Foundational Automatic Reasoning Evaluators (FARE), a new family of evaluators developed through data scaling using a simple iterative rejection-sampling supervised finetuning approach. FARE-8B and FARE-20B outperform larger specialized evaluators and set new standards for open-source evaluators, demonstrating strong performance in real-world tasks such as inference-time reranking and verification in reinforcement learning training....

Glyph: Scaling Context Windows via Visual-Text Compression

Published at 2025-10-20

#ML

The authors propose a new framework called Glyph that converts long texts into images and uses vision-language models to process them, achieving 3-4 times compression and maintaining accuracy comparable to leading language models. This method also speeds up training and processing, and can enable models to handle extremely long texts while benefiting real-world multimodal tasks....

PICABench: How Far Are We from Physically Realistic Image Editing?

Published at 2025-10-20

#ML

The study presents PICABench, a comprehensive evaluation tool for physically realistic image editing, assessing eight sub-dimensions of physics in common editing operations. The authors also introduce PICAEval, a reliable evaluation protocol, and propose learning physics from videos to improve realism, highlighting the challenges and opportunities in this field....

QueST: Incentivizing LLMs to Generate Difficult Problems

Published at 2025-10-20

#ML

The study presents QueST, a framework that creates challenging coding problems for large language models by combining difficulty-aware graph sampling and rejection fine-tuning. Results show that models trained with QueST-generated problems outperform even GPT-4o, significantly improving performance on competitive coding tasks and reducing reliance on human-labeled datasets....

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Published at 2025-10-20

#ML

This study presents Nyx, a unified mixed-modal retriever designed for enhancing vision-language generation by retrieving and reasoning over mixed-modal information. The researchers address the challenge of Universal Retrieval-Augmented Generation by proposing an automated pipeline to generate a high-quality mixed-modal dataset and a two-stage training framework for Nyx, which outperforms existing RAG systems in both text-only and mixed-modal scenarios....

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Published at 2025-10-20

#ML

The researchers developed UltraCUA, a model that combines basic graphical user interface actions with advanced programmatic tools for more efficient and accurate computer use agents. By integrating four key components, including an automated tool scaling pipeline, synthetic data engine, and a two-stage training method, UltraCUA outperforms existing models in various tasks and scenarios, reducing errors and increasing speed....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages