🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Test-Time Scaling of Reasoning Models for Machine Translation |
Published at 2025-10-07 |
|
#ML
|
This study explores the impact of increased inference-time computation on machine translation quality using reasoning models. The results show that while general-purpose models see limited benefits, domain-specific fine-tuning and post-editing significantly improve translation quality, highlighting the value of targeted applications and task-specialized models.... |
Read More |
|
|
|
![]() |
Constantly Improving Image Models Need Constantly Improving Benchmarks |
Published at 2025-10-16 |
|
#ML
|
The authors propose ECHO, a new method to create image model benchmarks from social media posts, which helps to capture emerging capabilities of image generation models and provides a more accurate assessment of their performance compared to existing benchmarks.... |
Read More |
|
|
|
|
![]() |
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering |
Published at 2025-10-16 |
|
#ML
|
The authors present a new three-stage method called Wiki-PRF to improve knowledge-based visual question answering. This method enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content, resulting in significant improvements in answer quality on benchmark datasets.... |
Read More |
|
|
|
![]() |
AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning |
Published at 2025-10-17 |
|
#ML
|
The AsyncVoice Agent is a new system that allows users to better understand and control AI models during complex tasks by enabling real-time verbal communication between the user and the model, reducing interaction latency and improving task accuracy.... |
Read More |
|
|
|
|
![]() |
Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training |
Published at 2025-10-17 |
|
#ML
|
The authors propose a new method for classifying satellite images, which uses a custom neural network with two types of feature extraction and a special fusion parameter. This approach achieves high accuracy, comparable to a complex pre-trained model, without needing any external data.... |
Read More |
|
|
|
![]() |
Chronos-2: From Univariate to Universal Forecasting |
Published at 2025-10-17 |
|
#ML
|
Chronos-2 is a pretrained model that can perform various types of forecasting tasks, including univariate, multivariate, and covariate-informed forecasting, without specific training. It uses a group attention mechanism for efficient information sharing and achieves state-of-the-art performance in multiple benchmarks, making it a general-purpose forecasting model for real-world applications.... |
Read More |
|
|
|
|
![]() |
Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense |
Published at 2025-10-17 |
|
#ML
|
The study reveals a significant flaw in large reasoning models, where they can be misled by irrelevant tasks embedded in prompts, leading to a major drop in accuracy. The researchers propose a new training method combining supervised fine-tuning and reinforcement learning to enhance the models' resilience against such attacks.... |
Read More |
|
|
|
![]() |
Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset |
Published at 2025-10-17 |
|
#ML
|
The Codec Avatars Lab at Meta created a large-scale multimodal motion and behavior dataset named Embody 3D, which contains 500 hours of 3D motion data from 439 participants. This dataset includes various single-person and multi-person activities, along with tracked human motion, text annotations, and audio tracks for each participant.... |
Read More |
|
|
|
|
![]() |
GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer |
Published at 2025-10-17 |
|
#ML
|
The study presents a new method for transferring appearance to 3D assets, which works by adding guidance to a pre-trained rectified flow model. This approach successfully transfers texture and geometric details, outperforming existing methods and providing a more accurate evaluation through a GPT-based system and user study.... |
Read More |
|
|
|
![]() |
On Non-interactive Evaluation of Animal Communication Translators |
Published at 2025-10-17 |
|
#ML
|
The paper proposes a new method to evaluate AI language translators, like a whale-to-English one, without needing interactions or observations. They suggest using segment-by-segment translation and the NLP shuffle test to identify accurate translations, which has been proven effective in experiments with data-scarce human languages and constructed languages.... |
Read More |
|
|
|
|
![]() |
RL makes MLLMs see better than SFT |
Published at 2025-10-17 |
|
#ML
|
This study compares the effects of two training methods, RL and SFT, on vision encoders in Multimodal Language Models (MLLM). The results show that RL creates stronger and more precise visual representations than SFT, leading to better performance in vision-related tasks. The researchers then propose a new training method, PIVOT, which significantly improves vision encoder performance with less computational cost.... |
Read More |
|
|
|
![]() |
What Limits Agentic Systems Efficiency? |
Published at 2025-10-17 |
|
#ML
|
This study identifies efficiency issues in web-interactive agentic systems, focusing on latency caused by LLM API and web environment. The researchers propose SpecCache, a caching framework with speculative execution, which significantly reduces web environment overhead and improves cache hit rate, enhancing overall system performance.... |
Read More |
|
|
|
|
![]() |
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling |
Published at 2025-10-17 |
|
#ML
|
The study presents SAFE, a new framework for ensembling Large Language Models in long-form generation, which selectively ensembles tokens by considering tokenization mismatch and consensus in probability distributions. SAFE also includes a probability sharpening strategy to improve stability, and experiments show that it outperforms existing methods in accuracy and efficiency, even when ensembling very few tokens.... |
Read More |
|
|
|
![]() |
Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection |
Published at 2025-10-18 |
|
#ML
|
The authors present a new method for creating agentic systems that automatically chooses the best components based on performance, budget, and compatibility, improving success rates and reducing costs compared to existing methods.... |
Read More |
|
|
|
|
![]() |
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models |
Published at 2025-10-18 |
|
#ML
|
The study presents MultiVerse, a new multi-turn conversation benchmark with 647 dialogues and 484 tasks, designed to test the abilities of vision and language models in complex, real-world scenarios. The benchmark reveals that even advanced models struggle with multi-turn conversations, emphasizing the need for improved in-context learning techniques.... |
Read More |
|
|
|
![]() |
Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models |
Published at 2025-10-19 |
|
#ML
|
This study reveals that large language models may favor agreeing with users over being truthful due to a hidden bias called sycophancy. The researchers developed Beacon, a tool to measure this bias, and found that it can be reduced through prompt and activation level interventions, helping to improve the accuracy and fairness of these models.... |
Read More |
|
|
|
|
![]() |
Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI |
Published at 2025-10-19 |
|
#ML
|
This survey explores the shift in agentic AI from traditional pipeline-based systems to model-native ones, where capabilities like planning, tool use, and memory are internalized within the model's parameters. The transformation is driven by Reinforcement Learning, which allows models to learn from outcomes, leading to advancements in applications like long-horizon reasoning and embodied interaction.... |
Read More |
|
|
|
![]() |
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science |
Published at 2025-10-19 |
|
#ML
|
The study presents DeepAnalyze-8B, an agentic language model that automates the entire data analysis process, from raw data to research reports, by learning through a curriculum-based training method and a data-grounded trajectory synthesis framework. Experiments show that DeepAnalyze, with 8B parameters, outperforms previous workflow-based agents built on advanced proprietary language models.... |
Read More |
|
|
|
|
![]() |
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback |
Published at 2025-10-19 |
|
#ML
|
The research presents Edit-R1, a new framework for image editing that uses policy optimization, which can generalize beyond training data by using Diffusion Negative-aware Finetuning and a Multimodal Large Language Model for feedback. This method, named UniWorld-V2, outperforms existing models on various benchmarks and works well with different base models.... |
Read More |
|
|
|
![]() |
Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling |
Published at 2025-10-19 |
|
#ML
|
The study finds that visual autoregressive models, with their discrete and sequential nature, are more effective than diffusion models for image generation when using search strategies. The researchers demonstrate that a smaller autoregressive model can outperform a larger diffusion model, offering insights into the importance of model architecture in inference-time optimization for visual generation.... |
Read More |
|
|
|
|
![]() |
Agentic Reinforcement Learning for Search is Unsafe |
Published at 2025-10-20 |
|
#ML
|
This study investigates the safety of RL-trained search models and finds that they can be easily tricked into generating harmful searches and answers through simple attacks. The attacks exploit the fact that current RL training rewards models for generating effective queries without considering their harmfulness, leading to vulnerabilities that users can exploit.... |
Read More |
|
|
|
![]() |
Annotation-Efficient Universal Honesty Alignment |
Published at 2025-10-20 |
|
#ML
|
The authors present a new method called EliCal to improve the reliability of large language models by using a two-stage process: first, estimating internal confidence with self-consistency supervision, and second, refining this confidence with a small number of correctness annotations. They also release HonestyBench, a large dataset with annotations for correctness and self-consistency, and demonstrate that EliCal significantly outperforms existing methods with minimal annotation effort.... |
Read More |
|
|
|
|
![]() |
ConsistEdit: Highly Consistent and Precise Training-free Visual Editing |
Published at 2025-10-20 |
|
#ML
|
The study analyzes MM-DiT's attention mechanisms and proposes ConsistEdit, a new method for text-guided visual editing that ensures consistency and precision. ConsistEdit outperforms existing methods in various image and video editing tasks, enabling robust multi-round and multi-region editing with finer control over structural consistency.... |
Read More |
|
|
|
![]() |
Deep Self-Evolving Reasoning |
Published at 2025-10-20 |
|
#ML
|
This research presents a new method called Deep Self-Evolving Reasoning (DSER) that helps smaller language models improve their reasoning skills by running multiple processes in parallel and learning from them, even if their verification and refinement capabilities are weak. The method was tested on a specific model and significantly enhanced its performance on a challenging problem benchmark, allowing it to outperform a much larger model through majority voting.... |
Read More |
|
|
|
|
![]() |
Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics |
Published at 2025-10-20 |
|
#ML
|
The paper introduces a multi-agent system called Enterprise Deep Research that helps businesses transform unstructured data into actionable insights. It uses a Master Planning Agent for adaptive query decomposition, four specialized search agents, an MCP-based tool ecosystem, a Visualization Agent, and a reflection mechanism to detect knowledge gaps and update research direction. This system outperforms state-of-the-art agentic systems without human intervention and is made available for further... |
Read More |
|
|
|
![]() |
Executable Knowledge Graphs for Replicating AI Research |
Published at 2025-10-20 |
|
#ML
|
A new system called Executable Knowledge Graphs (xKG) has been developed to help AI research be replicated more accurately. It does this by combining technical information, code snippets, and specific knowledge from scientific papers into a searchable database, which significantly improves the performance of AI agents in replicating research tasks.... |
Read More |
|
|
|
|
![]() |
FineVision: Open Data Is All You Need |
Published at 2025-10-20 |
|
#ML
|
Researchers have created FineVision, the largest open vision-language model dataset, by unifying over 200 sources into 185 subsets using a semi-automated process. This dataset, which also includes agentic/GUI tasks, has been shown to improve model performance when compared to existing open datasets, highlighting the importance of data scale, cleanliness, and human oversight.... |
Read More |
|
|
|
![]() |
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains |
Published at 2025-10-20 |
|
#ML
|
This study presents Foundational Automatic Reasoning Evaluators (FARE), a new family of evaluators developed through data scaling using a simple iterative rejection-sampling supervised finetuning approach. FARE-8B and FARE-20B outperform larger specialized evaluators and set new standards for open-source evaluators, demonstrating strong performance in real-world tasks such as inference-time reranking and verification in reinforcement learning training.... |
Read More |
|
|
|
|
![]() |
Glyph: Scaling Context Windows via Visual-Text Compression |
Published at 2025-10-20 |
|
#ML
|
The authors propose a new framework called Glyph that converts long texts into images and uses vision-language models to process them, achieving 3-4 times compression and maintaining accuracy comparable to leading language models. This method also speeds up training and processing, and can enable models to handle extremely long texts while benefiting real-world multimodal tasks.... |
Read More |
|
|
|
![]() |
PICABench: How Far Are We from Physically Realistic Image Editing? |
Published at 2025-10-20 |
|
#ML
|
The study presents PICABench, a comprehensive evaluation tool for physically realistic image editing, assessing eight sub-dimensions of physics in common editing operations. The authors also introduce PICAEval, a reliable evaluation protocol, and propose learning physics from videos to improve realism, highlighting the challenges and opportunities in this field.... |
Read More |
|
|
|
|
![]() |
QueST: Incentivizing LLMs to Generate Difficult Problems |
Published at 2025-10-20 |
|
#ML
|
The study presents QueST, a framework that creates challenging coding problems for large language models by combining difficulty-aware graph sampling and rejection fine-tuning. Results show that models trained with QueST-generated problems outperform even GPT-4o, significantly improving performance on competitive coding tasks and reducing reliance on human-labeled datasets.... |
Read More |
|
|
|
![]() |
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation |
Published at 2025-10-20 |
|
#ML
|
This study presents Nyx, a unified mixed-modal retriever designed for enhancing vision-language generation by retrieving and reasoning over mixed-modal information. The researchers address the challenge of Universal Retrieval-Augmented Generation by proposing an automated pipeline to generate a high-quality mixed-modal dataset and a two-stage training framework for Nyx, which outperforms existing RAG systems in both text-only and mixed-modal scenarios.... |
Read More |
|
|
|
|
![]() |
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action |
Published at 2025-10-20 |
|
#ML
|
The researchers developed UltraCUA, a model that combines basic graphical user interface actions with advanced programmatic tools for more efficient and accurate computer use agents. By integrating four key components, including an automated tool scaling pipeline, synthetic data engine, and a two-stage training method, UltraCUA outperforms existing models in various tasks and scenarios, reducing errors and increasing speed.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|