🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection |
Published at 2025-11-16 |
|
#ML
|
The authors present a new network, MSRNet, designed to better detect camouflaged objects in complex scenarios by using multi-scale features and recursive refinement, achieving state-of-the-art results on several benchmark datasets.... |
Read More |
|
|
|
![]() |
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO |
Published at 2025-11-17 |
|
#ML
|
The study proposes M-GRPO, a new method for training specialized language models for each agent in a multi-agent system, tackling optimization challenges like varying frequencies, separate server deployment, and disruptive gradient flow. M-GRPO improves stability and efficiency, outperforming other methods in real-world benchmarks, making tool-augmented reasoning tasks more effective.... |
Read More |
|
|
|
|
![]() |
Computer-Use Agents as Judges for Generative User Interface |
Published at 2025-11-19 |
|
#ML
|
This study explores using Computer-Use Agents (CUA) as judges to help coding-oriented language models (Coder) design more efficient and reliable graphical user interfaces (GUI). The researchers developed a benchmark, synthesized tasks, and created a verifier to ensure the tasks' executability within their environment. They then proposed a framework where the Coder generates and revises websites, while the CUA evaluates their functionality and provides feedback, focusing on task solvability and n... |
Read More |
|
|
|
![]() |
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser |
Published at 2025-11-20 |
|
#ML
|
The study presents MinerU-HTML, a new method for extracting web content as a sequence labeling problem solved by a language model, which outperforms existing methods in preserving document structure and improving model performance. The researchers use MinerU-HTML to create a large, high-quality web corpus called AICC, which significantly outperforms other corpora in various benchmarks.... |
Read More |
|
|
|
|
![]() |
Controllable Layer Decomposition for Reversible Multi-Layer Image Generation |
Published at 2025-11-20 |
|
#ML
|
The authors propose a new method called Controllable Layer Decomposition (CLD) that allows for precise and controllable separation of raster images into distinct layers, which can then be edited independently. CLD outperforms existing methods in decomposition quality and controllability, and the separated layers can be used in popular design tools like PowerPoint.... |
Read More |
|
|
|
![]() |
EvoVLA: Self-Evolving Vision-Language-Action Model |
Published at 2025-11-20 |
|
#ML
|
The authors present EvoVLA, a self-supervised Vision-Language-Action framework that improves long-horizon robotic manipulation tasks by preventing shortcuts, grounding curiosity, and stabilizing memory. EvoVLA outperforms the strongest baseline in simulation and real-world tasks, reducing stage hallucination and improving success rates and sample efficiency.... |
Read More |
|
|
|
|
![]() |
Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling |
Published at 2025-11-20 |
|
#ML
|
The authors propose a new method called Upsample Anything, which is a simple and fast way to improve the resolution of features from Vision Foundation Models without needing to train the model again. This method can be used for various applications like semantic segmentation and depth estimation, and it works quickly and effectively.... |
Read More |
|
|
|
![]() |
Budget-Aware Tool-Use Enables Effective Agent Scaling |
Published at 2025-11-21 |
|
#ML
|
The study presents a method to improve the performance of web search agents using tools by providing them with continuous awareness of their tool-call budget. This approach, called BATS, dynamically adjusts its strategies based on remaining resources, leading to better cost-performance scaling and a deeper understanding of scaling in tool-augmented agents.... |
Read More |
|
|
|
|
![]() |
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark |
Published at 2025-11-21 |
|
#ML
|
The study introduces a new benchmark called M3-Bench to evaluate the performance of artificial intelligence systems in using tools across different modes, such as visual and textual, in complex, multi-step tasks. The benchmark focuses on the accuracy of tool usage and the consistency of workflow execution, revealing areas where current AI systems can be improved.... |
Read More |
|
|
|
![]() |
Pillar-0: A New Frontier for Radiology Foundation Models |
Published at 2025-11-21 |
|
#ML
|
The study presents Pillar-0, a radiology foundation model trained on a large dataset of CT and MRI scans from a major academic center. This model, combined with a new evaluation framework, outperforms existing medical models in various radiologic tasks and demonstrates its effectiveness in tasks beyond its pretraining, providing a strong foundation for building advanced radiology systems.... |
Read More |
|
|
|
|
![]() |
Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets? |
Published at 2025-11-21 |
|
#ML
|
The study presents Target-Bench, a new evaluation tool for world models in robot path planning towards semantic targets in real-world environments. The authors find that current world models have limitations in this task, but fine-tuning a model on Target-Bench data significantly improves performance.... |
Read More |
|
|
|
![]() |
Extracting Interaction-Aware Monosemantic Concepts in Recommender Systems |
Published at 2025-11-22 |
|
#ML
|
The study proposes a technique to identify clear and understandable concepts from user and item embeddings in recommendation systems using a Sparse Autoencoder. This method preserves user-item interactions and allows for post-analysis control operations like targeted filtering and content promotion, without altering the original model, and can be applied to various recommendation models and datasets.... |
Read More |
|
|
|
|
![]() |
Fidelity-Aware Recommendation Explanations via Stochastic Path Integration |
Published at 2025-11-22 |
|
#ML
|
The study presents a new approach called SPINRec, which improves the accuracy of recommendation explanations by using random sampling to generate realistic user profiles. The method outperforms existing techniques in various evaluations, offering more stable and personalized explanations for recommender systems.... |
Read More |
|
|
|
![]() |
Plan-X: Instruct Video Generation via Semantic Planning |
Published at 2025-11-22 |
|
#ML
|
The paper presents Plan-X, a new framework that improves video generation by adding a step for understanding user instructions and planning the video content before generating it. This helps reduce mistakes and make the video more aligned with the user's input, especially for complex scenes or actions.... |
Read More |
|
|
|
|
![]() |
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios |
Published at 2025-11-22 |
|
#ML
|
The authors present UltraFlux, a powerful text-to-image generation model that can create high-quality native 4K images across various aspect ratios. They achieve this by addressing limitations in diffusion transformers through a data-model co-design approach, resulting in a model that outperforms existing open-source baselines and competes with a proprietary model.... |
Read More |
|
|
|
![]() |
General Agentic Memory Via Deep Research |
Published at 2025-11-23 |
|
#ML
|
The study presents a new framework called GAM that improves AI memory systems by creating optimized contexts at runtime, using a 'just-in time' approach. GAM's duo-design, consisting of a Memorizer and Researcher, enhances the capabilities of large language models, allowing for better performance optimization and improved task completion in memory-grounded scenarios.... |
Read More |
|
|
|
|
![]() |
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models |
Published at 2025-11-23 |
|
#ML
|
This research addresses the challenge of Vision-Language Models struggling with physics-driven reasoning in videos by proposing a method to interpret physical-world context cues. They introduce MASS-Bench, a comprehensive benchmark for physics-related comprehension tasks, and MASS, a model-agnostic technique that enhances VLMs with spatial-temporal signals, outperforming prior state-of-the-art models by 8.7% and 6.0%.... |
Read More |
|
|
|
![]() |
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning |
Published at 2025-11-24 |
|
#ML
|
The authors present AutoEnv, a framework that generates diverse and controllable environments at a low cost, and introduces AutoEnv-36, a dataset of 36 environments for measuring cross-environment agent learning. They then evaluate various learning methods on this dataset, finding that environment-adaptive selection of learning methods improves performance but does not scale infinitely, highlighting the need for further research in this area.... |
Read More |
|
|
|
|
![]() |
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens |
Published at 2025-11-24 |
|
#ML
|
The study presents a framework called Chain-of-Visual-Thought that enhances Vision-Language Models by incorporating continuous visual tokens, which help the models better understand visual information like spatial reasoning and geometry. This approach improves the models' performance on various perception benchmarks by 3% to 16% without sacrificing efficiency, making it more grounded and interpretable.... |
Read More |
|
|
|
![]() |
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research |
Published at 2025-11-24 |
|
#ML
|
The authors present a new training method, RLER, for deep research models that allows them to perform long-form, multi-step research tasks. They apply this method to create DR Tulu-8B, an open-source model that outperforms existing open models and matches proprietary ones, while being more cost-effective. The team also shares all their data, models, and code to help future research.... |
Read More |
|
|
|
|
![]() |
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation |
Published at 2025-11-24 |
|
#ML
|
The research introduces a new method for generating images more efficiently by separating the creation of high and low frequency details, resulting in faster training and better image quality compared to existing pixel diffusion models.... |
Read More |
|
|
|
![]() |
Flow Map Distillation Without Data |
Published at 2025-11-24 |
|
#ML
|
This study proposes a new method for accelerating flow models that doesn't require external data, addressing the risk of mismatch between the teacher model and the data. The proposed framework predicts the teacher's sampling path and corrects its own errors, outperforming all data-based methods and setting a new state-of-the-art for image generation on ImageNet.... |
Read More |
|
|
|
|
![]() |
HunyuanVideo 1.5 Technical Report |
Published at 2025-11-24 |
|
#ML
|
Researchers created a new, efficient video generation model called HunyuanVideo 1.5 that uses less memory and can run on regular graphics cards, making high-quality video generation more accessible. This model uses a specialized architecture and techniques to produce state-of-the-art visual quality and motion coherence with only 8.3 billion parameters.... |
Read More |
|
|
|
![]() |
In-Video Instructions: Visual Signals as Generative Control |
Published at 2025-11-24 |
|
#ML
|
The study explores using visual signals within video frames as instructions (In-Video Instruction) to control image-to-video generation, making it more precise and localized compared to text-based methods. Experiments on popular video generators show that they can accurately follow these visual instructions, especially in complex scenes with multiple objects.... |
Read More |
|
|
|
|
![]() |
MIST: Mutual Information Via Supervised Training |
Published at 2025-11-24 |
|
#ML
|
This study presents a new method for estimating mutual information using neural networks, which outperforms traditional methods and is faster and more reliable. The approach is flexible, efficient, and can be adapted to various data types, making it useful for larger learning pipelines.... |
Read More |
|
|
|
![]() |
One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control |
Published at 2025-11-24 |
|
#ML
|
One4D is a unified framework for generating and reconstructing dynamic 4D content, such as synchronized RGB frames and pointmaps, from single images, full videos, or sparse frames. It uses a novel approach called Decoupled LoRA Control to generate high-quality RGB frames and accurate pointmaps, contributing to general, high-quality 3D world modeling using video diffusion models.... |
Read More |
|
|
|
|
![]() |
PRInTS: Reward Modeling for Long-Horizon Information Seeking |
Published at 2025-11-24 |
|
#ML
|
The authors present PRInTS, a new model that helps AI agents gather information over long periods by evaluating the quality of each step and summarizing the context. PRInTS improves the performance of various models in information-seeking tasks, even outperforming other advanced models with a smaller base agent.... |
Read More |
|
|
|
![]() |
Representational Stability of Truth in Large Language Models |
Published at 2025-11-24 |
|
#ML
|
The study examines how well large language models (LLMs) distinguish between true, false, and neither-true-nor-false content, focusing on the impact of familiarity and linguistic form on this ability. Results indicate that LLMs struggle more with unfamiliar statements, leading to significant shifts in truth judgments, while familiar fictional statements are more consistently categorized, suggesting that representational stability in LLMs is influenced more by epistemic familiarity than linguisti... |
Read More |
|
|
|
|
![]() |
SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis |
Published at 2025-11-24 |
|
#ML
|
The study presents SyncMV4D, a model that generates synchronized multi-view videos and 4D motions for hand-object interaction, overcoming limitations of current methods that either rely on single-view videos or high-quality 3D data. SyncMV4D uses a new model and aligner to co-generate videos and motions, and establishes a feedback loop to improve 2D appearance and 4D dynamics, resulting in superior visual realism, motion plausibility, and multi-view consistency compared to existing methods.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|