🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click |
Published at 2025-11-19 |
|
#ML
|
Click2Graph is a new interactive tool that allows users to understand video scenes better by combining human input with advanced AI technologies. It can segment, track, and analyze objects in videos, providing a clear and organized representation of the scene based on user guidance, and has shown promising results in user-guided Panoptic Video Scene Graph Generation.... |
Read More |
|
|
|
![]() |
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization |
Published at 2025-11-24 |
|
#ML
|
This study presents a new evaluation method for vision-language models that checks if they use visual tools accurately and introduces CodeV, a model trained to improve faithful visual reasoning by using Python code for visual tools and a reward system that encourages correct tool usage, resulting in better performance and more reliable tool usage.... |
Read More |
|
|
|
|
![]() |
Deep Research: A Systematic Survey |
Published at 2025-11-24 |
|
#ML
|
The survey explains Deep Research, a method that combines large language models' reasoning with external tools, enabling them to perform complex tasks. It outlines a three-stage roadmap, key components, optimization techniques, and future challenges, aiming to guide further development in this rapidly evolving field.... |
Read More |
|
|
|
![]() |
Mixture of Horizons in Action Chunking |
Published at 2025-11-24 |
|
#ML
|
The study finds that VLA models for robotic manipulation have a trade-off between long and short action chunk lengths, leading to suboptimal performance. The proposed solution, Mixture of Horizons (MoH), combines different action chunk lengths within a single model, improving both performance and generalizability, and enabling efficient in-time horizon selection.... |
Read More |
|
|
|
|
![]() |
PixelDiT: Pixel Diffusion Transformers for Image Generation |
Published at 2025-11-25 |
|
#ML
|
The authors present PixelDiT, a new model for image generation that doesn't require a two-stage pipeline with a pretrained autoencoder, which can introduce errors. Instead, PixelDiT is a single-stage, end-to-end transformer-based model that learns directly in the pixel space, achieving better results on ImageNet 256x256 compared to existing pixel generative models and also performing well in text-to-image generation at higher resolutions.... |
Read More |
|
|
|
![]() |
The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models |
Published at 2025-11-25 |
|
#ML
|
This study examines how large language models (LLMs) use analogical reasoning, which is crucial for human cognition. The researchers found that while LLMs can encode relationships between analogous entities, they struggle to apply this information to new situations, and success in analogical reasoning depends on strong structural alignment between situations.... |
Read More |
|
|
|
|
![]() |
Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions |
Published at 2025-11-26 |
|
#ML
|
The authors present HAGeo, a heuristic-based method for geometry theorem proving that outperforms neural network-based approaches, achieving gold-medal level performance on the IMO-30 benchmark. They also introduce HAGeo-409, a more challenging benchmark with 409 problems, to provide a more precise evaluation for geometry theorem proving.... |
Read More |
|
|
|
![]() |
Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models |
Published at 2025-11-26 |
|
#ML
|
Researchers studied Masked Diffusion Language Models (MDLMs) and found that they rely heavily on local context and are negatively affected by many mask tokens, which they refer to as distractors. They then developed a new loss function to improve the models' ability to ignore these distractors, leading to better context comprehension.... |
Read More |
|
|
|
|
![]() |
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration |
Published at 2025-11-26 |
|
#ML
|
The researchers developed a method called ToolOrchestra to train small orchestrators that manage other models and tools, which can improve efficiency and intelligence in solving complex tasks. They created an 8B model called Orchestrator using ToolOrchestra, which outperforms GPT-5 in terms of accuracy and efficiency while being more cost-effective and aligning with user preferences.... |
Read More |
|
|
|
![]() |
C^2DLM: Causal Concept-Guided Diffusion Large Language Models |
Published at 2025-11-27 |
|
#ML
|
The researchers present a new language model, C^2DLM, that improves upon existing language models by incorporating causal knowledge and thought. This model enhances reasoning capabilities by focusing on causal relationships between concepts, leading to better performance and faster training times on various tasks.... |
Read More |
|
|
|
|
![]() |
MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory |
Published at 2025-11-27 |
|
#ML
|
The authors propose MG-Nav, a framework for visual navigation that combines long-term memory with short-term control. It uses a compact memory graph to store information about the environment, and a lightweight module to align the agent's view with the target, enabling state-of-the-art zero-shot performance and robustness in dynamic and unseen scenes.... |
Read More |
|
|
|
![]() |
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization |
Published at 2025-11-27 |
|
#ML
|
The study investigates how different Chain-of-Thought (CoT) designs impact visual reasoning in vision-language models, focusing on a maze-solving benchmark. The experiments reveal that shorter CoT with essential grounding steps outperforms longer CoT and generalizes better across different maze sizes, highlighting a 'short is long' effect.... |
Read More |
|
|
|
|
![]() |
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation |
Published at 2025-11-28 |
|
#ML
|
The research presents BlockVid, a new framework for generating high-quality, coherent minute-long videos, addressing issues like error accumulation in existing methods. They also introduce LV-Bench, a detailed benchmark for evaluating long videos, and show that BlockVid significantly outperforms current techniques, improving video quality and clarity by 22.2% and 19.4% respectively.... |
Read More |
|
|
|
![]() |
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation |
Published at 2025-11-28 |
|
#ML
|
The researchers created a new system called DualCamCtrl that improves camera-controlled video generation by simultaneously creating matching color and depth videos. They also developed a technique called SIGMA to better combine these two types of data, resulting in more accurate videos that follow the specified camera movements, with a significant reduction in errors compared to previous methods.... |
Read More |
|
|
|
|
![]() |
Ovis-Image Technical Report |
Published at 2025-11-28 |
|
#ML
|
The Ovis-Image model is a compact, efficient text-to-image model optimized for high-quality text rendering, even on limited computational resources. It delivers performance comparable to much larger models and proprietary systems, using a strong multimodal backbone and a carefully designed, text-focused training recipe.... |
Read More |
|
|
|
![]() |
SimScale: Learning to Drive via Real-World Simulation at Scale |
Published at 2025-11-28 |
|
#ML
|
The authors present a new simulation framework called SimScale that generates diverse and complex driving scenarios to improve autonomous driving systems. By using this framework, they achieve significant improvements in the robustness and generalization of planning methods on real-world benchmarks, without requiring additional real-world data.... |
Read More |
|
|
|
|
![]() |
SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds |
Published at 2025-11-30 |
|
#ML
|
SimWorld is a new simulator for training and testing AI agents in realistic, real-world-like environments, addressing the limitations of existing world simulators. It offers realistic world simulation, a rich interface for AI agents, and customizable reasoning scenarios, and its open-source availability aims to advance real-world agent intelligence across various fields.... |
Read More |
|
|
|
![]() |
SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead |
Published at 2025-11-30 |
|
#ML
|
The researchers present SwiftVLA, an efficient architecture that enhances a compact model with 4D understanding. It uses a pretrained 4D visual geometry transformer and a temporal cache to extract 4D features from 2D images, along with Fusion Tokens and a mask-and-reconstruct strategy to improve the model's performance on real and simulated environments, outperforming lightweight baselines and rivaling larger VLAs.... |
Read More |
|
|
|
|
![]() |
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition |
Published at 2025-11-30 |
|
#ML
|
This study presents a new method called TRivia, which allows vision-language models to learn table recognition from unlabeled images, eliminating the need for costly labeled data. The result is TRivia-3B, an open-source, high-performing table recognition model that outperforms existing systems on popular benchmarks.... |
Read More |
|
|
|
![]() |
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization |
Published at 2025-11-30 |
|
#ML
|
This research presents WUSH, a new method for optimizing the quantization of large language models by using tailored linear transforms that consider the data's statistics, leading to improved performance compared to traditional orthogonal transforms like Hadamard matrices.... |
Read More |
|
|
|
|
![]() |
Artemis: Structured Visual Reasoning for Perception Policy Learning |
Published at 2025-12-01 |
|
#ML
|
The study presents Artemis, a perception-policy learning framework that uses structured visual reasoning instead of linguistic reasoning for visual perception tasks. Artemis represents each reasoning step with a label and bounding box, allowing for explicit tracking and direct supervision, which leads to improved performance on various perception tasks and general MLLM benchmarks.... |
Read More |
|
|
|
![]() |
DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models |
Published at 2025-12-01 |
|
#ML
|
The study presents a new method called DiG-Flow to improve the performance of Vision-Language-Action models on robotic manipulation tasks. By using geometric regularization, DiG-Flow enhances model robustness and reduces errors in complex, multi-step tasks, making the models more reliable even with limited training data.... |
Read More |
|
|
|
|
![]() |
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention |
Published at 2025-12-01 |
|
#ML
|
The study presents FlashVGGT, a more efficient version of the Visual Geometry Grounding Transformer, which solves the scalability issue of its predecessor by using a descriptor-based attention mechanism. This method significantly reduces computational costs and enables online inference over long sequences, resulting in faster inference time and better scalability without compromising reconstruction accuracy.... |
Read More |
|
|
|
![]() |
InnoGym: Benchmarking the Innovation Potential of AI Agents |
Published at 2025-12-01 |
|
#ML
|
InnoGym is a new benchmark and framework that evaluates AI agents' innovation potential by measuring their performance gain and novelty in solving 18 real-world tasks. The study reveals a gap between AI agents' creativity and effectiveness, suggesting the importance of assessing both aspects in AI development.... |
Read More |
|
|
|
|
![]() |
PAI-Bench: A Comprehensive Benchmark For Physical AI |
Published at 2025-12-01 |
|
#ML
|
PAI-Bench is a new tool for testing Physical AI, which measures how well models can understand and predict real-world actions. The study shows that current AI models can create realistic videos but struggle with accurately forecasting and explaining physical events, indicating that there's still much improvement needed in this field.... |
Read More |
|
|
|
![]() |
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch |
Published at 2025-12-01 |
|
#ML
|
The study presents Skywork-R1V4, a powerful model that combines visual operations with external knowledge retrieval to improve multimodal agentic systems. It achieves top performance in perception and multimodal search benchmarks without relying on costly reinforcement learning, demonstrating the potential for sophisticated artificial intelligence through careful supervised learning.... |
Read More |
|
|
|
|
![]() |
Understanding and Harnessing Sparsity in Unified Multimodal Models |
Published at 2025-12-01 |
|
#ML
|
This study analyzes unified multimodal models to understand and reduce inefficiencies, finding that understanding components can be compressed without significant loss of performance, while generation components are more sensitive to compression. The researchers propose a Mixture-of-Experts (MoE) Adaptation technique to improve generation quality by partitioning the generation module into multiple experts and enabling sparse activation, resulting in a model that performs as well as the full mode... |
Read More |
|
|
|
![]() |
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits |
Published at 2025-12-01 |
|
#ML
|
The researchers have created a new method for building datasets that improves the quality and scale of data for reasoning-enriched edits. They trained a 7B model to check and improve the data, resulting in a large dataset called UnicEdit-10M. They also developed a new benchmark, UnicBench, to better evaluate and understand the strengths and weaknesses of different models.... |
Read More |
|
|
|
|
![]() |
Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench |
Published at 2025-12-02 |
|
#ML
|
The researchers have created a new benchmark called VideoScience-Bench to test how well video models understand and apply scientific concepts like those taught in undergraduate physics and chemistry courses. They evaluated seven advanced video models using this benchmark and found that the models' performance was strongly correlated with human evaluations, making this the first benchmark to assess video models' ability to reason scientifically.... |
Read More |
|
|
|
![]() |
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning |
Published at 2025-12-02 |
|
#ML
|
The CUDA-L2 system uses large language models and reinforcement learning to optimize matrix multiplication performance, surpassing popular and state-of-the-art libraries like cuBLAS and cuBLASLt by significant margins in both offline and server modes.... |
Read More |
|
|
|
|
![]() |
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models |
Published at 2025-12-02 |
|
#ML
|
DeepSeek-V3.2 is a new and improved language model that uses an efficient attention mechanism called DSA to handle long contexts better while using less computational power. It also uses a scalable reinforcement learning framework to compete with top models like GPT-5 and Gemini-3.0-Pro, and has a unique data generation method for better tool-use and problem-solving skills in complex environments.... |
Read More |
|
|
|
![]() |
Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation |
Published at 2025-12-02 |
|
#ML
|
This study explores whether training a model to denoise both audio and video together can enhance video generation quality. By using a new architecture that combines pre-trained text-to-video and text-to-audio models, the researchers found that audio-video joint denoising not only improves synchrony but also video quality, especially for complex motions. The results suggest that incorporating audio as a signal helps the model understand the cause-and-effect relationships between visual events an... |
Read More |
|
|
|
|
![]() |
GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning |
Published at 2025-12-02 |
|
#ML
|
The authors present a new simulation environment for research on GUI agent navigation, which allows for flexible design and access to environment information. They experiment with supervised fine-tuning, single-turn, and multi-turn reinforcement learning, finding that each step improves the agent's performance in navigating screens, with reinforcement learning showing particular advantages.... |
Read More |
|
|
|
![]() |
Glance: Accelerating Diffusion Models with 1 Sample |
Published at 2025-12-02 |
|
#ML
|
This study presents a new method to speed up diffusion models used for image generation without sacrificing quality. By using two specialized adapters, Slow-LoRA and Fast-LoRA, the model can accelerate the denoising process, achieving up to 5 times faster inference while maintaining visual quality. The adapters are trained efficiently with only one sample on a single V100 GPU within an hour, yet they generalize well to unseen prompts.... |
Read More |
|
|
|
|
![]() |
Guided Self-Evolving LLMs with Minimal Human Supervision |
Published at 2025-12-02 |
|
#ML
|
The study presents a new framework, R-Few, that allows language models to improve themselves with minimal human input. This framework addresses common self-evolution issues and results in consistent performance gains on various tasks, even outperforming models trained with significantly more human data.... |
Read More |
|
|
|
![]() |
In-Context Sync-LoRA for Portrait Video Editing |
Published at 2025-12-02 |
|
#ML
|
The study presents Sync-LoRA, a method for editing portrait videos that allows for high-quality visual modifications while ensuring precise synchronization with the original frames. This is achieved by training a model on paired videos with identical motion but different appearances, enabling it to preserve the subject's original behavior and generalize to various edits and identities.... |
Read More |
|
|
|
|
![]() |
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues |
Published at 2025-12-02 |
|
#ML
|
The authors present MagicQuill V2, a new system that allows for precise and interactive image editing by using a layered approach, enabling users to control content, position, shape, and color separately, resulting in more intuitive and direct creative control.... |
Read More |
|
|
|
![]() |
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework |
Published at 2025-12-02 |
|
#ML
|
The authors present a new framework called MultiShotMaster that creates narrative multi-shot videos with flexibility and control, addressing the limitations of current video generation techniques. They achieve this by integrating two novel RoPE variants, enabling text-driven inter-shot consistency, customized subjects with motion control, and background-driven scenes, all with flexible shot count and duration configurations.... |
Read More |
|
|
|
|
![]() |
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence |
Published at 2025-12-02 |
|
#ML
|
The study presents RULER-Bench, a benchmark for evaluating the rule-based reasoning abilities of video generation models, which focuses on cognitive rules and covers 40 tasks across two paradigms. Experiments reveal that current state-of-the-art models have room for improvement in reasoning capabilities, with a score of only 48.87% on the rule coherence metric.... |
Read More |
|
|
|
![]() |
ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation |
Published at 2025-12-02 |
|
#ML
|
The authors present a new method called ViSAudio, which can generate realistic and spatially immersive binaural audio directly from silent videos, without the need for a two-stage process. They also introduce a large dataset called BiAudio to support this task, and demonstrate that ViSAudio outperforms existing methods in both automated and human evaluations.... |
Read More |
|
|
|
|
![]() |
Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation |
Published at 2025-12-02 |
|
#ML
|
The authors propose a new framework, Video4Spatial, which uses video diffusion models to perform complex spatial tasks like scene navigation and object grounding, all from video-only inputs. This framework demonstrates strong spatial understanding and generalization, bringing video generative models closer to human-like visuospatial reasoning.... |
Read More |
|
|
|
![]() |
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning |
Published at 2025-12-02 |
|
#ML
|
The authors present a new system called WorldMM that uses different types of memory to understand and reason about long videos. This system outperforms existing methods by a significant margin, demonstrating its ability to handle complex video reasoning tasks more effectively.... |
Read More |
|
|
|
|
![]() |
YingVideo-MV: Music-Driven Multi-Stage Video Generation |
Published at 2025-12-02 |
|
#ML
|
The study presents YingVideo-MV, a pioneering framework for generating music-performance videos with camera motions. It uses audio semantic analysis, an interpretable shot planning module, and temporal-aware diffusion Transformer architectures to create high-quality music videos from audio signals, with explicit camera motion control and enhanced continuity between clips.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|