🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation |
Published at 2025-12-08 |
|
#ML
|
The researchers present a new framework called ContextAnyone that creates character-consistent videos from text and a single image by focusing on more than just facial identity. This method improves the integration of reference information and enhances the stability of creating videos with diverse motions and scenes, resulting in better visual quality and identity consistency compared to existing methods.... |
Read More |
|
|
|
![]() |
CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates |
Published at 2025-12-11 |
|
#ML
|
The authors present CoSPlan, a benchmark for evaluating visual sequential planning in large-scale Vision-Language Models (VLMs) across four domains, focusing on error detection and correction. They also introduce SGI, a training-free method that improves VLMs' reasoning abilities in sequential planning tasks, resulting in a 5.2% average performance gain.... |
Read More |
|
|
|
|
![]() |
Hierarchical Dataset Selection for High-Quality Data Sharing |
Published at 2025-12-11 |
|
#ML
|
This study presents a new method called DaSH that selects high-quality datasets for machine learning by considering their relevance, quality, and utility at both individual and group levels. The method significantly outperforms existing data selection techniques in accuracy and efficiency, making it ideal for practical multi-source learning workflows.... |
Read More |
|
|
|
![]() |
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation |
Published at 2025-12-11 |
|
#ML
|
The authors have created a new dataset called MeViS, which has over 33,000 human-annotated motion expressions in both text and audio, for the purpose of improving video understanding based on motion expressions. They tested various methods on this dataset and found that existing methods struggle with this task, so they developed a new approach called LMPM++, which outperforms previous methods in motion expression-guided video understanding.... |
Read More |
|
|
|
|
![]() |
Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views |
Published at 2025-12-14 |
|
#ML
|
The study presents Iceberg, a benchmark suite for evaluating Vector Similarity Search methods in real-world scenarios, focusing on three main sources of performance degradation: Embedding Loss, Metric Misuse, and Data Distribution Sensitivity. Iceberg assesses 13 state-of-the-art VSS methods across eight diverse datasets, revealing significant differences in their performance when evaluated based on application-level metrics rather than traditional recall-latency evaluations.... |
Read More |
|
|
|
![]() |
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling |
Published at 2025-12-14 |
|
#ML
|
The study presents a new method called Scone that improves subject-driven image generation by combining composition and distinction. Scone allows for better identification and generation of the correct subject in complex visual settings, outperforming existing models in various composition and distinction tasks.... |
Read More |
|
|
|
|
![]() |
UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction |
Published at 2025-12-14 |
|
#ML
|
The authors propose a new method called UAGLNet for extracting buildings from remote sensing images, which effectively captures and fuses both local and global visual semantics using a hybrid of CNN and transformer layers. This approach results in more accurate and less ambiguous building extractions compared to existing methods.... |
Read More |
|
|
|
![]() |
Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation |
Published at 2025-12-15 |
|
#ML
|
This study compares four methods for safely removing harmful responses from large language models, testing them across sixteen models. The results show that some methods preserve the model's capabilities better than others, with mathematical reasoning being the most affected, and provide guidance for choosing the best method based on the model architecture.... |
Read More |
|
|
|
|
![]() |
Differentiable Evolutionary Reinforcement Learning |
Published at 2025-12-15 |
|
#ML
|
The authors present a new method called Differentiable Evolutionary Reinforcement Learning (DERL) that helps in designing better reward systems for training AI agents in complex tasks. DERL improves upon previous methods by being differentiable, which allows it to learn more effectively and adapt to new situations, ultimately leading to better performance in various tasks compared to other methods.... |
Read More |
|
|
|
![]() |
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed |
Published at 2025-12-15 |
|
#ML
|
The study presents a method to transform pre-trained autoregressive language models into efficient diffusion language models without sacrificing accuracy, achieving faster speed and better performance compared to state-of-the-art models.... |
Read More |
|
|
|
|
![]() |
Feedforward 3D Editing via Text-Steerable Image-to-3D |
Published at 2025-12-15 |
|
#ML
|
The authors have developed a new method called Steer3D, inspired by ControlNet, which allows for editing AI-generated 3D assets using text in a single forward pass. This approach is significantly faster than existing methods and maintains better consistency with the original 3D asset, making it easier to use in real-world applications like design, AR/VR, and robotics.... |
Read More |
|
|
|
![]() |
Janus: Disaggregating Attention and Experts for Scalable MoE Inference |
Published at 2025-12-15 |
|
#ML
|
The authors present a new system called Janus that improves the efficiency of large Mixture-of-Experts models during inference by separating and managing attention and expert modules independently on different GPU clusters. Janus uses an adaptive communication scheme, a lightweight scheduler, and fine-grained resource management to reduce latency and increase throughput, outperforming existing systems by up to 3.9 times in per-GPU throughput.... |
Read More |
|
|
|
|
![]() |
MobileWorldBench: Towards Semantic World Modeling For Mobile Agents |
Published at 2025-12-15 |
|
#ML
|
This research presents MobileWorldBench, a benchmark for testing vision-language models as world models for mobile GUI agents, and MobileWorld, a large-scale dataset of 1.4M samples to enhance these models. They also propose a new framework that incorporates these semantic world models into mobile agents, improving their task success rates. (In simple terms: They created new tools and methods to help mobile apps understand and interact with graphical interfaces better, making them more efficient... |
Read More |
|
|
|
![]() |
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models |
Published at 2025-12-15 |
|
#ML
|
The authors present Nemotron-Cascade, a new method for creating general-purpose reasoning models using reinforcement learning. Instead of mixing different types of problems, Nemotron-Cascade handles them one domain at a time, making the training process more efficient and effective, allowing the model to outperform its teacher on various benchmarks.... |
Read More |
|
|
|
|
![]() |
Olmo 3 |
Published at 2025-12-15 |
|
#ML
|
The study presents Olmo 3, a new series of advanced, fully-open language models with 7B and 32B parameters, designed for tasks like long-context reasoning, coding, and chat. The researchers provide the complete model flow, including all stages, checkpoints, data, and dependencies, with the Olmo 3 Think 32B model being the most powerful fully-open thinking model released so far.... |
Read More |
|
|
|
![]() |
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value |
Published at 2025-12-15 |
|
#ML
|
The authors present OpenDataArena, an open platform for evaluating the quality of datasets used to train large language models. This platform offers a fair comparison of datasets, a scoring system, and tools for training and evaluation, aiming to promote a better understanding of data's impact on model performance and foster the development of data-centric AI.... |
Read More |
|
|
|
|
![]() |
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics |
Published at 2025-12-15 |
|
#ML
|
This study presents RoboTracer, a 3D-aware Vision-Language Model designed to enhance spatial reasoning in robots, enabling them to understand, measure, and refer to spatial concepts more accurately. The model, trained using a large-scale dataset of 30M QA pairs, outperforms existing methods in spatial tracing tasks, making it suitable for complex, long-term robotic tasks in various environments.... |
Read More |
|
|
|
![]() |
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement |
Published at 2025-12-15 |
|
#ML
|
The authors propose a new task called creative table visualization, where a model generates an infographic from a given table. They introduce ShowTable, a pipeline that uses MLLMs and diffusion models to achieve high-fidelity results, and TableVisBench, a benchmark to evaluate performance on this task. Experiments show that their pipeline outperforms baselines, demonstrating effective multi-modal reasoning, generation, and error correction capabilities.... |
Read More |
|
|
|
|
![]() |
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models |
Published at 2025-12-15 |
|
#ML
|
This study presents a new framework called Sparse-LaViDa that speeds up the sampling process of multimodal discrete diffusion models by eliminating unnecessary masked tokens during inference. The framework uses special tokens to maintain generation quality and a custom attention mask to ensure consistency between training and inference, resulting in up to 2x faster performance for tasks like text-to-image generation and image editing.... |
Read More |
|
|
|
![]() |
TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning |
Published at 2025-12-15 |
|
#ML
|
The study presents a new approach called TraPO that combines a small labeled dataset with unlabeled samples to train reasoning models more efficiently and effectively than previous unsupervised methods. This method outperforms fully supervised models and existing unsupervised ones, even with significantly less labeled data, by ensuring only verified reasoning patterns are incorporated into the training process.... |
Read More |
|
|
|
|
![]() |
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans? |
Published at 2025-12-15 |
|
#ML
|
The study presents a new benchmark for testing the realism of AI-generated ASMR videos, which are videos designed to induce relaxation. The benchmark uses real ASMR videos to create a dataset for training AI models, and then tests their ability to deceive both humans and video models. The results show that the best AI models can fool most video models but not human experts, and that adding audio can improve discrimination between real and fake videos.... |
Read More |
|
|
|
![]() |
A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning |
Published at 2025-12-16 |
|
#ML
|
The authors present a new framework called A4-Agent that improves affordance prediction for embodied AI by using specialized foundation models in a three-stage pipeline, which outperforms existing methods and generalizes better to new objects and environments without requiring training on annotated datasets.... |
Read More |
|
|
|
|
![]() |
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives |
Published at 2025-12-16 |
|
#ML
|
The authors present a new method called CRISP that accurately reconstructs human motion and scene geometry from a single video by fitting simple shapes to the scene and using human posture to fill in missing details. This approach significantly reduces failures in motion tracking and speeds up simulation, demonstrating its usefulness for robotics and AR/VR applications.... |
Read More |
|
|
|
![]() |
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models |
Published at 2025-12-16 |
|
#ML
|
The study presents EVOLVE-VLA, a new framework that allows Vision-Language-Action models to adapt in real-time through environment interaction, without relying on many demonstrations. This framework improves performance on long-horizon tasks, one-shot learning, and cross-task generalization, and enables the models to recover from errors and develop new strategies.... |
Read More |
|
|
|
|
![]() |
JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction |
Published at 2025-12-16 |
|
#ML
|
The authors create JMMMU-Pro, an image-based Japanese benchmark for testing how well machines can understand images and text together, and introduce Vibe Benchmark Construction, a cost-effective method for building such benchmarks using an image generative model and human verification.... |
Read More |
|
|
|
![]() |
MMGR: Multi-Modal Generative Reasoning |
Published at 2025-12-16 |
|
#ML
|
The study presents MMGR, a comprehensive evaluation framework for assessing the reasoning abilities of video and image generation models. MMGR tests models across five reasoning skills in various domains, revealing significant performance gaps and limitations in current models, particularly in abstract reasoning and long-term spatial planning.... |
Read More |
|
|
|
|
![]() |
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives |
Published at 2025-12-16 |
|
#ML
|
This study presents a method called MemFlow to improve long video narrative generation. It dynamically updates memory by selecting relevant historical frames and activates only the most relevant tokens during generation, ensuring consistency and efficiency with minimal computation burden.... |
Read More |
|
|
|
![]() |
RePo: Language Models with Context Re-Positioning |
Published at 2025-12-16 |
|
#ML
|
The study presents a new method called RePo that improves language models by dynamically adjusting the position of contextual information, reducing unnecessary cognitive load and improving performance on tasks with noisy contexts, structured data, and longer context length.... |
Read More |
|
|
|
|
![]() |
RecGPT-V2 Technical Report |
Published at 2025-12-16 |
|
#ML
|
The authors present RecGPT-V2, which improves upon its predecessor by introducing a hierarchical multi-agent system for efficient intent reasoning, a meta-prompting framework for diverse explanations, constrained reinforcement learning for better tag prediction and explanation acceptance, and an agent-as-a-judge framework for improved human preference alignment. These innovations result in significant improvements in CTR, IPV, TV, and NER in online A/B tests on Taobao, demonstrating the technica... |
Read More |
|
|
|
![]() |
S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation |
Published at 2025-12-16 |
|
#ML
|
The authors present a new method for unsupervised video instance segmentation that uses real video data only, unlike previous methods that relied on synthetic data. They establish temporal coherence by identifying high-quality keymasks using deep motion priors, and then use these keymasks to train a segmentation model with a new Sparse-To-Dense Distillation approach and Temporal DropLoss, resulting in a model that outperforms current state-of-the-art methods across various benchmarks.... |
Read More |
|
|
|
|
![]() |
SS4D: Native 4D Generative Model via Structured Spacetime Latents |
Published at 2025-12-16 |
|
#ML
|
The authors propose a new method called SS4D that creates realistic moving 3D objects from single-view videos, directly training on 4D data for better detail, smooth transitions, and structural integrity. They tackle the lack of 4D training data by using a pre-existing 2D-to-3D model, ensure smooth motion with special layers that consider multiple frames, and compress the data for efficient processing, all while designing a training strategy to handle occlusions.... |
Read More |
|
|
|
![]() |
Spherical Leech Quantization for Visual Tokenization and Generation |
Published at 2025-12-16 |
|
#ML
|
This study explores various non-parametric quantization methods and finds that the Leech lattice-based quantization, called Spherical Leech Quantization, offers a simpler training process and better image reconstruction compared to existing methods, even using fewer bits.... |
Read More |
|
|
|
|
![]() |
TAT: Task-Adaptive Transformer for All-in-One Medical Image Restoration |
Published at 2025-12-16 |
|
#ML
|
This study presents a new framework called TAT, which improves medical image restoration by addressing issues like conflicting gradient updates and uneven optimization across different tasks. The framework uses two innovative strategies, task-adaptive weight generation and task-adaptive loss balancing, to achieve state-of-the-art performance in various medical image restoration tasks.... |
Read More |
|
|
|
![]() |
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure |
Published at 2025-12-16 |
|
#ML
|
The paper proposes a framework to improve animation of Scalable Vector Graphics (SVGs) using vision-language models (VLMs). By aggregating multiple weak part predictions, the framework recovers the semantic structure of SVGs, enabling VLMs to produce more coherent animations. Experiments show significant improvements over existing methods, highlighting the importance of semantic recovery for robust SVG animation.... |
Read More |
|
|
|
|
![]() |
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse |
Published at 2025-12-16 |
|
#ML
|
The authors present VersatileFFN, a new feed-forward network that allows for flexible parameter reuse in both width and depth, addressing the memory costs of large language models. This method enhances architectural capacity without increasing model size, using two adaptive pathways for efficient processing of 'easy' and 'hard' tokens, and experiments show its effectiveness across various benchmarks and model scales.... |
Read More |
|
|
|
![]() |
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling |
Published at 2025-12-16 |
|
#ML
|
The authors propose WorldPlay, a new streaming video diffusion model that allows for real-time, interactive world modeling with long-term geometric consistency. This is achieved through three key innovations: a Dual Action Representation for robust action control, a Reconstituted Context Memory to enforce long-term consistency, and Context Forcing, a novel distillation method for memory-aware models, enabling real-time speeds while preventing error drift.... |
Read More |
|
|
|
|
![]() |
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in |
Published at 2025-12-16 |
|
#ML
|
The researchers developed a new framework called Zoom-Zero to improve the accuracy of answering questions about videos, especially in identifying the correct time and visual details. They achieved this by first finding relevant parts of the video and then zooming in on the most important frames, which helped reduce errors and improve answer accuracy by up to 6.4% in long videos.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|