🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
LLaDA2.0: Scaling Up Diffusion Language Models to 100B |
Published at 2025-12-10 |
|
#ML
|
The authors propose LLaDA2.0, a new method to create large language models up to 100 billion parameters by converting pre-trained models into a more efficient format. This approach, which includes a three-phase training process, results in two new models, LLaDA2.0-mini and LLaDA2.0-flash, that offer improved performance and efficiency compared to existing models.... |
Read More |
|
|
|
![]() |
Bidirectional Normalizing Flow: From Data to Noise and Back |
Published at 2025-12-11 |
|
#ML
|
This study presents a new framework called Bidirectional Normalizing Flow (BiFlow) that enhances the generation quality and accelerates sampling in normalizing flows, a method for generative modeling, by learning an approximate inverse mapping instead of relying on an exact analytic inverse. Experiments show that BiFlow significantly outperforms its counterpart with causal decoding on ImageNet and achieves state-of-the-art results among normalizing flow-based methods.... |
Read More |
|
|
|
|
![]() |
Insight Miner: A Time Series Analysis Dataset for Cross-Domain Alignment with Natural Language |
Published at 2025-12-11 |
|
#ML
|
The authors present Insight Miner, a model that generates detailed descriptions of time-series data using a large-scale multimodal model and a new dataset called TS-Insights. This approach helps in understanding time-series data more efficiently without requiring deep domain expertise.... |
Read More |
|
|
|
![]() |
Coupled Variational Reinforcement Learning for Language Model General Reasoning |
Published at 2025-12-14 |
|
#ML
|
The study presents a new method called Coupled Variational Reinforcement Learning (CoVRL) that improves the reasoning abilities of language models. CoVRL efficiently explores solutions and maintains a strong connection between reasoning steps and final answers by combining variational inference and reinforcement learning, resulting in significant performance improvements over existing methods.... |
Read More |
|
|
|
|
![]() |
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model |
Published at 2025-12-15 |
|
#ML
|
The authors present Seedance 1.5 pro, a model for creating synchronized audio-visual content using a special architecture and optimization techniques, resulting in high-quality, professional-grade content with features like lip-syncing and dynamic camera control.... |
Read More |
|
|
|
![]() |
Sharing State Between Prompts and Programs |
Published at 2025-12-16 |
|
#ML
|
The study proposes a new programming concept, shared program state, which allows natural language code to interact directly with program variables and objects, eliminating the need for manual work to enable interoperability between natural language code and formal languages. The researchers implemented this concept in the Nightjar programming system, which enables Python programs to contain natural code that shares the Python program state, achieving higher task accuracy and decreasing the lines... |
Read More |
|
|
|
|
![]() |
Vibe Spaces for Creatively Connecting and Expressing Visual Concepts |
Published at 2025-12-16 |
|
#ML
|
The study presents Vibe Space, a new method that helps create new visual concepts by smoothly transitioning between distinct ideas based on their shared attributes, or 'vibe'. This approach is more effective than current methods in generating coherent and creative hybrids, as evaluated by a combination of human judgments, AI reasoning, and geometric path-based difficulty scores.... |
Read More |
|
|
|
![]() |
EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration |
Published at 2025-12-17 |
|
#ML
|
The study presents EmoCaliber, a confidence-aware multimodal large language model designed to improve the reliability of visual emotion comprehension. It does this by enabling the model to express its confidence in emotion predictions, which helps account for the subjectivity of emotions and provides users with a better understanding of the model's self-assessed competence.... |
Read More |
|
|
|
|
![]() |
Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision |
Published at 2025-12-17 |
|
#ML
|
The researchers created Nemotron-Math, a large dataset with 7.5M mathematical reasoning traces, combining structured competition tasks and diverse real-world queries to improve the quality and generalization of mathematical reasoning models. They also developed a strategy to speed up long-context training, enabling state-of-the-art performance on various mathematical reasoning tasks.... |
Read More |
|
|
|
![]() |
TabReX : Tabular Referenceless eXplainable Evaluation |
Published at 2025-12-17 |
|
#ML
|
The authors present TabReX, a new method for evaluating generated tables by large language models, which converts text and tables into knowledge graphs and calculates scores based on their alignment. They also introduce TabReX-Bench, a large-scale benchmark for testing the metric's robustness, and show that TabReX outperforms existing methods and provides detailed error analysis.... |
Read More |
|
|
|
|
![]() |
AdaTooler-V: Adaptive Tool-Use for Images and Videos |
Published at 2025-12-18 |
|
#ML
|
The authors present AdaTooler-V, a multimodal large language model that determines when to use visual tools for problems, reducing unnecessary tool use and improving performance. They introduce AT-GRPO, a reinforcement learning algorithm that adjusts tool use based on a Tool Benefit Score, and create two datasets for training. Experiments show that AdaTooler-V outperforms existing methods, including GPT-4o and Gemini 1.5 Pro, on various visual reasoning tasks.... |
Read More |
|
|
|
![]() |
Adaptation of Agentic AI |
Published at 2025-12-18 |
|
#ML
|
The paper presents a unified framework for adapting agentic AI systems and tools, which helps in understanding, comparing, and selecting different adaptation strategies. It also reviews various approaches, analyzes their strengths and limitations, and identifies future opportunities for building better agentic AI systems.... |
Read More |
|
|
|
|
![]() |
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection |
Published at 2025-12-18 |
|
#ML
|
The authors present a new framework called Alchemist, which uses meta-gradients to choose the best samples from large text-to-image datasets. This method improves the visual quality and performance of text-to-image models by automatically selecting important data, making training more efficient without needing manual curation or heuristic scoring.... |
Read More |
|
|
|
![]() |
DeContext as Defense: Safe Image Editing in Diffusion Transformers |
Published at 2025-12-18 |
|
#ML
|
The paper presents a new method called DeContext to protect images from unauthorized editing in diffusion transformers. By injecting small perturbations into specific model layers, DeContext disrupts the flow of contextual information, effectively preventing unwanted edits without sacrificing image quality.... |
Read More |
|
|
|
|
![]() |
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation |
Published at 2025-12-18 |
|
#ML
|
The authors create a model for estimating depth in panoramic images that works well in various scenes. They build a large dataset using real and synthetic images, improve the model's generalization using a three-stage curation process, and test the model on multiple benchmarks, showing strong performance and zero-shot generalization.... |
Read More |
|
|
|
![]() |
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification |
Published at 2025-12-18 |
|
#ML
|
The authors present AuditDM, a new framework that improves the evaluation of multimodal large language models (MLLMs) by identifying and fixing their weaknesses. By using reinforcement learning, AuditDM creates challenging questions and images to reveal model failures, which can then be used to improve model performance on various benchmarks.... |
Read More |
|
|
|
|
![]() |
EasyV2V: A High-quality Instruction-based Video Editing Framework |
Published at 2025-12-18 |
|
#ML
|
The authors propose a new framework called EasyV2V for improving video editing based on instructions. They enhance video editing quality by creating diverse video pairs, simplifying the model design, and providing better control over the editing process, resulting in superior video editing outcomes compared to current and commercial systems.... |
Read More |
|
|
|
![]() |
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward |
Published at 2025-12-18 |
|
#ML
|
The study investigates the exploration-exploitation trade-off in a framework called RLVR, which helps large language models reason better. The researchers found that a type of reward, called spurious rewards, and a method called entropy minimization, both improve reasoning performance by making the model more confident, even though they seem to contradict each other. The study also proposes a new model to explain why spurious rewards work well.... |
Read More |
|
|
|
|
![]() |
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction |
Published at 2025-12-18 |
|
#ML
|
FlashPortrait is a new method that creates long, consistent portrait videos 6 times faster than current techniques by using a video diffusion transformer and a dynamic sliding-window scheme, which predicts future latents to skip denoising steps, ensuring smooth transitions and identity consistency.... |
Read More |
|
|
|
![]() |
FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering |
Published at 2025-12-18 |
|
#ML
|
The authors present FrameDiffuser, a neural rendering framework that creates realistic and consistent frames for interactive applications by using G-buffer data and its own previous output. This method is efficient and effective for consumer gaming setups, unlike other diffusion-based approaches, by specializing training to individual environments.... |
Read More |
|
|
|
|
![]() |
Generative Refocusing: Flexible Defocus Control from a Single Image |
Published at 2025-12-18 |
|
#ML
|
This study presents a new method called Generative Refocusing that allows for flexible defocus control from a single image. The method uses two neural networks, DeblurNet and BokehNet, and a semi-supervised training approach that combines synthetic and real bokeh images to achieve superior performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks.... |
Read More |
|
|
|
![]() |
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs |
Published at 2025-12-18 |
|
#ML
|
This study tests the performance of SpeechLLMs, which directly translate spoken language, against traditional cascaded systems. The results show that while cascaded systems are more reliable overall, SpeechLLMs can match their performance in certain situations, emphasizing the importance of integrating LLMs for high-quality speech translation.... |
Read More |
|
|
|
|
![]() |
JustRL: Scaling a 1.5B LLM with a Simple RL Recipe |
Published at 2025-12-18 |
|
#ML
|
This study presents a simpler and more efficient method, JustRL, for training large language models compared to current complex methods. The new approach achieves top performance using fewer resources and stable training, suggesting that the field may be unnecessarily complicating model training.... |
Read More |
|
|
|
![]() |
Kling-Omni Technical Report |
Published at 2025-12-18 |
|
#ML
|
The Kling-Omni framework is a unified system for creating high-quality videos from various inputs like text and images, using a comprehensive data system, efficient pre-training, and infrastructure optimizations. It excels in generating videos based on context, performing reasoning-based edits, and following multimodal instructions, making it a significant step towards advanced multimodal world simulators.... |
Read More |
|
|
|
|
![]() |
Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation |
Published at 2025-12-18 |
|
#ML
|
The authors present a new method called Make-It-Poseable for posing 3D characters, which addresses limitations of existing techniques by transforming character posing into a latent-space problem. This approach manipulates shape tokens based on skeletal motion, providing precise control and high-quality results, and can also be used for 3D editing applications.... |
Read More |
|
|
|
![]() |
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning |
Published at 2025-12-18 |
|
#ML
|
The authors present MomaGraph, a comprehensive scene representation for robots in homes that combines spatial, functional, and interactive elements. They also introduce MomaGraph-Scenes, a large-scale dataset for task-driven scene graphs, and MomaGraph-R1, a 7B vision-language model that can predict task-oriented scene graphs and plan tasks, outperforming other open-source models by 11.4%.... |
Read More |
|
|
|
|
![]() |
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image |
Published at 2025-12-18 |
|
#ML
|
The researchers present Multimodal RewardBench 2, a new evaluation tool for models that understand and generate both text and images. They test various state-of-the-art models on this benchmark and find that while human performance is over 90%, the best models like Gemini 3 Pro and GPT-5 reach 75-80% accuracy, outperforming GPT-4o.... |
Read More |
|
|
|
![]() |
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models |
Published at 2025-12-18 |
|
#ML
|
The authors present a new framework, N3D-VLM, which enhances vision-language models by adding native 3D object perception and reasoning capabilities. This allows the model to accurately locate objects in 3D space based on textual descriptions and perform interpretable spatial understanding, outperforming existing methods in 3D spatial reasoning tasks.... |
Read More |
|
|
|
|
![]() |
Next-Embedding Prediction Makes Strong Vision Learners |
Published at 2025-12-18 |
|
#ML
|
This study proposes a new approach for self-supervised visual learning by training models to generate embeddings for predictive tasks, instead of using traditional methods like pixel reconstruction or contrastive loss. The proposed method, called Next-Embedding Predictive Autoregression (NEPA), achieves strong results across various tasks, including image classification and semantic segmentation, with simple Transformer architectures.... |
Read More |
|
|
|
![]() |
REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion |
Published at 2025-12-18 |
|
#ML
|
The study presents REGLUE, a framework that enhances image synthesis by combining VAE image latents, local VFM semantics, and a global [CLS] token in a single SiT backbone. By nonlinearly aggregating multi-layer VFM features and entangling them with VAE latents, REGLUE improves image quality and accelerates convergence compared to existing methods.... |
Read More |
|
|
|
|
![]() |
RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing |
Published at 2025-12-18 |
|
#ML
|
The research proposes a new method called RePlan for editing images based on complex instructions. It uses a two-step process: a planner that breaks down instructions into smaller tasks and maps them to specific image areas, followed by an editor that makes precise changes to those areas without needing to repeatedly fill in missing details. This approach significantly improves the accuracy and reliability of image editing compared to other methods, even when dealing with detailed and knowledge-... |
Read More |
|
|
|
![]() |
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors |
Published at 2025-12-18 |
|
#ML
|
This study presents a new method called StereoPilot that efficiently creates 3D videos by directly synthesizing the target view, avoiding the need for separate depth maps or complex multi-stage processes. The method is trained using a large-scale unified dataset called UniStereo, which covers both stereo formats, and it outperforms existing techniques in terms of visual quality and computational efficiency.... |
Read More |
|
|
|
|
![]() |
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text |
Published at 2025-12-18 |
|
#ML
|
WorldCanvas is a new framework that allows users to create detailed and controlled world events by using a mix of text, motion paths, and reference images. This method can generate realistic videos with multiple characters, object appearances, and unexpected events, improving the way we interact with and simulate virtual environments.... |
Read More |
|
|
|
![]() |
Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers |
Published at 2025-12-18 |
|
#ML
|
The study presents a new attention mechanism called Log-linear Sparse Attention (LLSA) that significantly improves the efficiency of visual generation models like Diffusion Transformers. LLSA reduces the computational cost from quadratic to log-linear by using a hierarchical structure, allowing for faster attention inference and training while maintaining image generation quality.... |
Read More |
|
|
|
|
![]() |
VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks |
Published at 2025-12-18 |
|
#ML
|
The authors present VenusBench-GD, a large-scale, cross-platform benchmark for GUI grounding with extensive coverage of applications and rich annotated data. This benchmark introduces a hierarchical task taxonomy for evaluating models from complementary perspectives, and experimental results show that general-purpose multimodal models perform well on basic grounding tasks, while advanced tasks still favor GUI-specialized models.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|