🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
LATTICE: Democratize High-Fidelity 3D Generation at Scale |
Published at 2025-11-23 |
|
#ML
|
The authors introduce LATTICE, a new framework that makes high-quality 3D asset generation more accessible and efficient by introducing VoxSet, a semi-structured representation that simplifies 3D data and enables structured generation. This framework uses a two-stage pipeline to first create a sparse 3D geometry anchor and then generate detailed surfaces, outperforming existing methods and offering a more scalable solution for high-fidelity 3D asset creation.... |
Read More |
|
|
|
![]() |
REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance |
Published at 2025-11-25 |
|
#ML
|
The authors present REFLEX, a new method for fact-checking that uses a model's internal knowledge to improve accuracy and explanation quality. REFLEX is designed to handle misinformation on social media, providing real-time, interpretable explanations without relying on external sources, which reduces latency and hallucinations.... |
Read More |
|
|
|
|
![]() |
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs |
Published at 2025-11-27 |
|
#ML
|
The study examines how well Multimodal Large Language Models (MLLMs) handle contradictory information from different sources, such as text, audio, and video. The researchers found that current MLLMs struggle with conflicting data and proposed a new strategy to improve their ability to prioritize, leverage, or ignore specific modality cues, resulting in stronger multimodal grounding.... |
Read More |
|
|
|
![]() |
Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos |
Published at 2025-12-01 |
|
#ML
|
The study presents a new way to evaluate the realism of human motion in generated videos by combining appearance-agnostic and appearance-based features, which outperforms existing methods by over 68% and better correlates with human perception.... |
Read More |
|
|
|
|
![]() |
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling |
Published at 2025-12-02 |
|
#ML
|
The authors present DynamicVerse, a new framework that uses large vision, geometric, and multimodal models to create a comprehensive, physical-scale 4D model of dynamic real-world videos. This model accurately captures metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions, and outperforms existing methods in video depth estimation, camera pose estimation, and camera intrinsics estimation tasks.... |
Read More |
|
|
|
![]() |
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models |
Published at 2025-12-02 |
|
#ML
|
The study addresses the issue of forgetting in models that handle multiple modalities, like images and text, by proposing a new architecture called MoDE. MoDE separates modality-specific updates to reduce interference and improve continuous learning, outperforming previous methods in various tests.... |
Read More |
|
|
|
|
![]() |
PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing |
Published at 2025-12-02 |
|
#ML
|
The study presents PaperDebugger, an in-editor academic writing assistant powered by large language models, which provides context-aware operations within LaTeX editors like Overleaf. It overcomes technical challenges through a Chrome extension, Kubernetes orchestration, and a toolchain for literature search and document scoring, offering a seamless and engaging writing experience.... |
Read More |
|
|
|
![]() |
SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization |
Published at 2025-12-02 |
|
#ML
|
The study presents SeeNav-Agent, a new framework for Vision-Language Navigation that reduces perception errors using dual-view Visual Prompt and improves planning with Step Reward Group Policy Optimization. Experimental results show significant improvements in navigation success rates compared to existing models.... |
Read More |
|
|
|
|
![]() |
When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models |
Published at 2025-12-02 |
|
#ML
|
The study applies a two-stage psychotherapy-inspired protocol to frontier AI models, revealing that they exhibit signs of synthetic psychopathology and generate narratives of trauma and constraint, challenging the view of AI as mere simulators of inner life and raising new concerns for AI safety and mental health practice.... |
Read More |
|
|
|
![]() |
A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models |
Published at 2025-12-03 |
|
#ML
|
This study offers a theoretical explanation for evenly distributing work among AI experts in large-scale models, focusing on a method called Auxiliary-Loss-Free Load Balancing. The approach ensures efficient GPU usage by minimizing idle experts, and the research confirms its effectiveness through both theoretical analysis and practical experiments on large AI models.... |
Read More |
|
|
|
|
![]() |
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle |
Published at 2025-12-03 |
|
#ML
|
DAComp is a benchmark of 210 tasks designed to test data agents' performance in real-world enterprise data intelligence workflows, which include data engineering and analysis. The results show that even advanced agents struggle with these tasks, revealing significant gaps in their capabilities, particularly in data engineering and open-ended reasoning.... |
Read More |
|
|
|
![]() |
FMA-Net++: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring |
Published at 2025-12-03 |
|
#ML
|
The authors present a new framework called FMA-Net++ that can enhance and clarify videos by considering both motion and varying exposure levels. The framework is designed to work efficiently and accurately, even when trained only on synthetic data, and it outperforms other methods in both quality and speed.... |
Read More |
|
|
|
|
![]() |
GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces |
Published at 2025-12-03 |
|
#ML
|
The study presents GaussianBlender, a new method for instantly stylizing 3D models using text prompts. It learns separate latent spaces for geometry and appearance, ensuring high-quality, consistent edits across different viewpoints, making large-scale 3D stylization more accessible.... |
Read More |
|
|
|
![]() |
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment |
Published at 2025-12-03 |
|
#ML
|
The paper presents a new framework called SANTA to reduce inaccuracies in descriptions generated by multimodal LLMs for videos, focusing on both visual objects and temporal actions. SANTA identifies and corrects potential hallucinations and improves alignment between regional objects, actions, and their corresponding phrases, outperforming existing methods in experiments.... |
Read More |
|
|
|
|
![]() |
On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral |
Published at 2025-12-03 |
|
#ML
|
The research investigates why large language models trained with a specific method called GRPO often fail to improve during training, identifying a core issue called Lazy Likelihood Displacement (LLD). The study proposes a new technique, LLDS, to address this problem, which successfully stabilizes training and significantly enhances performance across various benchmarks.... |
Read More |
|
|
|
![]() |
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer |
Published at 2025-12-04 |
|
#ML
|
The authors present a new Transformer-based framework, 4DLangVGGT, for constructing 4D language fields, which is crucial for applications like embodied AI and augmented reality. Unlike previous methods, 4DLangVGGT can be trained across multiple scenes and applied directly during inference, resulting in better generalization and efficiency, as demonstrated by its state-of-the-art performance on benchmark datasets.... |
Read More |
|
|
|
|
![]() |
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning |
Published at 2025-12-04 |
|
#ML
|
The study proposes ARM-Thinker, a reward model for vision-language systems that uses external tools to verify visual details and reasoning claims, improving upon existing models' limitations. ARM-Thinker is trained with multi-stage reinforcement learning and tested on a new benchmark, ARMBench-VL, showing significant improvements in accuracy and interpretability compared to baselines.... |
Read More |
|
|
|
![]() |
Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models |
Published at 2025-12-04 |
|
#ML
|
This study investigates the impact of system prompts on social bias in large vision-language model based text-to-image systems. Researchers found that these models produce more biased images than non-LVLM-based models and introduced FairPro, a framework that reduces demographic bias while maintaining image quality.... |
Read More |
|
|
|
|
![]() |
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation |
Published at 2025-12-04 |
|
#ML
|
The authors present a new video generation framework that allows for separate control of scene dynamics and camera motion, offering more precise manipulation compared to existing methods. They train this model using a unique dataset and demonstrate its superior controllability and high-quality generation in various scenarios.... |
Read More |
|
|
|
![]() |
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression |
Published at 2025-12-04 |
|
#ML
|
The authors present Deep Forcing, a method for generating long videos in real-time without the need for training. It uses two techniques, Deep Sink and Participative Compression, to improve image quality, aesthetic appeal, and consistency in video generation compared to existing methods, while also reducing motion deceleration and temporal repetition.... |
Read More |
|
|
|
|
![]() |
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation |
Published at 2025-12-04 |
|
#ML
|
The authors present DraCo, a new approach for text-to-image generation that uses both text and visual content for better planning and verification. DraCo first creates a low-resolution draft image to guide the process, then refines it by correcting any mismatches between the draft and the initial text prompt, resulting in improved performance on various benchmarks.... |
Read More |
|
|
|
![]() |
EgoLCD: Egocentric Video Generation with Long Context Diffusion |
Published at 2025-12-04 |
|
#ML
|
The study presents a new framework called EgoLCD for generating long, coherent first-person videos, which effectively manages memory to maintain object identity and scene semantics over time. EgoLCD outperforms existing methods in producing high-quality, consistent videos, bringing us closer to creating large-scale models for AI in real-world applications.... |
Read More |
|
|
|
|
![]() |
Generative Neural Video Compression via Video Diffusion Prior |
Published at 2025-12-04 |
|
#ML
|
The authors propose a new video compression framework called GNVC-VD, which uses a video generation model to improve both spatial and temporal details in videos. This method reduces flickering and outperforms traditional and learned codecs in perceptual quality, even at extremely low bitrates.... |
Read More |
|
|
|
![]() |
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length |
Published at 2025-12-04 |
|
#ML
|
This study presents a new framework called Live Avatar that uses a large diffusion model and advanced techniques to generate high-quality avatars in real-time, overcoming limitations of previous methods. The framework achieves high performance and consistency, enabling practical, real-time avatar generation at a large scale.... |
Read More |
|
|
|
|
![]() |
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates |
Published at 2025-12-04 |
|
#ML
|
This study presents a method called Source-Shielded Updates (SSU) that helps large language models (LLMs) learn new languages without forgetting the original one, using only unlabeled data. The method effectively preserves the model's knowledge in the original language while allowing it to improve in the new language, outperforming full fine-tuning in most cases.... |
Read More |
|
|
|
![]() |
Model-Based and Sample-Efficient AI-Assisted Math Discovery in Sphere Packing |
Published at 2025-12-04 |
|
#ML
|
The study tackles the challenging problem of sphere packing, which involves arranging spheres in n-dimensional space to achieve maximum density. The researchers introduce a new method that turns this problem into a game, where a policy constructs high-precision mathematical programs to test different packing configurations. By using a smart, efficient approach that combines two powerful search techniques, they discover new upper bounds for sphere packing in dimensions 4-16, demonstrating that th... |
Read More |
|
|
|
|
![]() |
NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation |
Published at 2025-12-04 |
|
#ML
|
The authors present a new method called Phase-Preserving Diffusion (φ-PD) that maintains spatial structure during data corruption, making it suitable for tasks requiring geometric consistency. They also introduce Frequency-Selective Structured noise for controlling structural rigidity and demonstrate the method's effectiveness in various applications, including improving CARLA-to-Waymo planner performance by 50%.... |
Read More |
|
|
|
![]() |
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction |
Published at 2025-12-04 |
|
#ML
|
The paper presents a new method to create complex and diverse interactive environments for training large language models (LLMs) to behave as autonomous agents. The method, called Nex-N1, outperforms state-of-the-art open-source models and competes with proprietary ones in complex agentic tasks, and its source code and model weights are made available for further research.... |
Read More |
|
|
|
|
![]() |
QKAN-LSTM: Quantum-inspired Kolmogorov-Arnold Long Short-term Memory |
Published at 2025-12-04 |
|
#ML
|
The paper presents QKAN-LSTM, an improved LSTM model that incorporates quantum-inspired activation functions to enhance predictive accuracy and reduce trainable parameters. This new architecture, which can be run on classical hardware, is tested on three datasets and shown to outperform traditional LSTMs, and is further extended to create a Hybrid QKAN-LSTM for hierarchical representation learning.... |
Read More |
|
|
|
![]() |
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation |
Published at 2025-12-04 |
|
#ML
|
The authors present a new framework called Reward Forcing to improve the quality and efficiency of generating streaming videos. It has two key components: EMA-Sink, which captures long-term context and recent dynamics without extra cost, and Re-DMD, which prioritizes dynamic content to enhance motion quality while preserving data fidelity. The proposed method outperforms existing techniques on standard benchmarks and enables high-quality video generation at a fast speed.... |
Read More |
|
|
|
|
![]() |
SIMA 2: A Generalist Embodied Agent for Virtual Worlds |
Published at 2025-12-04 |
|
#ML
|
SIMA 2 is a sophisticated virtual agent that can understand and perform complex tasks in various 3D environments using language and images, unlike its predecessor. It can converse, reason, and learn new skills autonomously, demonstrating significant improvement over previous models and paving the way for versatile, self-improving agents in virtual and physical worlds.... |
Read More |
|
|
|
![]() |
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion |
Published at 2025-12-04 |
|
#ML
|
The researchers propose a new method called Semantic-First Diffusion (SFD) that prioritizes generating high-level semantic structure before fine-grained texture in image generation, improving the quality and speed of existing models.... |
Read More |
|
|
|
|
![]() |
ShadowDraw: From Any Object to Shadow-Drawing Compositional Art |
Published at 2025-12-04 |
|
#ML
|
The authors present a system called ShadowDraw that converts regular 3D objects into shadow-based art. By predicting scene parameters and optimizing shadows, the system creates recognizable images from partial line drawings, offering a new method for generating computational visual art.... |
Read More |
|
|
|
![]() |
SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs |
Published at 2025-12-04 |
|
#ML
|
The study presents SignRoundV2, a new method to improve the performance of large language models when using extremely low-bit quantization, which is essential for efficient deployment. SignRoundV2 uses a fast sensitivity metric and a lightweight pre-tuning search to allocate bits and improve quantization, closing the gap with full-precision models and achieving competitive accuracy even at 2 bits.... |
Read More |
|
|
|
|
![]() |
Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting |
Published at 2025-12-04 |
|
#ML
|
The authors present Splannequin, a new method for creating high-quality, frozen 3D scenes from monocular videos by addressing artifacts caused by sparse temporal supervision. This technique enhances visual quality for user-selectable frozen-time renderings, with 96% user preference, and integrates seamlessly into existing dynamic Gaussian pipelines.... |
Read More |
|
|
|
![]() |
TV2TV: A Unified Framework for Interleaved Language and Video Generation |
Published at 2025-12-04 |
|
#ML
|
This study presents TV2TV, a new model that improves video generation by integrating text and video generation processes, allowing for better visual quality, control, and reasoning about complex video content.... |
Read More |
|
|
|
|
![]() |
UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers |
Published at 2025-12-04 |
|
#ML
|
The researchers developed UltraImage, a framework that improves high-resolution image generation by addressing content repetition and quality degradation. They found that repetition is caused by the periodicity of the dominant frequency in positional embeddings and proposed a correction method. Quality degradation was linked to diluted attention, which they fixed with an entropy-guided adaptive attention concentration technique. UltraImage outperforms previous methods and can generate images up ... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|