🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Multimodal Evaluation of Russian-language Architectures |
Published at 2025-11-19 |
|
#ML
|
The study presents Mera Multi, a new open framework for evaluating Russian-language multimodal models, which includes 18 custom-built tasks covering text, image, audio, and video modalities. The framework offers a universal taxonomy of multimodal abilities, Russian-specific datasets, baseline results for various models, and a methodology to prevent benchmark leakage, with potential for replication in other typologically diverse languages, especially within the Slavic family.... |
Read More |
|
|
|
![]() |
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots |
Published at 2025-11-21 |
|
#ML
|
The authors present a new framework called MobileVLA-R1 that helps quadruped robots understand and follow natural language instructions by improving the connection between high-level reasoning and low-level actions. They created a large dataset and a two-stage training method, resulting in better performance on various tasks and real-world deployment compared to existing methods.... |
Read More |
|
|
|
|
![]() |
Frequency-Adaptive Sharpness Regularization for Improving 3D Gaussian Splatting Generalization |
Published at 2025-11-22 |
|
#ML
|
The authors address the issue of 3D Gaussian Splatting's limited generalization in novel viewpoints by proposing Frequency-Adaptive Sharpness Regularization (FASR). FASR improves 3DGS's ability to reconstruct high-frequency details and fine details, outperforming other methods across various datasets.... |
Read More |
|
|
|
![]() |
RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale |
Published at 2025-11-22 |
|
#ML
|
The researchers present RAISECity, a new system that creates detailed, large-scale 3D cities using a unique agent-based approach. This method improves the quality, realism, and size of generated 3D worlds compared to existing techniques, making it useful for applications like virtual reality and artificial intelligence.... |
Read More |
|
|
|
|
![]() |
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma |
Published at 2025-11-23 |
|
#ML
|
The paper explores the challenges in creating fair, robust, and representative AI systems using Reinforcement Learning from Human Feedback (RLHF). It reveals that achieving all three goals simultaneously is computationally expensive and proposes a framework to understand and address these trade-offs in AI alignment.... |
Read More |
|
|
|
![]() |
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation |
Published at 2025-11-24 |
|
#ML
|
Inferix is a new inference engine that improves world simulation by using a semi-autoregressive decoding method, which generates more stable and coherent videos than traditional methods. It also offers interactive video streaming, profiling, and efficient benchmarking, making it unique compared to other systems and models.... |
Read More |
|
|
|
|
![]() |
Terminal Velocity Matching |
Published at 2025-11-24 |
|
#ML
|
The study presents a new method called Terminal Velocity Matching (TVM) that improves the quality of one- and few-step generative models by regulating the transition between diffusion timesteps. TVM offers better performance and stability through minimal architectural changes and an efficient fused attention kernel, achieving state-of-the-art results on ImageNet datasets with low FID scores.... |
Read More |
|
|
|
![]() |
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary |
Published at 2025-11-24 |
|
#ML
|
The authors propose UniGame, a self-adversarial training method that improves the consistency and performance of unified multimodal models by making the model challenge its own weaknesses. UniGame enhances understanding, generation, and robustness of the model without adding many extra parameters and can be used with existing post-training methods.... |
Read More |
|
|
|
|
![]() |
Block Cascading: Training Free Acceleration of Block-Causal Video Models |
Published at 2025-11-25 |
|
#ML
|
This study presents a method called Block Cascading that improves the speed of block-causal video models without requiring additional training. By allowing blocks to be generated in parallel, the method significantly increases the speed of both small and large models, and also reduces the time it takes to switch between different parts of the video during interactive generation, all without sacrificing video quality.... |
Read More |
|
|
|
![]() |
Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs |
Published at 2025-11-25 |
|
#ML
|
This study presents a new method called Trajectory-Backward Consistency Model (TBCM) that improves the efficiency of diffusion models by eliminating the need for external training data, making it more suitable for resource-constrained scenarios. TBCM achieves high-quality few-step generation with superior efficiency, reducing training time and GPU memory usage, and offers insights for future research on diffusion-generation space discrepancy and sampling strategies.... |
Read More |
|
|
|
|
![]() |
Latent Collaboration in Multi-Agent Systems |
Published at 2025-11-25 |
|
#ML
|
The authors present a new framework called LatentMAS that allows language models to collaborate directly in a continuous space, enhancing system-level reasoning quality and efficiency without additional training. This method outperforms existing models in various benchmarks, achieving higher accuracy, reduced token usage, and faster inference.... |
Read More |
|
|
|
![]() |
NVIDIA Nemotron Parse 1.1 |
Published at 2025-11-25 |
|
#ML
|
Nemotron-Parse-1.1 is a new, lightweight model for document parsing and OCR that improves upon its predecessor. It excels in general OCR, markdown formatting, table parsing, and extracting text from visuals, while also providing faster speed and better accuracy on public benchmarks.... |
Read More |
|
|
|
|
![]() |
Reinforcing Action Policies by Prophesying |
Published at 2025-11-25 |
|
#ML
|
The authors present ProphRL, a method that improves Vision-Language-Action policies by using a learned world model and a tailored reinforcement learning procedure. ProphRL is data and compute efficient, and can quickly adapt to new robots, objects, and environments, resulting in significant success gains on both public benchmarks and real robots.... |
Read More |
|
|
|
![]() |
SPHINX: A Synthetic Environment for Visual Perception and Reasoning |
Published at 2025-11-25 |
|
#ML
|
Sphinx is a new virtual setting designed to test and improve visual perception and reasoning abilities, particularly in artificial intelligence. It creates complex puzzles that require understanding various visual concepts, and even the most advanced AI models struggle to solve them, suggesting a need for further development in multimodal reasoning.... |
Read More |
|
|
|
|
![]() |
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy |
Published at 2025-11-26 |
|
#ML
|
The paper presents Harmony, a new framework that solves the problem of synchronized audio-visual content generation in AI by addressing issues like correspondence drift, inefficient attention mechanisms, and intra-modal bias. It introduces a training paradigm, a module for efficient temporal alignment, and a synchronization-enhanced CFG to ensure better audio-visual synchronization, outperforming existing methods in experiments.... |
Read More |
|
|
|
![]() |
I-GLIDE: Input Groups for Latent Health Indicators in Degradation Estimation |
Published at 2025-11-26 |
|
#ML
|
The study presents a new method called I-GLIDE that improves the prediction of an item's remaining useful life by using a technique called Reconstruction along Projected Pathways and adding uncertainty quantification. This approach helps in understanding the specific causes of system failures in complex systems like aerospace and manufacturing, leading to more accurate and reliable predictions compared to existing methods.... |
Read More |
|
|
|
|
![]() |
Monet: Reasoning in Latent Visual Space Beyond Images and Language |
Published at 2025-11-26 |
|
#ML
|
The authors present Monet, a training framework that allows large language models to think in a visual space by creating visual thoughts, addressing challenges like high cost and lack of supervision in training, and introducing a new method called VLPO for better visual reasoning. Monet-7B, the model developed, outperforms others in various visual reasoning tasks and provides insights for future advancements in this field.... |
Read More |
|
|
|
![]() |
Revisiting Generalization Across Difficulty Levels: It's Not So Easy |
Published at 2025-11-26 |
|
#ML
|
This study examines how well large language models perform on tasks of varying difficulties, using a new method to objectively rank task difficulties based on the abilities of many different models. The results show that training on either easy or hard tasks does not lead to consistent improvements across all difficulties, highlighting the importance of including a range of difficulty levels in both training and evaluation data for these models.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|