🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models |
Published at 2025-11-24 |
|
#ML
|
ThreadWeaver is a new framework that improves the efficiency of language models by allowing them to reason in parallel, reducing latency without sacrificing accuracy. It does this through a two-stage parallel trajectory generator, a trie-based training-inference co-design, and a parallelization-aware reinforcement learning framework, making it compatible with existing inference engines and achieving up to 1.53x speedup in token latency.... |
Read More |
|
|
|
![]() |
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models |
Published at 2025-12-01 |
|
#ML
|
The researchers present AV-SpeakerBench, a new benchmark of 3,212 questions designed to test the ability of multimodal language models to understand human speech in real-world videos. This benchmark focuses on speaker-centric audiovisual reasoning, with questions that require models to consider both visual and audio cues to answer, and it has been shown to effectively distinguish between different model performances in this area.... |
Read More |
|
|
|
|
![]() |
Predicting Time-Dependent Flow Over Complex Geometries Using Operator Networks |
Published at 2025-12-03 |
|
#ML
|
The study develops a Deep Operator Network that accurately predicts velocity fields for unsteady flows around various shapes, achieving significant speedups over traditional CFD methods. The model's performance is evaluated using various diagnostics, and potential improvements are discussed.... |
Read More |
|
|
|
![]() |
LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning |
Published at 2025-12-04 |
|
#ML
|
The study presents LYNX, a new method that helps reasoning models stop processing when they have enough information, improving efficiency and accuracy. LYNX uses a model's internal signals and a lightweight probe trained on mathematical tasks to make confident decisions, reducing tokens by 35-70% while maintaining or improving accuracy across various benchmarks.... |
Read More |
|
|
|
|
![]() |
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems |
Published at 2025-12-04 |
|
#ML
|
The authors propose MemLoRA, a memory system that allows small language models to perform memory operations on-device, and its extension MemLoRA-V, which adds visual understanding capabilities. These models outperform larger baseline models on text-based tasks and demonstrate strong performance in multimodal contexts.... |
Read More |
|
|
|
![]() |
MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment |
Published at 2025-12-06 |
|
#ML
|
The researchers created a new system called MIND-V that can generate realistic and logical videos of robots performing complex tasks over long periods. MIND-V uses a three-part approach to plan tasks, translate instructions, and render videos, and it's trained to obey physical laws to ensure the generated actions are possible in the real world.... |
Read More |
|
|
|
|
![]() |
Novel Deep Learning Architectures for Classification and Segmentation of Brain Tumors from MRI Images |
Published at 2025-12-06 |
|
#ML
|
This study presents two new deep learning models for detecting brain tumors from MRI images: SAETCN, which achieves 99.38% accuracy in classifying different types of tumors, and SAS-Net, which has an overall pixel accuracy of 99.23% in segmenting brain tumors.... |
Read More |
|
|
|
![]() |
Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training |
Published at 2025-12-07 |
|
#ML
|
The authors present AutoQ-VIS, a new method for unsupervised video instance segmentation that improves upon previous techniques by bridging the gap between synthetic and real videos using quality-guided self-training. This method achieves state-of-the-art performance without human annotations, outperforming the previous best method by 4.4%.... |
Read More |
|
|
|
|
![]() |
From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs |
Published at 2025-12-07 |
|
#ML
|
The authors propose a method to improve the efficiency of language models by transitioning from sequential to parallel generation, which allows for faster and more efficient text generation. This approach enables the use of existing language model knowledge and achieves state-of-the-art performance in various tasks without the need for costly training from scratch.... |
Read More |
|
|
|
![]() |
DeepCode: Open Agentic Coding |
Published at 2025-12-08 |
|
#ML
|
This study presents DeepCode, a self-governing system that improves the process of converting scientific papers into code by managing information flow efficiently, outperforming both commercial coding agents and human experts in this task.... |
Read More |
|
|
|
|
![]() |
Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation |
Published at 2025-12-08 |
|
#ML
|
The authors present DualVLN, a new model for vision-language navigation that combines high-level reasoning with low-level action execution. This dual-system approach improves real-time control and adaptability in complex, dynamic environments, outperforming previous methods in benchmarks and real-world tests.... |
Read More |
|
|
|
![]() |
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory |
Published at 2025-12-08 |
|
#ML
|
The authors present a new method called OneStory that improves the generation of multi-shot videos by using a global yet compact memory of semantically relevant frames from previous shots, and an adaptive conditioner that generates compact context for direct conditioning. This approach allows for controllable and immersive long-form video storytelling with improved narrative coherence across diverse and complex scenes.... |
Read More |
|
|
|
|
![]() |
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality |
Published at 2025-12-08 |
|
#ML
|
The authors present LivingSwap, a new model for video face swapping that uses keyframes and visual attributes from source videos to create high-quality, realistic face swaps with consistent expressions, lighting, and motion. They also address the lack of data for training this model by creating a new dataset, Face2Face, and demonstrate that their method outperforms existing techniques in achieving seamless face swaps for film and entertainment production.... |
Read More |
|
|
|
![]() |
SUCCESS-GS: Survey of Compactness and Compression for Efficient Static and Dynamic Gaussian Splatting |
Published at 2025-12-08 |
|
#ML
|
This study offers a comprehensive review of methods to efficiently represent 3D and 4D scenes using Gaussian Splatting, a technique that allows for real-time, high-quality 3D reconstruction and novel view synthesis. The methods can be broadly categorized into two groups: Parameter Compression and Restructuring Compression, which help reduce memory and computational demands while maintaining reconstruction quality.... |
Read More |
|
|
|
|
![]() |
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models |
Published at 2025-12-08 |
|
#ML
|
The study presents a new framework called TreeGRPO that significantly enhances the efficiency of reinforcement learning for improving generative models. By structuring the denoising process as a search tree, TreeGRPO increases sample efficiency, allows for more precise reward assignment, and reduces computation costs, leading to faster training and better performance compared to existing methods.... |
Read More |
|
|
|
![]() |
EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce |
Published at 2025-12-09 |
|
#ML
|
The authors present EcomBench, a new evaluation tool for assessing the capabilities of foundation agents in real e-commerce environments. EcomBench is based on real user demands from global e-commerce platforms and tests agents on tasks like information retrieval, reasoning, and knowledge integration across different difficulty levels.... |
Read More |
|
|
|
|
![]() |
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time |
Published at 2025-12-09 |
|
#ML
|
This research presents a new model called D4RT, which is a unified transformer architecture that efficiently reconstructs dynamic scenes from video by inferring depth, motion, and camera parameters all at once. Its unique querying mechanism reduces computation and complexity, enabling fast training and inference, and it outperforms previous methods in 4D reconstruction tasks.... |
Read More |
|
|
|
![]() |
Modular Neural Image Signal Processing |
Published at 2025-12-09 |
|
#ML
|
The proposed framework processes raw images to produce high-quality output with modularity, allowing detailed control over the rendering process. This method enhances scalability, debugging, and adaptability to various camera types and user preferences, and is utilized in a user-interactive photo-editing tool for diverse editing operations and styles.... |
Read More |
|
|
|
|
![]() |
SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos |
Published at 2025-12-09 |
|
#ML
|
The study presents a new method, SAM-Body4D, for reconstructing 3D human bodies from videos without the need for additional training. This technique improves upon existing methods by ensuring temporal consistency and robustness against occlusions, making it more effective in real-world scenarios with moving and overlapping people.... |
Read More |
|
|
|
![]() |
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs |
Published at 2025-12-09 |
|
#ML
|
The study presents two new benchmarks to assess the performance of multimodal large language models in processing the same information across different modalities. The results reveal significant inconsistencies in model performance, with factors like text color, resolution, and number of vision tokens affecting the outcomes, even when text recognition errors are excluded.... |
Read More |
|
|
|
|
![]() |
Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation |
Published at 2025-12-09 |
|
#ML
|
The authors present Terrain Diffusion, an improvement over Perlin noise for creating infinite, realistic terrains in real-time. It uses a new algorithm called InfiniteDiffusion, a stack of diffusion models for detail and context, and an open-source framework for handling large data, enabling the generation of entire planets seamlessly and coherently.... |
Read More |
|
|
|
![]() |
TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels |
Published at 2025-12-09 |
|
#ML
|
The authors present a new method called TrackingWorld that improves monocular 3D tracking by separating camera motion from foreground motion and densely tracking new subjects. They achieve this through a tracking upsampler, reducing redundancy in 2D tracks, and an optimization-based framework for back-projecting dense 2D tracks into world-centric 3D trajectories, resulting in more accurate and dense 3D tracking.... |
Read More |
|
|
|
|
![]() |
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform |
Published at 2025-12-09 |
|
#ML
|
The authors present Visionary, a web-based platform that enables real-time rendering of 3D Gaussian Splatting and meshes using WebGPU, providing a lightweight and efficient experience for dynamic neural processing and generative models.... |
Read More |
|
|
|
![]() |
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance |
Published at 2025-12-09 |
|
#ML
|
The authors propose Wan-Move, a straightforward and efficient system for controlling motion in video generation models, by integrating motion-aware features into the model without changing its architecture. This approach enables high-quality, precise motion control, and the resulting videos are on par with commercial tools, as demonstrated by user studies and extensive experiments on a new benchmark dataset.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|