🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Inferring Compositional 4D Scenes without Ever Seeing One |
Published at 2025-12-04 |
|
#ML
|
This research presents a method called COM4D that can accurately predict the structure and movement of multiple objects in a scene using only 2D video input, without relying on any 3D or 4D training data. By learning from both object compositions and single object dynamics, COM4D can reconstruct complex, interactive scenes directly from video footage.... |
Read More |
|
|
|
![]() |
Aesthetic Alignment Risks Assimilation: How Image Generation and Reward Models Reinforce Beauty Bias and Ideological "Censorship" |
Published at 2025-12-08 |
|
#ML
|
This study shows that image generation models often favor conventionally beautiful images over user-requested 'anti-aesthetic' ones, which limits user freedom and diversity in artistic expression. The researchers found that reward models even penalize images that match user prompts but don't fit the traditional idea of beauty.... |
Read More |
|
|
|
|
![]() |
START: Spatial and Textual Learning for Chart Understanding |
Published at 2025-12-08 |
|
#ML
|
The authors present a new method, START, to improve multimodal large language models' understanding of charts by focusing on both their visual layout and data details. They introduce two new tasks, chart-element grounding and chart-to-code generation, and create a novel dataset, START-Dataset, to train models in these tasks. The proposed method outperforms existing models and demonstrates consistent improvements across different model sizes and benchmarks.... |
Read More |
|
|
|
![]() |
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification |
Published at 2025-12-09 |
|
#ML
|
This study presents a method called KD-OCT to create a more efficient and smaller deep learning model for analyzing images used in eye disease detection. The new model maintains high accuracy while significantly reducing the model size and inference time, making it suitable for real-time deployment and edge devices.... |
Read More |
|
|
|
|
![]() |
Learning Robot Manipulation from Audio World Models |
Published at 2025-12-09 |
|
#ML
|
The authors present a model that predicts future audio observations, helping robots make better decisions in tasks requiring multiple types of information, like filling a bottle with water. They show that their method, which focuses on predicting future audio states with rhythmic patterns, outperforms other methods in two manipulation tasks.... |
Read More |
|
|
|
![]() |
Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries |
Published at 2025-12-09 |
|
#ML
|
This study presents a refined deep learning model for automatically identifying fish in video data collected by Electronic Monitoring systems, which is more efficient than manual review. The proposed model, based on the Swin-T architecture, significantly improves fish re-identification metrics compared to a conventional model, with the main challenge being the distinction of similar-looking fish of the same species.... |
Read More |
|
|
|
|
![]() |
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer |
Published at 2025-12-09 |
|
#ML
|
This study presents AEGIS, a Vision-Language-Safe Action architecture that enhances safety in robotic manipulation tasks by adding a plug-and-play layer using control barrier functions, without compromising the models' original performance. The researchers also created a safety-critical benchmark and demonstrated AEGIS's superiority over existing methods, achieving a significant improvement in obstacle avoidance and task success rates.... |
Read More |
|
|
|
![]() |
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment |
Published at 2025-12-10 |
|
#ML
|
The authors present MentraSuite, a framework that enhances the reasoning abilities of large language models for mental health applications. They introduce MentraBench, a benchmark for evaluating reasoning quality, and Mindora, a post-trained model that ensures faithful and coherent reasoning, outperforming other LLMs in complex mental health scenarios.... |
Read More |
|
|
|
|
![]() |
Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge |
Published at 2025-12-10 |
|
#ML
|
The authors present a solution to the 2025 BEHAVIOR Challenge, which simulates long-term tasks for physical agents, achieving 2nd place. They improve upon a previous model by studying training techniques and data, finding that scaling pre-training and post-training phases significantly enhances performance. They share their findings to help the embodied AI community apply powerful models to complex scenarios.... |
Read More |
|
|
|
![]() |
CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models |
Published at 2025-12-11 |
|
#ML
|
The study presents a new method called CAPTAIN to address the issue of diffusion models replicating training examples, which can lead to privacy and copyright problems. CAPTAIN reduces memorization by modifying latent features during denoising, using frequency-based noise initialization, identifying optimal denoising timesteps, localizing memorized regions, and injecting semantically aligned features from non-memorized images, all while preserving the prompt's intended meaning and visual quality... |
Read More |
|
|
|
|
![]() |
FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos |
Published at 2025-12-11 |
|
#ML
|
The authors present a new method to create large, detailed motion datasets for videos using object detection, tracking, and Large Language Models. They fine-tune open-source models with this data, improving motion understanding and outperforming other models in various benchmarks.... |
Read More |
|
|
|
![]() |
What matters for Representation Alignment: Global Information or Spatial Structure? |
Published at 2025-12-11 |
|
#ML
|
This study explores what aspect of a target representation is crucial for generation: global semantic information or spatial structure. Through a large-scale analysis and experiments, the researchers found that spatial structure, not global performance, significantly impacts generation performance. They then introduced a simple method, iREPA, which improves the convergence speed of representation alignment, demonstrating the importance of spatial information in generative model training.... |
Read More |
|
|
|
|
![]() |
Flowception: Temporally Expansive Flow Matching for Video Generation |
Published at 2025-12-12 |
|
#ML
|
Flowception is a new method for generating videos without processing each frame one after another, which reduces errors and computational requirements compared to other techniques. It can handle long videos efficiently, integrate different tasks, and generates better results based on various metrics.... |
Read More |
|
|
|
![]() |
Rethinking Expert Trajectory Utilization in LLM Post-training |
Published at 2025-12-12 |
|
#ML
|
The study presents a new framework to improve the use of expert trajectories in training large language models, suggesting a specific order and guidelines for fine-tuning and reinforcement learning to enhance model performance.... |
Read More |
|
|
|
|
![]() |
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions |
Published at 2025-12-12 |
|
#ML
|
This study presents V-REX, a new evaluation tool for testing complex visual reasoning tasks that require multiple steps of exploration and thinking, similar to how a detective would work. V-REX breaks down these tasks into a series of questions and measures a model's ability to plan and follow through with these questions to find the final answer, providing a more detailed analysis of a model's performance in these challenging tasks.... |
Read More |
|
|
|
![]() |
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning |
Published at 2025-12-14 |
|
#ML
|
The authors present DrivePI, a spatial-aware 4D MLLM that integrates vision, language, and action for autonomous driving, outperforming both VLA and VA models in tasks like 3D perception, prediction, and planning.... |
Read More |
|
|
|
|
![]() |
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics |
Published at 2025-12-14 |
|
#ML
|
The authors present a new attention mechanism called Error-Free Linear Attention (EFLA) that is numerically stable, fully parallel, and generalized. EFLA allows for linear-time computation and error-free learning, outperforming other methods in noisy environments without adding extra parameters.... |
Read More |
|
|
|
![]() |
GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation |
Published at 2025-12-14 |
|
#ML
|
The authors present a new framework called GenieDrive for generating physics-aware driving videos. It uses 4D occupancy to provide physical information, reduces latent size for efficient compression, and improves forecasting and video quality through various techniques, resulting in better performance and speed.... |
Read More |
|
|
|
|
![]() |
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents |
Published at 2025-12-14 |
|
#ML
|
The study introduces NL2Repo Bench, a new evaluation tool for coding agents that tests their ability to create complete software systems over long periods. Experiments showed that current agents struggle with this task, revealing areas for improvement such as maintaining coherence and managing dependencies over extended periods.... |
Read More |
|
|
|
![]() |
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management |
Published at 2025-12-14 |
|
#ML
|
The study presents QwenLong-L1.5, a model with improved long-context reasoning and memory management. It uses a data synthesis pipeline for generating challenging tasks, stabilized reinforcement learning for long-context training, and a memory-augmented architecture for ultra-long contexts, outperforming its baseline and competitors on various benchmarks.... |
Read More |
|
|
|
|
![]() |
State over Tokens: Characterizing the Role of Reasoning Tokens |
Published at 2025-12-14 |
|
#ML
|
The State over Tokens (SoT) framework suggests that reasoning tokens in large language models are not a literal account of their reasoning process, but rather an externalized computational state that guides correct reasoning without being a faithful explanation when read as text. This approach highlights the need for future research to focus on decoding these tokens as state, rather than interpreting them as a narrative.... |
Read More |
|
|
|
![]() |
WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment |
Published at 2025-12-14 |
|
#ML
|
The authors present a new tree-search framework called WebOperator, which helps AI agents explore web environments more effectively by incorporating strategic foresight and safe execution. This method allows agents to backtrack reliably, explore alternative paths, and generate diverse action candidates, resulting in a higher success rate in web tasks compared to existing approaches.... |
Read More |
|
|
|
|
![]() |
DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders |
Published at 2025-12-15 |
|
#ML
|
The DiffusionBrowser framework enables interactive preview generation during video diffusion modeling, allowing users to see RGB and scene intrinsics at high speed. It also provides new control capabilities by enabling interactive guidance of generation and probing the model to reveal how details are composed and assembled.... |
Read More |
|
|
|
![]() |
Directional Textual Inversion for Personalized Text-to-Image Generation |
Published at 2025-12-15 |
|
#ML
|
The study presents a new method called Directional Textual Inversion (DTI) that improves personalized text-to-image generation by optimizing only the direction of embeddings in a fixed magnitude, leading to better text fidelity and smooth interpolation between concepts compared to existing methods.... |
Read More |
|
|
|
|
![]() |
FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models |
Published at 2025-12-15 |
|
#ML
|
The authors present FIN-bench-v2, a comprehensive and consistent benchmark suite for assessing large language models in Finnish. This suite combines popular Finnish benchmarks with an updated FIN-bench, covering various tasks and prompt formats, and ensures robustness through pretraining and evaluation of models, making all resources publicly available.... |
Read More |
|
|
|
![]() |
Few-Step Distillation for Text-to-Image Generation: A Practical Guide |
Published at 2025-12-15 |
|
#ML
|
This study is the first to systematically adapt and compare state-of-the-art diffusion distillation techniques for text-to-image generation, focusing on the challenges that arise when transitioning from class labels to free-form language prompts. The researchers provide practical guidelines for input scaling, network architecture, and hyperparameters, along with an open-source implementation and pretrained models, paving the way for fast, high-quality, and resource-efficient image generation in ... |
Read More |
|
|
|
|
![]() |
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows |
Published at 2025-12-15 |
|
#ML
|
The authors present a benchmark for AI agents to perform finance and accounting tasks in realistic enterprise workflows, using data from Enron and other financial institutions. They create 172 workflows with 384 tasks, and find that top AI systems struggle to complete them correctly, with GPT 5.1 Pro passing only 38.4% of workflows.... |
Read More |
|
|
|
![]() |
Image Diffusion Preview with Consistency Solver |
Published at 2025-12-15 |
|
#ML
|
This study presents a new method called ConsistencySolver to improve the speed and quality of image diffusion models. By using a lightweight, trainable solver optimized with reinforcement learning, ConsistencySolver generates better previews and maintains consistency with the final output, reducing user interaction time by nearly 50% while keeping generation quality.... |
Read More |
|
|
|
|
![]() |
KlingAvatar 2.0 Technical Report |
Published at 2025-12-15 |
|
#ML
|
The authors present KlingAvatar 2.0, a system that creates high-quality, long-duration videos by first generating low-resolution video keyframes and then improving them into high-resolution, coherent sub-clips. To better align the videos with given instructions, they use a Co-Reasoning Director with three large language model experts to understand the instructions and create a detailed storyline, while also managing multiple characters in the video.... |
Read More |
|
|
|
![]() |
LitePT: Lighter Yet Stronger Point Transformer |
Published at 2025-12-15 |
|
#ML
|
Researchers analyzed the role of convolutional layers and attention blocks in 3D point cloud networks and found that convolutions are better for extracting low-level geometry in early stages, while attention is more efficient for capturing high-level semantics in deeper layers. They then developed a new, efficient 3D point cloud backbone called LitePT that uses convolutions in early stages and switches to attention for deeper layers, resulting in a model that is faster, uses less memory, and has... |
Read More |
|
|
|
|
![]() |
LongVie 2: Multimodal Controllable Ultra-Long Video World Model |
Published at 2025-12-15 |
|
#ML
|
The study presents LongVie 2, a video world model that can generate high-quality, long-term videos with control and consistency. It's trained in three stages: multi-modal guidance for control, degradation-aware training for quality, and history-context guidance for consistency. The model is tested on LongVGenBench, a benchmark of diverse high-resolution videos, and outperforms existing models in long-range control, temporal coherence, and visual fidelity.... |
Read More |
|
|
|
![]() |
Memory in the Age of AI Agents |
Published at 2025-12-15 |
|
#ML
|
This study provides a comprehensive overview of agent memory research, distinguishing it from related concepts and examining it through the lenses of forms, functions, and dynamics. The research identifies three main types of agent memory, proposes a taxonomy for its functions, and analyzes its development over time, while also discussing emerging research frontiers in the field.... |
Read More |
|
|
|
|
![]() |
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding |
Published at 2025-12-15 |
|
#ML
|
The study presents ReFusion, a new type of language model that improves upon previous models by using a higher level of parallel decoding, which increases both performance and efficiency. ReFusion outperforms prior models by a significant margin and is also faster, making it a strong competitor to other language models.... |
Read More |
|
|
|
![]() |
RecTok: Reconstruction Distillation along Rectified Flow |
Published at 2025-12-15 |
|
#ML
|
The researchers created RecTok to improve high-dimensional visual tokenizers by focusing on the forward flow in flow matching, making it semantically rich and distilling information from vision foundation models into this flow. RecTok outperforms previous methods in image reconstruction, generation quality, and discriminative performance, achieving state-of-the-art results and improving with increased latent dimensionality.... |
Read More |
|
|
|
|
![]() |
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos |
Published at 2025-12-15 |
|
#ML
|
The researchers developed a new method for training vision-language-action models to understand 3D spatial relationships by aligning visual and physical spaces. They used human demonstration videos to create a dual-encoder architecture, which improved the model's ability to perform robotic tasks in 3D environments.... |
Read More |
|
|
|
![]() |
Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection |
Published at 2025-12-15 |
|
#ML
|
This study presents a new task called Visually Grounded Active View Selection, which helps vision language models choose the most informative next viewpoint by only using the current image's visual information. The researchers created a synthetic dataset and a framework that trains these models to select viewpoints, leading to better question-answering performance and improved accuracy in existing scene-exploration-based systems.... |
Read More |
|
|
|
|
![]() |
Towards Interactive Intelligence for Digital Humans |
Published at 2025-12-15 |
|
#ML
|
The authors propose a new approach for digital humans that can express personality, adapt to interactions, and evolve over time. They introduce Mio, a comprehensive framework with five modules for cognitive reasoning and real-time multimodal embodiment, which outperforms current methods in evaluating interactive intelligence.... |
Read More |
|
|
|
![]() |
Towards Scalable Pre-training of Visual Tokenizers for Generation |
Published at 2025-12-15 |
|
#ML
|
The research proposes a new framework, VTP, to improve the quality of visual tokenizers for image generation by focusing on high-level semantics instead of pixel-level accuracy. Through extensive experimentation, the study finds that better understanding leads to better generation and that the proposed method scales more effectively with compute, parameters, and data, resulting in faster convergence and improved performance in downstream image generation tasks.... |
Read More |
|
|
|
|
![]() |
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans? |
Published at 2025-12-15 |
|
#ML
|
This study presents a new benchmark suite, Video Reality Test, which evaluates the realism of AI-generated ASMR videos with tight audio-visual coupling. Experimental results reveal that while the best AI-generated videos can deceive most video models, human experts can still distinguish them with high accuracy, and superficial cues can mislead models, highlighting limitations in video models' perceptual fidelity and audio-visual consistency.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|