🤗 Daily Paper(2025-12-16)

3 views

Skip to first unread message

deep.di...@gmail.com

unread,

Dec 16, 2025, 3:08:07 PM12/16/25

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Inferring Compositional 4D Scenes without Ever Seeing One

Published at 2025-12-04

#ML

This research presents a method called COM4D that can accurately predict the structure and movement of multiple objects in a scene using only 2D video input, without relying on any 3D or 4D training data. By learning from both object compositions and single object dynamics, COM4D can reconstruct complex, interactive scenes directly from video footage....

Aesthetic Alignment Risks Assimilation: How Image Generation and Reward Models Reinforce Beauty Bias and Ideological "Censorship"

Published at 2025-12-08

#ML

This study shows that image generation models often favor conventionally beautiful images over user-requested 'anti-aesthetic' ones, which limits user freedom and diversity in artistic expression. The researchers found that reward models even penalize images that match user prompts but don't fit the traditional idea of beauty....

START: Spatial and Textual Learning for Chart Understanding

Published at 2025-12-08

#ML

The authors present a new method, START, to improve multimodal large language models' understanding of charts by focusing on both their visual layout and data details. They introduce two new tasks, chart-element grounding and chart-to-code generation, and create a novel dataset, START-Dataset, to train models in these tasks. The proposed method outperforms existing models and demonstrates consistent improvements across different model sizes and benchmarks....

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

Published at 2025-12-09

#ML

This study presents a method called KD-OCT to create a more efficient and smaller deep learning model for analyzing images used in eye disease detection. The new model maintains high accuracy while significantly reducing the model size and inference time, making it suitable for real-time deployment and edge devices....

Learning Robot Manipulation from Audio World Models

Published at 2025-12-09

#ML

The authors present a model that predicts future audio observations, helping robots make better decisions in tasks requiring multiple types of information, like filling a bottle with water. They show that their method, which focuses on predicting future audio states with rhythmic patterns, outperforms other methods in two manipulation tasks....

Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries

Published at 2025-12-09

#ML

This study presents a refined deep learning model for automatically identifying fish in video data collected by Electronic Monitoring systems, which is more efficient than manual review. The proposed model, based on the Swin-T architecture, significantly improves fish re-identification metrics compared to a conventional model, with the main challenge being the distinction of similar-looking fish of the same species....

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Published at 2025-12-09

#ML

This study presents AEGIS, a Vision-Language-Safe Action architecture that enhances safety in robotic manipulation tasks by adding a plug-and-play layer using control barrier functions, without compromising the models' original performance. The researchers also created a safety-critical benchmark and demonstrated AEGIS's superiority over existing methods, achieving a significant improvement in obstacle avoidance and task success rates....

MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

Published at 2025-12-10

#ML

The authors present MentraSuite, a framework that enhances the reasoning abilities of large language models for mental health applications. They introduce MentraBench, a benchmark for evaluating reasoning quality, and Mindora, a post-trained model that ensures faithful and coherent reasoning, outperforming other LLMs in complex mental health scenarios....

Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge

Published at 2025-12-10

#ML

The authors present a solution to the 2025 BEHAVIOR Challenge, which simulates long-term tasks for physical agents, achieving 2nd place. They improve upon a previous model by studying training techniques and data, finding that scaling pre-training and post-training phases significantly enhances performance. They share their findings to help the embodied AI community apply powerful models to complex scenarios....

CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Published at 2025-12-11

#ML

The study presents a new method called CAPTAIN to address the issue of diffusion models replicating training examples, which can lead to privacy and copyright problems. CAPTAIN reduces memorization by modifying latent features during denoising, using frequency-based noise initialization, identifying optimal denoising timesteps, localizing memorized regions, and injecting semantically aligned features from non-memorized images, all while preserving the prompt's intended meaning and visual quality...

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Published at 2025-12-11

#ML

The authors present a new method to create large, detailed motion datasets for videos using object detection, tracking, and Large Language Models. They fine-tune open-source models with this data, improving motion understanding and outperforming other models in various benchmarks....

What matters for Representation Alignment: Global Information or Spatial Structure?

Published at 2025-12-11

#ML

This study explores what aspect of a target representation is crucial for generation: global semantic information or spatial structure. Through a large-scale analysis and experiments, the researchers found that spatial structure, not global performance, significantly impacts generation performance. They then introduced a simple method, iREPA, which improves the convergence speed of representation alignment, demonstrating the importance of spatial information in generative model training....

Flowception: Temporally Expansive Flow Matching for Video Generation

Published at 2025-12-12

#ML

Flowception is a new method for generating videos without processing each frame one after another, which reduces errors and computational requirements compared to other techniques. It can handle long videos efficiently, integrate different tasks, and generates better results based on various metrics....

Rethinking Expert Trajectory Utilization in LLM Post-training

Published at 2025-12-12

#ML

The study presents a new framework to improve the use of expert trajectories in training large language models, suggesting a specific order and guidelines for fine-tuning and reinforcement learning to enhance model performance....

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Published at 2025-12-12

#ML

This study presents V-REX, a new evaluation tool for testing complex visual reasoning tasks that require multiple steps of exploration and thinking, similar to how a detective would work. V-REX breaks down these tasks into a series of questions and measures a model's ability to plan and follow through with these questions to find the final answer, providing a more detailed analysis of a model's performance in these challenging tasks....

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Published at 2025-12-14

#ML

The authors present DrivePI, a spatial-aware 4D MLLM that integrates vision, language, and action for autonomous driving, outperforming both VLA and VA models in tasks like 3D perception, prediction, and planning....

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Published at 2025-12-14

#ML

The authors present a new attention mechanism called Error-Free Linear Attention (EFLA) that is numerically stable, fully parallel, and generalized. EFLA allows for linear-time computation and error-free learning, outperforming other methods in noisy environments without adding extra parameters....

GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Published at 2025-12-14

#ML

The authors present a new framework called GenieDrive for generating physics-aware driving videos. It uses 4D occupancy to provide physical information, reduces latent size for efficient compression, and improves forecasting and video quality through various techniques, resulting in better performance and speed....

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Published at 2025-12-14

#ML

The study introduces NL2Repo Bench, a new evaluation tool for coding agents that tests their ability to create complete software systems over long periods. Experiments showed that current agents struggle with this task, revealing areas for improvement such as maintaining coherence and managing dependencies over extended periods....

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Published at 2025-12-14

#ML

The study presents QwenLong-L1.5, a model with improved long-context reasoning and memory management. It uses a data synthesis pipeline for generating challenging tasks, stabilized reinforcement learning for long-context training, and a memory-augmented architecture for ultra-long contexts, outperforming its baseline and competitors on various benchmarks....

State over Tokens: Characterizing the Role of Reasoning Tokens

Published at 2025-12-14

#ML

The State over Tokens (SoT) framework suggests that reasoning tokens in large language models are not a literal account of their reasoning process, but rather an externalized computational state that guides correct reasoning without being a faithful explanation when read as text. This approach highlights the need for future research to focus on decoding these tokens as state, rather than interpreting them as a narrative....

WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment

Published at 2025-12-14

#ML

The authors present a new tree-search framework called WebOperator, which helps AI agents explore web environments more effectively by incorporating strategic foresight and safe execution. This method allows agents to backtrack reliably, explore alternative paths, and generate diverse action candidates, resulting in a higher success rate in web tasks compared to existing approaches....

DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

Published at 2025-12-15

#ML

The DiffusionBrowser framework enables interactive preview generation during video diffusion modeling, allowing users to see RGB and scene intrinsics at high speed. It also provides new control capabilities by enabling interactive guidance of generation and probing the model to reveal how details are composed and assembled....

Directional Textual Inversion for Personalized Text-to-Image Generation

Published at 2025-12-15

#ML

The study presents a new method called Directional Textual Inversion (DTI) that improves personalized text-to-image generation by optimizing only the direction of embeddings in a fixed magnitude, leading to better text fidelity and smooth interpolation between concepts compared to existing methods....

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Published at 2025-12-15

#ML

The authors present FIN-bench-v2, a comprehensive and consistent benchmark suite for assessing large language models in Finnish. This suite combines popular Finnish benchmarks with an updated FIN-bench, covering various tasks and prompt formats, and ensures robustness through pretraining and evaluation of models, making all resources publicly available....

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Published at 2025-12-15

#ML

This study is the first to systematically adapt and compare state-of-the-art diffusion distillation techniques for text-to-image generation, focusing on the challenges that arise when transitioning from class labels to free-form language prompts. The researchers provide practical guidelines for input scaling, network architecture, and hyperparameters, along with an open-source implementation and pretrained models, paving the way for fast, high-quality, and resource-efficient image generation in ...

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Published at 2025-12-15

#ML

The authors present a benchmark for AI agents to perform finance and accounting tasks in realistic enterprise workflows, using data from Enron and other financial institutions. They create 172 workflows with 384 tasks, and find that top AI systems struggle to complete them correctly, with GPT 5.1 Pro passing only 38.4% of workflows....

Image Diffusion Preview with Consistency Solver

Published at 2025-12-15

#ML

This study presents a new method called ConsistencySolver to improve the speed and quality of image diffusion models. By using a lightweight, trainable solver optimized with reinforcement learning, ConsistencySolver generates better previews and maintains consistency with the final output, reducing user interaction time by nearly 50% while keeping generation quality....

KlingAvatar 2.0 Technical Report

Published at 2025-12-15

#ML

The authors present KlingAvatar 2.0, a system that creates high-quality, long-duration videos by first generating low-resolution video keyframes and then improving them into high-resolution, coherent sub-clips. To better align the videos with given instructions, they use a Co-Reasoning Director with three large language model experts to understand the instructions and create a detailed storyline, while also managing multiple characters in the video....

LitePT: Lighter Yet Stronger Point Transformer

Published at 2025-12-15

#ML

Researchers analyzed the role of convolutional layers and attention blocks in 3D point cloud networks and found that convolutions are better for extracting low-level geometry in early stages, while attention is more efficient for capturing high-level semantics in deeper layers. They then developed a new, efficient 3D point cloud backbone called LitePT that uses convolutions in early stages and switches to attention for deeper layers, resulting in a model that is faster, uses less memory, and has...

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Published at 2025-12-15

#ML

The study presents LongVie 2, a video world model that can generate high-quality, long-term videos with control and consistency. It's trained in three stages: multi-modal guidance for control, degradation-aware training for quality, and history-context guidance for consistency. The model is tested on LongVGenBench, a benchmark of diverse high-resolution videos, and outperforms existing models in long-range control, temporal coherence, and visual fidelity....

Memory in the Age of AI Agents

Published at 2025-12-15

#ML

This study provides a comprehensive overview of agent memory research, distinguishing it from related concepts and examining it through the lenses of forms, functions, and dynamics. The research identifies three main types of agent memory, proposes a taxonomy for its functions, and analyzes its development over time, while also discussing emerging research frontiers in the field....

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Published at 2025-12-15

#ML

The study presents ReFusion, a new type of language model that improves upon previous models by using a higher level of parallel decoding, which increases both performance and efficiency. ReFusion outperforms prior models by a significant margin and is also faster, making it a strong competitor to other language models....

RecTok: Reconstruction Distillation along Rectified Flow

Published at 2025-12-15

#ML

The researchers created RecTok to improve high-dimensional visual tokenizers by focusing on the forward flow in flow matching, making it semantically rich and distilling information from vision foundation models into this flow. RecTok outperforms previous methods in image reconstruction, generation quality, and discriminative performance, achieving state-of-the-art results and improving with increased latent dimensionality....

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Published at 2025-12-15

#ML

The researchers developed a new method for training vision-language-action models to understand 3D spatial relationships by aligning visual and physical spaces. They used human demonstration videos to create a dual-encoder architecture, which improved the model's ability to perform robotic tasks in 3D environments....

Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

Published at 2025-12-15

#ML

This study presents a new task called Visually Grounded Active View Selection, which helps vision language models choose the most informative next viewpoint by only using the current image's visual information. The researchers created a synthetic dataset and a framework that trains these models to select viewpoints, leading to better question-answering performance and improved accuracy in existing scene-exploration-based systems....

Towards Interactive Intelligence for Digital Humans

Published at 2025-12-15

#ML

The authors propose a new approach for digital humans that can express personality, adapt to interactions, and evolve over time. They introduce Mio, a comprehensive framework with five modules for cognitive reasoning and real-time multimodal embodiment, which outperforms current methods in evaluating interactive intelligence....

Towards Scalable Pre-training of Visual Tokenizers for Generation

Published at 2025-12-15

#ML

The research proposes a new framework, VTP, to improve the quality of visual tokenizers for image generation by focusing on high-level semantics instead of pixel-level accuracy. Through extensive experimentation, the study finds that better understanding leads to better generation and that the proposed method scales more effectively with compute, parameters, and data, resulting in faster convergence and improved performance in downstream image generation tasks....

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Published at 2025-12-15

#ML

This study presents a new benchmark suite, Video Reality Test, which evaluates the realism of AI-generated ASMR videos with tight audio-visual coupling. Experimental results reveal that while the best AI-generated videos can deceive most video models, human experts can still distinguish them with high accuracy, and superficial cues can mislead models, highlighting limitations in video models' perceptual fidelity and audio-visual consistency....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages