🤗 Daily Paper(2025-12-09)

1 view
Skip to first unread message

deep.di...@gmail.com

unread,
Dec 9, 2025, 3:10:21 PM12/9/25
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue

Published at 2025-12-03

#ML

The study presents a new framework called DZ-TDPO to help long-context dialogue systems better handle changes in user intent over time, without losing important historical context. This method improves performance on a large-scale chat dataset, and the researchers found that larger models can handle this task more efficiently, maintaining their general capabilities....

Read Moreicon

ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

Published at 2025-12-03

#ML

The research presents a new framework called ReCamDriving that creates realistic videos using only camera data and 3D renderings, which provides more accurate control and detail compared to existing methods. They also introduce a strategy to improve the quality of the generated videos and a new dataset for training, which results in better performance than current technologies....

Read Moreicon

The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Published at 2025-12-04

#ML

The study highlights the significant differences between SAM2 and SAM3, two Segment Anything Models. SAM2 uses spatial prompts for geometric and temporal segmentation, while SAM3 incorporates a unified vision-language architecture for open-vocabulary reasoning and concept understanding, marking a shift towards concept-driven segmentation....

Read Moreicon

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

Published at 2025-12-05

#ML

This research creates a new system for editing videos taken from a first-person perspective, called EgoEdit, which can work in real-time on a single graphics card. The system is designed to handle the challenges of rapid movement and frequent interactions with objects in these videos, and it performs well in preserving hands and following editing instructions, outperforming existing methods in egocentric editing tasks while maintaining comparable performance in general editing tasks....

Read Moreicon

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

Published at 2025-12-06

#ML

This research explores using Reinforcement Learning to improve decoding-based regression, which converts regression into a sequence generation task using large language models. The proposed method aligns discrete token-level objectives with continuous numerical values, enhancing precision and generalization, and outperforms existing token-level baselines and traditional regression heads in various experiments....

Read Moreicon

Embodied Referring Expression Comprehension in Human-Robot Interaction

Published at 2025-12-06

#ML

This study presents the Refer360 dataset, a large-scale collection of embodied verbal and nonverbal interactions in various indoor and outdoor settings, addressing limitations of existing datasets. Additionally, the researchers introduce MuRes, a multimodal guided residual module that improves embodied referring expression comprehension in human-robot interaction by extracting and reinforcing salient signals from pre-trained representations....

Read Moreicon

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Published at 2025-12-06

#ML

The researchers created a comprehensive toolbox called OmniSafeBench-MM to evaluate the security of multimodal language models. This toolbox includes various attack methods, defense strategies, and a diverse dataset to test the models' vulnerability to jailbreak attacks, providing a standardized foundation for future research in this area....

Read Moreicon

Rethinking Training Dynamics in Scale-wise Autoregressive Generation

Published at 2025-12-06

#ML

The study investigates the exposure bias issue in scale-wise autoregressive models used for image generation and proposes a new method called Self-Autoregressive Refinement (SAR). SAR improves generation quality by aligning train-test patterns and providing adequate supervision for self-generated contexts, resulting in a consistent improvement in image generation with minimal computational overhead....

Read Moreicon

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

Published at 2025-12-06

#ML

The authors present a new framework called VG-Refiner that improves tool-integrated visual reasoning, specifically in referring and grounding tasks, by addressing the issue of unreliable tool outputs. They introduce a two-stage mechanism for analyzing and responding to tool feedback, along with a refinement reward to encourage corrections, and propose new metrics for evaluating model performance....

Read Moreicon

Vector Quantization using Gaussian Variational Autoencoder

Published at 2025-12-06

#ML

The study presents a new method called Gaussian Quant (GQ) that simplifies the process of creating a vector quantized variational autoencoder (VQ-VAE) without the need for training. This technique generates random Gaussian noise as a codebook, compares it to the posterior mean, and ensures a small quantization error by meeting a specific constraint. Additionally, the study introduces a heuristic called target divergence constraint (TDC) to improve the training of Gaussian VAEs for GQ, resulting ...

Read Moreicon

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

Published at 2025-12-07

#ML

The paper presents a new framework called DoGe that helps vision-language models learn more effectively in data-scarce environments. DoGe does this by focusing on problem context, decoupling the learning process, and creating a diverse training data pipeline, which allows the models to generalize better and outperform baseline methods in various benchmarks....

Read Moreicon

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Published at 2025-12-07

#ML

The researchers present a new framework called DoVer that improves debugging of large language model-based multi-agent systems by actively testing hypotheses through targeted actions and focusing on outcomes rather than just attributing errors to specific agents or steps. This approach enhances system reliability and paves the way for more effective debugging methods in the future....

Read Moreicon

Scaling Zero-Shot Reference-to-Video Generation

Published at 2025-12-07

#ML

This study presents Saber, a new method for generating videos from text prompts and reference images without needing expensive, hard-to-scale data. Saber uses a masked training strategy and a special attention-based model to create consistent and accurate videos, and it performs better than other methods on a standard benchmark....

Read Moreicon

Small-Gain Nash: Certified Contraction to Nash Equilibria in Differentiable Games

Published at 2025-12-07

#ML

This study presents Small-Gain Nash (SGN), a new method that uses local curvature and player interaction bounds to ensure convergence to Nash equilibria in differentiable games, even when traditional methods fail due to non-monotonicity....

Read Moreicon

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Published at 2025-12-07

#ML

VideoVLA is a method that uses large video generation models to predict action sequences and visual outcomes based on language instructions and images. This approach allows robots to better generalize to new tasks, objects, and settings by imagining future visual consequences of their actions....

Read Moreicon

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Published at 2025-12-08

#ML

The researchers propose a new method to improve Long-Context LLMs by using both real and imaginary components of complex-valued representations in Rotary Position Embeddings, which helps preserve more positional information and enhances performance, especially for longer contexts....

Read Moreicon

Distribution Matching Variational AutoEncoder

Published at 2025-12-08

#ML

The study presents a new method called DMVAE that aligns the latent distribution of an encoder with a chosen reference distribution, improving upon traditional VAEs by allowing for more flexibility in distribution choice. The researchers discover that using self-supervised learning derived distributions leads to better image reconstruction and modeling efficiency on the ImageNet dataset....

Read Moreicon

Group Representational Position Encoding

Published at 2025-12-08

#ML

GRAPE is a unified framework for positional encoding that combines two mechanisms: Multiplicative GRAPE, which uses multiplicative rotations in SO(d), and Additive GRAPE, which uses additive logit biases from unipotent actions in GL. This framework provides a structured way to design positional geometry for long-context models, encompassing RoPE and ALiBi as special cases....

Read Moreicon

LongCat-Image Technical Report

Published at 2025-12-08

#ML

The LongCat-Image model is a new, open-source, Chinese-English foundation model for image generation that addresses key challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility. It offers superior text-rendering capabilities, remarkable photorealism, and Chinese character rendering, all while being highly efficient with a compact design and extensive open-source ecosystem for developers and researchers....

Read Moreicon

Multi-view Pyramid Transformer: Look Coarser to See Broader

Published at 2025-12-08

#ML

The authors present a new method called Multi-view Pyramid Transformer (MVP) that efficiently reconstructs large 3D scenes from many images. It works by gradually expanding its view from local to global and refining details, allowing for quick and accurate reconstruction of complex scenes, which they demonstrate on various datasets....

Read Moreicon

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Published at 2025-12-08

#ML

The authors present a new method called Native Parallel Reasoner (NPR) that allows large language models to develop their own parallel reasoning capabilities without external guidance. NPR uses three innovative techniques to enable this, resulting in significant performance improvements and inference speedups on various reasoning benchmarks....

Read Moreicon

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Published at 2025-12-08

#ML

This study presents a controlled experimental framework to investigate the impact of pre-training, mid-training, and RL on reasoning language models. The results reveal that RL enhances model performance when pre-training leaves room for improvement, RL data targets the model's edge of competence, and mid-training significantly boosts performance compared to RL alone....

Read Moreicon

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Published at 2025-12-08

#ML

The authors present FAE, a straightforward method to adjust pre-trained visual representations into compact latent spaces for image generation, using a single attention layer. This approach effectively combines two decoders for feature reconstruction and image generation, achieving strong performance on various benchmarks, including near state-of-the-art FID scores on ImageNet 256x256....

Read Moreicon

Relational Visual Similarity

Published at 2025-12-08

#ML

This study introduces a new way to measure similarity between images based on their internal relationships or functions, rather than just their visible attributes. The researchers created a dataset of 114k images with captions describing their underlying relational logic and used it to train a model that can measure relational similarity between images, which could have various real-world applications....

Read Moreicon

Unified Video Editing with Temporal Reasoner

Published at 2025-12-08

#ML

The study presents a new method called VideoCoF that improves video editing by adding a step for the model to think about the edits it needs to make, resulting in more precise and mask-free editing. The method also allows for better motion alignment and longer video editing with less training data, outperforming existing techniques....

Read Moreicon

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Published at 2025-12-08

#ML

The study presents UnityVideo, a new framework that improves video generation by learning from various data types such as segmentation masks and depth maps, and training methods. UnityVideo uses techniques like dynamic noising and a modality switcher to process different data types together, resulting in better video quality and alignment with real-world objects....

Read Moreicon

Voxify3D: Pixel Art Meets Volumetric Rendering

Published at 2025-12-08

#ML

The authors present a new method called Voxify3D that converts 3D models into pixel art by combining 3D mesh optimization with 2D pixel art supervision. It achieves this through three main components: perspective distortion elimination, semantic preservation, and discrete color space optimization, resulting in high-quality pixel art with better control over aesthetics and abstraction levels....

Read Moreicon

Published at

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages