🤗 Daily Paper(2025-06-06)

8 views
Skip to first unread message

deep.di...@gmail.com

unread,
Jun 6, 2025, 4:08:11 PM6/6/25
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

Geometry-Editable and Appearance-Preserving Object Compositon

Published at 2025-05-27

#ML

The study presents a new model, DGAD, that can edit an object's geometry while preserving its fine-grained appearance details. DGAD uses semantic embeddings to understand the object's shape and a cross-attention mechanism to match the appearance features with the edited geometry, allowing for precise editing and faithful appearance preservation....

Read Moreicon

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Published at 2025-05-29

#ML

This study presents a method for teaching Large Language Models (LLMs) to understand and maintain contextual integrity (CI) when sharing information. The approach involves reasoning about CI and a reinforcement learning framework, which significantly reduces inappropriate information disclosure while preserving task performance....

Read Moreicon

Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

Published at 2025-05-29

#ML

This study improves 3D occupancy prediction for autonomous driving by using diffusion models, which learn the data distribution and incorporate 3D scene priors, leading to more accurate and consistent predictions compared to existing methods, especially in challenging conditions....

Read Moreicon

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Published at 2025-05-29

#ML

The authors present a new method named VideoREPA that improves the physics accuracy in text-to-video models by learning from video understanding foundation models. They introduce a new loss function, TRD, which helps align the token-level relations, enabling more realistic and physics-plausible video generation....

Read Moreicon

Autoregressive Images Watermarking through Lexical Biasing: An Approach Resistant to Regeneration Attack

Published at 2025-06-01

#ML

The authors present a new watermarking method called Lexical Bias Watermarking (LBW) that is specifically designed for autoregressive image generation models, which create images by predicting tokens sequentially. LBW embeds watermarks directly into the token maps, making it resistant to regeneration attacks and secure against white-box attacks, by randomly selecting a green list of tokens for each image and analyzing the token distribution for watermark detection....

Read Moreicon

SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

Published at 2025-06-01

#ML

The authors propose a unified framework called SkyReels-Audio for creating realistic and long talking portrait videos using pretrained video diffusion transformers. The framework allows for diverse and controllable conditioning through multimodal inputs like text and images, and employs various strategies to ensure high-quality, coherent, and accurate video generation....

Read Moreicon

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Published at 2025-06-01

#ML

The study investigates how well self-supervised Wav2Vec2 models represent Dutch language features and finds that pre-training specifically on Dutch leads to better encoding of Dutch linguistic aspects compared to training on English or multilingual data. This improvement in language-specific representation is linked to enhanced performance in Automatic Speech Recognition tasks....

Read Moreicon

BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations

Published at 2025-06-03

#ML

The authors present a new method for LiDAR-camera calibration that uses bird's-eye view features, which can be extracted from raw data and fused into a shared feature space. This approach, called BEVCALIB, is more efficient and accurate than traditional methods, demonstrating state-of-the-art performance on various datasets and outperforming other baselines by a significant margin....

Read Moreicon

FlexPainter: Flexible and Multi-View Consistent Texture Generation

Published at 2025-06-03

#ML

The authors present FlexPainter, a new texture generation system that allows for flexible, multi-modal control and ensures consistency across multiple views. It uses a shared space for different input types, decomposes structural and style info, and employs 3D knowledge to generate seamless, high-quality texture maps, outperforming existing methods....

Read Moreicon

Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach

Published at 2025-06-03

#ML

The authors propose a new method for interpreting whole-body CT images, focusing on abnormalities. They collaborated with radiologists to create a classification system, gathered a large dataset of CT images with annotations, developed a model that can automatically identify and describe abnormalities, and established benchmarks for evaluating the model's performance....

Read Moreicon

RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS

Published at 2025-06-03

#ML

The paper presents a new method called RobustSplat to improve 3D rendering by addressing the issue of artifacts caused by transient objects. It does this through two key designs: a strategy to delay Gaussian growth and a mask bootstrapping approach that uses lower-resolution feature similarity for initial transient mask estimation before moving to high-resolution supervision. This results in more accurate and robust 3D renderings, as shown in experiments on challenging datasets....

Read Moreicon

SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

Published at 2025-06-03

#ML

The proposed framework generates hand-object interaction videos and motion simultaneously by combining visual information and physical laws in a synchronized process, improving consistency and eliminating the need for predefined models or explicit pose guidance. Experimental results show that this method outperforms existing approaches in generating high-quality, dynamic sequences that can generalize to unseen real-world scenarios....

Read Moreicon

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

Published at 2025-06-03

#ML

The authors present a new method called StreamBP that efficiently handles the memory issues of training language models on long sequence data, improving the maximum sequence length for backpropagation and reducing computational costs, making it applicable for complex tasks....

Read Moreicon

Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

Published at 2025-06-03

#ML

The authors describe Surfer-H, a budget-friendly web tool that uses AI to complete tasks given by the user, and Holo1, a new set of AI models designed for navigating the web and finding information. By combining Surfer-H with Holo1, the tool outperforms others in accuracy and cost-effectiveness, and the researchers are sharing their data and models to help advance the field of AI systems....

Read Moreicon

Images are Worth Variable Length of Representations

Published at 2025-06-04

#ML

The authors present a new method called DOVE that creates different numbers of visual tokens for images depending on their complexity, unlike traditional vision encoders that use a fixed number of tokens. This approach allows DOVE to use fewer tokens while maintaining high image quality and performs better than other methods in various tasks, especially when using fewer tokens....

Read Moreicon

Language-Image Alignment with Fixed Text Encoders

Published at 2025-06-04

#ML

This study explores if a pre-trained large language model can serve as a text encoder for learning language-image alignment, without the need for joint training with an image encoder. The proposed method, LIFT, outperforms CLIP in scenarios involving complex concepts and long descriptions, while also being more efficient....

Read Moreicon

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

Published at 2025-06-04

#ML

MedAgentGym is a new training platform for improving coding-based medical reasoning in large language models, containing 72,413 tasks from real-world biomedical scenarios. Testing over 30 LLMs showed that Med-Copilot-7B significantly improved its performance with MedAgentGym, becoming a competitive and affordable alternative to gpt-4o....

Read Moreicon

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Published at 2025-06-04

#ML

The study presents a new method called RACRO that improves multi-modal reasoning by aligning visual information with language-based reasoning. This approach enhances the accuracy of visual grounding and ensures better performance in tasks like math and science problems by optimizing the captioning process....

Read Moreicon

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Published at 2025-06-04

#ML

The authors present RoboRefer, a 3D-aware vision language model that improves spatial understanding and reasoning for robotic interaction with complex 3D scenes. They introduce a large-scale dataset and a challenging benchmark, demonstrating that RoboRefer outperforms existing models in spatial referring tasks, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy....

Read Moreicon

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Published at 2025-06-04

#ML

The study derives scaling laws for comparing language-vision learning procedures, CLIP and MaMMUT, across various scales and tasks, finding MaMMUT to be more scalable and sample-efficient. The research provides a systematic method for comparing and improving open foundation models and datasets, releasing pre-trained models and code for reproducibility....

Read Moreicon

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Published at 2025-06-04

#ML

This study examines how two popular watermarking methods affect the truthfulness, safety, and helpfulness of four language models, discovering that these methods can cause the models to either become too cautious or not cautious enough. The researchers then propose a new method called Alignment Resampling to fix these issues, which successfully recovers the models' original alignment while keeping the watermark detectable....

Read Moreicon

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Published at 2025-06-05

#ML

The authors present CG-AV-Counting, a new benchmark for clue-grounded audio-visual counting in MLLMs, which addresses the limitations of existing benchmarks. They also introduce AV-Reasoner, a model trained to improve counting ability using reinforcement learning and curriculum learning, which outperforms current models on multiple benchmarks but struggles with out-of-domain benchmarks....

Read Moreicon

Aligning Latent Spaces with Flow Priors

Published at 2025-06-05

#ML

The authors propose a new method for aligning latent spaces to target distributions using flow-based generative models. This technique simplifies the process by eliminating expensive likelihood evaluations and avoiding ODE solving, and it has been validated through both theoretical proofs and empirical image generation experiments....

Read Moreicon

ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Published at 2025-06-05

#ML

ComfyUI-Copilot is a new tool that uses AI to help users easily create workflows on the ComfyUI platform for AI-driven art. It offers smart suggestions, automated construction, and reduces complexity, making it easier for both beginners and experienced users to use ComfyUI effectively....

Read Moreicon

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Published at 2025-06-05

#ML

The study presents Diagonal Batching, a scheduling scheme that enhances the performance of Recurrent Memory Transformers (RMTs) for long-context inference by enabling parallelism and eliminating sequential execution. This method leads to faster GPU inference, reduced inference cost, and lower latency, making RMTs more practical for real-world applications without requiring retraining....

Read Moreicon

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Published at 2025-06-05

#ML

EOC-Bench is a new benchmark for testing how well multimodal large language models understand objects in dynamic, first-person scenarios. It has detailed questions about objects in the past, present, and future, and measures the models' performance in a novel way, helping to improve their ability to understand and interact with objects in real-world situations....

Read Moreicon

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Published at 2025-06-05

#ML

The study finds that the performance of certain popular reasoning models can vary greatly depending on small changes in how they're tested, making it hard to trust their claimed improvements. To address this, the researchers suggest a more consistent way to test these models and share their findings on the Deepseek-R1-Distill series models....

Read Moreicon

FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Published at 2025-06-05

#ML

The authors present FEAT, a novel Transformer model designed to create high-quality dynamic medical videos by addressing limitations in existing methods, such as insufficient channel interactions, high computational complexity, and coarse denoising guidance. FEAT uses a unified attention mechanism, a linear-complexity design, and a residual value guidance module, resulting in a model that performs as well as or better than state-of-the-art models with fewer parameters and better scalability....

Read Moreicon

FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Published at 2025-06-05

#ML

The authors present a new method called FlowDirector for editing videos based on text instructions without using complex techniques that often cause issues like inconsistent timing and loss of detail. This method directly edits the video in its original form, preserving its original structure and timing, and uses a technique to ensure that only the desired parts of the video are edited while keeping the rest unchanged. The method also improves the alignment of the edited video with the given tex...

Read Moreicon

FreeTimeGS: Free Gaussians at Anytime and Anywhere for Dynamic Scene Reconstruction

Published at 2025-06-05

#ML

The authors present a new method, FreeTimeGS, to improve the reconstruction of dynamic 3D scenes with complex motions. Instead of relying on deformation fields that are hard to optimize, FreeTimeGS uses a 4D representation that allows Gaussian primitives to appear at any time and location, providing more flexibility and reducing temporal redundancy, resulting in better rendering quality compared to recent methods....

Read Moreicon

Inference-Time Hyper-Scaling with KV Cache Compression

Published at 2025-06-05

#ML

This study presents a new method called Dynamic Memory Sparsification (DMS) that compresses the KV cache in Transformer LLMs, enabling more tokens to be generated within the same compute budget. The result is improved accuracy for similar inference runtime and memory load, as demonstrated on various LLM families, including a significant boost in performance for the Qwen-R1 32B model....

Read Moreicon

Kinetics: Rethinking Test-Time Scaling Laws

Published at 2025-06-05

#ML

This study examines the efficiency of smaller models versus larger ones during inference and finds that larger models are more effective due to memory access costs that were previously overlooked. The researchers propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget, resulting in improved performance compared to dense counterparts....

Read Moreicon

MARBLE: Material Recomposition and Blending in CLIP-Space

Published at 2025-06-05

#ML

The researchers have developed a method called MARBLE that allows for editing and blending material properties in images using CLIP-space. This technique enables fine-grained control over material attributes like roughness, metallic, transparency, and glow, and can be applied to multiple edits in a single forward pass and painting....

Read Moreicon

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Published at 2025-06-05

#ML

The study presents MINT-CoT, a method that improves mathematical reasoning in AI by integrating visual information into text-based reasoning steps. This approach, supported by a new dataset and training strategy, significantly outperforms existing models in solving math problems that involve visual elements....

Read Moreicon

Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Published at 2025-06-05

#ML

The authors propose a framework called Micro-Act to help large language models better handle contradictory information from external sources and their own knowledge. This framework breaks down complex comparisons into smaller, manageable steps, allowing the models to reason more effectively and improve question answering accuracy on various benchmark datasets....

Read Moreicon

PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Published at 2025-06-05

#ML

The study presents a new method called PATS that improves the accuracy of automated sports skill assessment by preserving complete fundamental movements within continuous temporal segments across multiple views. PATS outperforms existing methods in various sports domains, adapting to different activity characteristics for better skill evaluation....

Read Moreicon

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Published at 2025-06-05

#ML

The Qwen3 Embedding series is a new set of text embedding and reranking models that are better than its predecessor. These models use a special training method with the Qwen3 LLMs and offer different sizes for various tasks, achieving top results in many tests and being available for public use....

Read Moreicon

Rectified Point Flow: Generic Point Cloud Pose Estimation

Published at 2025-06-05

#ML

The authors present a new method called Rectified Point Flow that combines point cloud registration and shape assembly into a single problem. This approach learns the correct positions of points in a cloud without needing symmetry labels and outperforms existing methods on various benchmarks by leveraging shared geometric priors....

Read Moreicon

Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

Published at 2025-06-05

#ML

The study presents a new method called PM-Loss to enhance the quality of feed-forward 3D Gaussian Splatting using depth maps. PM-Loss, based on a pointmap from a pre-trained transformer, improves geometric smoothness and reduces fragmentation in point clouds, leading to better rendering results across different architectures and scenes....

Read Moreicon

Search Arena: Analyzing Search-Augmented LLMs

Published at 2025-06-05

#ML

The researchers created a large-scale dataset called Search Arena, which contains over 24,000 human-generated multi-turn conversations with search-augmented language models in various languages and topics. Their analysis shows that users prefer responses with more citations, even if they don't support the claims, and that community-driven platforms are generally more trusted than static encyclopedic sources. They also found that web search can improve performance in non-search settings but may b...

Read Moreicon

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Published at 2025-06-05

#ML

The study presents SeedVR2, a one-step model for improving video restoration quality, which uses adversarial training against real data. The model features an adaptive window attention mechanism that adjusts dynamically to output resolutions, and a series of losses to enhance training stability and efficiency, as demonstrated through extensive experiments....

Read Moreicon

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Published at 2025-06-05

#ML

The study examines how Multimodal Large Language Models process visual inputs and discovers that only a small percentage of attention heads actively contribute to visual understanding. They then introduce SparseMM, a strategy that optimizes computation for these visual heads, resulting in faster and more efficient model performance....

Read Moreicon

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Published at 2025-06-05

#ML

This work presents the Common Pile v0.1, an 8TB dataset of openly licensed text, addressing the need for high-quality, large datasets for training large language models (LLMs) without intellectual property concerns. The dataset, collected from 30 diverse sources, is validated by training two competitive LLMs, Comma v0.1-1T and Comma v0.1-2T, which perform as well as LLMs trained on unlicensed text with similar resources....

Read Moreicon

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Published at 2025-06-05

#ML

STARE is a new benchmark that tests multimodal large language models on tasks requiring multi-step visual simulations, like geometry and puzzles. The results show that while models are good at simple tasks, they struggle with complex ones, unlike humans who improve significantly with visual simulations....

Read Moreicon

Video World Models with Long-term Spatial Memory

Published at 2025-06-05

#ML

This study presents a new method to improve the consistency of video world models by using a long-term spatial memory system, similar to human memory. The proposed framework enhances the quality and context length of generated videos, outperforming existing models, by storing and retrieving information from a 3D memory mechanism....

Read Moreicon

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Published at 2025-06-05

#ML

The authors present VideoMathQA, a new benchmark for evaluating models' ability to reason through mathematical problems in video format. This benchmark covers various mathematical domains and requires models to integrate visual, audio, and textual information, reflecting real-world scenarios....

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages