🤗 Daily Paper(2025-12-02)

3 views

Skip to first unread message

deep.di...@gmail.com

unread,

Dec 2, 2025, 3:08:17 PM12/2/25

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Published at 2025-11-21

#ML

The study finds that current text-to-image models often produce culturally neutral or English-biased results when given multilingual prompts. To address this, they propose a method to localize culture-sensitive signals in the models and introduce two strategies to improve cultural consistency in generated images without compromising their quality and diversity....

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Published at 2025-11-23

#ML

This study offers a detailed guide on code Large Language Models (LLMs), exploring their life cycle and analyzing their capabilities. It identifies gaps between academic research and real-world code tasks, suggesting practical research directions....

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Published at 2025-11-25

#ML

This study presents Flash-DMD, a new framework that significantly speeds up the training of diffusion models while maintaining high image quality, and also stabilizes the fine-tuning process using reinforcement learning, resulting in state-of-the-art generation quality with fewer computation steps....

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Published at 2025-11-25

#ML

The authors present a new framework, infty-RoPE, which overcomes limitations in current video diffusion models. This framework enables longer, more controlled, and cinematic video generation through three components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Experiments show that infty-RoPE outperforms previous models in overall VBench scores....

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Published at 2025-11-25

#ML

The authors present LongVT, a framework that allows for more accurate video reasoning by mimicking human-like video analysis - starting with a broad overview and then focusing on specific clips. They also introduce a new dataset, VideoSIAH, to train and evaluate this framework, which outperforms existing methods in long-video understanding and reasoning tasks....

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

Published at 2025-11-25

#ML

The authors present ImageCritic, a method to fix inconsistencies in generated images by using a reference image and a new dataset. ImageCritic can automatically detect and correct inconsistencies in complex scenarios, improving detail accuracy in various customized generation tasks....

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Published at 2025-11-27

#ML

The study presents a new approach called RS-EoT to improve the understanding of remote sensing images by vision-language models, which were previously prone to 'pseudo reasoning'. This is achieved through a language-driven, iterative visual evidence-seeking paradigm and a two-stage progressive RL strategy, resulting in state-of-the-art performance and genuine evidence-grounded reasoning....

Structured Extraction from Business Process Diagrams Using Vision-Language Models

Published at 2025-11-27

#ML

This study develops a method to extract structured information from business process diagrams (BPMN) directly from images using Vision-Language Models (VLMs), even when source files are missing. They enhance text recognition with OCR and find that it improves performance in various models, providing a better understanding of its impact through statistical analyses and prompt ablation studies....

Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories

Published at 2025-11-28

#ML

The proposed Rectified MeanFlow framework enables one-step generation by modeling the mean velocity field along rectified trajectories with just a single reflow step, improving sample quality and training efficiency over previous methods on ImageNet at various resolutions....

LFM2 Technical Report

Published at 2025-11-28

#ML

The LFM2 family is a collection of compact, efficient large language models designed for on-device use. These models achieve faster processing speeds and stronger task capabilities than similarly sized models by using a hardware-in-the-loop architecture search and a tempered, decoupled Top-K knowledge distillation objective....

OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

Published at 2025-11-28

#ML

The study presents OmniFusion, a system that combines multimodal foundation models with translation language models to perform simultaneous multilingual translations with improved latency and quality by leveraging both audio and visual inputs....

Doppler-Enhanced Deep Learning: Improving Thyroid Nodule Segmentation with YOLOv5 Instance Segmentation

Published at 2025-11-29

#ML

This study explores using YOLOv5 algorithms for accurately identifying thyroid nodules in ultrasound images, which could help in creating AI-assisted tools for doctors. They found that adding doppler images, which doctors usually don't use, greatly improves the accuracy of nodule detection, making the process faster and more reliable....

IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages

Published at 2025-11-29

#ML

The authors created IndicParam, a large test for language models on less-studied Indic languages like Nepali, Gujarati, and Sanskrit. They tested 19 language models and found that even the best one only got about half the questions right, showing that more work is needed to improve language models for these languages....

POLARIS: Projection-Orthogonal Least Squares for Robust and Adaptive Inversion in Diffusion Models

Published at 2025-11-29

#ML

The study investigates the Inversion-Denoising Paradigm used in diffusion models for image editing and restoration, identifying an overlooked factor causing reconstruction degradation: the approximate noise error. The researchers propose POLARIS, a method that reformulates inversion to address this error by treating the guidance scale as a step-wise variable, significantly improving inversion latent quality with minimal performance overhead....

SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

Published at 2025-11-29

#ML

The SCALE framework selectively allocates computational resources to challenging sub-problems in mathematical reasoning, improving performance and efficiency compared to uniform resource distribution methods....

SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs

Published at 2025-11-29

#ML

The authors propose SpeContext, a new algorithm and system for long-context reasoning in language models. It reduces parameters by 90% and improves throughput by up to 24.89 times in cloud environments and 10.06 times in edge environments with minimal accuracy loss....

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

Published at 2025-11-29

#ML

The authors present NewtonRewards, a framework for improving physical realism in video generation. By using measurable proxies from generated videos, such as optical flow for velocity and high-level appearance features for mass, NewtonRewards enforces Newtonian laws of motion, resulting in more realistic and smooth videos compared to existing methods....

Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models

Published at 2025-11-29

#ML

The authors present a new method called Wikontic for creating knowledge graphs from open-domain text, which results in high-quality, compact, and well-connected graphs. This approach improves the quality of the generated knowledge graphs and is more efficient than existing methods, making it easier to use structured knowledge in large language models....

WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing

Published at 2025-11-29

#ML

The authors present WiseEdit, a new benchmark for evaluating advanced image editing models that consider cognition and creativity. WiseEdit assesses models' abilities in awareness, interpretation, and imagination steps, and includes tasks requiring various knowledge types, revealing the limitations of current state-of-the-art models in knowledge-based reasoning and creative composition....

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

Published at 2025-11-30

#ML

The authors present a new method called Streaming Token Compression (STC) to improve the speed and efficiency of streaming video large language models. STC reduces the processing time for similar frames and compresses visual token sequences, leading to significant reductions in latency and memory usage without sacrificing accuracy....

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Published at 2025-11-30

#ML

A study compared three generalist large language models (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) with two clinical AI systems (OpenEvidence and UpToDate) and found that the generalist models performed better on medical benchmarks. The clinical tools lacked in areas like completeness, communication quality, and safety reasoning, highlighting the need for independent evaluation of clinical AI systems....

Learning Eigenstructures of Unstructured Data Manifolds

Published at 2025-11-30

#ML

The authors present a new method that learns a spectral basis for analyzing shapes and manifolds directly from unstructured data, without the need for traditional methods. This approach, based on optimal-approximation theory, can approximate the Laplacian operator and its eigendecomposition, and works for any dataset without assuming anything about the data manifold....

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

Published at 2025-11-30

#ML

The study presents Lotus-2, a new framework for predicting geometric properties in images using a powerful image generative model. It consists of two stages: a core predictor that generates coherent structures and a detail sharpener that refines fine-grained geometry, outperforming existing methods with significantly less training data....

Seeing the Wind from a Falling Leaf

Published at 2025-11-30

#ML

This research introduces a new method to understand invisible forces, like wind, by observing moving objects in videos. They created a system that can learn about object shapes, properties, and interactions from videos, allowing it to infer the forces at play. This system has potential uses in creating and editing physics-based videos, helping to connect the fields of computer vision and physics....

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

Published at 2025-11-30

#ML

The study presents VLASH, a new framework for Vision-Language-Action models that enables smooth, fast, and accurate real-time control without additional overhead or changes to the existing architecture. VLASH improves speed and reduces reaction latency compared to traditional methods and allows VLAs to perform high-precision tasks like playing ping-pong and whack-a-mole....

Agentic Policy Optimization via Instruction-Policy Co-Evolution

Published at 2025-12-01

#ML

This study presents INSPO, a framework that dynamically optimizes instructions for autonomous agents in reinforcement learning, improving their reasoning capability and performance on multi-turn retrieval and reasoning tasks compared to static instruction-based baselines....

CauSight: Learning to Supersense for Visual Causal Discovery

Published at 2025-12-01

#ML

The researchers created a new dataset of over 32,000 images with causal relationships and a model called CauSight to help AI understand cause-and-effect in visual scenarios. CauSight, trained with a mix of data curation, reasoning synthesis, and reinforcement learning, performs better than GPT-4.1 in visual causal discovery, providing a significant performance improvement....

ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling

Published at 2025-12-01

#ML

The authors present a new method called ChronosObserver that creates high-quality and synchronized 3D videos without the need for training or fine-tuning. This is achieved by using a concept called World State Hyperspace to represent the spatial and temporal relationships in a 4D scene, and then using Hyperspace Guided Sampling to align the video generation process across multiple viewpoints....

DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

Published at 2025-12-01

#ML

The authors present DreamingComics, a framework that improves story visualization by maintaining artistic consistency and controlling subject positioning. It uses a pretrained video model and a new positional encoding scheme to enhance identity and style consistency, and integrates an LLM-based layout generator for flexible layout conditioning, resulting in significant improvements in character consistency, style similarity, and spatial accuracy....

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Published at 2025-12-01

#ML

The study presents Envision, a benchmark for evaluating models that generate multi-image sequences based on textual descriptions, focusing on causal event progression. The researchers introduce Envision-Score, a metric to assess the consistency, realism, and aesthetics of these sequences, and find that unified multimodal models perform better in understanding causal narratives than specialized text-to-image models, but struggle with maintaining spatiotemporal consistency....

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

Published at 2025-12-01

#ML

The authors propose GR-RL, a robotic learning framework that enhances a generalist vision-language-action policy for long-horizon dexterous manipulation tasks. GR-RL uses a multi-stage training pipeline with demonstration filtering, augmentation, and reinforcement learning to improve performance and generalization, allowing a robot to autonomously lace up a shoe with high success rate....

Generative Video Motion Editing with 3D Point Tracks

Published at 2025-12-01

#ML

The authors describe a new method for editing camera and object movements in videos by using 3D point tracks, which provide depth information to enable precise and context-aware edits. This approach allows for diverse motion edits, such as joint camera and object manipulation, motion transfer, and non-rigid deformation, enhancing creative possibilities in video editing....

HiconAgent: History Context-aware Policy Optimization for GUI Agents

Published at 2025-12-01

#ML

The researchers present HiconAgent, a GUI agent that efficiently uses historical information for sequential navigation tasks. It does this through History Context-aware Policy Optimization (HCPO), which includes Dynamic Context Sampling and Anchor-guided History Compression, allowing the agent to adapt to relevant context and maintain efficiency, resulting in strong performance on various benchmarks....

How Far Are We from Genuinely Useful Deep Research Agents?

Published at 2025-12-01

#ML

This study presents FINDER, an enhanced benchmark for deep research agents that focuses on generating comprehensive reports with standardized structure and analytical depth. The researchers also introduce DEFT, the first failure taxonomy for deep research agents, which identifies 14 fine-grained failure modes, revealing that current agents struggle with evidence integration, verification, and reasoning-resilient planning....

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Published at 2025-12-01

#ML

The study presents a new framework called InternVideo-Next to improve general video foundation models without relying on video-text supervision. They address issues in previous methods by separating semantic abstraction from pixel-level details and incorporating reliable semantic priors, resulting in a more accurate and efficient model that outperforms existing ones....

MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

Published at 2025-12-01

#ML

The authors propose a new method for speech detection and phoneme classification using a compact Conformer model applied to raw MEG signals, achieving top-10 performance in the LibriBrain 2025 PNPL competition. They introduce MEG-specific augmentation, class weighting, and instance-level normalization techniques to improve model robustness and accuracy....

OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM-as-Critic

Published at 2025-12-01

#ML

The authors present a new method called OpenREAD that uses a language model to improve end-to-end autonomous driving by focusing on open-ended reasoning and decision-making, which leads to better performance in understanding scenes and planning routes compared to existing methods....

PromptBridge: Cross-Model Prompt Transfer for Large Language Models

Published at 2025-12-01

#ML

The research presents a method called PromptBridge to solve the problem of 'Model Drifting', where prompts designed for one language model perform poorly on another. PromptBridge allows for effective prompt transfer between models without needing per-task or per-model re-optimization, reducing migration effort and improving accuracy....

Rectifying LLM Thought from Lens of Optimization

Published at 2025-12-01

#ML

This study examines the reasoning processes of large language models (LLMs) and introduces a new method called RePro to improve their performance. RePro assesses and optimizes LLM reasoning by defining a surrogate objective function and utilizing a dual scoring mechanism, which is integrated into reinforcement learning pipelines to enhance reasoning performance and reduce suboptimal behaviors across various tasks and models....

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Published at 2025-12-01

#ML

The paper presents a new method called Script for reducing memory usage and inference time in multimodal large language models. Script removes visually redundant tokens and preserves query-relevant visual information, improving performance on image and video understanding tasks without requiring model retraining....

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Published at 2025-12-01

#ML

The study presents a new approach for reinforcement learning using large language models, explaining why and how a surrogate token-level objective can optimize the true sequence-level reward in policy gradient methods. They demonstrate the importance of minimizing training-inference discrepancy and policy staleness for this surrogate to work effectively, and provide guidelines for stable RL training through extensive experiments with a 30B model....

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Published at 2025-12-01

#ML

The study presents StreamGaze, a new benchmark for evaluating models' ability to use human gaze signals for understanding streaming videos in real-time. StreamGaze assesses models' performance in tracking shifting attention, inferring user intentions, and making proactive predictions, revealing significant gaps between current models and human performance....

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Published at 2025-12-01

#ML

The study presents TUNA, a new model for handling multimodal data that combines image and video understanding and generation in a single framework, eliminating the need for separate encoders and improving performance over previous methods. Experiments show that TUNA outperforms other models in tasks like image and video understanding, generation, and editing, highlighting the benefits of its unified representation design....

The Art of Scaling Test-Time Compute for Large Language Models

Published at 2025-12-01

#ML

This study is the first large-scale analysis of test-time scaling strategies for large language models, covering eight models and four datasets. The main findings are: no single strategy works best for all cases, reasoning models perform differently based on problem difficulty and length, and optimal performance improves with more compute budget for a given model type. The research offers a guide to choosing the best test-time scaling strategy based on problem difficulty, model type, and compute...

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages