🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation |
Published at 2025-11-21 |
|
#ML
|
The study finds that current text-to-image models often produce culturally neutral or English-biased results when given multilingual prompts. To address this, they propose a method to localize culture-sensitive signals in the models and introduce two strategies to improve cultural consistency in generated images without compromising their quality and diversity.... |
Read More |
|
|
|
![]() |
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence |
Published at 2025-11-23 |
|
#ML
|
This study offers a detailed guide on code Large Language Models (LLMs), exploring their life cycle and analyzing their capabilities. It identifies gaps between academic research and real-world code tasks, suggesting practical research directions.... |
Read More |
|
|
|
|
![]() |
Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning |
Published at 2025-11-25 |
|
#ML
|
This study presents Flash-DMD, a new framework that significantly speeds up the training of diffusion models while maintaining high image quality, and also stabilizes the fine-tuning process using reinforcement learning, resulting in state-of-the-art generation quality with fewer computation steps.... |
Read More |
|
|
|
![]() |
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout |
Published at 2025-11-25 |
|
#ML
|
The authors present a new framework, infty-RoPE, which overcomes limitations in current video diffusion models. This framework enables longer, more controlled, and cinematic video generation through three components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Experiments show that infty-RoPE outperforms previous models in overall VBench scores.... |
Read More |
|
|
|
|
![]() |
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling |
Published at 2025-11-25 |
|
#ML
|
The authors present LongVT, a framework that allows for more accurate video reasoning by mimicking human-like video analysis - starting with a broad overview and then focusing on specific clips. They also introduce a new dataset, VideoSIAH, to train and evaluate this framework, which outperforms existing methods in long-video understanding and reasoning tasks.... |
Read More |
|
|
|
![]() |
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment |
Published at 2025-11-25 |
|
#ML
|
The authors present ImageCritic, a method to fix inconsistencies in generated images by using a reference image and a new dataset. ImageCritic can automatically detect and correct inconsistencies in complex scenarios, improving detail accuracy in various customized generation tasks.... |
Read More |
|
|
|
|
![]() |
Asking like Socrates: Socrates helps VLMs understand remote sensing images |
Published at 2025-11-27 |
|
#ML
|
The study presents a new approach called RS-EoT to improve the understanding of remote sensing images by vision-language models, which were previously prone to 'pseudo reasoning'. This is achieved through a language-driven, iterative visual evidence-seeking paradigm and a two-stage progressive RL strategy, resulting in state-of-the-art performance and genuine evidence-grounded reasoning.... |
Read More |
|
|
|
![]() |
Structured Extraction from Business Process Diagrams Using Vision-Language Models |
Published at 2025-11-27 |
|
#ML
|
This study develops a method to extract structured information from business process diagrams (BPMN) directly from images using Vision-Language Models (VLMs), even when source files are missing. They enhance text recognition with OCR and find that it improves performance in various models, providing a better understanding of its impact through statistical analyses and prompt ablation studies.... |
Read More |
|
|
|
|
![]() |
Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories |
Published at 2025-11-28 |
|
#ML
|
The proposed Rectified MeanFlow framework enables one-step generation by modeling the mean velocity field along rectified trajectories with just a single reflow step, improving sample quality and training efficiency over previous methods on ImageNet at various resolutions.... |
Read More |
|
|
|
![]() |
LFM2 Technical Report |
Published at 2025-11-28 |
|
#ML
|
The LFM2 family is a collection of compact, efficient large language models designed for on-device use. These models achieve faster processing speeds and stronger task capabilities than similarly sized models by using a hardware-in-the-loop architecture search and a tempered, decoupled Top-K knowledge distillation objective.... |
Read More |
|
|
|
|
![]() |
OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion |
Published at 2025-11-28 |
|
#ML
|
The study presents OmniFusion, a system that combines multimodal foundation models with translation language models to perform simultaneous multilingual translations with improved latency and quality by leveraging both audio and visual inputs.... |
Read More |
|
|
|
![]() |
Doppler-Enhanced Deep Learning: Improving Thyroid Nodule Segmentation with YOLOv5 Instance Segmentation |
Published at 2025-11-29 |
|
#ML
|
This study explores using YOLOv5 algorithms for accurately identifying thyroid nodules in ultrasound images, which could help in creating AI-assisted tools for doctors. They found that adding doppler images, which doctors usually don't use, greatly improves the accuracy of nodule detection, making the process faster and more reliable.... |
Read More |
|
|
|
|
![]() |
IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages |
Published at 2025-11-29 |
|
#ML
|
The authors created IndicParam, a large test for language models on less-studied Indic languages like Nepali, Gujarati, and Sanskrit. They tested 19 language models and found that even the best one only got about half the questions right, showing that more work is needed to improve language models for these languages.... |
Read More |
|
|
|
![]() |
POLARIS: Projection-Orthogonal Least Squares for Robust and Adaptive Inversion in Diffusion Models |
Published at 2025-11-29 |
|
#ML
|
The study investigates the Inversion-Denoising Paradigm used in diffusion models for image editing and restoration, identifying an overlooked factor causing reconstruction degradation: the approximate noise error. The researchers propose POLARIS, a method that reformulates inversion to address this error by treating the guidance scale as a step-wise variable, significantly improving inversion latent quality with minimal performance overhead.... |
Read More |
|
|
|
|
![]() |
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling |
Published at 2025-11-29 |
|
#ML
|
The SCALE framework selectively allocates computational resources to challenging sub-problems in mathematical reasoning, improving performance and efficiency compared to uniform resource distribution methods.... |
Read More |
|
|
|
![]() |
SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs |
Published at 2025-11-29 |
|
#ML
|
The authors propose SpeContext, a new algorithm and system for long-context reasoning in language models. It reduces parameters by 90% and improves throughput by up to 24.89 times in cloud environments and 10.06 times in edge environments with minimal accuracy loss.... |
Read More |
|
|
|
|
![]() |
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards |
Published at 2025-11-29 |
|
#ML
|
The authors present NewtonRewards, a framework for improving physical realism in video generation. By using measurable proxies from generated videos, such as optical flow for velocity and high-level appearance features for mass, NewtonRewards enforces Newtonian laws of motion, resulting in more realistic and smooth videos compared to existing methods.... |
Read More |
|
|
|
![]() |
Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models |
Published at 2025-11-29 |
|
#ML
|
The authors present a new method called Wikontic for creating knowledge graphs from open-domain text, which results in high-quality, compact, and well-connected graphs. This approach improves the quality of the generated knowledge graphs and is more efficient than existing methods, making it easier to use structured knowledge in large language models.... |
Read More |
|
|
|
|
![]() |
WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing |
Published at 2025-11-29 |
|
#ML
|
The authors present WiseEdit, a new benchmark for evaluating advanced image editing models that consider cognition and creativity. WiseEdit assesses models' abilities in awareness, interpretation, and imagination steps, and includes tasks requiring various knowledge types, revealing the limitations of current state-of-the-art models in knowledge-based reasoning and creative composition.... |
Read More |
|
|
|
![]() |
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression |
Published at 2025-11-30 |
|
#ML
|
The authors present a new method called Streaming Token Compression (STC) to improve the speed and efficiency of streaming video large language models. STC reduces the processing time for similar frames and compresses visual token sequences, leading to significant reductions in latency and memory usage without sacrificing accuracy.... |
Read More |
|
|
|
|
![]() |
Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks |
Published at 2025-11-30 |
|
#ML
|
A study compared three generalist large language models (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) with two clinical AI systems (OpenEvidence and UpToDate) and found that the generalist models performed better on medical benchmarks. The clinical tools lacked in areas like completeness, communication quality, and safety reasoning, highlighting the need for independent evaluation of clinical AI systems.... |
Read More |
|
|
|
![]() |
Learning Eigenstructures of Unstructured Data Manifolds |
Published at 2025-11-30 |
|
#ML
|
The authors present a new method that learns a spectral basis for analyzing shapes and manifolds directly from unstructured data, without the need for traditional methods. This approach, based on optimal-approximation theory, can approximate the Laplacian operator and its eigendecomposition, and works for any dataset without assuming anything about the data manifold.... |
Read More |
|
|
|
|
![]() |
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model |
Published at 2025-11-30 |
|
#ML
|
The study presents Lotus-2, a new framework for predicting geometric properties in images using a powerful image generative model. It consists of two stages: a core predictor that generates coherent structures and a detail sharpener that refines fine-grained geometry, outperforming existing methods with significantly less training data.... |
Read More |
|
|
|
![]() |
Seeing the Wind from a Falling Leaf |
Published at 2025-11-30 |
|
#ML
|
This research introduces a new method to understand invisible forces, like wind, by observing moving objects in videos. They created a system that can learn about object shapes, properties, and interactions from videos, allowing it to infer the forces at play. This system has potential uses in creating and editing physics-based videos, helping to connect the fields of computer vision and physics.... |
Read More |
|
|
|
|
![]() |
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference |
Published at 2025-11-30 |
|
#ML
|
The study presents VLASH, a new framework for Vision-Language-Action models that enables smooth, fast, and accurate real-time control without additional overhead or changes to the existing architecture. VLASH improves speed and reduces reaction latency compared to traditional methods and allows VLAs to perform high-precision tasks like playing ping-pong and whack-a-mole.... |
Read More |
|
|
|
![]() |
Agentic Policy Optimization via Instruction-Policy Co-Evolution |
Published at 2025-12-01 |
|
#ML
|
This study presents INSPO, a framework that dynamically optimizes instructions for autonomous agents in reinforcement learning, improving their reasoning capability and performance on multi-turn retrieval and reasoning tasks compared to static instruction-based baselines.... |
Read More |
|
|
|
|
![]() |
CauSight: Learning to Supersense for Visual Causal Discovery |
Published at 2025-12-01 |
|
#ML
|
The researchers created a new dataset of over 32,000 images with causal relationships and a model called CauSight to help AI understand cause-and-effect in visual scenarios. CauSight, trained with a mix of data curation, reasoning synthesis, and reinforcement learning, performs better than GPT-4.1 in visual causal discovery, providing a significant performance improvement.... |
Read More |
|
|
|
![]() |
ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling |
Published at 2025-12-01 |
|
#ML
|
The authors present a new method called ChronosObserver that creates high-quality and synchronized 3D videos without the need for training or fine-tuning. This is achieved by using a concept called World State Hyperspace to represent the spatial and temporal relationships in a 4D scene, and then using Hyperspace Guided Sampling to align the video generation process across multiple viewpoints.... |
Read More |
|
|
|
|
![]() |
DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models |
Published at 2025-12-01 |
|
#ML
|
The authors present DreamingComics, a framework that improves story visualization by maintaining artistic consistency and controlling subject positioning. It uses a pretrained video model and a new positional encoding scheme to enhance identity and style consistency, and integrates an LLM-based layout generator for flexible layout conditioning, resulting in significant improvements in character consistency, style similarity, and spatial accuracy.... |
Read More |
|
|
|
![]() |
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights |
Published at 2025-12-01 |
|
#ML
|
The study presents Envision, a benchmark for evaluating models that generate multi-image sequences based on textual descriptions, focusing on causal event progression. The researchers introduce Envision-Score, a metric to assess the consistency, realism, and aesthetics of these sequences, and find that unified multimodal models perform better in understanding causal narratives than specialized text-to-image models, but struggle with maintaining spatiotemporal consistency.... |
Read More |
|
|
|
|
![]() |
GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation |
Published at 2025-12-01 |
|
#ML
|
The authors propose GR-RL, a robotic learning framework that enhances a generalist vision-language-action policy for long-horizon dexterous manipulation tasks. GR-RL uses a multi-stage training pipeline with demonstration filtering, augmentation, and reinforcement learning to improve performance and generalization, allowing a robot to autonomously lace up a shoe with high success rate.... |
Read More |
|
|
|
![]() |
Generative Video Motion Editing with 3D Point Tracks |
Published at 2025-12-01 |
|
#ML
|
The authors describe a new method for editing camera and object movements in videos by using 3D point tracks, which provide depth information to enable precise and context-aware edits. This approach allows for diverse motion edits, such as joint camera and object manipulation, motion transfer, and non-rigid deformation, enhancing creative possibilities in video editing.... |
Read More |
|
|
|
|
![]() |
HiconAgent: History Context-aware Policy Optimization for GUI Agents |
Published at 2025-12-01 |
|
#ML
|
The researchers present HiconAgent, a GUI agent that efficiently uses historical information for sequential navigation tasks. It does this through History Context-aware Policy Optimization (HCPO), which includes Dynamic Context Sampling and Anchor-guided History Compression, allowing the agent to adapt to relevant context and maintain efficiency, resulting in strong performance on various benchmarks.... |
Read More |
|
|
|
![]() |
How Far Are We from Genuinely Useful Deep Research Agents? |
Published at 2025-12-01 |
|
#ML
|
This study presents FINDER, an enhanced benchmark for deep research agents that focuses on generating comprehensive reports with standardized structure and analytical depth. The researchers also introduce DEFT, the first failure taxonomy for deep research agents, which identifies 14 fine-grained failure modes, revealing that current agents struggle with evidence integration, verification, and reasoning-resilient planning.... |
Read More |
|
|
|
|
![]() |
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision |
Published at 2025-12-01 |
|
#ML
|
The study presents a new framework called InternVideo-Next to improve general video foundation models without relying on video-text supervision. They address issues in previous methods by separating semantic abstraction from pixel-level details and incorporating reliable semantic priors, resulting in a more accurate and efficient model that outperforms existing ones.... |
Read More |
|
|
|
![]() |
MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification |
Published at 2025-12-01 |
|
#ML
|
The authors propose a new method for speech detection and phoneme classification using a compact Conformer model applied to raw MEG signals, achieving top-10 performance in the LibriBrain 2025 PNPL competition. They introduce MEG-specific augmentation, class weighting, and instance-level normalization techniques to improve model robustness and accuracy.... |
Read More |
|
|
|
|
![]() |
OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM-as-Critic |
Published at 2025-12-01 |
|
#ML
|
The authors present a new method called OpenREAD that uses a language model to improve end-to-end autonomous driving by focusing on open-ended reasoning and decision-making, which leads to better performance in understanding scenes and planning routes compared to existing methods.... |
Read More |
|
|
|
![]() |
PromptBridge: Cross-Model Prompt Transfer for Large Language Models |
Published at 2025-12-01 |
|
#ML
|
The research presents a method called PromptBridge to solve the problem of 'Model Drifting', where prompts designed for one language model perform poorly on another. PromptBridge allows for effective prompt transfer between models without needing per-task or per-model re-optimization, reducing migration effort and improving accuracy.... |
Read More |
|
|
|
|
![]() |
Rectifying LLM Thought from Lens of Optimization |
Published at 2025-12-01 |
|
#ML
|
This study examines the reasoning processes of large language models (LLMs) and introduces a new method called RePro to improve their performance. RePro assesses and optimizes LLM reasoning by defining a surrogate objective function and utilizing a dual scoring mechanism, which is integrated into reinforcement learning pipelines to enhance reasoning performance and reduce suboptimal behaviors across various tasks and models.... |
Read More |
|
|
|
![]() |
Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models |
Published at 2025-12-01 |
|
#ML
|
The paper presents a new method called Script for reducing memory usage and inference time in multimodal large language models. Script removes visually redundant tokens and preserves query-relevant visual information, improving performance on image and video understanding tasks without requiring model retraining.... |
Read More |
|
|
|
|
![]() |
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices |
Published at 2025-12-01 |
|
#ML
|
The study presents a new approach for reinforcement learning using large language models, explaining why and how a surrogate token-level objective can optimize the true sequence-level reward in policy gradient methods. They demonstrate the importance of minimizing training-inference discrepancy and policy staleness for this surrogate to work effectively, and provide guidelines for stable RL training through extensive experiments with a 30B model.... |
Read More |
|
|
|
![]() |
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos |
Published at 2025-12-01 |
|
#ML
|
The study presents StreamGaze, a new benchmark for evaluating models' ability to use human gaze signals for understanding streaming videos in real-time. StreamGaze assesses models' performance in tracking shifting attention, inferring user intentions, and making proactive predictions, revealing significant gaps between current models and human performance.... |
Read More |
|
|
|
|
![]() |
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models |
Published at 2025-12-01 |
|
#ML
|
The study presents TUNA, a new model for handling multimodal data that combines image and video understanding and generation in a single framework, eliminating the need for separate encoders and improving performance over previous methods. Experiments show that TUNA outperforms other models in tasks like image and video understanding, generation, and editing, highlighting the benefits of its unified representation design.... |
Read More |
|
|
|
![]() |
The Art of Scaling Test-Time Compute for Large Language Models |
Published at 2025-12-01 |
|
#ML
|
This study is the first large-scale analysis of test-time scaling strategies for large language models, covering eight models and four datasets. The main findings are: no single strategy works best for all cases, reasoning models perform differently based on problem difficulty and length, and optimal performance improves with more compute budget for a given model type. The research offers a guide to choosing the best test-time scaling strategy based on problem difficulty, model type, and compute... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|