🤗 Daily Paper(2025-05-29)

5 views

Skip to first unread message

deep.di...@gmail.com

unread,

May 29, 2025, 4:13:18 PM5/29/25

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

Published at 2025-05-18

#ML

The authors present Safe-Sora, a novel framework for embedding graphical watermarks in AI-generated videos, addressing the lack of invisible watermarking in video generation. They introduce a hierarchical mechanism for matching watermark patches to video frames and develop a 3D wavelet transform-enhanced Mamba architecture for spatiotemporal fusion, achieving state-of-the-art performance in video quality, watermark fidelity, and robustness....

Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

Published at 2025-05-21

#ML

The study presents BraInCoRL, a transformer-based model that predicts neural responses using few-example learning, improving the generalizability and interpretability of voxelwise models of higher visual cortex without needing extensive data acquisition....

FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Published at 2025-05-22

#ML

The study presents FS-DAG, a new model architecture designed for efficiently understanding visually complex documents with limited data. FS-DAG is capable of adapting to various document types, handling common issues like OCR errors, and is highly performant with less than 90M parameters, making it suitable for real-world applications with limited computational resources....

Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph

Published at 2025-05-23

#ML

This study presents HuggingKG, a large-scale knowledge graph built from the Hugging Face community, which captures domain-specific relations and rich textual attributes. The researchers also introduce HuggingBench, a multi-task benchmark with novel test collections for information retrieval tasks such as resource recommendation, classification, and tracing, revealing unique characteristics of the derived tasks....

First Finish Search: Efficient Test-Time Scaling in Large Language Models

Published at 2025-05-23

#ML

This study presents a new method called First Finish Search (FFS) for improving reasoning in large language models during inference. FFS is a straightforward, training-free technique that selects the shortest trace, which is more likely to be correct, and outperforms other methods in accuracy while reducing token usage and inference latency....

Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems

Published at 2025-05-23

#ML

The study presents a scalable framework for improving enterprise search systems' accuracy in retrieving domain-specific information. By dynamically selecting challenging yet irrelevant documents, the proposed method enhances re-ranking models, achieving significant performance improvements on both proprietary and public domain-specific datasets....

Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

Published at 2025-05-23

#ML

The paper proposes a new method, model immunization, to prevent AI models from spreading false information. This method involves training AI models on a small, labeled set of falsehoods, similar to how vaccines work in humans, to strengthen their ability to detect and reject misleading claims without affecting their accuracy on true information....

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Published at 2025-05-23

#ML

This research proposes that token reduction in Transformer architectures should be viewed as a crucial principle in generative modeling, rather than just an efficiency strategy. The study suggests that token reduction can enhance multimodal integration, reduce hallucinations, maintain input coherence, and improve training stability in vision, language, and multimodal systems....

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

Published at 2025-05-23

#ML

This study introduces DynToM, a new benchmark for testing large language models' ability to understand and track the changing mental states in social scenarios. The results show that these models struggle with this task, performing 44.7% worse than humans, especially when it comes to tracking shifts in mental states over time....

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Published at 2025-05-24

#ML

The authors present a new framework called Chain-of-Zoom that enables super-resolution models to handle extreme magnifications beyond their training range. This is achieved by breaking down the task into smaller, manageable steps using a pre-trained model and supplementing each step with text prompts generated by a vision-language model, which are then aligned with human preferences. The result is a significant improvement in image quality and detail at magnifications up to 256 times the origina...

GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

Published at 2025-05-24

#ML

The authors present the GRE Suite, a new framework that enhances Visual Language Models for geo-localization tasks by incorporating structured reasoning chains. This framework, built on a high-quality dataset, a multi-stage reasoning model, and a comprehensive evaluation benchmark, outperforms existing methods in both coarse-grained and fine-grained location inference....

Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

Published at 2025-05-24

#ML

The study presents PENGUIN, a benchmark for evaluating the safety of large language models (LLMs) in various sensitive domains, considering user-specific backgrounds. They also introduce RAISE, a framework that improves LLM safety by strategically acquiring user-specific information, resulting in up to 31.6% higher safety scores with minimal interaction cost....

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

Published at 2025-05-25

#ML

The authors present DeepResearchGym, an open-source platform that offers a cost-effective, reproducible, and transparent way to evaluate deep research systems. It uses a state-of-the-art dense retriever and approximate nearest neighbor search to index large public web corpora, providing lower latency and stable document rankings compared to commercial APIs. The platform also includes an extended benchmark for automatic evaluation of deep research systems' outputs, which has been validated throug...

Efficient Data Selection at Scale via Influence Distillation

Published at 2025-05-25

#ML

The authors present a new method called Influence Distillation that selects efficient data for training Large Language Models by calculating the impact of each sample on the target distribution. This approach is faster and performs as well or better than existing methods, making it a promising tool for LLM fine-tuning....

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Published at 2025-05-25

#ML

The study presents a method called PIR that improves the reasoning abilities of large language models during test-time inference by selectively removing unnecessary steps in the reasoning process. This results in more accurate and efficient models that use fewer computational resources, making them more suitable for real-world applications....

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Published at 2025-05-25

#ML

The authors present a new, lightweight reasoning module called Universal Reasoner (UniR) that can be used with any frozen Large Language Model (LLM) to improve its reasoning skills. UniR is trained independently and can be combined with any LLM during inference by adding its output logits, allowing for modular composition and strong weak-to-strong generalization, as demonstrated in mathematical reasoning and machine translation tasks....

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Published at 2025-05-26

#ML

This study examines the impact of different allocation strategies on the long-context capabilities of Vision-Language Models (VLMs) and proposes a new method called HoPE. HoPE is a Hybrid of Position Embedding that improves VLM performance in long contexts by introducing a hybrid frequency allocation strategy and a dynamic temporal scaling mechanism. The results of extensive experiments show that HoPE outperforms existing methods in long video understanding and retrieval tasks....

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Published at 2025-05-26

#ML

The study presents two tools for better understanding manga: MangaOCR for recognizing text in manga pages and MangaVQA for testing comprehension through visual questions. They also developed MangaLMM, a model trained on these tools, to improve the performance of large multimodal models in interpreting manga narratives, and compared its efficiency with other proprietary models....

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Published at 2025-05-26

#ML

Prot2Token is a new framework that can predict various protein properties and interactions by turning them into a standardized format. It uses a single model to perform multiple tasks, making protein modeling more efficient and promising for biological discoveries and therapeutics development....

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Published at 2025-05-26

#ML

The researchers created an automated system to collect over 21,000 real-world, interactive Python-based software engineering tasks from GitHub repositories. They used this data to build a new, uncontaminated benchmark for evaluating language models in software engineering tasks, finding that some models' performance may have been overestimated due to contamination issues in existing benchmarks....

AITEE -- Agentic Tutor for Electrical Engineering

Published at 2025-05-27

#ML

This study presents AITEE, an agent-based tutoring system for electrical engineering that provides personalized support and promotes self-directed learning. AITEE uses a graph-based similarity measure, parallel Spice simulation, and Socratic dialogue to enhance students' understanding of electrical circuits and outperforms baseline approaches in domain-specific knowledge application....

CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Published at 2025-05-27

#ML

Researchers created CHIMERA, a large database of creative idea combinations in scientific literature, specifically in AI. They used machine learning to find over 28K examples of this idea mixing and also made tools to predict new creative ideas, which other researchers found helpful....

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Published at 2025-05-27

#ML

The authors present EPiC, a framework for learning efficient video camera control that creates precise anchor videos without needing expensive camera trajectory annotations. EPiC uses masked source videos for training and a lightweight ControlNet module for integration with pretrained video diffusion models, resulting in state-of-the-art performance for image-to-video camera control tasks with fewer parameters, training steps, and less data....

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

Published at 2025-05-27

#ML

The researchers present a new method called MUSEG that improves video understanding for AI models by helping them better reason about events over time. MUSEG achieves this by enabling the models to connect queries with multiple relevant video segments and using a reward system to guide the model towards better temporal reasoning....

RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

Published at 2025-05-27

#ML

RenderFormer is a new rendering pipeline that uses a transformer architecture to render images from triangle-based scenes with global illumination effects, without needing per-scene training. It works in two stages: a view-independent stage that models light transport between triangles, and a view-dependent stage that converts a sequence of triangle tokens into pixel values, resulting in realistic images with minimal constraints....

SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing Problem

Published at 2025-05-27

#ML

SVRPBench is a new benchmark for vehicle routing that simulates real-world conditions like traffic, delays, and accidents. It shows that traditional methods are better at handling these uncertainties than modern AI-based methods, and provides a dataset for others to test and improve their solutions....

SageAttention2++: A More Efficient Implementation of SageAttention2

Published at 2025-05-27

#ML

The study presents SageAttention2++, an improved version of SageAttention2, which speeds up attention mechanisms by using more efficient matrix multiplication techniques. Experiments show that SageAttention2++ is 3.9 times faster than FlashAttention, with no significant loss in performance, making it beneficial for various models like language, image, and video generation....

Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles

Published at 2025-05-27

#ML

The authors present a new method for quickly creating stylized 3D scenes by separating structure modeling and appearance shading, allowing for efficient and consistent stylization using unposed sparse-view images and an arbitrary style image. This approach outperforms existing methods in terms of quality, multi-view consistency, and efficiency....

Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Published at 2025-05-27

#ML

This study presents a new method called HLIP that efficiently pre-trains language-image models for 3D medical imaging like CT and MRI scans by using a lightweight attention mechanism. HLIP can train on large, uncurated datasets, leading to improved performance and state-of-the-art results in various medical imaging benchmarks....

Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities

Published at 2025-05-27

#ML

This research explores how fine-tuning improves Large Language Models' ability to follow instructions, focusing on identifying and analyzing specific neurons and experts in these models. The study introduces a new dataset and framework to understand the role of these components in instruction execution, providing valuable insights into the inner workings of LLMs....

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Published at 2025-05-28

#ML

The study explores improving multimodal reasoning in large language models by combining supervised fine-tuning with structured chain-of-thought patterns and reinforcement learning. The two-stage approach significantly outperforms single-method models, achieving state-of-the-art results among open-source multimodal language models at different scales....

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Published at 2025-05-28

#ML

The study examines how Large Language Models (LLMs) perform when prompted in Simplified and Traditional Chinese, using benchmark tasks like regional term choice and regional name choice. Results show that LLMs have biases based on the task and prompting language, which could be due to training data representation, character preferences, and tokenization differences. The research highlights the need for more analysis of LLM biases and provides an open-sourced benchmark dataset for future evaluati...

FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control

Published at 2025-05-28

#ML

The authors present FastTD3, a streamlined reinforcement learning algorithm that reduces training time for humanoid robots in popular simulation environments. By making modifications like parallel simulation and careful hyperparameter tuning, FastTD3 can solve various tasks in less than 3 hours on a single A100 GPU....

Fostering Video Reasoning via Next-Event Prediction

Published at 2025-05-28

#ML

The authors propose a new learning task called next-event prediction to help multimodal LLMs understand the order of events in videos, using future video segments as a self-supervised signal. They create a large dataset of video segments and test different training strategies to improve this ability, introducing FutureBench to measure progress....

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Published at 2025-05-28

#ML

The researchers present JQL, a new method for creating high-quality, diverse, and large-scale multilingual training data for language models. JQL uses lightweight annotators based on pretrained multilingual embeddings, which outperform existing heuristic filtering methods and improve downstream model training quality and data retention rates....

Let's Predict Sentence by Sentence

Published at 2025-05-28

#ML

The study explores if pre-trained language models can reason over sentences instead of individual tokens. They developed a framework that adapts a pre-trained language model to operate in sentence space, achieving competitive performance with reduced inference time compared to Chain-of-Thought, and introduced a diagnostic tool called SentenceLens to visualize latent trajectories....

One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

Published at 2025-05-28

#ML

Researchers found that in Text-to-Image diffusion models, decoders are better at capturing complex details while encoders can be reused for different time steps. They developed a new method, TiUE, which shares encoder features across multiple decoder time steps, improving efficiency and quality of generated images....

Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning

Published at 2025-05-28

#ML

This study examines the limitations of rule-based and model-based verifiers in mathematical reasoning, finding that rule-based ones struggle with different answer formats and model-based ones are prone to being hacked. The research aims to provide insights for creating more reliable reward systems in reinforcement learning by highlighting these unique risks....

PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

Published at 2025-05-28

#ML

The researchers present a new dataset of transparent images with accurate alpha mattes and a training-free method to generate such images. They also introduce a strong, open-source model for creating multi-layer transparent images, which can be edited easily and offers high-quality visuals....

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Published at 2025-05-28

#ML

The authors present a new method called RICO to improve the accuracy and completeness of image recaptioning by refining captions through visual reconstruction. RICO uses a text-to-image model to create a reference image from the caption, identifies discrepancies between the original and reconstructed images, and iteratively refines the caption. To reduce computational cost, they also introduce RICO-Flash, which learns to generate captions like RICO using DPO. Experiments show that RICO significa...

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Published at 2025-05-28

#ML

The authors analyze the self-correction abilities of reasoning Vision-Language Models and develop Sherlock, a training framework that enhances these abilities. Sherlock, built on Llama3.2-Vision-11B, significantly improves performance across various benchmarks through self-correction and self-improvement....

Skywork Open Reasoner 1 Technical Report

Published at 2025-05-28

#ML

This study presents Skywork-OR1, an efficient reinforcement learning method for long reasoning tasks, which significantly improves the performance of large language models on various benchmarks by addressing entropy collapse and optimizing the training pipeline....

Text2Grad: Reinforcement Learning from Natural Language Feedback

Published at 2025-05-28

#ML

Text2Grad is a new reinforcement learning method that uses natural language feedback to make precise adjustments to language models. By converting textual critiques into gradient signals, it outperforms traditional methods in tasks like summarization and question answering, while also providing more interpretable results....

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Published at 2025-05-28

#ML

This study explores why language models struggle with reasoning in reinforcement learning due to a lack of exploration, and finds that this issue is linked to a decrease in policy entropy. The researchers develop a relationship between entropy and performance, and propose two techniques, Clip-Cov and KL-Cov, to manage entropy and improve exploration, leading to better results....

Thinking with Generated Images

Published at 2025-05-28

#ML

The proposed paradigm allows large multimodal models to generate their own visual thinking steps, improving visual reasoning performance by up to 50% in complex scenarios, and enabling AI models to engage in visual imagination and iterative refinement similar to human thinking....

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Published at 2025-05-28

#ML

This study presents MM-UPT, a new method for improving multi-modal large language models without the need for expensive, manually annotated data. MM-UPT uses a stable and scalable online RL algorithm, GRPO, and a self-rewarding mechanism to enhance the reasoning ability of models like Qwen2.5-VL-7B, even outperforming some supervised methods....

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Published at 2025-05-28

#ML

This study presents a new framework, VRAG-RL, that uses reinforcement learning to help models understand and reason through visually rich information. By enabling models to interact with search engines and optimize their reasoning process, the framework addresses limitations in traditional text-based and vision-based methods, leading to improved performance in real-world applications....

WebDancer: Towards Autonomous Information Seeking Agency

Published at 2025-05-28

#ML

The authors propose a new method for creating autonomous information-seeking agents that can reason through complex problems. They introduce a four-stage process, including data construction, trajectory sampling, and reinforcement learning, and demonstrate its effectiveness through the WebDancer agent, which outperforms existing models on challenging benchmarks....

What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?

Published at 2025-05-28

#ML

The abstract investigates how text-to-image diffusion models, like Stable Diffusion, can be adapted for 360-degree panorama generation. The study introduces a simple framework called UniPano, which outperforms existing methods and reduces memory usage and training time, making it scalable for high-resolution panorama generation....

Zero-Shot Vision Encoder Grafting via LLM Surrogates

Published at 2025-05-28

#ML

The research presents a method to reduce the cost of training vision language models by using smaller 'surrogate models' that share the same embedding space and representation language as the large target model. This zero-shot grafting process allows vision encoders trained on the surrogate to be directly transferred to the larger model, improving performance and reducing training costs by about 45%....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages