🤗 Daily Paper(2025-10-23)

6 views

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 23, 2025, 4:07:49 PMOct 23

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

Published at 2025-10-16

#ML

The authors present DeLeaker, a new method to prevent unintended feature transfer in Text-to-Image models by controlling attention maps during the generation process. They also introduce SLIM, a dataset for evaluating this issue, and show that DeLeaker outperforms existing methods without sacrificing image quality....

Directional Reasoning Injection for Fine-Tuning MLLMs

Published at 2025-10-16

#ML

The authors propose a new method called DRIFT to improve the reasoning abilities of multimodal large language models without requiring resource-intensive training. This method transfers reasoning knowledge in the gradient space, which allows for efficient reasoning transfer while preserving the simplicity of standard supervised fine-tuning pipelines, and has been shown to consistently improve reasoning performance over existing methods....

Attention Sinks in Diffusion Language Models

Published at 2025-10-17

#ML

This study examines the attention patterns in Diffusion Language Models (DLMs) and discovers a phenomenon called 'attention sinking'. Unlike in traditional models, these attention sinks in DLMs change position dynamically during text generation and masking them has minimal impact on performance, providing new insights into their inner workings....

Language Models are Injective and Hence Invertible

Published at 2025-10-17

#ML

This study proves and confirms that transformer language models are injective, meaning they can exactly recover input text from their representations, and introduces SipIt, the first algorithm to efficiently utilize this property for exact invertibility....

What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

Published at 2025-10-18

#ML

Researchers created a dataset of 1,893 questions that people might ask household robots, based on videos and text scenarios. The questions cover various topics, from simple task details to hypothetical situations, and can help improve robot design and interaction by understanding what users need to know....

FinSight: Towards Real-World Financial Deep Research

Published at 2025-10-19

#ML

FinSight is a new system that uses a multi-agent framework to create high-quality financial reports with the help of a Code Agent with Variable Memory architecture for data analysis and a two-stage writing framework for generating coherent reports. It outperforms other systems in accuracy, depth, and presentation quality, making professional financial report generation more accessible....

From Charts to Code: A Hierarchical Benchmark for Multimodal Models

Published at 2025-10-20

#ML

A new test called Chart2Code is presented, which checks how well big computer models understand charts and make code. This test has different levels of difficulty, from copying charts to changing them or turning detailed tables into charts. The test has 2,023 tasks and checks both the correctness of the code and the quality of the charts. Twenty-five top models were tested, and even the best one had trouble with the test, showing how hard it is. This test will help improve these models in the fu...

Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

Published at 2025-10-20

#ML

The study presents a new method called ODiS that improves the selection of data for large language models by focusing on both quality and diversity. ODiS evaluates data from multiple dimensions and uses PCA to ensure the dimensions are uncorrelated, allowing for the selection of high-quality and diverse data, which results in better model performance on downstream benchmarks....

SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

Published at 2025-10-20

#ML

SAVANT is a new framework that uses layered scene analysis to detect unusual scenarios in autonomous driving with high accuracy and recall. It improves upon existing methods by enabling a cost-effective, open-source model to outperform other models in anomaly detection, addressing the challenge of limited data in this field....

AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library

Published at 2025-10-21

#ML

The authors present AlphaOPT, a self-improving system that helps an LLM learn from limited examples and solver feedback to automate optimization modeling, which is typically hard to automate. This system operates in two phases: reflecting on failed attempts and extracting insights, and refining the applicability conditions of stored insights, all without costly retraining or curated rationales. Experiments show that AlphaOPT improves with more data and outperforms the strongest baseline on the O...

Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Published at 2025-10-21

#ML

The study presents SCRIPTS, a dataset of 1,000 dialogues from movies in English and Korean, to evaluate social reasoning in language models. The models struggle with social reasoning, particularly in Korean, and often fail to recognize relationships or exhibit biases, emphasizing the need for improved socially-aware language models....

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Published at 2025-10-21

#ML

This study addresses the challenges of off-policy reinforcement learning for large language models, focusing on the issues of declining policy entropy and optimization instability. The authors propose BAPO, a new method that dynamically adjusts clipping bounds to balance positive and negative contributions, maintain entropy, and stabilize the learning process, resulting in fast, stable, and data-efficient training....

NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning

Published at 2025-10-21

#ML

The study presents NeuroAda, a new method for parameter-efficient fine-tuning of neural networks, which achieves high performance with minimal memory usage by identifying key parameters and introducing bypass connections for updates. NeuroAda outperforms existing methods on various tasks using as little as 0.02% trainable parameters and reduces CUDA memory usage by up to 60%....

OmniNWM: Omniscient Driving Navigation World Models

Published at 2025-10-21

#ML

The authors present a new model, OmniNWM, which effectively handles state, action, and reward dimensions in autonomous driving by generating panoramic videos with rich details, enabling precise control over video generation, and incorporating 3D occupancy for rule-based dense rewards, outperforming existing models in various aspects....

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Published at 2025-10-21

#ML

The study presents ProfBench, a new benchmark for evaluating large language models in professional domains like Physics, Chemistry, Finance, and Consulting, using over 7000 response-criterion pairs evaluated by human experts. The benchmark reveals that even top-performing models struggle with these tasks, and it provides insights into the performance of proprietary and open-weight models, as well as the importance of extended thinking in complex professional-domain tasks....

RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling

Published at 2025-10-21

#ML

The authors present RIR-Mega, a large dataset of simulated room impulse responses, which is useful for improving machine learning models and room acoustics modeling. This dataset includes a compact metadata schema, validation tools, and a regression baseline, and is publicly available for reproducible research....

See the Text: From Tokenization to Visual Reading

Published at 2025-10-21

#ML

This study proposes a new method called SeeTok that treats text as images, using visual-text rendering and pretrained multimodal LLMs to interpret them. Compared to traditional subword tokenization, SeeTok requires fewer tokens, reduces computation, and improves cross-lingual generalization and robustness to typographic noise, moving towards more human-like language models....

Steering Autoregressive Music Generation with Recursive Feature Machines

Published at 2025-10-21

#ML

The authors present MusicRFM, a new method that allows for detailed and understandable control over existing music models without re-training them. By analyzing the models' internal workings, MusicRFM can steer the generation process in real-time, improving the accuracy of generating specific musical notes while maintaining high text prompt adherence....

ColorAgent: Building A Robust, Personalized, and Interactive OS Agent

Published at 2025-10-22

#ML

The authors present ColorAgent, an operating system agent that can engage in long-term, robust interactions with the environment and provide personalized, proactive user interaction. ColorAgent outperforms previous agents in benchmark tests but the authors suggest that current benchmarks are not comprehensive enough and propose areas for future research....

DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

Published at 2025-10-22

#ML

The authors present DaMo, a novel solution for optimizing training data mixtures in multimodal language models for mobile phone tasks. DaMo predicts optimal data mixtures by forecasting task performance and outperforms other methods by 3.38% on PhoneAgentBench and 2.57% on other established benchmarks, while maintaining robust scalability....

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Published at 2025-10-22

#ML

This study presents a new method called Decomposed Attention Fusion (DecAF) that improves the accuracy of attention maps in video reasoning segmentation without the need for retraining. By suppressing irrelevant activations and enhancing object-focused cues, DecAF converts attention maps into coarse segmentation masks and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks....

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Published at 2025-10-22

#ML

The Ring-linear model series, including Ring-mini-linear-2.0 and Ring-flash-linear-2.0, are introduced. These hybrid architecture models integrate linear and softmax attention, reducing inference costs by up to 1/10 compared to a 32 billion parameter dense model and over 50% compared to the original Ring series, while also improving training efficiency by 50% with the linghe operator library....

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Published at 2025-10-22

#ML

The authors present GigaBrain-0, a new Vision-Language-Action model that uses data generated by world models to reduce the need for real-world robot data, improving task generalization and policy robustness. This results in better performance on various tasks, and an optimized lightweight version is also available for devices like NVIDIA Jetson AGX Orin....

KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

Published at 2025-10-22

#ML

The study presents KORE, a method that enhances the knowledge of large multimodal models by adding new information and preserving old data. KORE converts individual knowledge items into structured data for accurate learning and uses a covariance matrix to minimize interference with existing knowledge, improving the model's ability to adapt to new information without forgetting previous data....

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

Published at 2025-10-22

#ML

The study presents LoongRL, a method that uses reinforcement learning to improve reasoning with large amounts of information. It introduces a technique called KeyChain which transforms simple tasks into challenging ones involving long contexts, encouraging the model to follow a plan, retrieve relevant information, reason, and recheck, all of which enhances its performance in long-context tasks without increasing the cost of training....

MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Published at 2025-10-22

#ML

This study presents MINED, a new benchmark for evaluating large multimodal models' understanding of time-sensitive knowledge across six dimensions and 11 tasks. The benchmark reveals that while some models perform well, many still struggle with time-sensitive knowledge, especially in areas like sports. The researchers also explore ways to update models with new information, finding that knowledge editing methods can be effective in certain scenarios....

Machine Text Detectors are Membership Inference Attacks

Published at 2025-10-22

#ML

This study explores the similarities between detecting machine-generated text and identifying training samples for language models. They prove that the same metric works best for both tasks and find that a method designed for text detection also performs well for identifying training samples, suggesting that researchers should work together more and share methods....

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Published at 2025-10-22

#ML

The authors have created a large and high-quality dataset called Pico-Banana-400K, which contains 400,000 real images with diverse editing pairs, to help improve text-guided image editing models. This dataset includes specialized subsets for studying complex editing scenarios, such as multi-turn editing and preference-based editing, making it a robust foundation for training and benchmarking future models....

TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

Published at 2025-10-22

#ML

TheMCPCompany is a benchmark for evaluating tool-calling agents using over 18,000 tools from real-world services. The study shows that advanced reasoning models can effectively discover tools in simpler environments but struggle with complex enterprise environments, highlighting the need for better reasoning and retrieval models to navigate and combine tools for solving complex problems....

Unified Reinforcement and Imitation Learning for Vision-Language Models

Published at 2025-10-22

#ML

This study presents a new training method called Unified Reinforcement and Imitation Learning that creates powerful, lightweight Vision-Language Models. By combining reinforcement learning with imitation learning, smaller models can mimic and improve upon the text generation of larger models, resulting in significant performance gains and competitive results with leading VLMs....

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Published at 2025-10-22

#ML

The authors developed a method called VideoAgentTrek that automatically generates training data for computer-use agents by analyzing publicly available screen-recorded videos, eliminating the need for costly manual annotation. This approach significantly improves task success rates and step accuracy in various benchmarks, showing that internet videos can be used to create high-quality training data for computer-use agents....

When Do Transformers Learn Heuristics for Graph Connectivity?

Published at 2025-10-22

#ML

This study explores why Transformers often rely on unreliable shortcuts instead of learning generalizable algorithms. Using graph connectivity as an example, the research shows that a simplified Transformer model can solve graphs up to a certain complexity level, and the model learns different strategies based on the complexity of the training graphs. The findings suggest that limiting the training data to the model's capacity can help it learn the correct algorithm instead of a less efficient s...

olmOCR 2: Unit Test Rewards for Document OCR

Published at 2025-10-22

#ML

The authors have developed an advanced OCR system, olmOCR 2, which converts digitized print documents into clean text using a specialized model trained with a diverse set of binary unit tests. This new system outperforms previous versions, especially in handling math formulas, tables, and multi-column layouts, and its model, data, and code are openly available for use....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages