🤗 Daily Paper(2025-10-09)

2 views
Skip to first unread message

deep.di...@gmail.com

unread,
Oct 9, 2025, 4:08:12 PMOct 9
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents

Published at 2025-09-26

#ML

This study presents DeepTravel, a framework for creating autonomous travel planning agents that can plan, execute, and learn from their actions without relying on fixed prompts or external API constraints. The framework uses a sandbox environment, hierarchical reward modeling, and a reply augmented reinforcement learning method to enable small language models to outperform larger ones in travel planning tasks....

Read Moreicon

A Single Character can Make or Break Your LLM Evals

Published at 2025-10-02

#ML

The way demonstration examples are formatted in LLM evaluations significantly impacts model performance, with the choice of delimiter between examples able to alter response quality by up to 23%. The research explores how different delimiters affect attention scores and provides practical recommendations to improve LLMs' robustness to the choice of delimiter....

Read Moreicon

G^2RPO: Granular GRPO for Precise Reward in Flow Models

Published at 2025-10-02

#ML

The paper presents a new framework called G^2RPO that improves the alignment of generative models with human preferences in flow models by using a Singular Stochastic Sampling strategy and a Multi-Granularity Advantage Integration module, resulting in more accurate and comprehensive reward assessments compared to existing methods....

Read Moreicon

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Published at 2025-10-02

#ML

The authors propose a new method called Patch-as-Decodable Token (PaDT) that enables multimodal language models to directly produce both textual and visual outputs, improving performance on tasks like detection and segmentation by using visual reference tokens and a lightweight decoder. PaDT achieves state-of-the-art results across various tasks, even compared to larger models, and the code is available for public use....

Read Moreicon

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Published at 2025-10-03

#ML

The researchers propose a new method called Cache-to-Cache (C2C) that allows large language models to communicate directly with each other using their internal representations, improving response quality and reducing latency compared to text-based communication. Experiments show that C2C outperforms individual models by 8.5-10.5% and text communication by 3.0-5.0%, while also being 2.0 times faster....

Read Moreicon

AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning

Published at 2025-10-05

#ML

AlphaApollo is a system that combines multiple models with professional tools to improve agentic reasoning in foundation models. It enables accurate calculations, decision-making, and iterative refinement, resulting in significant performance gains on AIME evaluations and demonstrating the successful execution of tool calls....

Read Moreicon

CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

Published at 2025-10-05

#ML

The study presents a new method called CALM that enhances large reasoning models' performance in automating optimization tasks by progressively refining them with expert-provided hints, leading to a new state-of-the-art model, STORM, that matches the performance of a much larger model with fewer parameters....

Read Moreicon

Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Published at 2025-10-05

#ML

The study presents a new reasoning method called Language-Mixed CoT that alternates between English and a target language, reducing translation errors while improving reasoning. They created a large Korean dataset and trained several models, with the best one, KO-REAson-35B, setting new performance records and outperforming other models on most benchmarks....

Read Moreicon

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Published at 2025-10-05

#ML

This study explains why training transformer models with flash attention in low-precision formats often fails due to similar low-rank representations and biased rounding errors, leading to loss explosions. The researchers propose a minimal modification to flash attention that reduces rounding errors, successfully stabilizing the training process....

Read Moreicon

Bridging Text and Video Generation: A Survey

Published at 2025-10-06

#ML

This survey explores the development of text-to-video generation models, from early generative adversarial networks (GANs) and variational autoencoders (VAEs) to hybrid diffusion-transformer architectures. It discusses the challenges faced in creating coherent, high-quality videos from text prompts, and suggests future directions for research in this evolving field....

Read Moreicon

Glocal Information Bottleneck for Time Series Imputation

Published at 2025-10-06

#ML

The study presents a new training method, Glocal Information Bottleneck, for improving time series imputation, which is the process of recovering missing values in temporal data. This method addresses the issue of models focusing too much on local details and ignoring the global structure of the data, resulting in better performance and more accurate imputations, especially under high missingness....

Read Moreicon

Multi-Agent Tool-Integrated Policy Optimization

Published at 2025-10-06

#ML

The authors present a new method called MATPO that allows a single language model to have different roles, improving its performance on complex tasks compared to traditional single-agent models. MATPO effectively manages context and tool responses, outperforming single-agent baselines by an average of 18.38% in various experiments....

Read Moreicon

NorMuon: Making Muon more efficient and scalable

Published at 2025-10-06

#ML

This study presents NorMuon, an optimizer that combines orthogonalization and neuron-level adaptive learning rates to improve training efficiency and scalability in large language models. NorMuon balances parameter utilization and maintains Muon's conditioning benefits, outperforming both Adam and Muon in various model scales while keeping a similar memory footprint to Muon....

Read Moreicon

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Published at 2025-10-06

#ML

The authors present StaMo, a method that learns a compact and expressive state representation for robotic motion using a lightweight encoder and a pre-trained Diffusion Transformer decoder. This approach improves task performance and real-world success rates while reducing inference overhead, and also discovers latent actions that can be translated into executable robot actions without explicit supervision....

Read Moreicon

D^3QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

Published at 2025-10-07

#ML

The authors of this study present a new method, D^3QE, to detect images generated by autoregressive models, which are different from previous methods as they generate images through discrete token prediction. The proposed method leverages the unique patterns and frequency distribution bias of the codebook in real and fake images, and it has shown superior detection accuracy and robustness to real-world perturbations across different autoregressive models....

Read Moreicon

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

Published at 2025-10-07

#ML

The FinLFQA benchmark evaluates Large Language Models' ability to generate long, complex financial answers with detailed attributions, focusing on evidence, reasoning steps, and financial knowledge. The study finds that fine-grained metrics are crucial for distinguishing model capabilities and that end-to-end generation performs as well as post-hoc approaches, with iterative refinement being helpful only with external feedback....

Read Moreicon

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Published at 2025-10-07

#ML

The researchers present Lumina-DiMOO, a new open-source model that excels at creating and understanding various types of data, such as text and images, by using a unique method called fully discrete diffusion modeling. This approach enables Lumina-DiMOO to perform tasks like generating images from text, editing images, and more, all while being more efficient than previous methods and outperforming other open-source models in these areas....

Read Moreicon

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Published at 2025-10-07

#ML

The study presents MingTok, a new visual tokenizer that uses a continuous latent space to unify visual understanding and generation, addressing the challenge of quantization errors in existing methods. Ming-UniVision, built on MingTok, eliminates the need for task-specific visual representations and supports multi-round, in-context tasks, achieving state-of-the-art performance in both understanding and generation tasks....

Read Moreicon

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

Published at 2025-10-07

#ML

The study presents PuzzlePlex, a diverse puzzle benchmark to evaluate foundation models' reasoning and planning skills in complex environments. The results indicate that reasoning models excel in instruction-based tasks, while code-based tasks, though challenging, provide a scalable and efficient solution....

Read Moreicon

Revisiting Long-context Modeling from Context Denoising Perspective

Published at 2025-10-07

#ML

This study analyzes the impact of irrelevant information, or 'context noise,' on long-context models and proposes a new metric and training strategy to improve their performance. The proposed methods significantly enhance the model's ability to focus on important information, leading to better predictions and even outperforming a high-profile model in certain tasks....

Read Moreicon

The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

Published at 2025-10-07

#ML

The African Languages Lab is a research initiative that creates a large, validated dataset and builds models to improve Natural Language Processing technologies for African languages, which are currently underrepresented. They've also mentored early-career researchers and their models perform as well as or better than Google Translate in several languages....

Read Moreicon

The Markovian Thinker

Published at 2025-10-07

#ML

The authors propose a new paradigm called Markovian Thinking, which allows reasoning models to think for long periods without requiring a lot of computational resources. They introduce Delethink, an environment that structures reasoning into fixed-size chunks, enabling linear compute with constant memory. This approach significantly reduces computational costs compared to existing methods, making long reasoning tasks more efficient and scalable....

Read Moreicon

Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

Published at 2025-10-08

#ML

The study examines various visual token compression methods for Multimodal Large Language Models and finds that simple image downsampling often outperforms advanced techniques on existing benchmarks. The researchers then propose VTC-Bench, a new evaluation framework that filters data to reduce noise and provide a fairer assessment of visual token compression methods....

Read Moreicon

Artificial Hippocampus Networks for Efficient Long-Context Modeling

Published at 2025-10-08

#ML

The authors present a new method for long-sequence modeling that combines the benefits of both RNN-like models and attention-based Transformers. Their approach, inspired by cognitive science, uses a sliding window of the Transformer's KV cache as short-term memory and a learnable module called Artificial Hippocampus Network to compress out-of-window information into a fixed-size long-term memory. This results in faster computations and reduced memory usage without sacrificing performance....

Read Moreicon

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

Published at 2025-10-08

#ML

This survey examines the challenges and advancements in code-switching natural language processing (NLP) using large language models (LLMs), focusing on the difficulties in handling mixed-language inputs and biased evaluations. The authors provide a comprehensive analysis of CSW-aware LLM research, classifying recent advances by architecture, training strategy, and evaluation methodology, and propose a roadmap to achieve truly multilingual intelligence, emphasizing the need for inclusive dataset...

Read Moreicon

Heptapod: Language Modeling on Visual Signals

Published at 2025-10-08

#ML

The study presents a new image modeling method called Heptapod that predicts the distribution of entire 2D images using a causal Transformer, outperforming previous methods on the ImageNet benchmark. This innovative approach combines sequential and holistic image learning, offering a new perspective on language modeling for visual signals....

Read Moreicon

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Published at 2025-10-08

#ML

The researchers created a new video dataset called MATRIX-11K with interaction-aware captions and multi-instance mask tracks to study how video generation models represent interactions. They then developed a method called MATRIX, which improves the models' ability to accurately represent interactions by aligning attention in specific layers with the mask tracks from their new dataset....

Read Moreicon

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

Published at 2025-10-08

#ML

The authors present MLE-Smith, a system that automates the creation of machine learning engineering tasks from raw datasets, ensuring quality, real-world applicability, and diversity. MLE-Smith transforms datasets into competition-style challenges using a multi-agent pipeline, which designs tasks, enforces rules, and validates their real-world relevance, demonstrating its effectiveness across various datasets and language models....

Read Moreicon

Native Hybrid Attention for Efficient Sequence Modeling

Published at 2025-10-08

#ML

The authors present a new model called Native Hybrid Attention (NHA) that combines the efficiency of linear attention with the context recall of full attention in a unified design. NHA outperforms Transformers and other hybrid models in recall-intensive tasks and can also be applied to pretrained language models for improved efficiency without sacrificing accuracy....

Read Moreicon

OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

Published at 2025-10-08

#ML

The authors present OBS-Diff, a new method that reduces the computational cost of large-scale text-to-image models by pruning them without the need for retraining. This method is effective, efficient, and maintains high visual quality, outperforming existing one-shot pruning techniques for diffusion models....

Read Moreicon

Online Generic Event Boundary Detection

Published at 2025-10-08

#ML

The study presents a new method called Estimator for detecting events in streaming videos in real-time, similar to human perception. The method outperforms other online video understanding models and performs comparably to existing offline event detection methods on various datasets....

Read Moreicon

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

Published at 2025-10-08

#ML

The authors present RLinf-VLA, a unified and efficient framework for training vision-language-action models using reinforcement learning, addressing the lack of a unified platform for fair comparison. RLinf-VLA supports various architectures, algorithms, and simulators, achieving high performance and generalization in simulations and real-world deployment....

Read Moreicon

Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces

Published at 2025-10-08

#ML

This study examines the Uniform Information Density (UID) hypothesis in large language models' reasoning, proposing a new metric and two measures of uniformity. Experiments show that reasoning traces with more uniform information density improve accuracy and provide a robust diagnostic criterion for building reliable reasoning systems....

Read Moreicon

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Published at 2025-10-08

#ML

This study presents a new framework called SHANKS that allows spoken language models to think and reason while listening to user input in real-time, improving interaction and reducing latency in speech-to-speech communication....

Read Moreicon

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Published at 2025-10-08

#ML

This study presents a new method, TTRV, for improving vision language models by adapting them in real-time during use, without requiring labeled data. The approach enhances the GRPO framework, rewards the model based on output frequency, and reduces output uncertainty, leading to significant improvements in object recognition and visual question answering tasks, even outperforming GPT-4o in some cases....

Read Moreicon

U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

Published at 2025-10-08

#ML

This study presents U-Bench, a large-scale, comprehensive benchmark evaluating 100 U-Net variants for medical image segmentation, focusing on statistical robustness, generalization, and efficiency. U-Bench introduces a new metric, U-Score, to measure performance-efficiency trade-offs and offers guidance for model selection based on dataset characteristics and architectural paradigms....

Read Moreicon

Vibe Checker: Aligning Code Evaluation with Human Preference

Published at 2025-10-08

#ML

This study proposes a new method called VeriCode to measure how well language models can follow code instructions, which is important for making code that feels right and meets human preferences beyond just functionality. The researchers found that the best models still struggle with following multiple instructions without losing functionality, and that combining functional correctness and instruction following provides the best alignment with human preference in coding tasks....

Read Moreicon

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

Published at 2025-10-08

#ML

This study examines how outdated benchmarks can affect the evaluation of large language models' factuality, finding that many popular benchmarks contain outdated information, leading to unreliable assessments. The researchers propose a new method to measure benchmark aging and its impact, aiming to improve the reliability of future evaluations....

Read Moreicon

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Published at 2025-10-08

#ML

The study presents WristWorld, a new 4D model that creates wrist-view videos from anchor views using visual geometry models and a two-stage process of reconstruction and generation. This model outperforms existing ones in video generation and improves virtual-to-real adaptation for robotic manipulation tasks....

Read Moreicon

Published at

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages