🤗 Daily Paper(2025-10-22)

4 views

Skip to first unread message

deep.di...@gmail.com

unread,

Oct 22, 2025, 4:08:08 PMOct 22

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Expanding the Action Space of LLMs to Reason Beyond Language

Published at 2025-10-08

#ML

The study presents a method to enhance Large Language Models (LLMs) by allowing them to interact with external environments beyond just language, such as symbolic operators or simulators. The new approach, called ExpA Reinforcement Learning (EARL), enables LLMs to perform better on tasks requiring multi-turn interactions and contingent planning, outperforming vocabulary-constrained action baselines....

AlphaQuanter: An End-to-End Tool-Orchestrated Agentic Reinforcement Learning Framework for Stock Trading

Published at 2025-10-15

#ML

The authors present a new single-agent framework called AlphaQuanter that uses reinforcement learning to automate stock trading. This tool improves upon existing multi-agent frameworks by being more efficient, providing consistent signals, and learning a coherent strategy from market feedback, ultimately achieving top performance on key financial metrics and offering valuable insights for human traders....

Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations

Published at 2025-10-15

#ML

This study discusses the limitations of current AI simulations that are confined to static environments and proposes a new approach using LLMs in multi-agent systems to model complex, ever-changing societies. The authors introduce a taxonomy for this field and present a research roadmap focused on creating adaptive, socially-aware AI ecosystems through open-ended co-evolution....

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

Published at 2025-10-17

#ML

The study presents PokeeResearch-7B, a 7B-parameter deep research agent that uses a unified reinforcement learning framework to improve its robustness, alignment, and scalability. This agent outperforms other 7B-scale deep research agents in 10 popular benchmarks by using an annotation-free reinforcement learning framework and a chain-of-thought-driven multi-call reasoning scaffold for self-verification and adaptive recovery from tool failures....

Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Published at 2025-10-17

#ML

The study presents a new framework that combines medical multimodal understanding and generation, which can process various medical inputs and produce diverse outputs. This unified model improves performance in medical image understanding and generation tasks by enabling knowledge sharing between them, filling gaps in data representation and feature integration....

Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism

Published at 2025-10-17

#ML

The study presents SciRecipe, a large dataset of structured biological protocols, and introduces the 'Sketch-and-Fill' paradigm to improve protocol generation. They develop Thoth, a model trained to generate accurate and executable protocols, which outperforms existing models in various benchmarks, thus enhancing the efficiency of scientific experiment reproduction....

PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Published at 2025-10-18

#ML

The authors present PRISMM-Bench, a new benchmark that tests large multimodal models' ability to handle inconsistencies in scientific papers, which are often missed by existing benchmarks. They collected 262 real-world inconsistencies and designed tasks to evaluate models' performance in detecting, correcting, and reasoning about these inconsistencies across different modalities. The results show that current models struggle with this task, highlighting the need for improvement in this area....

Chem-R: Learning to Reason as a Chemist

Published at 2025-10-19

#ML

The authors present a new model, Chem-R, which is trained in three stages to mimic chemist's problem-solving processes. This model outperforms other large language models and chemical foundation models, demonstrating its potential for AI-driven chemical discovery....

Video Reasoning without Training

Published at 2025-10-19

#ML

This study finds that high-quality models improve reasoning by exploring and exploiting options, and then converging on a solution. The researchers then developed V-Reason, a method that tunes model behavior during inference without training, leading to better performance and efficiency compared to RL-trained models....

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Published at 2025-10-20

#ML

The authors present Any-Depth Alignment (ADA), a method that enhances the safety of large language models by reintroducing alignment tokens during generation, which helps the model reassess and prevent harmful responses at any point. ADA is effective across various models, maintaining high performance without altering the base model's parameters and reducing successful adversarial attacks significantly....

DeepSeek-OCR: Contexts Optical Compression

Published at 2025-10-20

#ML

DeepSeek-OCR is a new system that uses optical 2D mapping to compress long texts into fewer vision tokens, enabling high OCR precision even with a high compression ratio. It outperforms other models in text recognition and can generate large-scale training data for LLMs/VLMs efficiently....

Efficient Long-context Language Model Training by Core Attention Disaggregation

Published at 2025-10-20

#ML

The authors propose a new method called core attention disaggregation (CAD) that improves the training of long-context large language models by separating a crucial computation step from the rest of the model. This separation allows for more efficient use of computing resources and reduces imbalances and bottlenecks, leading to faster and more effective training....

EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning

Published at 2025-10-20

#ML

The paper presents a new framework for creating reliable and verifiable synthetic data that can be used for training language models. This framework can generate problems, solutions, and verification artifacts, and it improves data filtering by ensuring consistency between human-annotated and strategy-induced checks. The approach has been tested and shown to improve performance on LiveCodeBench and AgentBench-OS tasks....

GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver

Published at 2025-10-20

#ML

The authors propose a new method called Generalized Adversarial Solver to improve the efficiency of diffusion models in generating high-quality samples. This method simplifies the training process and enhances fine-grained details by combining the original distillation loss with adversarial training, resulting in better performance compared to existing solver training methods....

Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution

Published at 2025-10-20

#ML

The study reveals that current multilingual watermarking methods for large language models are not effective in medium- and low-resource languages due to a flaw in semantic clustering. To solve this, they propose STEAM, a back-translation-based detection method that enhances watermark strength across various languages, tokenizers, and watermarking techniques, improving fairness in watermarking....

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Published at 2025-10-20

#ML

The study presents MT-Video-Bench, a comprehensive video understanding benchmark designed to evaluate AI's performance in multi-turn dialogues, focusing on six key competencies like interactive sports analysis and video-based tutoring. This benchmark helps assess various multimodal large language models, highlighting their strengths and weaknesses in real-world multi-turn scenarios....

MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Published at 2025-10-20

#ML

The authors present a training framework that significantly improves efficiency and performance for large-scale video generation models by optimizing data processing, model architecture, training strategy, and infrastructure. The resulting model, MUG-V 10B, matches or exceeds state-of-the-art video generators and is open-sourced along with the training code and inference pipelines....

Planned Diffusion

Published at 2025-10-20

#ML

The study presents a new method called 'planned diffusion' that combines the strengths of autoregressive and diffusion models for text generation. This method first creates a short plan using autoregressive models and then generates text in parallel using diffusion models, resulting in faster and high-quality text generation with Pareto-optimal trade-off between quality and latency....

World-in-World: World Models in a Closed-Loop World

Published at 2025-10-20

#ML

The authors present World-in-World, a new platform for evaluating generative world models in a realistic, closed-loop setting that focuses on task success rather than just visual quality. They find that controllability and scaling with action-observation data are more important than visual quality for successful decision-making by embodied agents....

ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

Published at 2025-10-20

#ML

The authors present ssToken, a method for improving large language models' fine-tuning by using a self-modulated signal from a history model and an attention-based metric to select important tokens, without relying on an additional reference model or solely on loss information....

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

Published at 2025-10-21

#ML

DSI-Bench is a new benchmark for evaluating models' understanding of dynamic 3D scenarios, as existing models struggle with simultaneous motion of observers and objects. The benchmark, which includes over 1,000 videos and 1,700 questions, reveals limitations in current models, such as confusing observer and object motion and failing to accurately infer relative relationships in dynamic scenarios....

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

Published at 2025-10-21

#ML

The authors have developed Ring-1T, an open-source thinking model with a trillion-scale parameter, which is the first of its kind. They introduced three innovations to tackle the challenges of training such a large model, resulting in exceptional performance across various benchmarks, including a silver medal-level result in the IMO-2025, demonstrating its advanced reasoning capabilities....

Extracting alignment data in open models

Published at 2025-10-21

#ML

This study demonstrates that a substantial amount of alignment training data can be extracted from a post-trained model, which can enhance its capabilities like long-context reasoning, safety, and math skills. The research suggests that using embedding models for measuring success in data extraction is more effective than string matching and highlights a potential risk in extracting alignment data....

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Published at 2025-10-21

#ML

The authors present a new method called Grasp Any Region (GAR) to improve Multimodal Large Language Models' understanding of complex scenes by considering both local and global contexts. GAR can precisely perceive details, model interactions between multiple prompts, and perform advanced compositional reasoning, outperforming other models in various benchmarks....

IF-VidCap: Can Video Caption Models Follow Instructions?

Published at 2025-10-21

#ML

The authors present IF-VidCap, a new benchmark for evaluating video captioning models based on their ability to follow instructions, which is a gap in current benchmarks that focus on descriptive comprehensiveness. The benchmark assesses captions on format and content correctness, and the evaluation of various models shows that while proprietary models still lead, open-source solutions are catching up, and specialized dense captioning models struggle with complex instructions compared to general...

LightMem: Lightweight and Efficient Memory-Augmented Generation

Published at 2025-10-21

#ML

The authors present a new memory system called LightMem that helps large language models better use historical information in complex environments. LightMem is inspired by human memory and reduces computational overhead, improving accuracy and efficiency in experiments with GPT and Qwen backbones....

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

Published at 2025-10-21

#ML

This study presents a new sparse attention method called MoGA that efficiently matches tokens without relying on blockwise estimation, enabling effective long-range interactions in video generation. The researchers use MoGA to create an efficient model that can generate high-quality, long videos at a rapid frame rate, which is validated through various experiments....

Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos

Published at 2025-10-21

#ML

The authors present a new method called Mono4DGS-HDR that creates detailed 3D scenes with high contrast and brightness from standard low-quality videos taken with varying exposure settings. This technique works without needing precise camera information and improves the consistency of the scenes over time, outperforming other methods in both quality and speed....

ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Published at 2025-10-21

#ML

The authors present a new method called ProCLIP to improve the alignment between an LLM-based embedder and the CLIP image encoder. ProCLIP overcomes limitations of the original CLIP text encoder by leveraging the LLM's ability to process long texts and multiple languages, while also ensuring that the vision-language alignment in the CLIP image encoder is not disrupted....

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

Published at 2025-10-21

#ML

The authors present 3DThinker, a novel framework that improves 3D spatial reasoning from limited views by leveraging geometric information within images. Unlike previous methods, 3DThinker enables 3D mental imagery during reasoning without requiring 3D prior input or explicit 3D labels, and it outperforms strong baselines in various benchmarks....

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Published at 2025-10-21

#ML

The authors present a new method called Critique-Post-Edit to improve the personalization of large language models. This approach uses a personalized reward model and a critique-post-edit mechanism to create more accurate and customized responses, outperforming standard methods and even surpassing GPT-4.1 in some cases....

UltraGen: High-Resolution Video Generation with Hierarchical Attention

Published at 2025-10-21

#ML

The paper presents UltraGen, a new video generation framework that allows for efficient and end-to-end high-resolution video synthesis, addressing the computational bottleneck of existing models by using a hierarchical dual-branch attention architecture and spatially compressed global modeling strategy, enabling the generation of 1080P and even 4K resolution videos for the first time....

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Published at 2025-10-21

#ML

UniGenBench++ is a new benchmark for testing the accuracy of text-to-image generation models. It includes 600 prompts in both English and Chinese, covering various themes and evaluation criteria, to assess the models' performance in generating images that align with the given text prompt....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages