🤗 Daily Paper(2025-08-08)

6 views

Skip to first unread message

deep.di...@gmail.com

unread,

Aug 8, 2025, 4:07:45 PMAug 8

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

StrandDesigner: Towards Practical Strand Generation with Sketch Guidance

Published at 2025-08-03

#ML

The authors present a new sketch-based hair strand generation model that provides more control and ease of use compared to current text or image-based methods. The model uses innovative techniques to handle complex strand interactions and various sketch patterns, resulting in more realistic and precise hair generation than existing methods....

Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models

Published at 2025-08-04

#ML

This study reviews methods to improve the efficiency of large reasoning models, like DeepSeek R1, which are better at logical deduction than traditional language models. The methods focus on reducing excessively long reasoning chains without sacrificing accuracy, categorized into single-model optimization and model collaboration....

I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

Published at 2025-08-04

#ML

The paper presents a new framework called I2CR that improves multimodal entity linking by primarily using text information and only incorporating image data when necessary, through a multi-round iterative strategy. This approach outperforms current state-of-the-art methods by 3.2%, 5.1%, and 1.6% on three public datasets....

Marco-Voice Technical Report

Published at 2025-08-04

#ML

This study proposes a unified speech synthesis system that combines voice cloning and emotion control, addressing challenges in creating expressive, controllable, and natural speech with preserved speaker identity. The system uses a speaker-emotion disentanglement mechanism and a rotational emotional embedding integration method, and is trained and evaluated using a high-quality emotional speech dataset called CSEMOTIONS, resulting in substantial improvements in speech clarity and emotional rich...

Are Today's LLMs Ready to Explain Well-Being Concepts?

Published at 2025-08-05

#ML

This study creates a large dataset of 43,880 explanations about well-being concepts generated by ten LLMs and evaluates their quality using a new framework. The researchers find that fine-tuning an LLM can improve explanation quality, and that the quality of explanations varies depending on the model, audience, and topic, with preference-based learning being particularly effective for this task....

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Published at 2025-08-05

#ML

The authors present Double-Bench, a new comprehensive evaluation system for document RAG systems, addressing limitations of current benchmarks by providing a large-scale, multilingual, and multimodal platform with fine-grained assessments. Experiments using Double-Bench reveal gaps in current document retrieval models and over-confidence issues in document RAG frameworks, aiming to support future research in advanced document RAG systems....

Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Published at 2025-08-05

#ML

The study presents a framework to evaluate large multimodal models' ability to detect faulty inputs, revealing that most models struggle with flawed textual premises and have varying reliance on different modalities. The findings highlight the need to improve these models' proactive input verification skills....

CoAct-1: Computer-using Agents with Coding as Actions

Published at 2025-08-05

#ML

The authors present CoAct-1, a multi-agent system that combines computer control through Graphical User Interfaces (GUIs) with direct programmatic execution. By allowing agents to write and execute Python or Bash scripts, CoAct-1 achieves a new state-of-the-art success rate of 60.76% on the OSWorld benchmark, outperforming prior methods and reducing the average number of steps required to complete a task to just 10.15....

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

Published at 2025-08-05

#ML

The authors propose MACT, a framework for visual document understanding and question answering, which consists of four small-scale agents working together to improve performance on tasks involving long visual contexts and complex reasoning, outperforming existing vision-language models....

Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Published at 2025-08-06

#ML

The authors propose a framework for customer support conversations based on professional guidelines, creating a dataset of annotated real-world interactions and a role-playing approach to train language models. Their experiments show that this method improves the models' ability to generate high-quality, strategy-aligned responses, leading to better problem resolution....

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Published at 2025-08-06

#ML

This study examines why reasoning models often make mistakes, particularly in complex, multi-step tasks. The researchers propose a new way to categorize these errors, focusing on the number of sources involved, the completeness of information, and inefficient thinking. Their findings offer valuable insights into the limitations of current models and suggest ways to improve their accuracy, transparency, and reliability....

I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

Published at 2025-08-06

#ML

The study presents a benchmark to measure bias in large language models by simulating interviews with 100 questions, focusing on how these models respond to subtle linguistic markers that reveal demographic attributes. They find that models penalize certain linguistic patterns, like hedging language, leading to lower ratings, and demonstrate the benchmark's effectiveness in identifying and measuring these biases....

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Published at 2025-08-06

#ML

Researchers created a new framework called R-Zero that allows language models to learn and improve on their own without needing large amounts of human-curated data. This method significantly enhances the models' reasoning skills in various tasks....

REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Published at 2025-08-06

#ML

The authors propose a new method, REINA, to improve simultaneous speech translation systems by balancing translation quality and latency. They demonstrate that REINA, based on information theory principles, outperforms previous approaches and achieves state-of-the-art results on multiple languages using only open-source or synthetic data....

RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation

Published at 2025-08-06

#ML

The authors present RPCANet++, a new method for segmenting sparse objects in images by combining the strengths of robust principal component analysis (RPCA) and deep learning architectures. This approach improves upon traditional RPCA by reducing computational burden, eliminating the need for fine-tuning hyperparameters, and increasing adaptability in dynamic scenarios, all while maintaining high performance and interpretability....

Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression

Published at 2025-08-06

#ML

The authors present SODEC, a new single-step diffusion image compression model that addresses the issues of slow decoding and poor image quality in current diffusion-based models. SODEC uses a pre-trained VAE model to create informative latents and a fidelity guidance module to improve image quality, resulting in faster decoding and better rate-distortion-perception performance compared to existing methods....

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

Published at 2025-08-06

#ML

The authors present MLLMSeg, a new framework that makes full use of the visual detail features in MLLM vision encoder to improve the Referring Expression Segmentation task, without adding extra visual encoders. They also introduce a light-weight mask decoder that combines detailed spatial features from the visual encoder and semantic features from the LLM to predict precise masks, outperforming both SAM-based and SAM-free competitors in a cost-effective manner....

Attention Basin: Why Contextual Position Matters in Large Language Models

Published at 2025-08-07

#ML

Researchers discovered that large language models pay more attention to information at the start and end of a sequence, ignoring the middle. They then created Attention-Driven Reranking, a method that rearranges information to highlight important content, improving performance across various models without training or parameter changes....

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Published at 2025-08-07

#ML

The authors present DeepPHY, a new benchmark framework that tests how well Vision Language Models (VLMs) understand and apply physical principles in complex, virtual environments. They found that even the best VLMs struggle with turning knowledge of physics into accurate actions, highlighting a gap that needs to be addressed for real-world applications....

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Published at 2025-08-07

#ML

Genie Envisioner is a single platform that combines learning, evaluating, and simulating robotic manipulation tasks. It uses a large-scale video model to understand real-world robotic interactions and generates action trajectories with minimal supervision, while also providing a benchmark suite to measure its performance....

Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Published at 2025-08-07

#ML

The authors present a new framework called Hi3DEval for evaluating 3D generated content, which goes beyond existing methods by assessing both object-level and part-level details, as well as material realism. They also create a large-scale dataset called Hi3DBench to support this framework, and their approach proves to be more effective than image-based metrics in modeling 3D characteristics and aligning with human preference....

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

Published at 2025-08-07

#ML

The study presents a new method called InfiAlign that efficiently enhances the reasoning skills of large language models by selecting high-quality data using multiple quality metrics, resulting in significant performance improvements with less training data....

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Published at 2025-08-07

#ML

The MOSEv2 dataset is introduced to improve video object segmentation in real-world scenarios. It contains 5,024 videos and over 701,976 masks for 10,074 objects, featuring increased complexity and challenges like object disappearance, severe occlusions, adverse weather, and low-light scenes, which current methods struggle with....

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Published at 2025-08-07

#ML

The authors propose a new method called Dynamic Fine-Tuning (DFT) to improve the generalization of Supervised Fine-Tuning (SFT) for Large Language Models (LLMs). DFT addresses the problematic reward structure in SFT by dynamically rescaling the objective function, resulting in significantly better performance across various benchmarks and base models, even competing with offline reinforcement learning settings....

PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction

Published at 2025-08-07

#ML

The study examines various architecture and training methods for large language models (LLMs) to determine their effectiveness in redacting sensitive information, while also considering factors like efficiency, privacy, and cost. The researchers developed an open-source tool called PRvL, which includes fine-tuned models and evaluation resources for general PII redaction, enabling data owners to customize and use the tool within their own secure environments....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages