🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
StrandDesigner: Towards Practical Strand Generation with Sketch Guidance |
Published at 2025-08-03 |
#ML
|
The authors present a new sketch-based hair strand generation model that provides more control and ease of use compared to current text or image-based methods. The model uses innovative techniques to handle complex strand interactions and various sketch patterns, resulting in more realistic and precise hair generation than existing methods.... |
Read More |
|
|
![]() |
Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models |
Published at 2025-08-04 |
#ML
|
This study reviews methods to improve the efficiency of large reasoning models, like DeepSeek R1, which are better at logical deduction than traditional language models. The methods focus on reducing excessively long reasoning chains without sacrificing accuracy, categorized into single-model optimization and model collaboration.... |
Read More |
|
|
|
![]() |
I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking |
Published at 2025-08-04 |
#ML
|
The paper presents a new framework called I2CR that improves multimodal entity linking by primarily using text information and only incorporating image data when necessary, through a multi-round iterative strategy. This approach outperforms current state-of-the-art methods by 3.2%, 5.1%, and 1.6% on three public datasets.... |
Read More |
|
|
![]() |
Marco-Voice Technical Report |
Published at 2025-08-04 |
#ML
|
This study proposes a unified speech synthesis system that combines voice cloning and emotion control, addressing challenges in creating expressive, controllable, and natural speech with preserved speaker identity. The system uses a speaker-emotion disentanglement mechanism and a rotational emotional embedding integration method, and is trained and evaluated using a high-quality emotional speech dataset called CSEMOTIONS, resulting in substantial improvements in speech clarity and emotional rich... |
Read More |
|
|
|
![]() |
Are Today's LLMs Ready to Explain Well-Being Concepts? |
Published at 2025-08-05 |
#ML
|
This study creates a large dataset of 43,880 explanations about well-being concepts generated by ten LLMs and evaluates their quality using a new framework. The researchers find that fine-tuning an LLM can improve explanation quality, and that the quality of explanations varies depending on the model, audience, and topic, with preference-based learning being particularly effective for this task.... |
Read More |
|
|
![]() |
Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? |
Published at 2025-08-05 |
#ML
|
The authors present Double-Bench, a new comprehensive evaluation system for document RAG systems, addressing limitations of current benchmarks by providing a large-scale, multilingual, and multimodal platform with fine-grained assessments. Experiments using Double-Bench reveal gaps in current document retrieval models and over-confidence issues in document RAG frameworks, aiming to support future research in advanced document RAG systems.... |
Read More |
|
|
|
![]() |
Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability |
Published at 2025-08-05 |
#ML
|
The study presents a framework to evaluate large multimodal models' ability to detect faulty inputs, revealing that most models struggle with flawed textual premises and have varying reliance on different modalities. The findings highlight the need to improve these models' proactive input verification skills.... |
Read More |
|
|
![]() |
CoAct-1: Computer-using Agents with Coding as Actions |
Published at 2025-08-05 |
#ML
|
The authors present CoAct-1, a multi-agent system that combines computer control through Graphical User Interfaces (GUIs) with direct programmatic execution. By allowing agents to write and execute Python or Bash scripts, CoAct-1 achieves a new state-of-the-art success rate of 60.76% on the OSWorld benchmark, outperforming prior methods and reducing the average number of steps required to complete a task to just 10.15.... |
Read More |
|
|
|
![]() |
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling |
Published at 2025-08-05 |
#ML
|
The authors propose MACT, a framework for visual document understanding and question answering, which consists of four small-scale agents working together to improve performance on tasks involving long visual contexts and complex reasoning, outperforming existing vision-language models.... |
Read More |
|
|
![]() |
Evaluating, Synthesizing, and Enhancing for Customer Support Conversation |
Published at 2025-08-06 |
#ML
|
The authors propose a framework for customer support conversations based on professional guidelines, creating a dataset of annotated real-world interactions and a role-playing approach to train language models. Their experiments show that this method improves the models' ability to generate high-quality, strategy-aligned responses, leading to better problem resolution.... |
Read More |
|
|
|
![]() |
Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis |
Published at 2025-08-06 |
#ML
|
This study examines why reasoning models often make mistakes, particularly in complex, multi-step tasks. The researchers propose a new way to categorize these errors, focusing on the number of sources involved, the completeness of information, and inefficient thinking. Their findings offer valuable insights into the limitations of current models and suggest ways to improve their accuracy, transparency, and reliability.... |
Read More |
|
|
![]() |
I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations |
Published at 2025-08-06 |
#ML
|
The study presents a benchmark to measure bias in large language models by simulating interviews with 100 questions, focusing on how these models respond to subtle linguistic markers that reveal demographic attributes. They find that models penalize certain linguistic patterns, like hedging language, leading to lower ratings, and demonstrate the benchmark's effectiveness in identifying and measuring these biases.... |
Read More |
|
|
|
![]() |
R-Zero: Self-Evolving Reasoning LLM from Zero Data |
Published at 2025-08-06 |
#ML
|
Researchers created a new framework called R-Zero that allows language models to learn and improve on their own without needing large amounts of human-curated data. This method significantly enhances the models' reasoning skills in various tasks.... |
Read More |
|
|
![]() |
REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation |
Published at 2025-08-06 |
#ML
|
The authors propose a new method, REINA, to improve simultaneous speech translation systems by balancing translation quality and latency. They demonstrate that REINA, based on information theory principles, outperforms previous approaches and achieves state-of-the-art results on multiple languages using only open-source or synthetic data.... |
Read More |
|
|
|
![]() |
RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation |
Published at 2025-08-06 |
#ML
|
The authors present RPCANet++, a new method for segmenting sparse objects in images by combining the strengths of robust principal component analysis (RPCA) and deep learning architectures. This approach improves upon traditional RPCA by reducing computational burden, eliminating the need for fine-tuning hyperparameters, and increasing adaptability in dynamic scenarios, all while maintaining high performance and interpretability.... |
Read More |
|
|
![]() |
Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression |
Published at 2025-08-06 |
#ML
|
The authors present SODEC, a new single-step diffusion image compression model that addresses the issues of slow decoding and poor image quality in current diffusion-based models. SODEC uses a pre-trained VAE model to create informative latents and a fidelity guidance module to improve image quality, resulting in faster decoding and better rate-distortion-perception performance compared to existing methods.... |
Read More |
|
|
|
![]() |
Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode |
Published at 2025-08-06 |
#ML
|
The authors present MLLMSeg, a new framework that makes full use of the visual detail features in MLLM vision encoder to improve the Referring Expression Segmentation task, without adding extra visual encoders. They also introduce a light-weight mask decoder that combines detailed spatial features from the visual encoder and semantic features from the LLM to predict precise masks, outperforming both SAM-based and SAM-free competitors in a cost-effective manner.... |
Read More |
|
|
![]() |
Attention Basin: Why Contextual Position Matters in Large Language Models |
Published at 2025-08-07 |
#ML
|
Researchers discovered that large language models pay more attention to information at the start and end of a sequence, ignoring the middle. They then created Attention-Driven Reranking, a method that rearranges information to highlight important content, improving performance across various models without training or parameter changes.... |
Read More |
|
|
|
![]() |
DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning |
Published at 2025-08-07 |
#ML
|
The authors present DeepPHY, a new benchmark framework that tests how well Vision Language Models (VLMs) understand and apply physical principles in complex, virtual environments. They found that even the best VLMs struggle with turning knowledge of physics into accurate actions, highlighting a gap that needs to be addressed for real-world applications.... |
Read More |
|
|
![]() |
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation |
Published at 2025-08-07 |
#ML
|
Genie Envisioner is a single platform that combines learning, evaluating, and simulating robotic manipulation tasks. It uses a large-scale video model to understand real-world robotic interactions and generates action trajectories with minimal supervision, while also providing a benchmark suite to measure its performance.... |
Read More |
|
|
|
![]() |
Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity |
Published at 2025-08-07 |
#ML
|
The authors present a new framework called Hi3DEval for evaluating 3D generated content, which goes beyond existing methods by assessing both object-level and part-level details, as well as material realism. They also create a large-scale dataset called Hi3DBench to support this framework, and their approach proves to be more effective than image-based metrics in modeling 3D characteristics and aligning with human preference.... |
Read More |
|
|
![]() |
InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities |
Published at 2025-08-07 |
#ML
|
The study presents a new method called InfiAlign that efficiently enhances the reasoning skills of large language models by selecting high-quality data using multiple quality metrics, resulting in significant performance improvements with less training data.... |
Read More |
|
|
|
![]() |
MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes |
Published at 2025-08-07 |
#ML
|
The MOSEv2 dataset is introduced to improve video object segmentation in real-world scenarios. It contains 5,024 videos and over 701,976 masks for 10,074 objects, featuring increased complexity and challenges like object disappearance, severe occlusions, adverse weather, and low-light scenes, which current methods struggle with.... |
Read More |
|
|
![]() |
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification |
Published at 2025-08-07 |
#ML
|
The authors propose a new method called Dynamic Fine-Tuning (DFT) to improve the generalization of Supervised Fine-Tuning (SFT) for Large Language Models (LLMs). DFT addresses the problematic reward structure in SFT by dynamically rescaling the objective function, resulting in significantly better performance across various benchmarks and base models, even competing with offline reinforcement learning settings.... |
Read More |
|
|
|
![]() |
PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction |
Published at 2025-08-07 |
#ML
|
The study examines various architecture and training methods for large language models (LLMs) to determine their effectiveness in redacting sensitive information, while also considering factors like efficiency, privacy, and cost. The researchers developed an open-source tool called PRvL, which includes fine-tuned models and evaluation resources for general PII redaction, enabling data owners to customize and use the tool within their own secure environments.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|