🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers |
Published at 2025-09-12 |
#ML
|
The authors present WhisTLE, a method to adapt pretrained speech recognition models to new domains using only text data. WhisTLE uses a variational autoencoder to model encoder outputs from text and fine-tunes the decoder, resulting in improved performance across various out-of-domain datasets and ASR models.... |
Read More |
|
|
![]() |
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents |
Published at 2025-09-16 |
#ML
|
The authors present a new approach for role-playing agents that uses video modality to create dynamic role profiles, improving their ability to generate responses compared to static role profiles. They introduce a large-scale video dataset and a comprehensive framework for video-guided role-playing agents, which integrates both dynamic and static role profiles, and propose a robust evaluation method. Experimental results show the effectiveness of their approach.... |
Read More |
|
|
|
![]() |
Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems |
Published at 2025-09-17 |
#ML
|
This study examines how well instruction-guided text-to-speech systems follow user instructions for speech style and finds that while some models perform well, most struggle with fine-grained control, particularly in producing child or elderly voices as requested.... |
Read More |
|
|
![]() |
Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue |
Published at 2025-09-18 |
#ML
|
The study presents a framework called Ask-to-Clarify that allows embodied agents to engage in multi-turn dialogues to resolve ambiguous instructions before taking action. This framework consists of two components: a VLM for collaboration and a diffusion model for generating actions. The framework is trained using a two-stage strategy and outperforms existing state-of-the-art VLAs in 8 real-world tasks, demonstrating its potential for creating more collaborative and adaptive embodied agents.... |
Read More |
|
|
|
![]() |
Lynx: Towards High-Fidelity Personalized Video Generation |
Published at 2025-09-18 |
#ML
|
The authors propose Lynx, a model for creating personalized videos from a single image using a high-quality video synthesis technique. It uses two adapters to ensure the video accurately represents the input image, resulting in better face resemblance, prompt following, and overall video quality compared to existing methods.... |
Read More |
|
|
![]() |
RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes |
Published at 2025-09-18 |
#ML
|
The authors present a new method for optimizing camera parameters in dynamic scenes using only a single RGB video, which is more accurate and efficient than the current method (COLMAP) that requires additional information like ground truth motion masks. The method uses three components: Patch-wise Tracking Filters, Outlier-aware Joint Optimization, and A Two-stage Optimization Strategy, to improve stability, speed, and accuracy of camera parameter optimization in dynamic scenes.... |
Read More |
|
|
|
![]() |
SPATIALGEN: Layout-guided 3D Indoor Scene Generation |
Published at 2025-09-18 |
#ML
|
The authors present SpatialGen, a model that creates realistic and consistent 3D indoor scenes using a 3D layout and a reference image. They also share a large, high-quality dataset of 12,328 indoor scenes to improve and advance the field of indoor scene generation.... |
Read More |
|
|
![]() |
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning |
Published at 2025-09-19 |
#ML
|
The authors present a model called VLAC that improves robotic real-world reinforcement learning by using vision-language data and robot/human trajectory data to create a general process reward model. This model allows for efficient exploration, eliminates the need for task-specific reward engineering, and significantly increases success rates in various manipulation tasks with the help of a human-in-the-loop protocol.... |
Read More |
|
|
|
![]() |
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent |
Published at 2025-09-19 |
#ML
|
The research presents a new framework called BTL that mimics human cognitive processes when interacting with graphical interfaces. This framework is composed of three phases: Blink (detecting relevant screen areas), Think (higher-level reasoning), and Link (generating executable commands). Two innovations are introduced for the BTL framework: Blink Data Generation and BTL Reward. A GUI agent model, BTL-UI, is developed based on this framework and demonstrates top performance in various benchmark... |
Read More |
|
|
![]() |
BaseReward: A Strong Baseline for Multimodal Reward Model |
Published at 2025-09-19 |
#ML
|
The paper presents BaseReward, a powerful and efficient baseline for multimodal reward modeling, built on a Qwen2.5-VL backbone with an optimized two-layer reward head. BaseReward outperforms previous models on major benchmarks and enhances MLLMs' performance in real-world tasks, offering a clear, empirically-backed guide for developing robust reward models.... |
Read More |
|
|
|
![]() |
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification |
Published at 2025-09-19 |
#ML
|
The authors propose a new model called Latent Zoning Network (LZN) that integrates generative modeling, representation learning, and classification by using a shared Gaussian latent space. LZN has shown promising results in improving image generation, achieving state-of-the-art performance in representation learning, and successfully performing joint generation and classification tasks.... |
Read More |
|
|
![]() |
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer |
Published at 2025-09-19 |
#ML
|
The authors present Manzano, a new framework that improves the performance of models that can understand and generate visual content. By using a hybrid image tokenizer and a special training recipe, Manzano achieves state-of-the-art results, outperforming other unified models and competing with specialist models, especially in text-rich evaluations.... |
Read More |
|
|
|
![]() |
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation |
Published at 2025-09-19 |
#ML
|
The study presents RPG, a new method for planning and generating large software repositories, which uses a graph to represent the repository's structure, functions, and data flows. The researchers also developed ZeroRepo, a framework based on RPG, which generates repositories from scratch and outperforms other methods in terms of size, functionality, and accuracy.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|