🤗 Daily Paper(2025-09-22)

0 views
Skip to first unread message

deep.di...@gmail.com

unread,
Sep 22, 2025, 4:07:21 PM (8 days ago) Sep 22
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

Published at 2025-09-12

#ML

The authors present WhisTLE, a method to adapt pretrained speech recognition models to new domains using only text data. WhisTLE uses a variational autoencoder to model encoder outputs from text and fine-tunes the decoder, resulting in improved performance across various out-of-domain datasets and ASR models....

Read Moreicon

Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents

Published at 2025-09-16

#ML

The authors present a new approach for role-playing agents that uses video modality to create dynamic role profiles, improving their ability to generate responses compared to static role profiles. They introduce a large-scale video dataset and a comprehensive framework for video-guided role-playing agents, which integrates both dynamic and static role profiles, and propose a robust evaluation method. Experimental results show the effectiveness of their approach....

Read Moreicon

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

Published at 2025-09-17

#ML

This study examines how well instruction-guided text-to-speech systems follow user instructions for speech style and finds that while some models perform well, most struggle with fine-grained control, particularly in producing child or elderly voices as requested....

Read Moreicon

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

Published at 2025-09-18

#ML

The study presents a framework called Ask-to-Clarify that allows embodied agents to engage in multi-turn dialogues to resolve ambiguous instructions before taking action. This framework consists of two components: a VLM for collaboration and a diffusion model for generating actions. The framework is trained using a two-stage strategy and outperforms existing state-of-the-art VLAs in 8 real-world tasks, demonstrating its potential for creating more collaborative and adaptive embodied agents....

Read Moreicon

Lynx: Towards High-Fidelity Personalized Video Generation

Published at 2025-09-18

#ML

The authors propose Lynx, a model for creating personalized videos from a single image using a high-quality video synthesis technique. It uses two adapters to ensure the video accurately represents the input image, resulting in better face resemblance, prompt following, and overall video quality compared to existing methods....

Read Moreicon

RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes

Published at 2025-09-18

#ML

The authors present a new method for optimizing camera parameters in dynamic scenes using only a single RGB video, which is more accurate and efficient than the current method (COLMAP) that requires additional information like ground truth motion masks. The method uses three components: Patch-wise Tracking Filters, Outlier-aware Joint Optimization, and A Two-stage Optimization Strategy, to improve stability, speed, and accuracy of camera parameter optimization in dynamic scenes....

Read Moreicon

SPATIALGEN: Layout-guided 3D Indoor Scene Generation

Published at 2025-09-18

#ML

The authors present SpatialGen, a model that creates realistic and consistent 3D indoor scenes using a 3D layout and a reference image. They also share a large, high-quality dataset of 12,328 indoor scenes to improve and advance the field of indoor scene generation....

Read Moreicon

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

Published at 2025-09-19

#ML

The authors present a model called VLAC that improves robotic real-world reinforcement learning by using vision-language data and robot/human trajectory data to create a general process reward model. This model allows for efficient exploration, eliminates the need for task-specific reward engineering, and significantly increases success rates in various manipulation tasks with the help of a human-in-the-loop protocol....

Read Moreicon

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Published at 2025-09-19

#ML

The research presents a new framework called BTL that mimics human cognitive processes when interacting with graphical interfaces. This framework is composed of three phases: Blink (detecting relevant screen areas), Think (higher-level reasoning), and Link (generating executable commands). Two innovations are introduced for the BTL framework: Blink Data Generation and BTL Reward. A GUI agent model, BTL-UI, is developed based on this framework and demonstrates top performance in various benchmark...

Read Moreicon

BaseReward: A Strong Baseline for Multimodal Reward Model

Published at 2025-09-19

#ML

The paper presents BaseReward, a powerful and efficient baseline for multimodal reward modeling, built on a Qwen2.5-VL backbone with an optimized two-layer reward head. BaseReward outperforms previous models on major benchmarks and enhances MLLMs' performance in real-world tasks, offering a clear, empirically-backed guide for developing robust reward models....

Read Moreicon

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Published at 2025-09-19

#ML

The authors propose a new model called Latent Zoning Network (LZN) that integrates generative modeling, representation learning, and classification by using a shared Gaussian latent space. LZN has shown promising results in improving image generation, achieving state-of-the-art performance in representation learning, and successfully performing joint generation and classification tasks....

Read Moreicon

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Published at 2025-09-19

#ML

The authors present Manzano, a new framework that improves the performance of models that can understand and generate visual content. By using a hybrid image tokenizer and a special training recipe, Manzano achieves state-of-the-art results, outperforming other unified models and competing with specialist models, especially in text-rich evaluations....

Read Moreicon

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Published at 2025-09-19

#ML

The study presents RPG, a new method for planning and generating large software repositories, which uses a graph to represent the repository's structure, functions, and data flows. The researchers also developed ZeroRepo, a framework based on RPG, which generates repositories from scratch and outperforms other methods in terms of size, functionality, and accuracy....

Read Moreicon

Published at

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages