🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
Beyond Transcription: Mechanistic Interpretability in ASR |
Published at 2025-08-21 |
#ML
|
The study explores how well-known interpretability methods like logit lens, linear probing, and activation patching can be used to understand automatic speech recognition (ASR) systems. By applying these techniques, the researchers discovered new information about the internal workings of ASR, such as the specific interactions causing repetition errors and the semantic biases present in acoustic representations, which could help make ASR systems more transparent and reliable.... |
Read More |
|
|
![]() |
Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation |
Published at 2025-08-25 |
#ML
|
The study presents a large-scale, diverse video dataset for monitoring health metrics like heart rate and stress levels using remote PhotoPlethysmoGraphy (rPPG). This dataset, which includes synchronized videos from 600 subjects under different conditions, can help improve AI medical assistants by providing a more comprehensive and realistic basis for training and testing rPPG models.... |
Read More |
|
|
|
![]() |
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models |
Published at 2025-08-25 |
#ML
|
The SEAM benchmark is designed to test if vision-language models understand information equally across text and images by using standardized notations in four domains. The results show that models often struggle with vision more than text, even when the information is the same, and the errors are usually due to misunderstanding text or seeing things that aren't there.... |
Read More |
|
|
![]() |
MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation |
Published at 2025-08-26 |
#ML
|
The authors present a new framework for creating interactive digital humans that can respond to various input signals in real-time with low latency and high efficiency. They use a large language model and a large-scale dialogue dataset to train their system, which can accept multimodal condition encodings like audio, pose, and text, and generate coherent representations to guide the denoising process of a diffusion head.... |
Read More |
|
|
|
![]() |
Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents |
Published at 2025-08-26 |
#ML
|
This study evaluates the privacy awareness of smartphone agents powered by multimodal large language models, finding that most agents have low privacy awareness and suggesting that the agents' privacy detection capability is related to scenario sensitivity level. The research aims to encourage the community to reconsider the balance between utility and privacy for smartphone agents.... |
Read More |
|
|
![]() |
MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment |
Published at 2025-08-26 |
#ML
|
The authors present a new method called TAPO that improves alignment between text descriptions and motion semantics, and a high-speed generation framework named MotionFLUX that enables real-time synthesis by reducing the need for multi-step sampling. These two techniques together create a unified system that outperforms current methods in terms of semantic consistency, motion quality, and generation speed.... |
Read More |
|
|
|
![]() |
Predicting the Order of Upcoming Tokens Improves Language Modeling |
Published at 2025-08-26 |
#ML
|
The study introduces a new method called Token Order Prediction (TOP) that outperforms existing techniques for improving language modeling. TOP trains models to predict the order of upcoming tokens based on their proximity, requiring fewer resources than previous methods, and it consistently performs better than other methods on various standard NLP benchmarks.... |
Read More |
|
|
![]() |
StepWiser: Stepwise Generative Judges for Wiser Reasoning |
Published at 2025-08-26 |
#ML
|
The paper proposes a new model called StepWiser that evaluates the reasoning process of other models by acting as a 'judge' that thinks similarly, providing better and more explanatory feedback than current methods. This helps improve the performance of the policy model during training and enhances search capabilities during inference.... |
Read More |
|
|
|
![]() |
AudioStory: Generating Long-Form Narrative Audio with Large Language Models |
Published at 2025-08-27 |
#ML
|
The authors present AudioStory, a new system that uses large language models and text-to-audio technology to create long, coherent audio stories. AudioStory can break down complex narratives into smaller parts, maintain consistent emotions, and train all its components together to produce high-quality audio, outperforming previous methods.... |
Read More |
|
|
![]() |
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning |
Published at 2025-08-27 |
#ML
|
The authors present CODA, a new and adaptable framework for autonomous agents on graphical user interfaces, specifically designed for scientific computing. CODA combines a generalist planner with a specialist executor, allowing for both long-term planning and precise execution, and outperforms existing models in various scientific applications.... |
Read More |
|
|
|
![]() |
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis |
Published at 2025-08-27 |
#ML
|
DeepScholar-bench is a live benchmark and evaluation framework for generative research synthesis, focusing on retrieving, synthesizing, and citing prior research for related work sections in papers. The framework assesses performance across knowledge synthesis, retrieval quality, and verifiability, with DeepScholar-base setting a strong baseline, but no system yet achieving high scores, highlighting the difficulty and importance of this task.... |
Read More |
|
|
![]() |
Diffusion Language Models Know the Answer Before Decoding |
Published at 2025-08-27 |
#ML
|
This study explores a new method called Prophet, which accelerates the decoding process of diffusion language models (DLMs) by up to 3.4 times without sacrificing quality. Prophet works by determining the right time to stop sampling, leveraging the fact that DLMs often identify the correct answer early in the decoding process.... |
Read More |
|
|
|
![]() |
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies |
Published at 2025-08-27 |
#ML
|
The study presents Discrete Diffusion VLA, a unified transformer policy that uses discrete diffusion to model action chunks in VLA models, improving consistency and enabling robust error correction. This method achieves better performance than autoregressive and continuous diffusion baselines, supporting precise action modeling and consistent training.... |
Read More |
|
|
![]() |
Self-Rewarding Vision-Language Model via Reasoning Decomposition |
Published at 2025-08-27 |
#ML
|
The study presents Vision-SR1, a self-rewarding method that enhances visual reasoning in Vision-Language Models (VLMs) without relying on external visual supervisions. It decomposes VLM reasoning into visual perception and language reasoning stages, using self-generated perceptions to create a balanced training signal that reduces visual hallucinations and language shortcuts.... |
Read More |
|
|
|
![]() |
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference |
Published at 2025-08-27 |
#ML
|
This study presents a solution called HeteroScale for efficiently managing resources in GPU-intensive tasks like serving Large Language Models. HeteroScale addresses challenges in modern Prefill-Decode architectures by balancing the use of heterogeneous hardware and network resources, resulting in increased GPU utilization and reduced GPU-hours used in a large-scale production environment.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|