🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Reinventing Clinical Dialogue: Agentic Paradigms for LLM Enabled Healthcare Communication |
Published at 2025-12-01 |
|
#ML
|
The paper proposes a new way of using Large Language Models (LLMs) in healthcare communication, moving from reactive and stateless processing to agentic autonomy. This new approach focuses on the model's ability to reason, plan, and remember, which helps balance creativity and reliability in clinical dialogue. The authors introduce a novel taxonomy to categorize methods into four archetypes, each with distinct architectural choices to ensure both autonomy and safety.... |
Read More |
|
|
|
![]() |
Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules |
Published at 2025-12-02 |
|
#ML
|
The authors propose SchED, a method to speed up diffusion language models without training, which works by stopping the decoding process once a certain confidence level is reached. SchED was tested on various tasks and models, resulting in significant speedups (up to 4.0 times) while maintaining high performance.... |
Read More |
|
|
|
|
![]() |
EtCon: Edit-then-Consolidate for Reliable Knowledge Editing |
Published at 2025-12-04 |
|
#ML
|
The study presents a new method called Edit-then-Consolidate to improve real-world knowledge editing in large language models. It addresses overfitting and insufficient knowledge integration issues, enhancing editing reliability and generalization while preserving pre-trained capabilities.... |
Read More |
|
|
|
![]() |
Smart Timing for Mining: A Deep Learning Framework for Bitcoin Hardware ROI Prediction |
Published at 2025-12-04 |
|
#ML
|
The study presents MineROI-Net, a Transformer-based model for predicting the profitability of Bitcoin mining hardware within a year of purchase. Tested on data from various ASIC miners, MineROI-Net outperforms other models, offering a practical tool for reducing financial risk in mining operations by helping to determine the best time to acquire hardware.... |
Read More |
|
|
|
|
![]() |
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory |
Published at 2025-12-04 |
|
#ML
|
The authors present a new method for generating long videos by combining a technique called autoregressive diffusion with a hybrid memory system. This system, called VideoSSM, maintains a global memory of scene dynamics and a local memory for motion cues and fine details, ensuring consistent and interactive video generation without repetitive patterns, even for minute-scale horizons.... |
Read More |
|
|
|
![]() |
TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression |
Published at 2025-12-05 |
|
#ML
|
The study presents TED-4DGS, a new method that efficiently compresses dynamic 3D scenes by combining temporal control and deformation schemes. This approach outperforms existing techniques on real-world datasets by using learnable parameters, a shared deformation bank, and an implicit neural representation-based hyperprior for rate-distortion compression.... |
Read More |
|
|
|
|
![]() |
Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS |
Published at 2025-12-08 |
|
#ML
|
This study presents a new framework for text-to-speech systems that balances the need for high-quality pronunciation with the requirement for real-time performance. By using a service-oriented architecture and lightweight strategies, the proposed system separates context-aware components from the core TTS engine, reducing latency and enabling real-time use of advanced phonemization models without sacrificing accuracy.... |
Read More |
|
|
|
![]() |
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models |
Published at 2025-12-08 |
|
#ML
|
The authors observe that function words can make vision-language models vulnerable to attacks and propose a method called Function-word De-Attention (FDA) to reduce this impact. FDA calculates the original and function-word cross-attention within attention heads and subtracts the latter from the former, resulting in more aligned and robust models. Experiments show that FDA significantly reduces attack success rates while maintaining or slightly improving performance on various tasks, datasets, a... |
Read More |
|
|
|
|
![]() |
BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain |
Published at 2025-12-09 |
|
#ML
|
The authors have developed a large-scale, automated framework called BrainExplore to discover and explain visual representations in the human cortex. This method uses unsupervised decomposition to find interpretable patterns in fMRI activity and generates natural-language descriptions for each pattern, revealing thousands of visual concepts, some of which have not been reported before.... |
Read More |
|
|
|
![]() |
GimbalDiffusion: Gravity-Aware Camera Control for Video Generation |
Published at 2025-12-09 |
|
#ML
|
The authors present a new framework called GimbalDiffusion that allows for precise and interpretable control over camera motion and orientation in text-to-video generation by using gravity as a global reference and defining camera trajectories in an absolute coordinate system. They also introduce null-pitch conditioning to improve camera guidance and establish a benchmark for camera-aware video generation, thereby enhancing the controllability and robustness of text-to-video models.... |
Read More |
|
|
|
|
![]() |
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models |
Published at 2025-12-09 |
|
#ML
|
The authors present InfiniteVL, a new vision-language model that combines sliding window attention and Gated DeltaNet to improve performance on information-intensive tasks and handle longer sequences compared to existing models, all while being resource-efficient and maintaining a small memory footprint.... |
Read More |
|
|
|
![]() |
Learning Unmasking Policies for Diffusion Language Models |
Published at 2025-12-09 |
|
#ML
|
This study presents a new method to improve the efficiency and quality of large language models by training sampling procedures using reinforcement learning. The proposed approach, which outperforms existing heuristic strategies, can generalize to new models and longer sequences but may struggle with out-of-domain data.... |
Read More |
|
|
|
|
![]() |
OmniPSD: Layered PSD Generation with Diffusion Transformer |
Published at 2025-12-09 |
|
#ML
|
The authors present a new method called OmniPSD that uses a diffusion model to create or separate layered PSD files with transparent backgrounds. This technique can generate PSD files from text descriptions and decompose single images into editable layers with the help of an additional tool called RGBA-VAE.... |
Read More |
|
|
|
![]() |
Towards a Science of Scaling Agent Systems |
Published at 2025-12-09 |
|
#ML
|
This study explores the performance of AI agent systems, focusing on scaling principles and their impact on various tasks. The research identifies three key effects: a trade-off between tools and coordination, capability saturation, and error amplification based on coordination strategy. The findings provide guidance on selecting the best coordination strategy for specific tasks, improving performance in some cases by up to 80.9%.... |
Read More |
|
|
|
|
![]() |
WonderZoom: Multi-Scale 3D World Generation |
Published at 2025-12-09 |
|
#ML
|
The authors propose a new method called WonderZoom for creating 3D scenes with content at various scales from one image, which is better than current models. They use specialized 3D shapes and a detailed content creator to allow users to zoom in and see more details, and their method works better than existing ones in tests.... |
Read More |
|
|
|
![]() |
Composing Concepts from Images and Videos via Concept-prompt Binding |
Published at 2025-12-10 |
|
#ML
|
The study presents a new method called Bind & Compose that improves visual concept composition by combining images and videos. It uses a hierarchical structure to accurately break down complex visual concepts and a mechanism to enhance compatibility between image and video concepts, resulting in better consistency, fidelity, and motion quality than existing approaches.... |
Read More |
|
|
|
|
![]() |
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models |
Published at 2025-12-10 |
|
#ML
|
The researchers present a new framework called HiF-VLA that uses motion to improve long-term decision making in robotic manipulation tasks. By incorporating past and future motion information, HiF-VLA outperforms existing methods in both benchmark tests and real-world applications with minimal additional processing time.... |
Read More |
|
|
|
![]() |
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting |
Published at 2025-12-10 |
|
#ML
|
The study presents IF-Bench, a new benchmark for evaluating the understanding of infrared images by multimodal large language models. The researchers assess over 40 models using this benchmark and introduce a method called GenViP to improve infrared image comprehension, which significantly boosts performance across various models.... |
Read More |
|
|
|
|
![]() |
Rethinking Chain-of-Thought Reasoning for Videos |
Published at 2025-12-10 |
|
#ML
|
This study explores whether shorter reasoning processes and fewer visual tokens can be as effective as longer, human-like reasoning chains and large numbers of visual tokens for video understanding in large language models. The researchers develop and test a new method that allows models to use compressed visual tokens and generate brief reasoning paths, resulting in faster and more efficient models that perform competitively on various benchmarks without relying on manual or supervised training... |
Read More |
|
|
|
![]() |
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation |
Published at 2025-12-10 |
|
#ML
|
The authors present StereoWorld, a framework that uses artificial intelligence to create high-quality stereo videos from single videos, making the process more affordable and reducing artifacts. They also created a large dataset of high-definition stereo videos to train and test their method, which outperforms existing techniques in producing realistic and geometrically accurate stereo videos.... |
Read More |
|
|
|
|
![]() |
UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving |
Published at 2025-12-10 |
|
#ML
|
This study addresses the challenges of autonomous driving in complex scenarios by creating specialized datasets with reasoning and planning annotations. They then present UniUGP, a framework that combines scene reasoning, future video generation, and trajectory planning to improve performance using pre-trained models, resulting in superior generalization to difficult driving situations.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|