🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in {pm 1, pm i} |
Published at 2025-12-02 |
|
#ML
|
The authors present Fairy2i, a framework that converts pre-trained real-valued language models into complex-valued ones, enabling lower memory usage without sacrificing performance. This approach allows for more efficient inference on regular hardware while maintaining the model's accuracy.... |
Read More |
|
|
|
![]() |
MeshSplatting: Differentiable Rendering with Opaque Meshes |
Published at 2025-12-07 |
|
#ML
|
The authors present MeshSplatting, a method that combines mesh-based reconstruction and differentiable rendering to create high-quality, real-time renderable meshes, improving upon existing techniques in terms of efficiency and visual quality.... |
Read More |
|
|
|
|
![]() |
Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge |
Published at 2025-12-07 |
|
#ML
|
The authors describe a winning vision-action policy for the 2025 BEHAVIOR Challenge, which involves completing diverse household tasks in a simulated environment. They improve the Pi0.5 architecture with innovations like correlated noise for flow matching, learnable mixed-layer attention, and System 2 stage tracking, resulting in a 26% q-score across all tasks.... |
Read More |
|
|
|
![]() |
EgoX: Egocentric Video Generation from a Single Exocentric Video |
Published at 2025-12-09 |
|
#ML
|
The authors present EgoX, a new method to convert third-person videos into first-person perspective ones. EgoX uses a pre-trained video model, combines it with new techniques to maintain geometry and detail, and generates realistic first-person videos from a single third-person input.... |
Read More |
|
|
|
|
![]() |
Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems |
Published at 2025-12-11 |
|
#ML
|
The authors present a new framework called Causal Judge Evaluation to address the issues of uncalibrated scores, inverted preferences, and collapsed importance-weighted estimators in LLM model assessment. Through three components, CJE achieves high ranking accuracy and calibration at a lower cost, while also improving confidence interval coverage compared to traditional methods.... |
Read More |
|
|
|
![]() |
CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images |
Published at 2025-12-11 |
|
#ML
|
The study focuses on estimating uncertainty for anatomical landmark-based segmentation in chest X-rays, using a hybrid neural network architecture. They introduce two uncertainty measures and release a large-scale dataset, CheXmask-U, to help researchers improve the robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray imaging.... |
Read More |
|
|
|
|
![]() |
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching |
Published at 2025-12-11 |
|
#ML
|
The authors present Fast-FoundationStereo, a new architecture that provides strong generalization for real-time stereo matching, overcoming the trade-off between speed and robustness in existing models. They use techniques like knowledge distillation, blockwise neural architecture search, and structured pruning to achieve this, resulting in a model that is 10 times faster than the previous state-of-the-art while maintaining similar accuracy.... |
Read More |
|
|
|
![]() |
LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator |
Published at 2025-12-11 |
|
#ML
|
The LEO-RobotAgent is a versatile and efficient framework that allows language models to control various robots for complex tasks, enhancing human-robot interaction and task planning with ease of adaptation to different robot platforms.... |
Read More |
|
|
|
|
![]() |
PersonaLive! Expressive Portrait Image Animation for Live Streaming |
Published at 2025-12-11 |
|
#ML
|
This study presents a new framework called PersonaLive for real-time portrait animation in live streaming, which improves upon existing models by focusing on reducing generation latency and improving real-time performance. The framework uses hybrid signals for expressive motion control, a strategy to improve inference efficiency, and a new generation paradigm for low-latency and stable long-term video generation, resulting in significant speedup compared to prior models.... |
Read More |
|
|
|
![]() |
Scaling Behavior of Discrete Diffusion Language Models |
Published at 2025-12-11 |
|
#ML
|
This study examines how discrete diffusion language models perform with different noise types, focusing on hyperparameters like batch size and learning rate. The results show that uniform diffusion requires fewer parameters and less data than masked diffusion for efficient training, making it a strong contender in data-limited scenarios.... |
Read More |
|
|
|
|
![]() |
Sharp Monocular View Synthesis in Less Than a Second |
Published at 2025-12-11 |
|
#ML
|
The authors propose a method called SHARP that creates realistic images of a scene from a single photograph in under a second using a neural network. This technique allows for real-time rendering of high-resolution, photorealistic images for nearby views and outperforms previous models in terms of image similarity and synthesis time.... |
Read More |
|
|
|
![]() |
Sliding Window Attention Adaptation |
Published at 2025-12-11 |
|
#ML
|
This study presents Sliding Window Attention Adaptation (SWAA), a collection of methods to improve the performance of Transformer-based Large Language Models (LLMs) during long-context inference. The proposed techniques, such as applying SWA only during prefilling and interleaving FA/SWA layers, help recover the original long-context performance while balancing performance-efficiency trade-offs for various scenarios.... |
Read More |
|
|
|
|
![]() |
CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare |
Published at 2025-12-12 |
|
#ML
|
The study presents CLINIC, a multilingual benchmark to evaluate the trustworthiness of language models in healthcare, focusing on truthfulness, fairness, safety, robustness, and privacy. The results show that language models struggle with factual correctness, exhibit bias, and are vulnerable to privacy breaches and attacks in various languages, emphasizing the need for improvements in global healthcare applications.... |
Read More |
|
|
|
![]() |
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry |
Published at 2025-12-12 |
|
#ML
|
The study presents DentalGPT, a new dental multimodal large language model that improves upon existing models by incorporating high-quality domain knowledge and reinforcement learning. This model, trained on the largest annotated multimodal dental dataset to date, outperforms many state-of-the-art models in disease classification and dental VQA tasks, even with fewer parameters.... |
Read More |
|
|
|
|
![]() |
Exploring MLLM-Diffusion Information Transfer with MetaCanvas |
Published at 2025-12-12 |
|
#ML
|
The study presents a new framework called MetaCanvas that enables multimodal large language models to directly reason and plan in spatial and spatiotemporal latent spaces for image and video generation. By implementing MetaCanvas on various diffusion backbones and evaluating it across multiple tasks, the researchers demonstrate its superior performance compared to global-conditioning baselines, suggesting a promising approach for improving the precision and control of multimodal generation.... |
Read More |
|
|
|
![]() |
Particulate: Feed-Forward 3D Object Articulation |
Published at 2025-12-12 |
|
#ML
|
The authors describe a new method, Particulate, that uses a transformer network to quickly and accurately determine the structure and movement of 3D objects. This approach is faster than previous methods and can also work with synthetic 3D assets, making it useful for extracting detailed 3D models from images.... |
Read More |
|
|
|
|
![]() |
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder |
Published at 2025-12-12 |
|
#ML
|
The study presents SVG-T2I, a method that enables high-quality text-to-image synthesis directly in the Visual Foundation Model (VFM) feature domain, without using a Variational Autoencoder. By utilizing a standard text-to-image diffusion pipeline, SVG-T2I demonstrates competitive performance and makes the project fully open-source to promote further research in representation-driven visual generation.... |
Read More |
|
|
|
![]() |
Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation |
Published at 2025-12-12 |
|
#ML
|
The authors present a new method to improve the realism of video generation for objects like humans and animals by incorporating structure-preserving motion from an autoregressive video model into a diffusion model. Their approach, SAM2VideoX, outperforms previous models in benchmarks and human evaluations by using innovative techniques to capture and align global and local motion patterns.... |
Read More |
|
|
|
|
![]() |
The N-Body Problem: Parallel Execution from Single-Person Egocentric Video |
Published at 2025-12-12 |
|
#ML
|
This study explores how a model can learn to parallelize complex activities by observing a single person in an egocentric video. The researchers propose a method to maximize efficiency while avoiding physically impossible scenarios, such as two people using the same object or occupying the same space, by using a Vision-Language Model to reason about the 3D environment, object usage, and temporal dependencies.... |
Read More |
|
|
|
![]() |
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties |
Published at 2025-12-12 |
|
#ML
|
The authors present V-RGBX, a new framework for editing videos with precise control over their intrinsic properties like color, texture, and lighting. This tool allows users to make changes to specific frames in a video, and those changes are consistently applied throughout the entire video in a realistic way, outperforming existing methods in tasks like object appearance editing and scene-level relighting.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|