🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets |
Published at 2025-11-17 |
|
#ML
|
The authors suggest a method that groups similar frames in videos to create more balanced and reliable datasets for training, validation, and testing, reducing the risk of information leakage.... |
Read More |
|
|
|
![]() |
Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models |
Published at 2025-11-17 |
|
#ML
|
The study presents a dual-backbone framework that detects rare and diverse anomalies in surveillance videos using only video-level supervision, resulting in a high 90.7% AUC on the UCF-Crime dataset by combining convolutional and transformer representations through top-k pooling.... |
Read More |
|
|
|
|
![]() |
YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection |
Published at 2025-11-17 |
|
#ML
|
The proposed method combines Object Detection with a Mixture-of-Experts framework, using adaptive routing among multiple YOLOv9-T models to improve feature specialization and achieve better performance, as measured by higher mAP and AR, compared to using a single model.... |
Read More |
|
|
|
![]() |
MRI Super-Resolution with Deep Learning: A Comprehensive Survey |
Published at 2025-11-20 |
|
#ML
|
This study provides a detailed review of deep learning methods for enhancing the resolution of MRI scans, making them clearer for clinical use without needing expensive equipment. The authors categorize these methods and discuss their unique challenges and potential future developments, along with providing open-access resources to facilitate further research.... |
Read More |
|
|
|
|
![]() |
SO-Bench: A Structural Output Evaluation of Multimodal LLMs |
Published at 2025-11-23 |
|
#ML
|
The study presents SO-Bench, a benchmark for evaluating the ability of multimodal large language models to process visual inputs and generate schema-grounded information. The benchmark covers four domains and reveals gaps in current models' structured output capabilities, suggesting a need for improved multimodal reasoning.... |
Read More |
|
|
|
![]() |
DiP: Taming Diffusion Models in Pixel Space |
Published at 2025-11-24 |
|
#ML
|
The authors present DiP, a new method for efficient pixel space diffusion that resolves the trade-off between generation quality and computational efficiency. DiP separates the generation process into two stages: a main stage that uses a special transformer to quickly create the overall structure, and a secondary stage that uses a lightweight model to add detailed features, resulting in faster inference speeds and high-quality generation without relying on a VAE.... |
Read More |
|
|
|
|
![]() |
CaptionQA: Is Your Caption as Useful as the Image Itself? |
Published at 2025-11-25 |
|
#ML
|
This study presents a new benchmark called CaptionQA to evaluate if image captions can effectively replace images in various tasks. The benchmark covers four domains and includes 33,027 questions that require visual information to answer, helping to measure the utility of captions in real-world applications.... |
Read More |
|
|
|
![]() |
Layer-Aware Video Composition via Split-then-Merge |
Published at 2025-11-25 |
|
#ML
|
The authors propose a new method called Split-then-Merge (StM) for creating videos with better control and using less data. StM separates videos into moving and background parts, combines them, and learns from unlabeled videos to generate realistic videos, outperforming current methods.... |
Read More |
|
|
|
|
![]() |
OmniRefiner: Reinforcement-Guided Local Diffusion Refinement |
Published at 2025-11-25 |
|
#ML
|
The authors present a new framework called OmniRefiner that improves the preservation of fine details in reference-guided image generation. This is achieved by using a two-stage process that first adapts a single-image diffusion editor and then applies reinforcement learning to enhance localized editing capability, resulting in more accurate and visually consistent edits compared to existing models.... |
Read More |
|
|
|
![]() |
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model |
Published at 2025-11-26 |
|
#ML
|
The authors present OralGPT-Omni, a dental-specialized model that can analyze various dental images and tasks reliably. They also introduce MMOral-Uni, the first unified benchmark for dental image analysis, and demonstrate OralGPT-Omni's superior performance compared to GPT-5.... |
Read More |
|
|
|
|
![]() |
Adversarial Flow Models |
Published at 2025-11-27 |
|
#ML
|
The study merges adversarial models and flow models to create a new type of generative model that generates data in one or a few steps. This model learns a direct mapping from noise to data, making training more stable and efficient, and it outperforms other models in image generation tasks.... |
Read More |
|
|
|
![]() |
Architecture Decoupling Is Not All You Need For Unified Multimodal Model |
Published at 2025-11-27 |
|
#ML
|
This research study investigates the issue of conflicting targets in unified multimodal models for image generation and understanding, and proposes a new approach called Attention Interaction Alignment (AIA) loss. The AIA loss aims to mitigate task conflicts without model decoupling by learning task-specific multimodal interaction patterns, which has been shown to improve both generation and understanding performance in various models.... |
Read More |
|
|
|
|
![]() |
Captain Safari: A World Engine |
Published at 2025-11-27 |
|
#ML
|
This study presents a new method called Captain Safari that creates long, 3D-consistent videos with stable structures and accurate camera movements by using a persistent world memory. The researchers also introduce a new dataset called OpenSafari to evaluate video generation models and show that Captain Safari outperforms other methods in terms of video quality, 3D consistency, and trajectory following.... |
Read More |
|
|
|
![]() |
Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield |
Published at 2025-11-27 |
|
#ML
|
This study reveals that in complex tasks like text-to-image generation, the key factor for few-step distillation is not matching the student's output with the teacher's, but a previously unnoticed component called CFG Augmentation. The researchers also found that another component, Distribution Matching, acts as a stabilizing force during training. By understanding these two aspects separately, the study proposes new methods to improve the distillation process and has already been applied in a t... |
Read More |
|
|
|
|
![]() |
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning |
Published at 2025-11-27 |
|
#ML
|
The study introduces DeepSeekMath-V2, a model that aims to improve AI's mathematical reasoning by focusing on self-verification and rigorous step-by-step derivation, which is particularly important for tasks like theorem proving and scaling test-time compute. The model is trained using an accurate and faithful verifier and a proof generator, which work together to identify and resolve issues in the proofs, resulting in strong theorem-proving capabilities and high scores on various mathematical c... |
Read More |
|
|
|
![]() |
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action |
Published at 2025-11-27 |
|
#ML
|
The authors present DualVLA, a method that improves the action performance of a generalizable Vision-Language-Action model without losing its reasoning capabilities. They achieve this through a dual-layer data pruning method and a dual-teacher adaptive distillation strategy, resulting in a model that excels in both precise action execution and multimodal understanding.... |
Read More |
|
|
|
|
![]() |
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration |
Published at 2025-11-27 |
|
#ML
|
The researchers present a method called Fast3Dcache that speeds up the generation of 3D shapes using diffusion models without sacrificing quality. This is achieved by introducing two new techniques, Predictive Caching Scheduler Constraint and Spatiotemporal Stability Criterion, which help maintain the accuracy of the generated shapes while significantly reducing computation time and resources.... |
Read More |
|
|
|
![]() |
FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning |
Published at 2025-11-27 |
|
#ML
|
The authors present a new framework called FedRE for federated learning with model-heterogeneous clients. FedRE uses entangled representations and normalized random weights to improve model performance, protect privacy, and reduce communication overhead.... |
Read More |
|
|
|
|
![]() |
Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information |
Published at 2025-11-27 |
|
#ML
|
The study presents a new method called Focused Chain-of-Thought (F-CoT) that improves the efficiency of large language models in reasoning tasks. F-CoT separates information extraction from reasoning, reducing unnecessary token usage and inference time, all without requiring additional training.... |
Read More |
|
|
|
![]() |
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images |
Published at 2025-11-27 |
|
#ML
|
The study presents CogIP-Bench, a benchmark to evaluate multimodal large language models on their understanding of subjective image properties like memorability and emotional impact. They find a gap in current models' alignment with human perception and demonstrate that a post-training phase can improve this alignment, which can then be applied to create more appealing images in creative tasks.... |
Read More |
|
|
|
|
![]() |
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models |
Published at 2025-11-27 |
|
#ML
|
This study improves image editing models by utilizing a large language model's reasoning abilities, allowing it to better understand instructions and correct mistakes. The proposed framework uses a thinking-editing-reflection loop to enhance editing accuracy, resulting in significant performance gains compared to previous methods.... |
Read More |
|
|
|
![]() |
RefineBench: Evaluating Refinement Capability of Language Models via Checklists |
Published at 2025-11-27 |
|
#ML
|
RefineBench is a new test for language models that checks if they can improve their answers on their own or with guidance. The test has 1,000 hard questions in 11 areas and found that state-of-the-art models struggle to self-refine but can do well with help, suggesting more progress is needed for self-refinement.... |
Read More |
|
|
|
|
![]() |
Test-time scaling of diffusions with flow maps |
Published at 2025-11-27 |
|
#ML
|
The study presents a new method called Flow Map Trajectory Tilting (FMTT) that improves the performance of diffusion models during testing. FMTT works directly with a flow map to create better sample results compared to other methods, and it can be used for exact sampling or searching for the best sample based on a user-specified reward. This approach enables more complex image editing techniques, such as interfacing with vision language models.... |
Read More |
|
|
|
![]() |
The Collapse of Patches |
Published at 2025-11-27 |
|
#ML
|
The study explores how observing certain image patches reduces uncertainty in other patches, similar to quantum mechanics' wave function collapse. They developed an autoencoder to identify crucial patches for image reconstruction, leading to improved image generation and classification methods, and introduced patch collapse as a new perspective for efficient image modeling.... |
Read More |
|
|
|
|
![]() |
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models |
Published at 2025-11-27 |
|
#ML
|
This study explores how vision-language models handle cultural elements from different origins appearing together in images, introducing a new benchmark called CultureMix. The researchers find that current models struggle with this 'culture mixing' challenge, relying too much on backgrounds and providing inconsistent answers. They then propose and test strategies to improve model performance, finding that fine-tuning with a diverse dataset significantly enhances consistency and reduces backgroun... |
Read More |
|
|
|
![]() |
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer |
Published at 2025-11-27 |
|
#ML
|
The authors present Z-Image, an efficient image generation model that challenges the 'scale-at-all-costs' paradigm. By optimizing the model's lifecycle, Z-Image offers competitive performance with reduced computational overhead, enabling accessibility and affordability for users.... |
Read More |
|
|
|
|
![]() |
AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement |
Published at 2025-11-28 |
|
#ML
|
The authors present a new framework called AnyTalker that efficiently generates multi-person talking videos with natural interactivity. They achieve this by using an identity-aware attention mechanism and a training pipeline that requires minimal multi-person data, while also introducing a new metric and dataset to evaluate the quality of the generated videos.... |
Read More |
|
|
|
![]() |
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models |
Published at 2025-11-28 |
|
#ML
|
The study presents a new attention mechanism called Hierarchical Sparse Attention (HSA) that efficiently handles ultra-long contexts in large language models, generalizing to 16M contexts with high accuracy. The researchers integrate HSA into Transformers, creating an 8B-parameter model called HSA-UltraLong, which performs well on various tasks and out-of-domain contexts, paving the way for future research in this area.... |
Read More |
|
|
|
|
![]() |
Vision Bridge Transformer at Scale |
Published at 2025-11-28 |
|
#ML
|
The study presents a large-scale model called Vision Bridge Transformer (ViBT) that efficiently translates data by modeling the trajectory between inputs and outputs, unlike traditional models. By scaling this Transformer-based model to 20B and 1.3B parameters, the authors demonstrate its effectiveness for image and video translation tasks, including instruction-based image editing and complex video translation.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|