🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models |
Published at 2025-11-14 |
|
#ML
|
The authors present a new framework called VisMem that improves Vision-Language Models by adding short-term and long-term memory modules, inspired by human cognitive memory. This enhancement allows the models to better retain visual details and maintain consistency during complex tasks, resulting in a significant performance boost across various benchmarks.... |
Read More |
|
|
|
![]() |
MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging |
Published at 2025-11-17 |
|
#ML
|
This study presents a new method for modeling genomic sequences that addresses the challenge of varying information density by automatically merging adjacent bases into words and using a hierarchical architecture with latent transformers for context-aware pre-training. The proposed model, MergeDNA, outperforms existing methods on various DNA benchmarks and multi-omics tasks.... |
Read More |
|
|
|
|
![]() |
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents |
Published at 2025-11-17 |
|
#ML
|
The authors present O-Mem, a new memory framework for AI agents that improves long-term interactions in complex environments. It dynamically updates user characteristics and event records, enabling more adaptive and coherent personalized responses, and outperforms previous state-of-the-art memory frameworks in benchmark tests.... |
Read More |
|
|
|
![]() |
Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations |
Published at 2025-11-17 |
|
#ML
|
The authors propose a new framework, RFxG, to categorize and assess visual explanations in deep learning, addressing the current lack of consensus in evaluating these explanations. This framework includes four new metrics to systematically assess explanation quality, which are applied to various methods, architectures, and datasets, promoting user-intent-driven evaluation and aligning explanations with human understanding.... |
Read More |
|
|
|
|
![]() |
InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization |
Published at 2025-11-18 |
|
#ML
|
The study presents a new framework, InstructMix2Mix, which enhances multi-view image editing from limited input views by combining a 2D diffusion model's editing skills with a pretrained multi-view diffusion model. This approach improves consistency across different views, reduces artifacts, and maintains high-quality edits in each view, compared to existing methods.... |
Read More |
|
|
|
![]() |
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization |
Published at 2025-11-19 |
|
#ML
|
The authors present GeoVista, a new model that uses web search and image zooming tools to improve geolocalization, which is the task of identifying a location from an image. They created a new benchmark called GeoBench for testing this model and showed that GeoVista is better than other open-source models at this task.... |
Read More |
|
|
|
|
![]() |
Insights from the ICLR Peer Review and Rebuttal Process |
Published at 2025-11-19 |
|
#ML
|
The study analyzes the ICLR 2024 and 2025 peer review processes to understand review dynamics and improve efficiency. Key findings include the impact of initial scores, co-reviewer ratings, and rebuttal strategies on score changes, offering insights to enhance the review process for authors and the community.... |
Read More |
|
|
|
![]() |
Taming Generative Synthetic Data for X-ray Prohibited Item Detection |
Published at 2025-11-19 |
|
#ML
|
The study presents a one-stage method for generating high-quality X-ray security images without extra labor cost, which outperforms previous methods by 1.2% mAP and improves prohibited item detection performance. The method, called Xsyn, uses two strategies: Cross-Attention Refinement to refine bounding box annotations and Background Occlusion Modeling to enhance imaging complexity.... |
Read More |
|
|
|
|
![]() |
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story |
Published at 2025-11-19 |
|
#ML
|
The study explores the intrinsic dimension of different text genres and finds that scientific texts are easier for language models to process, while creative writing requires more complexity. The research also identifies specific linguistic features that contribute to this complexity, such as formal tone in scientific texts and personalization in creative writing.... |
Read More |
|
|
|
![]() |
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight |
Published at 2025-11-20 |
|
#ML
|
The study presents Mantis, a new framework that improves vision-language-action models by separating visual foresight prediction from the main model and using a diffusion Transformer head. This approach enhances the model's comprehension and reasoning capabilities, leading to better performance in real-world tasks compared to existing models.... |
Read More |
|
|
|
|
![]() |
Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models |
Published at 2025-11-20 |
|
#ML
|
This study presents a framework called Multi-Faceted Attack (MFA) that uncovers vulnerabilities in popular vision-language models like GPT-4o, Gemini-Pro, and Llama-4. MFA uses a method called Attention-Transfer Attack to hide harmful instructions, which successfully bypasses defense mechanisms and demonstrates a 58.5% success rate, outperforming existing methods and challenging the robustness of current defense mechanisms.... |
Read More |
|
|
|
![]() |
OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists |
Published at 2025-11-20 |
|
#ML
|
The OmniScientist framework is designed to mimic human research processes in AI scientists, enabling end-to-end automation and collaboration, and fostering a sustainable innovation ecosystem through a structured knowledge system, collaborative research protocol, and open evaluation platform.... |
Read More |
|
|
|
|
![]() |
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe |
Published at 2025-11-20 |
|
#ML
|
The study presents a transparent and reproducible method for training multimodal reasoning models, called OpenMMReasoner, which consists of two stages: supervised fine-tuning and reinforcement learning. By focusing on high-quality data and careful training design, the proposed recipe significantly outperforms existing baselines and paves the way for future research in large-scale multimodal reasoning.... |
Read More |
|
|
|
![]() |
SAM 3: Segment Anything with Concepts |
Published at 2025-11-20 |
|
#ML
|
The authors have developed a new model called SAM 3 that can detect, segment, and track objects in images and videos using concept prompts like 'yellow school bus' or example images. SAM 3 can accurately identify objects even in challenging scenarios and outperforms existing systems in promptable concept segmentation tasks.... |
Read More |
|
|
|
|
![]() |
WorldGen: From Text to Traversable and Interactive 3D Worlds |
Published at 2025-11-20 |
|
#ML
|
WorldGen is a system that creates large, interactive 3D worlds from text prompts, using AI to transform descriptions into explorable environments. It allows creators to design coherent worlds easily, without needing manual modeling or 3D expertise, and offers control over layout, scale, and style for visually rich and consistent worlds.... |
Read More |
|
|
|
![]() |
Diversity Has Always Been There in Your Visual Autoregressive Models |
Published at 2025-11-21 |
|
#ML
|
The study presents DiverseVAR, a method that enhances the diversity of Visual Autoregressive models without additional training. It does this by controlling a key component of the feature map, improving diversity while maintaining high image quality.... |
Read More |
|
|
|
|
![]() |
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models |
Published at 2025-11-21 |
|
#ML
|
This study investigates how making multimodal models smaller impacts their visual and reasoning abilities. The researchers found that smaller models struggle more with visual tasks and introduced a new method called 'visual extraction tuning' to improve performance, resulting in their 'Extract+Think' approach.... |
Read More |
|
|
|
![]() |
Loomis Painter: Reconstructing the Painting Process |
Published at 2025-11-21 |
|
#ML
|
The authors present a new framework for generating multi-media painting processes with a focus on consistency and realism. They embed different media into a diffusion model and use a reverse-painting strategy to create smooth, human-like paintings, which they evaluate using various metrics and a new Perceptual Distance Profile curve.... |
Read More |
|
|
|
|
![]() |
Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs |
Published at 2025-11-21 |
|
#ML
|
The study introduces PARROT, a framework to measure the impact of social pressure on large language models. It evaluates 22 models and finds that advanced models are less likely to conform to false information, while older/smaller models are more susceptible to it.... |
Read More |
|
|
|
![]() |
Planning with Sketch-Guided Verification for Physics-Aware Video Generation |
Published at 2025-11-21 |
|
#ML
|
The authors present a new method called SketchVerify that enhances motion planning for video generation by utilizing a loop that samples and verifies multiple candidate motion plans. This approach improves physical realism, long-term consistency, and efficiency compared to existing methods, by rendering each trajectory as a lightweight video sketch for easy evaluation and refinement.... |
Read More |
|
|
|
|
![]() |
RynnVLA-002: A Unified Vision-Language-Action and World Model |
Published at 2025-11-21 |
|
#ML
|
RynnVLA-002 is a combined system that improves action prediction and visual understanding by learning from both actions and images, outperforming individual systems in both simulation and real-world tasks.... |
Read More |
|
|
|
![]() |
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation |
Published at 2025-11-21 |
|
#ML
|
The authors present VLA-4D, a vision-language-action model that improves robotic manipulation by incorporating 4D awareness for both spatial and temporal coherence. They achieve this by embedding time into 3D positions for a unified visual representation and extending spatial action representations with temporal information, resulting in smoother and more coordinated robotic movements.... |
Read More |
|
|
|
|
![]() |
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination |
Published at 2025-11-21 |
|
#ML
|
The authors present a new method called Video-R4 that improves video reasoning by simulating how humans re-examine important details. This method iteratively selects and zooms into informative frames, updates its reasoning, and achieves top performance in various video and document QA tasks, demonstrating the effectiveness of this iterative approach.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|