🤗 Daily Paper(2025-11-24)

4 views

Skip to first unread message

deep.di...@gmail.com

unread,

Nov 24, 2025, 3:07:49 PM11/24/25

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

Published at 2025-11-14

#ML

The authors present a new framework called VisMem that improves Vision-Language Models by adding short-term and long-term memory modules, inspired by human cognitive memory. This enhancement allows the models to better retain visual details and maintain consistency during complex tasks, resulting in a significant performance boost across various benchmarks....

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Published at 2025-11-17

#ML

This study presents a new method for modeling genomic sequences that addresses the challenge of varying information density by automatically merging adjacent bases into words and using a hierarchical architecture with latent transformers for context-aware pre-training. The proposed model, MergeDNA, outperforms existing methods on various DNA benchmarks and multi-omics tasks....

O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Published at 2025-11-17

#ML

The authors present O-Mem, a new memory framework for AI agents that improves long-term interactions in complex environments. It dynamically updates user characteristics and event records, enabling more adaptive and coherent personalized responses, and outperforms previous state-of-the-art memory frameworks in benchmark tests....

Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations

Published at 2025-11-17

#ML

The authors propose a new framework, RFxG, to categorize and assess visual explanations in deep learning, addressing the current lack of consensus in evaluating these explanations. This framework includes four new metrics to systematically assess explanation quality, which are applied to various methods, architectures, and datasets, promoting user-intent-driven evaluation and aligning explanations with human understanding....

InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

Published at 2025-11-18

#ML

The study presents a new framework, InstructMix2Mix, which enhances multi-view image editing from limited input views by combining a 2D diffusion model's editing skills with a pretrained multi-view diffusion model. This approach improves consistency across different views, reduces artifacts, and maintains high-quality edits in each view, compared to existing methods....

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Published at 2025-11-19

#ML

The authors present GeoVista, a new model that uses web search and image zooming tools to improve geolocalization, which is the task of identifying a location from an image. They created a new benchmark called GeoBench for testing this model and showed that GeoVista is better than other open-source models at this task....

Insights from the ICLR Peer Review and Rebuttal Process

Published at 2025-11-19

#ML

The study analyzes the ICLR 2024 and 2025 peer review processes to understand review dynamics and improve efficiency. Key findings include the impact of initial scores, co-reviewer ratings, and rebuttal strategies on score changes, offering insights to enhance the review process for authors and the community....

Taming Generative Synthetic Data for X-ray Prohibited Item Detection

Published at 2025-11-19

#ML

The study presents a one-stage method for generating high-quality X-ray security images without extra labor cost, which outperforms previous methods by 1.2% mAP and improves prohibited item detection performance. The method, called Xsyn, uses two strategies: Cross-Attention Refinement to refine bounding box annotations and Background Occlusion Modeling to enhance imaging complexity....

Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

Published at 2025-11-19

#ML

The study explores the intrinsic dimension of different text genres and finds that scientific texts are easier for language models to process, while creative writing requires more complexity. The research also identifies specific linguistic features that contribute to this complexity, such as formal tone in scientific texts and personalization in creative writing....

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Published at 2025-11-20

#ML

The study presents Mantis, a new framework that improves vision-language-action models by separating visual foresight prediction from the main model and using a diffusion Transformer head. This approach enhances the model's comprehension and reasoning capabilities, leading to better performance in real-world tasks compared to existing models....

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

Published at 2025-11-20

#ML

This study presents a framework called Multi-Faceted Attack (MFA) that uncovers vulnerabilities in popular vision-language models like GPT-4o, Gemini-Pro, and Llama-4. MFA uses a method called Attention-Transfer Attack to hide harmful instructions, which successfully bypasses defense mechanisms and demonstrates a 58.5% success rate, outperforming existing methods and challenging the robustness of current defense mechanisms....

OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

Published at 2025-11-20

#ML

The OmniScientist framework is designed to mimic human research processes in AI scientists, enabling end-to-end automation and collaboration, and fostering a sustainable innovation ecosystem through a structured knowledge system, collaborative research protocol, and open evaluation platform....

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Published at 2025-11-20

#ML

The study presents a transparent and reproducible method for training multimodal reasoning models, called OpenMMReasoner, which consists of two stages: supervised fine-tuning and reinforcement learning. By focusing on high-quality data and careful training design, the proposed recipe significantly outperforms existing baselines and paves the way for future research in large-scale multimodal reasoning....

SAM 3: Segment Anything with Concepts

Published at 2025-11-20

#ML

The authors have developed a new model called SAM 3 that can detect, segment, and track objects in images and videos using concept prompts like 'yellow school bus' or example images. SAM 3 can accurately identify objects even in challenging scenarios and outperforms existing systems in promptable concept segmentation tasks....

WorldGen: From Text to Traversable and Interactive 3D Worlds

Published at 2025-11-20

#ML

WorldGen is a system that creates large, interactive 3D worlds from text prompts, using AI to transform descriptions into explorable environments. It allows creators to design coherent worlds easily, without needing manual modeling or 3D expertise, and offers control over layout, scale, and style for visually rich and consistent worlds....

Diversity Has Always Been There in Your Visual Autoregressive Models

Published at 2025-11-21

#ML

The study presents DiverseVAR, a method that enhances the diversity of Visual Autoregressive models without additional training. It does this by controlling a key component of the feature map, improving diversity while maintaining high image quality....

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Published at 2025-11-21

#ML

This study investigates how making multimodal models smaller impacts their visual and reasoning abilities. The researchers found that smaller models struggle more with visual tasks and introduced a new method called 'visual extraction tuning' to improve performance, resulting in their 'Extract+Think' approach....

Loomis Painter: Reconstructing the Painting Process

Published at 2025-11-21

#ML

The authors present a new framework for generating multi-media painting processes with a focus on consistency and realism. They embed different media into a diffusion model and use a reverse-painting strategy to create smooth, human-like paintings, which they evaluate using various metrics and a new Perceptual Distance Profile curve....

Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

Published at 2025-11-21

#ML

The study introduces PARROT, a framework to measure the impact of social pressure on large language models. It evaluates 22 models and finds that advanced models are less likely to conform to false information, while older/smaller models are more susceptible to it....

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Published at 2025-11-21

#ML

The authors present a new method called SketchVerify that enhances motion planning for video generation by utilizing a loop that samples and verifies multiple candidate motion plans. This approach improves physical realism, long-term consistency, and efficiency compared to existing methods, by rendering each trajectory as a lightweight video sketch for easy evaluation and refinement....

RynnVLA-002: A Unified Vision-Language-Action and World Model

Published at 2025-11-21

#ML

RynnVLA-002 is a combined system that improves action prediction and visual understanding by learning from both actions and images, outperforming individual systems in both simulation and real-world tasks....

VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Published at 2025-11-21

#ML

The authors present VLA-4D, a vision-language-action model that improves robotic manipulation by incorporating 4D awareness for both spatial and temporal coherence. They achieve this by embedding time into 3D positions for a unified visual representation and extending spatial action representations with temporal information, resulting in smoother and more coordinated robotic movements....

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Published at 2025-11-21

#ML

The authors present a new method called Video-R4 that improves video reasoning by simulating how humans re-examine important details. This method iteratively selects and zooms into informative frames, updates its reasoning, and achieves top performance in various video and document QA tasks, demonstrating the effectiveness of this iterative approach....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages