🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms |
Published at 2025-11-17 |
|
#ML
|
The researchers developed GigaEvo, an open-source framework that allows for the study and experimentation of LLM-evolution approaches, specifically designed to address the lack of implementation details in previous work. The framework includes modular components for various evolutionary strategies and is validated through experiments on challenging problems, emphasizing modularity, concurrency, and ease of use.... |
Read More |
|
|
|
![]() |
Unified all-atom molecule generation with neural fields |
Published at 2025-11-19 |
|
#ML
|
The authors present a new framework called FuncBind that generates all-atom molecules for various targets using computer vision techniques and neural fields. This modality-agnostic approach allows FuncBind to create competitive small molecules, macrocyclic peptides, and antibody loops, and even designed new antibody binders in vitro, all while handling diverse atomic systems and variable counts.... |
Read More |
|
|
|
|
![]() |
Cognitive Foundations for Reasoning and Their Manifestation in LLMs |
Published at 2025-11-20 |
|
#ML
|
The paper analyzes the reasoning abilities of large language models (LLMs) and compares them to human cognition by creating a taxonomy of 28 cognitive elements. The study reveals that LLMs rely on surface-level processing and sequential organization, while humans use more abstract reasoning and conceptual processing. The research also suggests that LLMs have the potential to reason effectively but fail to deploy these abilities spontaneously, and proposes a method to improve their performance by... |
Read More |
|
|
|
![]() |
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System |
Published at 2025-11-22 |
|
#ML
|
The authors present SciEducator, a new multi-agent system that uses a self-evolving reasoning mechanism inspired by the Deming Cycle to improve understanding and education of scientific videos. SciEducator can create various types of educational content and has been shown to outperform other models in a new benchmark for scientific video understanding.... |
Read More |
|
|
|
|
![]() |
Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking |
Published at 2025-11-23 |
|
#ML
|
The study examines how well large language models can predict future events in various fields, and finds that their accuracy depends on the specific question asked and how it's framed, as well as the type of event being predicted.... |
Read More |
|
|
|
![]() |
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion |
Published at 2025-11-23 |
|
#ML
|
Yo'City is a new system that creates customizable and endless 3D city scenes by using pre-trained large models to plan and design cities in a hierarchical structure, and then refine them with detailed 3D images. This system also allows for interactive city growth and has been tested and found to be better than current methods in creating realistic and detailed city scenes.... |
Read More |
|
|
|
|
![]() |
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning |
Published at 2025-11-24 |
|
#ML
|
The authors present Agent0-VL, a self-evolving vision-language agent that improves its performance over time without human supervision. It uses tools for reasoning, self-evaluation, and self-repair, allowing the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. Experiments show a 12.5% improvement over the base model.... |
Read More |
|
|
|
![]() |
Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution |
Published at 2025-11-24 |
|
#ML
|
This study presents ORS3D, a new challenge for embodied AI that requires agents to understand language, grasp 3D spatial concepts, and optimize task efficiency by performing parallel tasks. The researchers also introduce ORS3D-60K, a large-scale dataset, and GRANT, a model that can generate efficient task schedules and actions, demonstrating improved performance in these areas compared to existing methods.... |
Read More |
|
|
|
|
![]() |
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection |
Published at 2025-11-24 |
|
#ML
|
DiffSeg30k is a new dataset of 30,000 images with local edits made by state-of-the-art diffusion models, designed to help detect and locate AI-generated content at a fine level. This dataset can improve the detection of AI-generated images and advance research in this area by providing a more realistic and diverse set of images with pixel-level annotations.... |
Read More |
|
|
|
![]() |
Fara-7B: An Efficient Agentic Model for Computer Use |
Published at 2025-11-24 |
|
#ML
|
The authors present Fara-7B, a new computer use agent model that can understand and interact with computers using only screenshots and predicted coordinates. They trained Fara-7B using FaraGen, a synthetic data generation system, and it outperforms other similar-sized models and competes with larger ones, demonstrating the power of large-scale data generation for efficient agentic models. The model and a new benchmark are made publicly available.... |
Read More |
|
|
|
|
![]() |
GigaWorld-0: World Models as Data Engine to Empower Embodied AI |
Published at 2025-11-24 |
|
#ML
|
The authors present GigaWorld-0, a new framework that uses artificial data to train embodied AI, which interacts with the environment using vision, language, and action. This framework, powered by efficient training methods, generates high-quality, diverse, and controllable data, enabling AI models to perform well in real-world tasks without any real-world training.... |
Read More |
|
|
|
![]() |
HunyuanOCR Technical Report |
Published at 2025-11-24 |
|
#ML
|
HunyuanOCR is a lightweight, open-source Vision-Language Model designed for OCR tasks, outperforming larger models and commercial APIs. It offers a unified and efficient approach, streamlined architecture, and utilizes data-driven and RL strategies, making it a top-tier choice for both research and industrial applications.... |
Read More |
|
|
|
|
![]() |
MagicWorld: Interactive Geometry-driven Video World Exploration |
Published at 2025-11-24 |
|
#ML
|
The proposed MagicWorld model improves upon existing interactive video world models by integrating 3D geometric information and historical data retrieval. It uses user actions to create a point cloud for consistent viewpoint transitions and stores relevant historical frames to maintain scene information and reduce errors in scene evolution.... |
Read More |
|
|
|
![]() |
MedSAM3: Delving into Segment Anything with Medical Concepts |
Published at 2025-11-24 |
|
#ML
|
The study presents MedSAM-3, a text-guided medical image and video segmentation model that improves upon existing methods by using semantic conceptual labels and open-vocabulary text descriptions, resulting in superior performance across various medical imaging modalities.... |
Read More |
|
|
|
|
![]() |
ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding |
Published at 2025-11-24 |
|
#ML
|
The authors propose a new method called ReDirector to create video retakes of any length using camera control for dynamically captured videos. They also present Rotary Camera Encoding (RoCE), which captures and integrates multi-view relationships between the input and target videos, improving object localization and background preservation while ensuring camera controllability and geometric consistency.... |
Read More |
|
|
|
![]() |
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs |
Published at 2025-11-24 |
|
#ML
|
This study presents VISTA-Gym, a new training platform that enhances the reasoning abilities of vision-language models by incorporating visual tools, allowing them to better interact with and understand real-world visual tasks. By training a model called VISTA-R1 using VISTA-Gym, the authors demonstrate significant improvements in performance on various visual reasoning benchmarks compared to existing models.... |
Read More |
|
|
|
|
![]() |
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation |
Published at 2025-11-24 |
|
#ML
|
SteadyDancer is a new framework for human image animation that preserves the first-frame identity and controls motion precisely. It does this by reconciling conflicting conditions, generating adaptive pose representations, and using a specialized training pipeline, resulting in better performance and fewer resources compared to existing methods.... |
Read More |
|
|
|
![]() |
Concept-Aware Batch Sampling Improves Language-Image Pretraining |
Published at 2025-11-25 |
|
#ML
|
The authors present DataConcept, a large dataset of web-crawled image-text pairs with detailed concept annotations. They then introduce Concept-Aware Batch Sampling (CABS), a flexible framework for creating training batches based on specific concept distributions, which significantly improves the performance of vision-language models on various benchmarks.... |
Read More |
|
|
|
|
![]() |
Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization |
Published at 2025-11-25 |
|
#ML
|
The authors present a new method called DPP-GRPO to improve diversity in video generation from text prompts. This method uses Determinantal Point Processes and Group Relative Policy Optimization to enforce diversity in video generation, which works across different visual elements and camera motions without reducing quality.... |
Read More |
|
|
|
![]() |
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward |
Published at 2025-11-25 |
|
#ML
|
The study presents UniSandbox, a framework to examine the relationship between understanding and generation in Unified Multimodal Models. Experiments show a gap between understanding and generation, which can be reduced using Chain-of-Thought in reasoning tasks and self-training, providing insights for future model designs.... |
Read More |
|
|
|
|
![]() |
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts |
Published at 2025-11-25 |
|
#ML
|
The authors present MajutsuCity, a framework for generating realistic 3D cities using natural language, which offers controllable layouts, assets, and materials. This method improves geometric fidelity, stylistic adaptability, and semantic controllability in 3D city generation, outperforming existing methods in various evaluation metrics.... |
Read More |
|
|
|
![]() |
OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation |
Published at 2025-11-25 |
|
#ML
|
The researchers present a new system called OmniAlpha, which is capable of generating and editing RGBA images through a unified, multi-task framework. This system outperforms other specialized models in various tasks, such as mask-free matting and layer-conditioned completion, by learning a shared representation for RGBA images.... |
Read More |
|
|
|
|
![]() |
PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding |
Published at 2025-11-25 |
|
#ML
|
The study presents a new framework called PhysChoreo that creates realistic videos with controlled physical properties by estimating initial physical properties of objects in an image and using a physically editable simulation to generate dynamic behaviors. Experiments show that PhysChoreo outperforms existing methods in generating physically realistic videos with diverse controllability.... |
Read More |
|
|
|
![]() |
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space |
Published at 2025-11-25 |
|
#ML
|
This study presents SSA, a new training framework for sparse attention in large language models that addresses the paradox of low sparsity in native sparse-attention methods. By enforcing bidirectional alignment between sparse and full attention at every layer, SSA preserves gradient flow and promotes stronger sparsity, resulting in state-of-the-art performance and flexible compute-performance trade-offs.... |
Read More |
|
|
|
|
![]() |
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow |
Published at 2025-11-25 |
|
#ML
|
The authors present STARFlow-V, a new video generation model based on normalizing flows, which offers benefits like end-to-end learning and native likelihood estimation. This model improves upon its predecessor by using a global-local architecture to manage spatiotemporal complexity and a lightweight causal denoiser to enhance video generation consistency, resulting in strong visual quality and temporal consistency compared to diffusion-based baselines.... |
Read More |
|
|
|
![]() |
Soft Adaptive Policy Optimization |
Published at 2025-11-25 |
|
#ML
|
The authors present a new method, Soft Adaptive Policy Optimization (SAPO), for policy optimization in reinforcement learning with large language models. SAPO improves upon existing methods by using a temperature-controlled gate to adaptively attenuate off-policy updates, maintaining sequence-level coherence and token-adaptive learning, leading to better stability and performance in mathematical reasoning benchmarks.... |
Read More |
|
|
|
|
![]() |
UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers |
Published at 2025-11-25 |
|
#ML
|
The research tackles the issue of video diffusion transformers not working well outside their training length by focusing on attention maps. They find that a problem called attention dispersion causes the models to fail, and they propose a new method, UltraViCo, to fix this. UltraViCo improves the models' performance in creating longer videos and enhances image quality, outperforming other methods.... |
Read More |
|
|
|
![]() |
Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation |
Published at 2025-11-25 |
|
#ML
|
This study presents a new two-step method to accurately track the 3D motion of a ping pong ball from regular videos, which is difficult due to real-world challenges. The method separates the problem into two tasks and uses a newly created dataset for training, resulting in a practical and robust application for analyzing ping pong ball trajectories and spin.... |
Read More |
|
|
|
|
![]() |
VQ-VA World: Towards High-Quality Visual Question-Visual Answering |
Published at 2025-11-25 |
|
#ML
|
The authors present VQ-VA World, a framework for creating high-quality image-text samples for training open-source models to answer visual questions with images, improving performance on the IntelligentBench benchmark and narrowing the gap with proprietary systems.... |
Read More |
|
|
|
![]() |
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation |
Published at 2025-11-25 |
|
#ML
|
The authors propose a method to combine powerful video models and image data to create a unified framework for generating image sets with natural transitions and a wide range of dynamics. Their approach allows the model to perform various image generation and editing tasks without losing its original motion capabilities, resulting in scenes with extraordinary dynamics.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|