🤗 Daily Paper(2025-11-26)

6 views
Skip to first unread message

deep.di...@gmail.com

unread,
Nov 26, 2025, 3:07:29 PM11/26/25
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms

Published at 2025-11-17

#ML

The researchers developed GigaEvo, an open-source framework that allows for the study and experimentation of LLM-evolution approaches, specifically designed to address the lack of implementation details in previous work. The framework includes modular components for various evolutionary strategies and is validated through experiments on challenging problems, emphasizing modularity, concurrency, and ease of use....

Read Moreicon

Unified all-atom molecule generation with neural fields

Published at 2025-11-19

#ML

The authors present a new framework called FuncBind that generates all-atom molecules for various targets using computer vision techniques and neural fields. This modality-agnostic approach allows FuncBind to create competitive small molecules, macrocyclic peptides, and antibody loops, and even designed new antibody binders in vitro, all while handling diverse atomic systems and variable counts....

Read Moreicon

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Published at 2025-11-20

#ML

The paper analyzes the reasoning abilities of large language models (LLMs) and compares them to human cognition by creating a taxonomy of 28 cognitive elements. The study reveals that LLMs rely on surface-level processing and sequential organization, while humans use more abstract reasoning and conceptual processing. The research also suggests that LLMs have the potential to reason effectively but fail to deploy these abilities spontaneously, and proposes a method to improve their performance by...

Read Moreicon

SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

Published at 2025-11-22

#ML

The authors present SciEducator, a new multi-agent system that uses a self-evolving reasoning mechanism inspired by the Deming Cycle to improve understanding and education of scientific videos. SciEducator can create various types of educational content and has been shown to outperform other models in a new benchmark for scientific video understanding....

Read Moreicon

Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking

Published at 2025-11-23

#ML

The study examines how well large language models can predict future events in various fields, and finds that their accuracy depends on the specific question asked and how it's framed, as well as the type of event being predicted....

Read Moreicon

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Published at 2025-11-23

#ML

Yo'City is a new system that creates customizable and endless 3D city scenes by using pre-trained large models to plan and design cities in a hierarchical structure, and then refine them with detailed 3D images. This system also allows for interactive city growth and has been tested and found to be better than current methods in creating realistic and detailed city scenes....

Read Moreicon

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Published at 2025-11-24

#ML

The authors present Agent0-VL, a self-evolving vision-language agent that improves its performance over time without human supervision. It uses tools for reasoning, self-evaluation, and self-repair, allowing the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. Experiments show a 12.5% improvement over the base model....

Read Moreicon

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Published at 2025-11-24

#ML

This study presents ORS3D, a new challenge for embodied AI that requires agents to understand language, grasp 3D spatial concepts, and optimize task efficiency by performing parallel tasks. The researchers also introduce ORS3D-60K, a large-scale dataset, and GRANT, a model that can generate efficient task schedules and actions, demonstrating improved performance in these areas compared to existing methods....

Read Moreicon

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Published at 2025-11-24

#ML

DiffSeg30k is a new dataset of 30,000 images with local edits made by state-of-the-art diffusion models, designed to help detect and locate AI-generated content at a fine level. This dataset can improve the detection of AI-generated images and advance research in this area by providing a more realistic and diverse set of images with pixel-level annotations....

Read Moreicon

Fara-7B: An Efficient Agentic Model for Computer Use

Published at 2025-11-24

#ML

The authors present Fara-7B, a new computer use agent model that can understand and interact with computers using only screenshots and predicted coordinates. They trained Fara-7B using FaraGen, a synthetic data generation system, and it outperforms other similar-sized models and competes with larger ones, demonstrating the power of large-scale data generation for efficient agentic models. The model and a new benchmark are made publicly available....

Read Moreicon

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

Published at 2025-11-24

#ML

The authors present GigaWorld-0, a new framework that uses artificial data to train embodied AI, which interacts with the environment using vision, language, and action. This framework, powered by efficient training methods, generates high-quality, diverse, and controllable data, enabling AI models to perform well in real-world tasks without any real-world training....

Read Moreicon

HunyuanOCR Technical Report

Published at 2025-11-24

#ML

HunyuanOCR is a lightweight, open-source Vision-Language Model designed for OCR tasks, outperforming larger models and commercial APIs. It offers a unified and efficient approach, streamlined architecture, and utilizes data-driven and RL strategies, making it a top-tier choice for both research and industrial applications....

Read Moreicon

MagicWorld: Interactive Geometry-driven Video World Exploration

Published at 2025-11-24

#ML

The proposed MagicWorld model improves upon existing interactive video world models by integrating 3D geometric information and historical data retrieval. It uses user actions to create a point cloud for consistent viewpoint transitions and stores relevant historical frames to maintain scene information and reduce errors in scene evolution....

Read Moreicon

MedSAM3: Delving into Segment Anything with Medical Concepts

Published at 2025-11-24

#ML

The study presents MedSAM-3, a text-guided medical image and video segmentation model that improves upon existing methods by using semantic conceptual labels and open-vocabulary text descriptions, resulting in superior performance across various medical imaging modalities....

Read Moreicon

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Published at 2025-11-24

#ML

The authors propose a new method called ReDirector to create video retakes of any length using camera control for dynamically captured videos. They also present Rotary Camera Encoding (RoCE), which captures and integrates multi-view relationships between the input and target videos, improving object localization and background preservation while ensuring camera controllability and geometric consistency....

Read Moreicon

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Published at 2025-11-24

#ML

This study presents VISTA-Gym, a new training platform that enhances the reasoning abilities of vision-language models by incorporating visual tools, allowing them to better interact with and understand real-world visual tasks. By training a model called VISTA-R1 using VISTA-Gym, the authors demonstrate significant improvements in performance on various visual reasoning benchmarks compared to existing models....

Read Moreicon

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

Published at 2025-11-24

#ML

SteadyDancer is a new framework for human image animation that preserves the first-frame identity and controls motion precisely. It does this by reconciling conflicting conditions, generating adaptive pose representations, and using a specialized training pipeline, resulting in better performance and fewer resources compared to existing methods....

Read Moreicon

Concept-Aware Batch Sampling Improves Language-Image Pretraining

Published at 2025-11-25

#ML

The authors present DataConcept, a large dataset of web-crawled image-text pairs with detailed concept annotations. They then introduce Concept-Aware Batch Sampling (CABS), a flexible framework for creating training batches based on specific concept distributions, which significantly improves the performance of vision-language models on various benchmarks....

Read Moreicon

Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization

Published at 2025-11-25

#ML

The authors present a new method called DPP-GRPO to improve diversity in video generation from text prompts. This method uses Determinantal Point Processes and Group Relative Policy Optimization to enforce diversity in video generation, which works across different visual elements and camera motions without reducing quality....

Read Moreicon

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Published at 2025-11-25

#ML

The study presents UniSandbox, a framework to examine the relationship between understanding and generation in Unified Multimodal Models. Experiments show a gap between understanding and generation, which can be reduced using Chain-of-Thought in reasoning tasks and self-training, providing insights for future model designs....

Read Moreicon

MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

Published at 2025-11-25

#ML

The authors present MajutsuCity, a framework for generating realistic 3D cities using natural language, which offers controllable layouts, assets, and materials. This method improves geometric fidelity, stylistic adaptability, and semantic controllability in 3D city generation, outperforming existing methods in various evaluation metrics....

Read Moreicon

OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

Published at 2025-11-25

#ML

The researchers present a new system called OmniAlpha, which is capable of generating and editing RGBA images through a unified, multi-task framework. This system outperforms other specialized models in various tasks, such as mask-free matting and layer-conditioned completion, by learning a shared representation for RGBA images....

Read Moreicon

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Published at 2025-11-25

#ML

The study presents a new framework called PhysChoreo that creates realistic videos with controlled physical properties by estimating initial physical properties of objects in an image and using a physically editable simulation to generate dynamic behaviors. Experiments show that PhysChoreo outperforms existing methods in generating physically realistic videos with diverse controllability....

Read Moreicon

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Published at 2025-11-25

#ML

This study presents SSA, a new training framework for sparse attention in large language models that addresses the paradox of low sparsity in native sparse-attention methods. By enforcing bidirectional alignment between sparse and full attention at every layer, SSA preserves gradient flow and promotes stronger sparsity, resulting in state-of-the-art performance and flexible compute-performance trade-offs....

Read Moreicon

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Published at 2025-11-25

#ML

The authors present STARFlow-V, a new video generation model based on normalizing flows, which offers benefits like end-to-end learning and native likelihood estimation. This model improves upon its predecessor by using a global-local architecture to manage spatiotemporal complexity and a lightweight causal denoiser to enhance video generation consistency, resulting in strong visual quality and temporal consistency compared to diffusion-based baselines....

Read Moreicon

Soft Adaptive Policy Optimization

Published at 2025-11-25

#ML

The authors present a new method, Soft Adaptive Policy Optimization (SAPO), for policy optimization in reinforcement learning with large language models. SAPO improves upon existing methods by using a temperature-controlled gate to adaptively attenuate off-policy updates, maintaining sequence-level coherence and token-adaptive learning, leading to better stability and performance in mathematical reasoning benchmarks....

Read Moreicon

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Published at 2025-11-25

#ML

The research tackles the issue of video diffusion transformers not working well outside their training length by focusing on attention maps. They find that a problem called attention dispersion causes the models to fail, and they propose a new method, UltraViCo, to fix this. UltraViCo improves the models' performance in creating longer videos and enhances image quality, outperforming other methods....

Read Moreicon

Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

Published at 2025-11-25

#ML

This study presents a new two-step method to accurately track the 3D motion of a ping pong ball from regular videos, which is difficult due to real-world challenges. The method separates the problem into two tasks and uses a newly created dataset for training, resulting in a practical and robust application for analyzing ping pong ball trajectories and spin....

Read Moreicon

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Published at 2025-11-25

#ML

The authors present VQ-VA World, a framework for creating high-quality image-text samples for training open-source models to answer visual questions with images, improving performance on the IntelligentBench benchmark and narrowing the gap with proprietary systems....

Read Moreicon

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Published at 2025-11-25

#ML

The authors propose a method to combine powerful video models and image data to create a unified framework for generating image sets with natural transitions and a wide range of dynamics. Their approach allows the model to perform various image generation and editing tasks without losing its original motion capabilities, resulting in scenes with extraordinary dynamics....

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages