🤗 Daily Paper(2025-12-12)

3 views

Skip to first unread message

deep.di...@gmail.com

unread,

Dec 12, 2025, 3:07:28 PM12/12/25

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Published at 2025-11-28

#ML

The authors present VQRAE, a new method that combines representation autoencoders with vector quantization to unify multimodal understanding, generation, and reconstruction. VQRAE uses a symmetric ViT decoder and a two-stage training strategy to create continuous semantic features for image understanding and discrete tokens for visual generation within a unified tokenizer, achieving competitive performance on various benchmarks....

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Published at 2025-12-04

#ML

The paper presents a method to create large-scale, diverse training data for intelligent humanoid robots by converting human videos into humanoid videos using a generative video editing approach. The approach, called X-Humanoid, adapts a powerful model to handle complex full-body motions and scene occlusions, and is trained using a scalable data creation pipeline. The resulting dataset outperforms existing baselines in terms of motion consistency and embodiment correctness....

BEAVER: An Efficient Deterministic LLM Verifier

Published at 2025-12-05

#ML

The authors present BEAVER, a framework that provides reliable and sound probability bounds for verifying that large language models adhere to required constraints, which is significantly more effective than existing methods....

Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents

Published at 2025-12-09

#ML

The authors propose a new framework called Fed-SE that enables language models to learn and adapt in multiple environments while maintaining privacy constraints. Fed-SE improves the success rate of tasks in various environments by 18% compared to existing methods, by allowing local learning and global sharing of knowledge in a way that reduces conflicts and negative transfer of information....

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification

Published at 2025-12-09

#ML

The researchers present a new method called MoRel that efficiently models long-range motion in dynamic videos, addressing issues like memory explosion and flickering. MoRel uses anchor spaces and bidirectional blending to maintain temporal consistency and quality, and it introduces a new dataset for evaluating long-range 4D motion....

Thinking with Images via Self-Calling Agent

Published at 2025-12-09

#ML

This study presents a new visual reasoning method called sCoT that breaks down complex tasks into smaller ones and uses virtual agents to solve them, requiring less training time and data compared to existing methods. Experiments show that sCoT improves reasoning performance by up to 1.9% with 75% fewer GPU hours than baseline approaches....

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Published at 2025-12-10

#ML

The authors present a method to create realistic robot manipulation videos by translating human interaction videos, without needing paired human-robot videos for training. They use a transferable representation to bridge the gap between humans and robots, and fine-tune a state-of-the-art video diffusion model to generate high-quality robot videos that mimic human actions, achieving more realistic and grounded robot motions compared to baselines....

MOA: Multi-Objective Alignment for Role-Playing Agents

Published at 2025-12-10

#ML

The study presents MOA, a reinforcement-learning framework that optimizes multiple dimensions for role-playing agents, improving their performance in various tasks and scenarios compared to existing methods....

ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Published at 2025-12-10

#ML

The study presents a new task called Reason-Informed Video Editing (RVE) and a benchmark called RVE-Bench to improve reason-informed video editing in unified models. They propose ReViSE, a Self-Reflective Reasoning framework that enhances editing accuracy and visual fidelity by integrating reasoning with visual transformation, resulting in a 32% improvement over state-of-the-art methods....

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Published at 2025-12-11

#ML

This study presents InternGeometry, an LLM agent that can solve over 80% of IMO geometry problems with minimal training data by proposing and verifying propositions, using a dynamic memory mechanism, and employing Complexity-Boosting Reinforcement Learning. InternGeometry also introduces novel auxiliary constructions for IMO problems and outperforms existing expert models on geometry tasks....

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Published at 2025-12-11

#ML

This study explores reinforcement learning for text-to-3D generation, investigating reward designs, RL algorithms, and benchmarks. The researchers introduce a new benchmark, MME-3DR, and propose Hi-GRPO, an RL paradigm for hierarchical 3D generation, resulting in the first RL-enhanced text-to-3D model, AR3D-R1....

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Published at 2025-12-11

#ML

The Confucius Code Agent is an open-source AI software engineer that can handle large-scale real-world tasks, built on the Confucius SDK which offers a unified orchestrator, persistent note-taking, and a modular extension module for robust tool use. This agent outperforms previous coding agents, achieving a state-of-the-art Resolve@1 performance of 54.3% on SWE-Bench-Pro....

Evaluating Gemini Robotics Policies in a Veo World Simulator

Published at 2025-12-11

#ML

This study presents a system that uses advanced video models to simulate and evaluate robotics policies in various scenarios, including those that are different from training conditions, to assess performance and safety. The system enables editing of scenes to include new objects and backgrounds, allowing for accurate prediction of policy performance in both normal and challenging situations, and helps identify unsafe behaviors....

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Published at 2025-12-11

#ML

The authors propose a benchmarking framework, MiSI-Bench, to evaluate the ability of Vision-Language Models in understanding microscopic spatial relationships in molecular structures. The results show that while current models struggle with this task, a fine-tuned model shows promise, highlighting the need for incorporating domain knowledge for advancements in scientific AI....

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Published at 2025-12-11

#ML

The authors present a new method called Outcome-based Process Verifier (OPV) that improves the accuracy and efficiency of verifying long reasoning processes in complex tasks by using an iterative active learning framework and expert annotations. OPV outperforms other models, achieving new state-of-the-art results and effectively detecting errors in synthetic datasets....

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Published at 2025-12-11

#ML

The authors present a new method called MoCapAnything that can capture and recreate 3D motions for any type of digital character using just a single video. This is achieved through a framework that predicts joint movements and recovers specific rotations for each character, allowing for high-quality animations and cross-species retargeting. (ELI5: Imagine being able to take a video of someone moving and use it to make any digital character, like animals or robots, move in the same way. This pape...

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

Published at 2025-12-11

#ML

The authors present a new verifier called OPV that accurately and efficiently checks the reasoning process of long chains of thought by using summarized outcomes. They improve the verifier's performance through an iterative active learning framework with expert annotations, which reduces the need for expensive human annotations. Experiments show OPV's superior performance and ability to detect errors, outperforming larger models and improving the accuracy of policy models....

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Published at 2025-12-11

#ML

The study presents a new open-vocabulary image attribute encoder called Omni-Attribute, which can separately learn and transfer specific image attributes like identity and style. This encoder is trained using a unique approach that involves semantically linked image pairs and a dual-objective training paradigm, allowing for high-fidelity, attribute-specific representations and achieving top performance in various visual concept personalization tasks....

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Published at 2025-12-11

#ML

The researchers present a new method called StereoSpace that generates stereo images from a single image without using depth maps or warping. This method is evaluated using new metrics that focus on how comfortable the stereo images are to view and how consistent the geometry is, and it outperforms other methods in generating sharp and robust stereo images for various scenes....

Stronger Normalization-Free Transformers

Published at 2025-12-11

#ML

This study explores alternatives to normalization layers in deep learning, focusing on point-wise functions. The researchers introduce a new function called Derf, which outperforms existing normalization methods across various domains, such as image recognition, speech representation, and DNA sequence modeling, due to its improved generalization capabilities....

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Published at 2025-12-11

#ML

The authors have developed an efficient Russian language model called T-pro 2.0, which can answer questions and show its thought process. They have made the model, a large training dataset, and other resources available to the public to promote research and development in Russian language reasoning....

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Published at 2025-12-11

#ML

The FACTS Leaderboard Suite is an online platform that evaluates the factual accuracy of large language models in various scenarios. It consists of four sub-leaderboards: Multimodal, Parametric, Search, and Grounding, each measuring different aspects of a model's factuality, and provides a comprehensive score to assess a model's overall performance....

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

Published at 2025-12-11

#ML

The authors present a Video Toolkit and a Spatiotemporal Reasoning Framework (STAR) that improve the performance of Multimodal Large Language Models (MLLMs) in understanding and reasoning about dynamic real-world scenarios in videos. By strategically scheduling the use of spatial and temporal tools, the STAR framework enhances GPT-4o, resulting in better accuracy on VideoMME and LongVideoBench benchmarks....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages