🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges |
Published at 2025-12-12 |
|
#ML
|
This survey provides a comprehensive guide to Vision-Language-Action models in robotics, outlining their development from basic components to recent challenges in representation, execution, generalization, safety, and data evaluation. It aims to assist both newcomers and experienced researchers in understanding and advancing the field of embodied intelligence.... |
Read More |
|
|
|
![]() |
MineTheGap: Automatic Mining of Biases in Text-to-Image Models |
Published at 2025-12-15 |
|
#ML
|
The study presents MineTheGap, a method that uses a genetic algorithm to find prompts which cause Text-to-Image models to generate biased outputs. This tool is different from others because it not only finds bias in a given prompt but also gets better at finding biases through an optimization process, which helps in reducing societal impacts and improving user experience.... |
Read More |
|
|
|
|
![]() |
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering |
Published at 2025-12-16 |
|
#ML
|
HERBench is a new benchmark for testing video question answering models' ability to integrate multiple pieces of evidence across time. The benchmark has 26,000 questions that require understanding of identity, relationships, order, and counting, and it measures the model's performance by how many frames they need to look at to answer correctly.... |
Read More |
|
|
|
![]() |
Are We on the Right Way to Assessing LLM-as-a-Judge? |
Published at 2025-12-17 |
|
#ML
|
The authors present Sage, a new evaluation suite for LLM-as-a-Judge that doesn't require human annotation, addressing the bias and scalability issues of existing methods. Experiments show Sage's reliability and reveal consistency problems in current LLMs when acting as judges, suggesting potential improvements like finetuning and deep reasoning.... |
Read More |
|
|
|
|
![]() |
Bolmo: Byteifying the Next Generation of Language Models |
Published at 2025-12-17 |
|
#ML
|
The researchers created Bolmo, a new type of open language model that uses bytes instead of subwords, which allows for better character understanding and efficiency. Bolmo is designed to be trained by converting existing subword-level language models and performs as well as or better than those models, with faster inference speeds and lower training costs.... |
Read More |
|
|
|
![]() |
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation |
Published at 2025-12-18 |
|
#ML
|
The authors present a new model, 4D-RGPT, and a training method, Perceptual 4D Distillation, to improve multimodal language models' understanding of 3D structures and temporal dynamics in videos. They also introduce R4D-Bench, a new benchmark for evaluating depth-aware dynamic scenes with region-level prompting, which they created using a combination of automated and human-verified processes. The proposed model and benchmark show significant improvements in performance compared to existing metho... |
Read More |
|
|
|
|
![]() |
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos |
Published at 2025-12-18 |
|
#ML
|
The research presents LongShOTBench, a new benchmark for evaluating multimodal reasoning in long videos using vision, speech, and audio, and introduces LongShOTAgent, an agent that analyzes long videos. The results show that state-of-the-art models struggle with this task, highlighting the challenge of real-world long-form video understanding.... |
Read More |
|
|
|
![]() |
Animate Any Character in Any World |
Published at 2025-12-18 |
|
#ML
|
The authors present AniX, a method that combines static world generation and controllable-entity models to create realistic videos of user-specified characters performing a wide range of actions in a given environment. AniX uses a pre-trained video generator and a new training strategy to improve motion dynamics and maintain versatility in actions and characters, resulting in high-quality, coherent videos.... |
Read More |
|
|
|
|
![]() |
Meta-RL Induces Exploration in Language Agents |
Published at 2025-12-18 |
|
#ML
|
The study presents LaMer, a Meta-RL framework that enhances large language model agents' ability to explore and learn from their environment. LaMer's components encourage exploration and adapt policies using feedback, resulting in improved performance and better generalization to new tasks compared to traditional RL methods.... |
Read More |
|
|
|
![]() |
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence |
Published at 2025-12-18 |
|
#ML
|
The study presents a method to improve physical intelligence in robots by using large-scale human egocentric videos, which capture interaction context and causal structure. The proposed Egocentric2Embodiment translation pipeline transforms first-person videos into structured supervision, enabling the creation of the E2E-3M dataset. The resulting egocentric-aware embodied brain, PhysBrain, demonstrates better understanding and planning in egocentric scenarios, facilitating more efficient robot co... |
Read More |
|
|
|
|
![]() |
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows |
Published at 2025-12-18 |
|
#ML
|
The authors propose a new way to measure a machine's ability to perform scientific tasks, called SGI, by creating a benchmark with four tasks: deep research, idea generation, experiments, and reasoning. They find that current large language models struggle with these tasks, especially in generating feasible ideas and accurately executing experiments, and introduce a new method to improve hypothesis generation.... |
Read More |
|
|
|
![]() |
StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models |
Published at 2025-12-18 |
|
#ML
|
The research proposes a method called StageVAR to improve the efficiency of Visual Autoregressive Models, which generate high-quality images by predicting the next scale. StageVAR identifies that early stages are crucial for maintaining image consistency while later stages mainly refine details. By exploiting this, StageVAR accelerates the generation process by up to 3.4 times without significantly affecting image quality.... |
Read More |
|
|
|
|
![]() |
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs |
Published at 2025-12-18 |
|
#ML
|
This study compares two reinforcement learning algorithms, GRPO and PPO, for training LLM agents in multi-turn tasks, and introduces a new variant called turn-PPO that improves performance by operating on a turn-level MDP formulation. The researchers find turn-PPO to be more effective than GRPO and even the standard PPO for multi-turn tasks, as demonstrated through experiments on WebShop and Sokoban datasets.... |
Read More |
|
|
|
![]() |
3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework |
Published at 2025-12-19 |
|
#ML
|
The authors present 3D-RE-GEN, a new framework that reconstructs a single image into textured 3D objects and a background, addressing current limitations in 3D scene reconstruction for artists in visual effects and game development. This method uses a combination of state-of-the-art models and a novel optimization technique to produce coherent, modifiable scenes with accurate spatial relationships and a comprehensive background.... |
Read More |
|
|
|
|
![]() |
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing |
Published at 2025-12-19 |
|
#ML
|
This study addresses challenges in using high-dimensional features from representation encoders for text-to-image generation and editing. The researchers propose a framework to regularize the latent space, enabling better semantic and fine-grained detail compression, resulting in state-of-the-art image reconstruction and improved performance in text-to-image and editing tasks.... |
Read More |
|
|
|
![]() |
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation |
Published at 2025-12-19 |
|
#ML
|
The study presents GroundingME, a new benchmark for evaluating the visual grounding capabilities of multimodal large language models (MLLMs) across four dimensions. The results show that most MLLMs struggle with real-world complexity, particularly in recognizing ungroundable queries, and suggest strategies for improvement.... |
Read More |
|
|
|
|
![]() |
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers |
Published at 2025-12-19 |
|
#ML
|
The authors present CANON LAYERS, a new architectural component for language models that improves reasoning and knowledge manipulation. These layers can enhance various sequence architectures, and their effectiveness is demonstrated through controlled synthetic pretraining tasks and real-world academic-scale pretraining.... |
Read More |
|
|
|
![]() |
RadarGen: Automotive Radar Point Cloud Generation from Cameras |
Published at 2025-12-19 |
|
#ML
|
The authors have developed a new model called RadarGen that creates realistic radar point clouds from camera images for autonomous vehicles. It uses visual cues from pre-trained models to generate accurate radar patterns and is compatible with existing visual datasets, making it a scalable solution for multimodal simulation.... |
Read More |
|
|
|
|
![]() |
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding |
Published at 2025-12-19 |
|
#ML
|
The authors present a new framework called Robust-R1 that improves the performance of multimodal large language models in real-world visual tasks by explicitly modeling visual degradations. Robust-R1 uses a specialized dataset with realistic degradations and structured reasoning chains to outperform other models in various challenging visual tasks.... |
Read More |
|
|
|
![]() |
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories |
Published at 2025-12-19 |
|
#ML
|
The authors present SWE-Bench++, a framework that automates the creation of software engineering benchmarks from open-source GitHub projects. It covers various coding tasks in 11 languages using live pull requests, and its initial benchmark shows that even the strongest models struggle with these tasks, highlighting the need for improvement in repository-level code generation.... |
Read More |
|
|
|
|
![]() |
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience |
Published at 2025-12-19 |
|
#ML
|
This study presents Seed-Prover 1.5, a formal theorem-proving model that uses large-scale agentic reinforcement learning and an efficient test-time scaling workflow. It can solve a high percentage of undergraduate, graduate, and PhD-level problems, including 11 out of 12 problems from Putnam 2025 within 9 hours, by learning from experience and bridging the gap between natural and formal languages.... |
Read More |
|
|
|
![]() |
When Reasoning Meets Its Laws |
Published at 2025-12-19 |
|
#ML
|
This study proposes a framework called LoRe to understand and improve the reasoning abilities of large models. The framework includes two laws: compute law and accuracy law, which help measure and enhance the performance of these models, leading to better reasoning capabilities.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|