🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards |
Published at 2025-11-29 |
|
#ML
|
The study presents RealGen, a framework for generating photorealistic images from text, which addresses the issue of 'fake' images produced by existing models. RealGen uses an LLM component for prompt optimization and a diffusion model for image generation, and introduces a 'Detector Reward' mechanism to enhance image realism and detail. The framework significantly outperforms other models in realism, detail, and aesthetics.... |
Read More |
|
|
|
![]() |
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks |
Published at 2025-12-02 |
|
#ML
|
The study presents CAPO, a new method for training large language models that improves their reasoning abilities by using advantage signals. CAPO first learns from positive signals to build a strong foundation, then incorporates negative signals to enhance the model's ability to generalize across various complex scenarios, leading to stable and significant improvements in mathematical reasoning tasks and multimodal GUI reasoning scenarios.... |
Read More |
|
|
|
|
![]() |
PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling |
Published at 2025-12-02 |
|
#ML
|
The researchers present a new framework, PaCo-RL, for generating consistent images using reinforcement learning. This framework includes a pairwise consistency evaluator, PaCo-Reward, and an efficient RL algorithm, PaCo-GRPO, which together improve alignment with human perceptions and achieve state-of-the-art performance in consistent image generation.... |
Read More |
|
|
|
![]() |
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning |
Published at 2025-12-02 |
|
#ML
|
The authors propose a new method called ReVSeg for video object segmentation that uses pretrained vision language models to break down the task into three explicit operations: semantics interpretation, temporal evidence selection, and spatial grounding. They then use reinforcement learning to optimize the multi-step reasoning chain, resulting in state-of-the-art performance and interpretable reasoning trajectories.... |
Read More |
|
|
|
|
![]() |
Self-Improving VLM Judges Without Human Annotations |
Published at 2025-12-02 |
|
#ML
|
The authors propose a new framework to train a Vision-Language Model (VLM) judge without human annotations, using only self-generated data. This method improves a VLM judge's accuracy on various tasks compared to larger models, showing the potential for a self-improving judge that evolves with VLM capabilities.... |
Read More |
|
|
|
![]() |
From FLOPs to Footprints: The Resource Cost of Artificial Intelligence |
Published at 2025-12-03 |
|
#ML
|
This study measures the environmental impact of AI training by analyzing the material composition of Nvidia A100 GPUs and linking it to computational workloads. Results show that AI hardware is mainly heavy metals and that optimizing computational efficiency and hardware lifespan can significantly reduce material demands and environmental footprint.... |
Read More |
|
|
|
|
![]() |
M3DR: Towards Universal Multilingual Multimodal Document Retrieval |
Published at 2025-12-03 |
|
#ML
|
The authors present M3DR, a universal multilingual multimodal document retrieval framework that effectively aligns text and images across 22 diverse languages, significantly improving cross-lingual retrieval performance.... |
Read More |
|
|
|
![]() |
TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows |
Published at 2025-12-03 |
|
#ML
|
The study presents TwinFlow, a framework for creating one-step generative models that surpasses existing methods in efficiency and scalability. TwinFlow achieves a high GenEval score with only one function evaluation, outperforming other baselines, and demonstrates its scalability by reducing computational cost by 100 times on a large-scale model with minimal quality loss.... |
Read More |
|
|
|
|
![]() |
AI & Human Co-Improvement for Safer Co-Superintelligence |
Published at 2025-12-04 |
|
#ML
|
The paper suggests that instead of focusing on AI self-improvement, which can be dangerous and time-consuming, we should aim for co-improvement. This means humans and AI systems working together to improve each other, specifically in conducting AI research, to create safer superintelligence through their collaboration.... |
Read More |
|
|
|
![]() |
COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence |
Published at 2025-12-04 |
|
#ML
|
The study presents COOPER, a unified model for spatial intelligence that improves 3D-aware reasoning in multimodal large language models. By leveraging depth and segmentation as auxiliary modalities and employing adaptive interleaved reasoning, COOPER enhances spatial perception and achieves better spatial reasoning performance compared to existing methods.... |
Read More |
|
|
|
|
![]() |
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture |
Published at 2025-12-04 |
|
#ML
|
The authors present EMMA, a single system for understanding, generating, and editing multimodal content. Its key features include an efficient autoencoder, channel-wise concatenation, a shared-and-decoupled network, and a mixture-of-experts mechanism, which together improve performance and efficiency, outperforming other unified multimodal approaches.... |
Read More |
|
|
|
![]() |
From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model |
Published at 2025-12-04 |
|
#ML
|
The study presents a new benchmark, TAD, specifically designed to evaluate the temporal understanding of autonomous driving using vision-language models. The benchmark consists of over 6,000 questions and answers, focusing on the unique challenges of ego-centric autonomous driving footage. The researchers found that current state-of-the-art models struggle with this task due to imperfect motion understanding and propose two novel training-free solutions, Scene-CoT and TCogMap, to improve perform... |
Read More |
|
|
|
|
![]() |
Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image |
Published at 2025-12-04 |
|
#ML
|
The authors present a new method called MoRe4D that combines geometry reconstruction and motion generation for creating realistic 4D scenes from a single image. They introduce a large-scale dataset and a diffusion-based generator to produce consistent and plausible 4D point trajectories, resulting in high-quality dynamic scenes with multi-view consistency.... |
Read More |
|
|
|
![]() |
SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs |
Published at 2025-12-04 |
|
#ML
|
The authors propose a new data format called SQ-format that balances accuracy and efficiency for large language models by combining quantization and sparsification techniques. SQ-format is compatible with both new hardware and existing GPUs, enabling high-precision acceleration for sparse matrices and low-precision matrix multiplication, resulting in improved performance and throughput.... |
Read More |
|
|
|
|
![]() |
SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling |
Published at 2025-12-04 |
|
#ML
|
The study presents a new method, SpaceControl, which allows users to have precise control over 3D object geometry using various geometric inputs, such as basic shapes or detailed meshes, without needing any additional training. This approach outperforms other methods in maintaining geometric accuracy while ensuring high visual quality, and it also includes an interactive interface for easy editing and deployment in creative workflows.... |
Read More |
|
|
|
![]() |
Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models |
Published at 2025-12-04 |
|
#ML
|
The study presents Roblox Guard 1.0, a sophisticated language model that improves safety in large language models by moderating both inputs and outputs. This model, built on Llama-3.1-8B-Instruct, uses a combination of synthetic and open-source safety datasets, along with advanced techniques, to generalize across unseen safety categories and perform well on various safety benchmarks.... |
Read More |
|
|
|
|
![]() |
TimesNet-Gen: Deep Learning-based Site Specific Strong Motion Generation |
Published at 2025-12-04 |
|
#ML
|
The study presents TimesNet-Gen, a new method for generating strong earthquake motions that takes into account local site conditions. This approach, which uses a station-specific latent bottleneck, effectively captures the characteristics of ground motion and outperforms a baseline model in generating site-specific strong motions.... |
Read More |
|
|
|
![]() |
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding |
Published at 2025-12-05 |
|
#ML
|
The authors propose a new method called Active Video Perception (AVP) that allows agents to interact with long videos to gather relevant information more efficiently. AVP improves performance on video understanding tasks by actively deciding what to observe, reducing computation time and improving accuracy compared to existing methods.... |
Read More |
|
|
|
|
![]() |
EditThinker: Unlocking Iterative Reasoning for Any Image Editor |
Published at 2025-12-05 |
|
#ML
|
The study presents a new method for improving image editing by using a single model, EditThinker, to simulate human-like thinking during the editing process. This approach significantly enhances the ability of image editing models to follow instructions, as demonstrated by extensive experiments on various benchmarks.... |
Read More |
|
|
|
![]() |
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning |
Published at 2025-12-05 |
|
#ML
|
The study presents a new method called Entropy Ratio Clipping (ERC) to improve the stability of reinforcement learning algorithms. ERC measures the change in policy exploration and imposes constraints to regulate probability shifts, resulting in improved performance across various benchmarks.... |
Read More |
|
|
|
|
![]() |
ProPhy: Progressive Physical Alignment for Dynamic World Simulation |
Published at 2025-12-05 |
|
#ML
|
The paper presents a new method called ProPhy to improve physical consistency in video generation, especially for large-scale or complex dynamics. ProPhy uses a two-stage system to learn detailed physics-based video representations and incorporates physical reasoning capabilities from vision-language models, resulting in more realistic and physically coherent videos.... |
Read More |
|
|
|
![]() |
SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations |
Published at 2025-12-05 |
|
#ML
|
The authors present SCAIL, a new framework for creating high-quality character animations that can handle complex motions and different characters seamlessly. They achieve this by introducing a new 3D pose representation and a full-context pose injection mechanism, which allow for more robust and flexible animations, and by developing a curated data pipeline and benchmark for evaluating the performance of their method.... |
Read More |
|
|
|
|
![]() |
World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty |
Published at 2025-12-05 |
|
#ML
|
The authors present a new method to improve controllable video generation by enabling the model to assess its uncertainty, thus reducing hallucinations. The proposed method, C3, introduces three innovations: a novel training framework, latent-space uncertainty estimation, and pixel-level uncertainty mapping, resulting in more accurate and interpretable uncertainty estimates.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|