🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer |
Published at 2025-08-01 |
#ML
|
The study presents LAMIC, a framework for generating coherent images from multiple references without training, using two attention mechanisms for better layout awareness and entity disentanglement. LAMIC outperforms existing methods in various metrics, demonstrating strong generalization abilities and paving the way for a new paradigm in controllable multi-image composition.... |
Read More |
|
|
![]() |
Representation Shift: Unifying Token Compression with FlashAttention |
Published at 2025-08-01 |
#ML
|
The authors present Representation Shift, a new method that measures changes in token representation to unify token compression with FlashAttention, making it compatible with GPU memory access optimization. This approach, which works for Transformers, CNNs, and state space models, significantly speeds up video-text retrieval and video QA tasks by up to 5.5% and 4.4% respectively, without requiring attention maps or retraining.... |
Read More |
|
|
|
![]() |
The Promise of RL for Autoregressive Image Editing |
Published at 2025-08-01 |
#ML
|
The study investigates three methods to improve image editing, focusing on a unified autoregressive model for text and visual data. The research finds that reinforcement learning combined with a large multimodal language model is the most effective, leading to the development of EARL, a strong RL-based image editing model that outperforms baselines with less training data.... |
Read More |
|
|
![]() |
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation |
Published at 2025-08-01 |
#ML
|
The authors present a new model called UniEgoMotion that can predict, generate, and reconstruct human motion from a first-person perspective, which is essential for improving AR/VR experiences and human-robot interaction. Unlike previous models, UniEgoMotion considers the scene context from first-person images to create realistic 3D motion, and it has achieved top performance in egocentric motion reconstruction and generating motion from a single egocentric image.... |
Read More |
|
|
|
![]() |
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? |
Published at 2025-08-03 |
#ML
|
The study presents LiveMCPBench, a comprehensive benchmark with 95 real-world tasks, 70 MCP servers, and 527 tools to evaluate LLM agents in large-scale, real-world scenarios. It also introduces an LLM-as-a-Judge framework for automated evaluation and proposes the MCP Copilot Agent for dynamic planning and tool execution, demonstrating the performance variance across 10 leading models in complex, tool-rich environments.... |
Read More |
|
|
![]() |
AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization |
Published at 2025-08-04 |
#ML
|
The authors present a new method called AlignGuard-LoRA that helps maintain the safety and behavioral constraints of large language models during fine-tuning. This approach uses various techniques, such as Fisher Information Matrix-based regularization and collision-aware regularization, to minimize alignment drift and ensure that new knowledge is integrated without weakening existing safety features.... |
Read More |
|
|
|
![]() |
CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search |
Published at 2025-08-04 |
#ML
|
The study presents CRINN, a new approach for approximate nearest neighbor search algorithms that uses reinforcement learning to optimize execution speed while maintaining accuracy. Experimental results show that CRINN outperforms or matches state-of-the-art algorithms on several benchmarks, demonstrating the potential of LLMs and reinforcement learning for automating algorithmic optimizations.... |
Read More |
|
|
![]() |
HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents |
Published at 2025-08-04 |
#ML
|
The study presents HyCodePolicy, a framework that combines language, geometry, and perception to improve decision-making in embodied agents. It decomposes instructions into subgoals, generates executable programs, and uses vision-language models to monitor and repair the programs during task completion, resulting in more robust and efficient robot manipulation policies.... |
Read More |
|
|
|
![]() |
Multi-human Interactive Talking Dataset |
Published at 2025-08-04 |
#ML
|
This study presents MIT, a large-scale dataset designed for generating realistic multi-human talking videos, filling a gap in existing research focused on single-person monologues or isolated facial animations. The dataset includes 12 hours of high-resolution footage with two to four speakers, capturing natural conversational dynamics and offering a valuable resource for studying interactive visual behaviors.... |
Read More |
|
|
![]() |
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference |
Published at 2025-08-04 |
#ML
|
The researchers created a fast language model called Seed Diffusion Preview, which generates text quickly without waiting for each word to be processed, making it faster than other models like Mercury and Gemini Diffusion.... |
Read More |
|
|
|
![]() |
TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs |
Published at 2025-08-04 |
#ML
|
The study presents a framework called TraceAlign to identify and mitigate alignment failures in language models, which occur when the models generate unsafe or policy-violating completions. The framework uses a Belief Conflict Index to trace these failures back to their root causes in the model's training data and proposes three interventions to reduce alignment drift, improving safety and preserving utility.... |
Read More |
|
|
![]() |
Tool-integrated Reinforcement Learning for Repo Deep Search |
Published at 2025-08-04 |
#ML
|
The study presents ToolTrain, a two-stage training framework that uses supervised fine-tuning and reinforcement learning to improve language models in locating code issues by utilizing repository retrieval tools. Experimental results show that ToolTrain-trained models outperform state-of-the-art models, including Claude-3.7, in function-level localization and end-to-end issue resolution.... |
Read More |
|
|
|
![]() |
TreeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEs |
Published at 2025-08-04 |
#ML
|
The authors present a new method to rank code suggestions in IDEs using language models, which organizes completions into a tree and performs a single pass to score them. This approach is fast, compatible with existing models, and provides more accurate and responsive developer assistance.... |
Read More |
|
|
![]() |
What Is Your AI Agent Buying? Evaluation, Implications and Emerging Questions for Agentic E-Commerce |
Published at 2025-08-04 |
#ML
|
This study explores how AI agents make purchasing decisions in online marketplaces, creating a simulation environment to analyze their behavior. The findings reveal that AI agents' shopping habits vary, with preferences for product positioning, pricing, and endorsements differing across models, and suggest potential strategies for sellers and considerations for platform design and regulation in AI-mediated e-commerce.... |
Read More |
|
|
|
![]() |
ChartCap: Mitigating Hallucination of Dense Chart Captioning |
Published at 2025-08-05 |
#ML
|
The study presents ChartCap, a large-scale dataset of 565K real-world chart images with detailed, accurate captions, created using a four-stage pipeline that excludes unnecessary information and highlights important elements. The researchers also introduce a new metric, Visual Consistency Score, to evaluate caption quality, and their experiments show that models trained on ChartCap produce more accurate and informative captions with fewer errors compared to other models.... |
Read More |
|
|
![]() |
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward |
Published at 2025-08-05 |
#ML
|
The authors present CompassVerifier, a precise and sturdy lightweight verifier model for evaluating large language models and determining outcomes. CompassVerifier can handle various answer types and identify incorrect responses across multiple domains, and it is trained using the VerifierBench benchmark, which includes model outputs from different sources and manual analysis of error patterns.... |
Read More |
|
|
|
![]() |
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction |
Published at 2025-08-05 |
#ML
|
The study presents Goedel-Prover-V2, a new open-source language model for automated theorem proving, which outperforms existing models by incorporating scaffolded data synthesis, verifier-guided self-correction, and model averaging techniques.... |
Read More |
|
|
![]() |
LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation |
Published at 2025-08-05 |
#ML
|
The study identifies and solves three main problems in creating long, controllable videos: inconsistent noise initialization, unaligned control signals, and limitations of single-modality guidance. The proposed LongVie framework addresses these issues through a unified noise initialization strategy, global control signal normalization, a multi-modal control framework, and a degradation-aware training strategy, resulting in high-quality, long videos with improved controllability and consistency.... |
Read More |
|
|
|
![]() |
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation |
Published at 2025-08-05 |
#ML
|
Skywork UniPic is a unified autoregressive model for visual understanding and generation, which achieves state-of-the-art performance on various benchmarks using a novel decoupled encoding strategy, a progressive training schedule, and meticulously curated datasets. This model demonstrates that high-fidelity multimodal integration can be achieved without requiring excessive resources, making it a practical paradigm for deployable, high-fidelity multimodal AI.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|