🤗 Daily Paper(2025-09-12)

1 view

Skip to first unread message

deep.di...@gmail.com

unread,

Sep 12, 2025, 4:06:52 PMSep 12

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

2D Gaussian Splatting with Semantic Alignment for Image Inpainting

Published at 2025-09-02

#ML

This study presents a new method for image inpainting using 2D Gaussian Splatting, which creates a smooth and continuous representation of images. The method uses a differentiable process to fill in missing parts of an image while maintaining overall context and detail, and it incorporates features from a pretrained model to ensure the filled-in content matches the surrounding scene....

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Published at 2025-09-06

#ML

The study explores new data poisoning attacks on advanced Large Language Models (LLMs) that utilize step-by-step reasoning. These attacks, called 'decomposed reasoning poison,' are more complex and harder to activate than previous ones, suggesting that LLMs may have an inherent ability to resist such attacks due to their reasoning capabilities and architecture....

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Published at 2025-09-07

#ML

This study presents Ego3D-Bench, a new benchmark for evaluating spatial reasoning abilities of Vision-Language Models in real-world, multi-view environments, and introduces Ego3D-VLM, a post-training framework that significantly enhances 3D spatial reasoning of these models....

All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching

Published at 2025-09-08

#ML

The authors describe an open-source system that uses artificial intelligence to find and fix security vulnerabilities in software, demonstrating its capability by discovering 28 vulnerabilities and patching 14 in a cybersecurity competition. They also introduce a public leaderboard to compare the performance of different AI models in vulnerability detection and patching tasks....

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Published at 2025-09-08

#ML

The study presents mmBERT, a new multilingual language model trained on vast multilingual text data, which outperforms previous models in classification and retrieval tasks, even in low-resource languages, by incorporating novel techniques like inverse mask ratio schedule and inverse temperature sampling ratio....

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Published at 2025-09-09

#ML

The authors present AU-Harness, a new and efficient toolkit for evaluating large audio language models (LALMs). This toolkit addresses current evaluation challenges like slow processing, inconsistent prompting, and limited task coverage by providing faster processing, standardized prompting, and new evaluation categories, revealing key gaps in current LALMs....

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Published at 2025-09-09

#ML

The researchers address a common issue in fine-tuning large language models using reinforcement learning, where the model's performance on multiple attempts worsens over time despite improvements in single-attempt accuracy. They propose a new framework, DPH-RL, which uses a special type of mathematical measure called f-divergence to help the model retain its diverse knowledge base and improve both single and multiple attempt performance, while also being more efficient to train....

Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Published at 2025-09-10

#ML

This study enhances the Contrastive Language-Image Pre-training (CLIP) model for person representation learning by creating a large, high-quality person-centric dataset called WebPerson and introducing a new framework called GA-DMS. GA-DMS improves cross-modal alignment by masking noisy textual tokens and incorporating masked token prediction objectives, leading to better fine-grained semantic representation learning and state-of-the-art performance on various benchmarks....

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Published at 2025-09-10

#ML

The authors present HuMo, a framework for generating human-centric videos using collaborative multimodal conditioning. They tackle challenges in coordinating heterogeneous modalities by constructing a high-quality dataset and proposing a two-stage training paradigm with task-specific strategies for subject preservation and audio-visual sync, resulting in a unified framework that outperforms specialized state-of-the-art methods....

Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation

Published at 2025-09-10

#ML

The authors present MambaRec, a new framework for multimodal recommendation systems that uses the Dilated Refinement Attention Module to align fine-grained semantic patterns and improve cross-modal semantic modeling. They also apply MMD and contrastive loss functions to enhance global modality alignment and robustness, while a dimensionality reduction strategy improves scalability. MambaRec outperforms existing methods in fusion quality, generalization, and efficiency on real-world e-commerce da...

Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Published at 2025-09-11

#ML

The study presents a new framework, UAE, that combines image understanding and generation using auto-encoder technology. By training both processes to work together, the framework improves the accuracy and detail of both tasks, leading to better image descriptions and more faithful image reconstructions....

Cross-Domain Evaluation of Transformer-Based Vulnerability Detection on Open & Industry Data

Published at 2025-09-11

#ML

This study tests CodeBERT's ability to find bugs in both open-source and industrial software, then creates AI-DO, a tool that uses CodeBERT to detect and pinpoint vulnerabilities during code review in a CI/CD pipeline without interrupting workflows. The researchers found that models trained on industrial data perform well in the same domain but struggle with open-source code, while models fine-tuned on open data with undersampling techniques improve vulnerability detection....

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Published at 2025-09-11

#ML

The authors propose EchoX, a new method to improve speech-to-speech language models by bridging the gap between sound and meaning, resulting in better performance on knowledge-based tasks....

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Published at 2025-09-11

#ML

Researchers have created FLUX-Reason-6M, a large dataset of 6 million images and 20 million descriptions, to improve reasoning in text-to-image models. They've also developed PRISM-Bench, a testing standard with seven tracks, to better evaluate these models and identify areas for improvement. The dataset, benchmark, and evaluation code are available for the community to use and advance the field of reasoning-oriented text-to-image generation....

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Published at 2025-09-11

#ML

The study presents a new framework, Entropy-Modulated Policy Gradients (EMPG), that improves learning efficiency and stability in long-horizon tasks for Large Language Models (LLMs) by adjusting updates based on step-wise uncertainty and task outcomes. EMPG outperforms existing methods in experiments on complex agent tasks like WebShop, ALFWorld, and Deep Search....

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Published at 2025-09-11

#ML

The authors present a new method called Kling-Avatar that creates realistic and expressive long-duration avatar videos by combining multimodal instructions with photorealistic portrait generation. This approach allows for more coherent and engaging avatar animations, and it can be used in real-world applications like digital human livestreaming and vlogging....

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Published at 2025-09-11

#ML

LoCoBench is a new benchmark for evaluating long-context language models in complex software development scenarios, focusing on understanding entire codebases and reasoning across multiple files, which existing benchmarks do not cover. It includes 8,000 scenarios in 10 programming languages, with context lengths ranging from 10K to 1M tokens, and 8 task categories to assess long-context capabilities, revealing significant performance gaps in state-of-the-art models....

ObjectReact: Learning Object-Relative Control for Visual Navigation

Published at 2025-09-11

#ML

This research proposes a new method for visual navigation that uses objects in the environment to guide movement, instead of relying solely on images. This approach allows for more flexible and adaptable navigation, as it can traverse new routes without imitating prior experience and is less affected by changes in sensors or mapping conditions. The method is effective in both simulation and real-world indoor environments....

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

Published at 2025-09-11

#ML

The study presents OmniEVA, a versatile planner that enhances embodied reasoning and task planning through two innovations: a mechanism that selectively regulates 3D fusion based on contextual requirements and a framework that incorporates task goals and embodiment constraints into the reasoning loop. This results in improved adaptability and practical feasibility for embodied tasks, outperforming state-of-the-art models and demonstrating robust planning capabilities in various scenarios....

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Published at 2025-09-11

#ML

The authors present SimpleVLA-RL, an efficient reinforcement learning framework designed for Vision-Language-Action models. This framework improves long-horizon planning, outperforms supervised fine-tuning, and reduces the need for large-scale data, while also identifying a new phenomenon called 'pushcut' during training....

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Published at 2025-09-11

#ML

The researchers have created a new dataset called SpatialVID, which includes over 21,000 hours of raw video, processed into 2.7 million clips with detailed spatial and semantic information like camera poses, depth maps, and motion instructions, to help improve the scalability and real-world performance of spatial intelligence models....

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Published at 2025-09-11

#ML

This study presents MMOral, a new multimodal instruction dataset and benchmark designed specifically for analyzing panoramic X-rays in dentistry. The dataset contains 20,563 annotated images with various tasks, and the authors fine-tuned a model called OralGPT, which significantly improved performance on this task....

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Published at 2025-09-11

#ML

The researchers propose VLA-Adapter, a new method that enhances the performance of vision-language-action models without requiring extensive pre-training or large-scale models. This approach uses a lightweight model with a special attention mechanism to connect visual and language information to actions, resulting in state-of-the-art performance and fast inference speeds, all while being trained quickly on a single GPU....

Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

Published at 2025-09-11

#ML

The study proposes a new method called Visual Programmability, which enables Vision-Language Models to choose between using code or direct visual analysis when interpreting charts. This adaptive approach improves the models' reasoning abilities and prevents them from relying on a single strategy, leading to better performance on various chart-understanding tasks....

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages