🤗 Daily Paper(2025-08-29)

3 views

Skip to first unread message

deep.di...@gmail.com

unread,

Aug 29, 2025, 4:07:13 PMAug 29

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Collaborative Multi-Modal Coding for High-Quality 3D Generation

Published at 2025-08-21

#ML

The study presents TriMM, a new 3D generation model that uses multiple data types (like RGB images and point clouds) to create high-quality 3D assets. TriMM combines the strengths of different data types, improving both texture and detail, and performs well even with limited training data....

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Published at 2025-08-24

#ML

The study presents DuET-PD, a framework that evaluates Large Language Models' performance in persuasive dialogues, focusing on their ability to resist misinformation and accept valid corrections. The researchers found that even advanced models like GPT-4o struggle with this task, especially under sustained misleading persuasions, and introduced a new training method called Holistic DPO to improve models' robustness and adaptability in dialogues....

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Published at 2025-08-24

#ML

The authors present Social-MAE, an improved audiovisual model based on CAV-MAE, which is pre-trained on a large dataset of human social interactions. This model excels in emotion recognition, laughter detection, and apparent personality estimation tasks, showcasing the power of self-supervised pre-training for social and affective understanding....

ROSE: Remove Objects with Side Effects in Videos

Published at 2025-08-25

#ML

The research proposes a new method called ROSE to remove objects and their effects, like shadows and reflections, from videos. They use a 3D rendering engine to create synthetic data and a special model called a diffusion transformer to erase objects and their side effects, which works better than existing methods and can be applied to real-world videos....

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Published at 2025-08-26

#ML

The study presents a new model, USO, that combines style-driven and subject-driven generation into one framework. USO uses a large-scale dataset, disentangled learning, and reward learning to improve performance and achieve state-of-the-art results in both style similarity and subject consistency....

TCIA: A Task-Centric Instruction Augmentation Method for Instruction Finetuning

Published at 2025-08-27

#ML

The authors present a new method called TCIA, which enhances instruction data for training language models, focusing on both diversity and relevance to specific tasks. Experiments show that TCIA significantly improves the performance of open-source language models in task-specific applications without losing their ability to follow general instructions....

AWorld: Orchestrating the Training Recipe for Agentic AI

Published at 2025-08-28

#ML

The authors present AWorld, an open-source system designed to improve the training of advanced AI systems by enhancing the efficiency of agent-environment interactions. AWorld significantly speeds up the process of collecting experience for AI training, enabling a notable improvement in the performance of a Qwen3-32B-based agent on the GAIA benchmark....

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Published at 2025-08-28

#ML

The authors present a new framework called CogVLA that improves the efficiency and performance of vision-language-action models by using instruction-driven routing and sparsification. This framework is inspired by human multimodal coordination and has three stages: Encoder-FiLM based Aggregation Routing, LLM-FiLM based Pruning Routing, and V-L-A Coupled Attention. CogVLA outperforms existing models in terms of success rates and reduces training costs and inference latency....

Dress&Dance: Dress up and Dance as You Like It - Technical Preview

Published at 2025-08-28

#ML

The authors propose a new video framework called Dress&Dance that creates realistic 5-second virtual try-on videos of users wearing different clothes while moving, based on a single image of the user. The framework uses a novel network called CondNet, which combines information from different sources to improve the accuracy of the try-on and the movement, and can handle various types of clothing....

FakeParts: a New Family of AI-Generated DeepFakes

Published at 2025-08-28

#ML

The study presents FakeParts, a new type of deepfake with subtle, targeted alterations that blend well with real footage, making them hard to detect. To help detect such partial deepfakes, the researchers created FakePartsBench, a large dataset with over 25,000 videos and manipulation annotations, which significantly reduces detection accuracy compared to traditional deepfakes....

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Published at 2025-08-28

#ML

MCP-Bench is a new tool that tests large language models on complex, real-world tasks requiring the use of various tools, planning, and reasoning. It connects language models to 28 live servers with 250 tools across different domains, allowing for the creation of authentic, multi-step tasks and evaluating the models' ability to work with these tools more effectively than existing benchmarks....

Mixture of Contexts for Long Video Generation

Published at 2025-08-28

#ML

This study addresses the challenge of generating long videos by treating it as an information retrieval task and proposing a new method called Mixture of Contexts (MoC). MoC dynamically selects important video chunks to focus on, allowing the model to efficiently remember and generate long video sequences without using too much memory or computation power....

Multi-View 3D Point Tracking

Published at 2025-08-28

#ML

The authors present a new multi-view 3D point tracker that uses multiple cameras to accurately track points in dynamic scenes, even when some points are occluded. This method is more practical and effective than existing ones, as it works with a smaller number of cameras and can operate online, making it useful for real-world applications....

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models

Published at 2025-08-28

#ML

The study introduces an LLM chat interface called OnGoal that helps users track and visualize their conversational goals in multi-turn dialogues with large language models. Through a study, it was found that OnGoal reduces time and effort for users to achieve their goals, encourages new prompting strategies, and enhances engagement and resilience in LLM dialogues....

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Published at 2025-08-28

#ML

The study presents a unified reinforcement learning framework called OneReward that improves a model's ability to generate images across various tasks using a single vision-language model as a reward system. This framework is particularly useful for mask-guided image generation, which includes tasks like image fill, extend, and object removal, as it allows for training without task-specific supervised fine-tuning, resulting in better performance compared to existing methods....

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Published at 2025-08-28

#ML

The authors address the issue of reward hacking in current text-to-image generation methods by proposing Pref-GRPO, a pairwise preference reward-based GRPO method. They also introduce UniGenBench, a comprehensive benchmark for evaluating T2I models, which reveals the strengths and weaknesses of both open and closed-source models and validates the effectiveness of Pref-GRPO....

Provable Benefits of In-Tool Learning for Large Language Models

Published at 2025-08-28

#ML

This study compares two methods of language model learning: in-weight learning (memorization) and in-tool learning (external retrieval). The researchers found that while memorization has a limit based on the model's parameters, in-tool learning allows for unlimited factual recall through efficient circuit construction. They also discovered that teaching tool-use to pre-trained language models is more effective than memorizing facts....

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Published at 2025-08-28

#ML

This research presents a new method called Rank-One Safety Injection (ROSI) that enhances the safety of Large Language Models (LLMs) by steering their activations towards refusal-mediating subspaces, thus preventing harmful requests. ROSI is a simple and cost-effective technique that can be applied without fine-tuning, and it has been shown to improve safety refusal rates while maintaining the model's utility on various benchmarks....

rStar2-Agent: Agentic Reasoning Technical Report

Published at 2025-08-28

#ML

The researchers developed a 14B math reasoning model, rStar2-Agent, using agentic reinforcement learning to achieve top-level performance in complex problem-solving. They employed three key innovations, including an efficient RL infrastructure, a resample-on-correct rollout strategy, and an agent training recipe, to enable advanced cognitive abilities and boost the model to state of the art in just 510 RL steps....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages