🤗 Daily Paper(2025-08-04)

2 views
Skip to first unread message

deep.di...@gmail.com

unread,
Aug 4, 2025, 4:07:02 PMAug 4
to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project pageicon
🤗 daily papericon

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Published at 2025-07-25

#ML

The authors present MCIF, a new benchmark for evaluating multilingual and multimodal capabilities of language models. MCIF uses scientific talks in four languages and three modalities to test model performance across different languages and complex tasks, addressing gaps in existing benchmarks....

Read Moreicon

Investigating Hallucination in Conversations for Low Resource Languages

Published at 2025-07-30

#ML

This study examines the issue of 'hallucination' in Large Language Models (LLMs) for three low resource languages: Hindi, Farsi, and Mandarin. The researchers analyzed conversational data in these languages for six LLMs and found that while Mandarin had fewer factually incorrect statements, Hindi and Farsi had significantly more hallucinations....

Read Moreicon

3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

Published at 2025-07-31

#ML

The authors present a new model, 3D-R1, that improves reasoning and generalization in 3D scene understanding by creating a high-quality synthetic dataset, using advanced training techniques, and implementing a dynamic view selection strategy, resulting in an average improvement of 10% across various benchmarks....

Read Moreicon

Multimodal Referring Segmentation: A Survey

Published at 2025-07-31

#ML

This survey explores the field of multimodal referring segmentation, which involves identifying target objects in visual scenes like images, videos, and 3D environments based on user instructions in text or audio. The paper covers the task's background, a unified architecture, representative methods for different visual scenes, and ways to handle real-world complexities, along with related applications and performance comparisons....

Read Moreicon

PixNerd: Pixel Neural Field Diffusion

Published at 2025-07-31

#ML

This study presents PixelNerd, a new method for image generation that avoids the errors and artifacts introduced by the traditional two-stage training process. PixelNerd is efficient, end-to-end, and achieved impressive results on ImageNet without complex pipelines or VAE, and it was also successfully applied to text-to-image applications....

Read Moreicon

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Published at 2025-07-31

#ML

The paper presents SWE-Debate, a new framework for resolving software issues using a competitive multi-agent debate. This framework encourages diverse reasoning paths and consolidates issue localization by having specialized agents debate in three rounds, resulting in a collaborative fix plan that outperforms existing methods in open-source agent frameworks....

Read Moreicon

SWE-Exp: Experience-Driven Software Issue Resolution

Published at 2025-07-31

#ML

The paper presents an improved method, SWE-Exp, for software issue resolution that learns from past experiences, unlike current memoryless agents. SWE-Exp creates a bank of successful and failed repair attempts, extracting reusable knowledge, and achieves a high resolution rate in experiments, shifting the paradigm from trial-and-error to strategic, experience-driven resolution....

Read Moreicon

Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

Published at 2025-08-01

#ML

The study presents DAEDAL, a new method that allows Diffusion Large Language Models to adapt their length dynamically, overcoming their limitation of fixed length generation. This results in better performance and efficiency compared to previous models, making them more competitive with Autoregressive Large Language Models....

Read Moreicon

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Published at 2025-08-01

#ML

The study presents Cognitive Kernel-Pro, an open-source and free agent framework for creating advanced AI agents, focusing on high-quality training data across four domains and novel strategies for agent reflection and voting to improve performance. The framework outperforms other open-source and free agents in the GAIA evaluation, setting a new standard for accessible, high-capability AI agents....

Read Moreicon

IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation

Published at 2025-08-01

#ML

The authors present IGL-Nav, a system that efficiently and accurately locates a goal image in 3D space for image-goal navigation by incrementally updating a scene representation and using geometric information for localization, outperforming existing methods and being deployable on real-world robotic platforms....

Read Moreicon

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

Published at 2025-08-01

#ML

The paper presents a new method for evaluating conversations with large language models that overcomes biases by using multiple models and reduces computation cost by combining their feedback into one model, which is proven to be more efficient and robust than existing methods in various scenarios....

Read Moreicon

Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

Published at 2025-08-01

#ML

The authors developed AVR-Eval, a metric that uses audio-visual recordings to evaluate multimedia content quality, and AVR-Agent, a multi-agent system that generates JavaScript code from multimedia assets. Their experiments show that content generated by AVR-Agent performs better than one-shot generated content, but custom assets and audio-visual feedback are not utilized effectively by the models yet....

Read Moreicon

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Published at 2025-08-01

#ML

The study presents SpA2V, a framework that uses spatial auditory cues from audio recordings to generate videos with high semantic and spatial accuracy, addressing the limitation of existing methods that only focus on semantic information....

Read Moreicon

Published at

Read Moreicon

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.


(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Fb X In
Reply all
Reply to author
Forward
0 new messages