🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks |
Published at 2025-07-25 |
#ML
|
The authors present MCIF, a new benchmark for evaluating multilingual and multimodal capabilities of language models. MCIF uses scientific talks in four languages and three modalities to test model performance across different languages and complex tasks, addressing gaps in existing benchmarks.... |
Read More |
|
|
![]() |
Investigating Hallucination in Conversations for Low Resource Languages |
Published at 2025-07-30 |
#ML
|
This study examines the issue of 'hallucination' in Large Language Models (LLMs) for three low resource languages: Hindi, Farsi, and Mandarin. The researchers analyzed conversational data in these languages for six LLMs and found that while Mandarin had fewer factually incorrect statements, Hindi and Farsi had significantly more hallucinations.... |
Read More |
|
|
|
![]() |
3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding |
Published at 2025-07-31 |
#ML
|
The authors present a new model, 3D-R1, that improves reasoning and generalization in 3D scene understanding by creating a high-quality synthetic dataset, using advanced training techniques, and implementing a dynamic view selection strategy, resulting in an average improvement of 10% across various benchmarks.... |
Read More |
|
|
![]() |
Multimodal Referring Segmentation: A Survey |
Published at 2025-07-31 |
#ML
|
This survey explores the field of multimodal referring segmentation, which involves identifying target objects in visual scenes like images, videos, and 3D environments based on user instructions in text or audio. The paper covers the task's background, a unified architecture, representative methods for different visual scenes, and ways to handle real-world complexities, along with related applications and performance comparisons.... |
Read More |
|
|
|
![]() |
PixNerd: Pixel Neural Field Diffusion |
Published at 2025-07-31 |
#ML
|
This study presents PixelNerd, a new method for image generation that avoids the errors and artifacts introduced by the traditional two-stage training process. PixelNerd is efficient, end-to-end, and achieved impressive results on ImageNet without complex pipelines or VAE, and it was also successfully applied to text-to-image applications.... |
Read More |
|
|
![]() |
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution |
Published at 2025-07-31 |
#ML
|
The paper presents SWE-Debate, a new framework for resolving software issues using a competitive multi-agent debate. This framework encourages diverse reasoning paths and consolidates issue localization by having specialized agents debate in three rounds, resulting in a collaborative fix plan that outperforms existing methods in open-source agent frameworks.... |
Read More |
|
|
|
![]() |
SWE-Exp: Experience-Driven Software Issue Resolution |
Published at 2025-07-31 |
#ML
|
The paper presents an improved method, SWE-Exp, for software issue resolution that learns from past experiences, unlike current memoryless agents. SWE-Exp creates a bank of successful and failed repair attempts, extracting reusable knowledge, and achieves a high resolution rate in experiments, shifting the paradigm from trial-and-error to strategic, experience-driven resolution.... |
Read More |
|
|
![]() |
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models |
Published at 2025-08-01 |
#ML
|
The study presents DAEDAL, a new method that allows Diffusion Large Language Models to adapt their length dynamically, overcoming their limitation of fixed length generation. This results in better performance and efficiency compared to previous models, making them more competitive with Autoregressive Large Language Models.... |
Read More |
|
|
|
![]() |
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training |
Published at 2025-08-01 |
#ML
|
The study presents Cognitive Kernel-Pro, an open-source and free agent framework for creating advanced AI agents, focusing on high-quality training data across four domains and novel strategies for agent reflection and voting to improve performance. The framework outperforms other open-source and free agents in the GAIA evaluation, setting a new standard for accessible, high-capability AI agents.... |
Read More |
|
|
![]() |
IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation |
Published at 2025-08-01 |
#ML
|
The authors present IGL-Nav, a system that efficiently and accurately locates a goal image in 3D space for image-goal navigation by incrementally updating a scene representation and using geometric information for localization, outperforming existing methods and being deployable on real-world robotic platforms.... |
Read More |
|
|
|
![]() |
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges |
Published at 2025-08-01 |
#ML
|
The paper presents a new method for evaluating conversations with large language models that overcomes biases by using multiple models and reduces computation cost by combining their feedback into one model, which is proven to be more efficient and robust than existing methods in various scenarios.... |
Read More |
|
|
![]() |
Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings |
Published at 2025-08-01 |
#ML
|
The authors developed AVR-Eval, a metric that uses audio-visual recordings to evaluate multimedia content quality, and AVR-Agent, a multi-agent system that generates JavaScript code from multimedia assets. Their experiments show that content generated by AVR-Agent performs better than one-shot generated content, but custom assets and audio-visual feedback are not utilized effectively by the models yet.... |
Read More |
|
|
|
![]() |
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation |
Published at 2025-08-01 |
#ML
|
The study presents SpA2V, a framework that uses spatial auditory cues from audio recordings to generate videos with high semantic and spatial accuracy, addressing the limitation of existing methods that only focus on semantic information.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|