🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling |
Published at 2025-08-22 |
#ML
|
This study investigates how different architectures and training methods affect a model's ability to perform multi-step reasoning, using a cellular automata framework. The researchers found that most neural architectures can abstract underlying rules when trained to avoid memorization, but performance drops for multi-step reasoning. They discovered that increasing model depth, along with recurrence, memory, and test-time compute scaling, significantly improves reasoning capabilities.... |
Read More |
|
|
![]() |
If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition |
Published at 2025-08-22 |
#ML
|
This study highlights that large language models still struggle with inconsistent performance due to presuppositions in questions and prompt sensitivity. The researchers present a new framework that uses broken-down, assumption-free questions to improve claim verification, reducing the impact of these issues and increasing accuracy by up to 2-5%.... |
Read More |
|
|
|
![]() |
MV-RAG: Retrieval Augmented Multiview Diffusion |
Published at 2025-08-22 |
#ML
|
The study presents a new method, MV-RAG, for generating 3D images from text by first finding related 2D images from a large database and then using them to create consistent and accurate 3D outputs. This approach is particularly useful for generating rare or unseen concepts, and the researchers provide a new set of challenging tests to evaluate its performance.... |
Read More |
|
|
![]() |
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling |
Published at 2025-08-22 |
#ML
|
The Text-aware Diffusion Transformer Speech Codec (TaDiCodec) is a new speech tokenizer that improves upon existing designs by using a diffusion autoencoder for quantization and reconstruction, and incorporating text guidance to enhance quality and compression. TaDiCodec has a low frame rate and bitrate, requires only one stage of training, and performs well on speech generation metrics, making it efficient and effective for speech language modeling.... |
Read More |
|
|
|
![]() |
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning |
Published at 2025-08-23 |
#ML
|
The authors present a new method called Rubric-Scaffolded Reinforcement Learning (RuscaRL) to help Large Language Models (LLMs) improve their reasoning abilities by using checklist-style guides as both exploration tools and reward systems, leading to better performance on various reasoning tasks compared to previous models.... |
Read More |
|
|
![]() |
PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs |
Published at 2025-08-23 |
#ML
|
The study presents PosterGen, a multi-agent framework that automates paper-to-poster generation, improving upon existing methods by incorporating design principles and aesthetic elements. This tool produces semantically accurate and visually appealing posters, reducing the need for manual refinement.... |
Read More |
|
|
|
![]() |
REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework |
Published at 2025-08-23 |
#ML
|
This study presents a new method for improving the realism of video games using generative adversarial networks, specifically a dual-stage framework called REGEN. This framework significantly increases the speed of inference while maintaining high visual quality, as demonstrated on Grand Theft Auto V.... |
Read More |
|
|
![]() |
Explain Before You Answer: A Survey on Compositional Visual Reasoning |
Published at 2025-08-24 |
#ML
|
The survey covers the development of compositional visual reasoning in multimodal AI from 2023 to 2025, focusing on the advantages of compositional approaches and the evolution of architectural designs. It also discusses various benchmarks, open challenges, and future research directions in this field.... |
Read More |
|
|
|
![]() |
MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment |
Published at 2025-08-24 |
#ML
|
The authors present MEENA, a new dataset for evaluating Persian vision-language models across various educational tasks, addressing the lack of non-English resources. MEENA contains 7,500 Persian and 3,000 English questions, rich metadata, and original Persian data, enabling cross-linguistic performance assessment and diverse experiments to enhance VLM capabilities beyond English.... |
Read More |
|
|
![]() |
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation |
Published at 2025-08-24 |
#ML
|
The authors present a new benchmark called T2I-ReasonBench to test the reasoning skills of text-to-image models. This benchmark has four areas to evaluate: interpreting idioms, designing images from text, using logical reasoning about objects, and applying scientific reasoning. They also introduce a two-step evaluation method to check the accuracy and quality of the generated images, and analyze various text-to-image models using this benchmark.... |
Read More |
|
|
|
![]() |
UQ: Assessing Language Models on Unsolved Questions |
Published at 2025-08-24 |
#ML
|
The authors propose UQ, a new benchmark for language models that uses unsolved questions from Stack Exchange, which are both challenging and realistic. They present a three-part contribution: a dataset collection pipeline, validation strategies, and an open platform for expert verification, demonstrating that even the top models can only pass validation on 15% of the questions.... |
Read More |
|
|
![]() |
German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German |
Published at 2025-08-25 |
#ML
|
The study presents German4All, a large-scale dataset of German paraphrases tailored for different reading levels, which is used to create an advanced model for generating reader-specific text simplifications. The dataset and model are made publicly available to promote further research in this area.... |
Read More |
|
|
|
![]() |
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency |
Published at 2025-08-25 |
#ML
|
The authors present InternVL 3.5, an open-source multimodal model that improves versatility, reasoning, and efficiency using a new training strategy called Cascade RL and features like Visual Resolution Router and Decoupled Vision-Language Deployment. These improvements lead to better performance and faster inference, and the model also supports new capabilities like GUI interaction and embodied agency, achieving state-of-the-art results among open-source models.... |
Read More |
|
|
![]() |
Limitations of Normalization in Attention Mechanism |
Published at 2025-08-25 |
#ML
|
The study explores the constraints of normalization in attention mechanisms, focusing on token selection and separation. Experiments with the GPT-2 model reveal that as more tokens are chosen, the model struggles to distinguish informative ones, and softmax normalization can be problematic during training, particularly at low temperatures.... |
Read More |
|
|
|
![]() |
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs |
Published at 2025-08-25 |
#ML
|
The study presents MMTok, a method that improves the efficiency of Vision-Language Models by selecting important vision tokens using both vision and text tokens, rather than just one. This approach is tested on various benchmark datasets and demonstrates significant speedup and performance preservation compared to existing methods.... |
Read More |
|
|
![]() |
MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting |
Published at 2025-08-25 |
#ML
|
The authors present a new method called MeshSplat that improves the accuracy of reconstructing 3D surfaces from a small number of input images. They achieve this by using a technique called Gaussian Splatting and a feed-forward network to predict pixel-aligned 2D Gaussian Splats, which helps synthesize novel view images and eliminates the need for direct 3D supervision.... |
Read More |
|
|
|
![]() |
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges |
Published at 2025-08-25 |
#ML
|
The paper questions the reliability and validity of using large language models as judges in natural language generation evaluation, examining assumptions about their accuracy, scalability, and cost-effectiveness, and recommends more responsible evaluation practices.... |
Read More |
|
|
![]() |
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering |
Published at 2025-08-25 |
#ML
|
The authors present a new framework, ST-Raptor, which uses large language models to automatically answer questions about complex, real-world tables. This system can understand and interpret table layouts, accurately answer questions, and verify its answers, outperforming other methods by up to 20% in accuracy.... |
Read More |
|
|
|
![]() |
SpotEdit: Evaluating Visually-Guided Image Editing Methods |
Published at 2025-08-25 |
#ML
|
The study presents SpotEdit, a comprehensive benchmark for evaluating visually-guided image editing methods. It reveals significant performance differences across various models and highlights a common issue of hallucination, where models mistakenly assume a visual cue exists, even when it doesn't.... |
Read More |
|
|
![]() |
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation |
Published at 2025-08-25 |
#ML
|
The study presents a new method called Visual-CoG that improves text-to-image generation by breaking it down into three stages and providing rewards at each stage, rather than just at the end. This approach leads to better performance, especially in handling complex and ambiguous prompts, as demonstrated by improvements on various benchmarks.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|