🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
![]() |
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe |
Published at 2025-09-16 |
#ML
|
The authors present MiniCPM-V 4.5, an efficient 8B parameter model for multimodal large language models, which overcomes training and inference challenges through improvements in model architecture, data strategy, and training method. This model outperforms larger proprietary and open-source models, such as GPT-4o-latest and Qwen2.5-VL 72B, while using significantly less GPU memory and inference time.... |
Read More |
|
|
![]() |
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR |
Published at 2025-09-17 |
#ML
|
This study presents Baseer, a specialized vision-language model for converting Arabic documents to markdown text using OCR, which outperforms existing solutions by utilizing a large-scale dataset and a unique training strategy. The model's performance is validated using Misraj-DocOCR, a new benchmark for Arabic OCR systems, and Baseer achieves a new state-of-the-art in Arabic document OCR with a WER of 0.25.... |
Read More |
|
|
|
![]() |
Large Language Models Discriminate Against Speakers of German Dialects |
Published at 2025-09-17 |
#ML
|
This study investigates if large language models hold biases against German dialect speakers, who are often stereotyped negatively in society. The research analyzes two tasks: an association task and a decision task, and finds that all evaluated models exhibit significant bias against dialect speakers, reflected in negative associations and decisions.... |
Read More |
|
|
![]() |
CommonForms: A Large, Diverse Dataset for Form Field Detection |
Published at 2025-09-19 |
#ML
|
Researchers created a large, diverse dataset called CommonForms for detecting form fields in web pages, which includes over 450,000 pages in various languages and domains. They also developed two cost-effective models, FFDNet-Small and FFDNet-Large, that accurately detect form fields, outperforming popular commercial PDF readers by identifying checkboxes, text, and signature fields.... |
Read More |
|
|
|
![]() |
HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis |
Published at 2025-09-21 |
#ML
|
The authors present HyRF, a new method for 3D scene representation that combines explicit Gaussians and neural fields to improve memory efficiency and rendering quality compared to previous methods like 3DGS. HyRF uses a compact set of Gaussians for high-frequency parameters and neural fields for other properties, resulting in a smaller model size and faster performance without sacrificing detail in the rendered scenes.... |
Read More |
|
|
![]() |
OpenGVL - Benchmarking Visual Temporal Progress for Data Curation |
Published at 2025-09-21 |
#ML
|
The study presents OpenGVL, a benchmark for estimating task progress in various manipulation tasks using robotics and human embodiments. They find that open-source models perform significantly worse than closed-source ones in predicting task progress, and demonstrate how OpenGVL can be used for automated data curation and filtering.... |
Read More |
|
|
|
![]() |
Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation |
Published at 2025-09-22 |
#ML
|
This study analyzes and improves latency metrics for evaluating simultaneous speech-to-text translation systems, introducing YAAL and LongYAAL metrics and the SoftSegmenter tool to provide more accurate and reliable assessments.... |
Read More |
|
|
![]() |
GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction |
Published at 2025-09-22 |
#ML
|
The paper presents GeoSVR, a new framework that uses sparse voxels to improve surface reconstruction accuracy. It introduces a depth constraint and surface regularization to ensure accurate and detailed reconstructions, outperforming existing methods in various scenarios.... |
Read More |
|
|
|
![]() |
PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies |
Published at 2025-09-22 |
#ML
|
The authors propose PEEK, a method that uses vision-language models to help robots with manipulation tasks by predicting necessary actions and focusing areas, improving zero-shot generalization and performance in real-world settings.... |
Read More |
|
|
![]() |
CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching |
Published at 2025-09-23 |
#ML
|
The study presents CAR-Flow, a new method for condition-aware reparameterization that aligns source and target distributions to improve flow matching in conditional generative modeling. By shifting these distributions, CAR-Flow reduces the complexity of the model, leading to faster training and better performance, as demonstrated by a significant reduction in FID score on ImageNet-256 data.... |
Read More |
|
|
|
![]() |
DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture |
Published at 2025-09-23 |
#ML
|
DRISHTIKON is a new benchmark focused on Indian culture, spanning 15 languages and over 64,000 text-image pairs, to evaluate AI systems' understanding of cultural nuances. The benchmark reveals limitations in current AI models when handling culturally-specific, multimodal inputs, particularly for less-resourced languages and traditions.... |
Read More |
|
|
![]() |
Do You Need Proprioceptive States in Visuomotor Policies? |
Published at 2025-09-23 |
#ML
|
This study shows that removing proprioceptive state input from robot manipulation policies and relying only on visual observations improves spatial generalization, data efficiency, and cross-embodiment adaptation. The new State-free Policy, using dual wide-angle wrist cameras, significantly enhances success rates in various real-world tasks, such as pick-and-place and shirt-folding, compared to traditional state-based policies.... |
Read More |
|
|
|
![]() |
Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation |
Published at 2025-09-23 |
#ML
|
The authors present a new framework called Hyper-Bagel that significantly speeds up both understanding and generating diverse content using multiple modes, like text and images. This framework uses a strategy to divide and conquer tasks, predicting next tokens and simplifying complex processes, resulting in faster performance and near real-time interactive editing and generation.... |
Read More |
|
|
![]() |
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation |
Published at 2025-09-23 |
#ML
|
This study presents a method to create 3D scenes using a self-distillation framework that transfers knowledge from video diffusion models to an explicit 3D representation, allowing for 3D scene generation without needing real-world multi-view data. The proposed technique can generate both static and dynamic 3D scenes from text or single image prompts, demonstrating superior performance compared to existing methods.... |
Read More |
|
|
|
![]() |
MAPO: Mixed Advantage Policy Optimization |
Published at 2025-09-23 |
#ML
|
This study presents MAPO, a new strategy for reinforcement learning in foundation models that addresses issues with advantage function allocation. MAPO improves the advantage function by considering the certainty of trajectories and dynamically adjusting the function for different samples, which enhances the performance of foundation models on reasoning tasks compared to existing methods.... |
Read More |
|
|
![]() |
Reinforcement Learning on Pre-Training Data |
Published at 2025-09-23 |
#ML
|
This study presents a new method called RLPT that uses reinforcement learning to improve large language models by learning from pre-existing data, without relying on human annotations. Extensive experiments show that RLPT significantly enhances model performance and demonstrates promising scaling behavior.... |
Read More |
|
|
|
![]() |
Soft Tokens, Hard Truths |
Published at 2025-09-23 |
#ML
|
The study presents a new method for training continuous reasoning models using reinforcement learning, which can generate more diverse reasoning paths and better adapt to new tasks compared to discrete models. This approach allows for the deployment of continuous models in a standard way and improves their performance on out-of-domain tasks.... |
Read More |
|
|
![]() |
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction |
Published at 2025-09-23 |
#ML
|
The authors present VIR-Bench, a new benchmark for testing the geospatial and temporal understanding of video in multimodal large language models. This benchmark consists of 200 travel videos and evaluates models' performance in reconstructing itineraries, which is crucial for real-world tasks like AI planning and navigation. Experiments show that current MLLMs struggle with this task, and the authors develop a travel-planning agent that improves itinerary recommendations based on VIR-Bench insi... |
Read More |
|
|
|
![]() |
VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction |
Published at 2025-09-23 |
#ML
|
The authors propose a new method called VolSplat to improve 3D Gaussian splatting for novel view synthesis. VolSplat addresses limitations of existing pixel-aligned methods by using voxel-aligned Gaussians, resulting in more accurate and consistent 3D reconstructions, and achieving state-of-the-art performance on popular benchmarks.... |
Read More |
|
|
![]() |
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT |
Published at 2025-09-23 |
#ML
|
This study finds that for large reasoning models, effective reasoning is characterized by fewer failed steps and a structured approach, rather than just long chains of thought. The researchers introduce a new metric, Failed-Step Fraction, to measure the effectiveness of reasoning and show that reducing failed branches improves accuracy.... |
Read More |
|
|
|
![]() |
Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications |
Published at 2025-09-23 |
#ML
|
The authors present a method to use generalist multimodal models for analyzing multi-spectral images, which are commonly used in remote sensing applications, without the need for specific training. They demonstrate strong performance improvements with the Gemini2.5 model on popular benchmarks, making it easier for geospatial professionals to leverage powerful multimodal models for their work.... |
Read More |
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|