🤗 Daily Paper(2025-09-24)

0 views

Skip to first unread message

deep.di...@gmail.com

unread,

Sep 24, 2025, 4:07:11 PMSep 24

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Published at 2025-09-16

#ML

The authors present MiniCPM-V 4.5, an efficient 8B parameter model for multimodal large language models, which overcomes training and inference challenges through improvements in model architecture, data strategy, and training method. This model outperforms larger proprietary and open-source models, such as GPT-4o-latest and Qwen2.5-VL 72B, while using significantly less GPU memory and inference time....

Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

Published at 2025-09-17

#ML

This study presents Baseer, a specialized vision-language model for converting Arabic documents to markdown text using OCR, which outperforms existing solutions by utilizing a large-scale dataset and a unique training strategy. The model's performance is validated using Misraj-DocOCR, a new benchmark for Arabic OCR systems, and Baseer achieves a new state-of-the-art in Arabic document OCR with a WER of 0.25....

Large Language Models Discriminate Against Speakers of German Dialects

Published at 2025-09-17

#ML

This study investigates if large language models hold biases against German dialect speakers, who are often stereotyped negatively in society. The research analyzes two tasks: an association task and a decision task, and finds that all evaluated models exhibit significant bias against dialect speakers, reflected in negative associations and decisions....

CommonForms: A Large, Diverse Dataset for Form Field Detection

Published at 2025-09-19

#ML

Researchers created a large, diverse dataset called CommonForms for detecting form fields in web pages, which includes over 450,000 pages in various languages and domains. They also developed two cost-effective models, FFDNet-Small and FFDNet-Large, that accurately detect form fields, outperforming popular commercial PDF readers by identifying checkboxes, text, and signature fields....

HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis

Published at 2025-09-21

#ML

The authors present HyRF, a new method for 3D scene representation that combines explicit Gaussians and neural fields to improve memory efficiency and rendering quality compared to previous methods like 3DGS. HyRF uses a compact set of Gaussians for high-frequency parameters and neural fields for other properties, resulting in a smaller model size and faster performance without sacrificing detail in the rendered scenes....

OpenGVL - Benchmarking Visual Temporal Progress for Data Curation

Published at 2025-09-21

#ML

The study presents OpenGVL, a benchmark for estimating task progress in various manipulation tasks using robotics and human embodiments. They find that open-source models perform significantly worse than closed-source ones in predicting task progress, and demonstrate how OpenGVL can be used for automated data curation and filtering....

Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Published at 2025-09-22

#ML

This study analyzes and improves latency metrics for evaluating simultaneous speech-to-text translation systems, introducing YAAL and LongYAAL metrics and the SoftSegmenter tool to provide more accurate and reliable assessments....

GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction

Published at 2025-09-22

#ML

The paper presents GeoSVR, a new framework that uses sparse voxels to improve surface reconstruction accuracy. It introduces a depth constraint and surface regularization to ensure accurate and detailed reconstructions, outperforming existing methods in various scenarios....

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

Published at 2025-09-22

#ML

The authors propose PEEK, a method that uses vision-language models to help robots with manipulation tasks by predicting necessary actions and focusing areas, improving zero-shot generalization and performance in real-world settings....

CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching

Published at 2025-09-23

#ML

The study presents CAR-Flow, a new method for condition-aware reparameterization that aligns source and target distributions to improve flow matching in conditional generative modeling. By shifting these distributions, CAR-Flow reduces the complexity of the model, leading to faster training and better performance, as demonstrated by a significant reduction in FID score on ImageNet-256 data....

DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture

Published at 2025-09-23

#ML

DRISHTIKON is a new benchmark focused on Indian culture, spanning 15 languages and over 64,000 text-image pairs, to evaluate AI systems' understanding of cultural nuances. The benchmark reveals limitations in current AI models when handling culturally-specific, multimodal inputs, particularly for less-resourced languages and traditions....

Do You Need Proprioceptive States in Visuomotor Policies?

Published at 2025-09-23

#ML

This study shows that removing proprioceptive state input from robot manipulation policies and relying only on visual observations improves spatial generalization, data efficiency, and cross-embodiment adaptation. The new State-free Policy, using dual wide-angle wrist cameras, significantly enhances success rates in various real-world tasks, such as pick-and-place and shirt-folding, compared to traditional state-based policies....

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Published at 2025-09-23

#ML

The authors present a new framework called Hyper-Bagel that significantly speeds up both understanding and generating diverse content using multiple modes, like text and images. This framework uses a strategy to divide and conquer tasks, predicting next tokens and simplifying complex processes, resulting in faster performance and near real-time interactive editing and generation....

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Published at 2025-09-23

#ML

This study presents a method to create 3D scenes using a self-distillation framework that transfers knowledge from video diffusion models to an explicit 3D representation, allowing for 3D scene generation without needing real-world multi-view data. The proposed technique can generate both static and dynamic 3D scenes from text or single image prompts, demonstrating superior performance compared to existing methods....

MAPO: Mixed Advantage Policy Optimization

Published at 2025-09-23

#ML

This study presents MAPO, a new strategy for reinforcement learning in foundation models that addresses issues with advantage function allocation. MAPO improves the advantage function by considering the certainty of trajectories and dynamically adjusting the function for different samples, which enhances the performance of foundation models on reasoning tasks compared to existing methods....

Reinforcement Learning on Pre-Training Data

Published at 2025-09-23

#ML

This study presents a new method called RLPT that uses reinforcement learning to improve large language models by learning from pre-existing data, without relying on human annotations. Extensive experiments show that RLPT significantly enhances model performance and demonstrates promising scaling behavior....

Soft Tokens, Hard Truths

Published at 2025-09-23

#ML

The study presents a new method for training continuous reasoning models using reinforcement learning, which can generate more diverse reasoning paths and better adapt to new tasks compared to discrete models. This approach allows for the deployment of continuous models in a standard way and improves their performance on out-of-domain tasks....

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Published at 2025-09-23

#ML

The authors present VIR-Bench, a new benchmark for testing the geospatial and temporal understanding of video in multimodal large language models. This benchmark consists of 200 travel videos and evaluates models' performance in reconstructing itineraries, which is crucial for real-world tasks like AI planning and navigation. Experiments show that current MLLMs struggle with this task, and the authors develop a travel-planning agent that improves itinerary recommendations based on VIR-Bench insi...

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Published at 2025-09-23

#ML

The authors propose a new method called VolSplat to improve 3D Gaussian splatting for novel view synthesis. VolSplat addresses limitations of existing pixel-aligned methods by using voxel-aligned Gaussians, resulting in more accurate and consistent 3D reconstructions, and achieving state-of-the-art performance on popular benchmarks....

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

Published at 2025-09-23

#ML

This study finds that for large reasoning models, effective reasoning is characterized by fewer failed steps and a structured approach, rather than just long chains of thought. The researchers introduce a new metric, Failed-Step Fraction, to measure the effectiveness of reasoning and show that reducing failed branches improves accuracy....

Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

Published at 2025-09-23

#ML

The authors present a method to use generalist multimodal models for analyzing multi-spectral images, which are commonly used in remote sensing applications, without the need for specific training. They demonstrate strong performance improvements with the Gemini2.5 model on popular benchmarks, making it easier for geospatial professionals to leverage powerful multimodal models for their work....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages