Large-scale digitized historical collections still contain substantial OCR errors. Re-processing millions of pages with improved engines is rarely feasible, making post-correction the most viable strategy for addressing the OCR debt accumulated in digital heritage collections. Recent progress in large language models opens promising new directions, but their effectiveness varies across languages and error types, and they may introduce hallucinations.
To what extent can modern large language models address the OCR debt accumulated in large-scale digitized historical collections?
HIPE-OCRepair 2026 addresses this question through HIPE-OCRepair-Bench, a unified multilingual benchmark comprising curated datasets, a standardised evaluation protocol, baseline systems, and an open leaderboard.
Participants correct noisy OCR transcripts of historical documents without access to the original images. For each text chunk (typically a paragraph or article), the dataset provides:
Systems must produce improved corrected text. Both generative (LLM-based) and discriminative or hybrid approaches are welcome.
Data
The benchmark consists of parallel OCR and ground truth data drawn from multiple curated historical collections, covering English, French, and German materials from the 17th to the 20th century, including newspapers and printed works. It consolidates existing resources alongside newly curated materials.
We look forward to your participation!