Call for Participation - HIPE-OCRepair 2026 - ICDAR Competition on LLM-Assisted OCR Post-Correction

32 views

Skip to first unread message

Maud Ehrmann

unread,

Mar 3, 2026, 2:14:01 AMMar 3

to ai4lam

(apologies for cross-postings)

====

HIPE-OCRepair 2026 - Historical OCR Post-Correction Shared Task

Website: https://hipe-eval.github.io/HIPE-OCRepair-2026/

Task: LLM-Assisted OCR Post-Correction for Multilingual Historical Documents

Venue: ICDAR 2026 (31 Aug - 4th Sep 2026)

====

Data: https://github.com/hipe-eval/HIPE-OCRepair-2026-data

How-to: Participation Guidelines

Scorer: https://github.com/hipe-eval/HIPE-OCRepair-scorer/

====

We invite participation in HIPE-OCRepair 2026, the ICDAR 2026 Competition on LLM-Assisted OCR Post-Correction for Historical Documents.

Large-scale digitized historical collections still contain substantial OCR errors. Re-processing millions of pages with improved engines is rarely feasible, making post-correction the most viable strategy for addressing the OCR debt accumulated in digital heritage collections. Recent progress in large language models opens promising new directions, but their effectiveness varies across languages and error types, and they may introduce hallucinations.

To what extent can modern large language models address the OCR debt accumulated in large-scale digitized historical collections?

HIPE-OCRepair 2026 addresses this question through HIPE-OCRepair-Bench, a unified multilingual benchmark comprising curated datasets, a standardised evaluation protocol, baseline systems, and an open leaderboard.

Task

Participants correct noisy OCR transcripts of historical documents without access to the original images. For each text chunk (typically a paragraph or article), the dataset provides:

one OCR hypothesis
document metadata (language, date, publication title)
OCR quality indicators (CER, WER, lexicon-based quality score)

Systems must produce improved corrected text. Both generative (LLM-based) and discriminative or hybrid approaches are welcome.

Data

The benchmark consists of parallel OCR and ground truth data drawn from multiple curated historical collections, covering English, French, and German materials from the 17th to the 20th century, including newspapers and printed works. It consolidates existing resources alongside newly curated materials.

Important dates

10 Dec 2025: Sample data release
02 Mar 2026: Training and development data release; scorer
23 Mar 2026: Hugging Face leader board release
06-08 Apr 2026: Evaluation phase (test release and submission)
10 Apr 2026: Results publication
31 Aug-4 Sep 2026: Presentation at ICDAR 2026

HIPE-OCRepair addresses a central challenge for the document analysis, NLP, and digital humanities communities: improving the usability of large historical text collections at scale. It offers a reproducible evaluation framework, openly available data and tools, and a leaderboard for benchmarking beyond the competition itself.

We look forward to your participation!

Best regards,

HIPE-OCRepair 2026 Organizers

https://hipe-eval.github.io/HIPE-OCRepair-2026/

Reply all

Reply to author

Forward

0 new messages