Tesseract trained to better detect historical texts

34 views

Skip to first unread message

Subhashish

unread,

Jun 11, 2026, 9:54:46 AMJun 11

to tesseract-ocr

HI all,

My name is Subhashish and I am a native speaker of the Odia (Oriya) language. I've started training Tesseract on Chapakala 19, a typeface revival of the widely used 19th-century letterpress font.

Here is the PR for the 100K run: https://github.com/tesseract-ocr/tessdata_contrib/pull/13

More details about the process and results: https://github.com/ofdn/tessdata_contrib/tree/main/ori_hist

Chapakala 19: https://github.com/ofdn/Chapakala/tree/main/19

Subhashish

unread,

Jun 18, 2026, 4:09:51 AMJun 18

to tesser...@googlegroups.com

Hi all,

I've just submitted a PR to include Santali trained data:

https://github.com/tesseract-ocr/tessdata/pull/203

This will be the first Tesseract LSTM model for Santali written in Ol Chiki script.

A personal anecdote: Though not a native speaker, I grew up listening Santali where I was born and spent the first few years of my life. I also initiated and coordinated the Project Guru Gomke (https://www.mediawiki.org/wiki/Project_Ol_chiki) back in 2014. This trained data includes some synthetic text typed in Guru Gomke, a typeface created as a part of Project Guru Gomke. Ramjit Tudu, an old friend and a fellow Wikimedian shared links to books on Wikisource which were immensely helpful for training. Still training and will be working towards improving the model.