Tesseract trained to better detect historical texts

23 views
Skip to first unread message

Subhashish

unread,
Jun 11, 2026, 9:54:46 AMJun 11
to tesseract-ocr
HI all,

My name is Subhashish and I am a native speaker of the Odia (Oriya) language. I've started training Tesseract on Chapakala 19, a typeface revival of the widely used 19th-century letterpress font.


More details about the process and results: https://github.com/ofdn/tessdata_contrib/tree/main/ori_hist


Subhashish

Subhashish

unread,
Jun 18, 2026, 4:09:51 AM (14 days ago) Jun 18
to tesser...@googlegroups.com
Hi all,

I've just submitted a PR to include Santali trained data:


This will be the first Tesseract LSTM model for Santali written in Ol Chiki script.

A personal anecdote: Though not a native speaker, I grew up listening Santali where I was born and spent the first few years of my life. I also initiated and coordinated the Project Guru Gomke (https://www.mediawiki.org/wiki/Project_Ol_chiki) back in 2014. This trained data includes some synthetic text typed in Guru Gomke, a typeface created as a part of Project Guru Gomke. Ramjit Tudu, an old friend and a fellow Wikimedian shared links to books on Wikisource which were immensely helpful for training. Still training and will be working towards improving the model.

Subhashish
Reply all
Reply to author
Forward
0 new messages