Tesseract trained to better detect historical texts
23 views
Skip to first unread message
Subhashish
unread,
Jun 11, 2026, 9:54:46 AMJun 11
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
HI all,
My name is Subhashish and I am a native speaker of the Odia (Oriya) language. I've started training Tesseract on Chapakala 19, a typeface revival of the widely used 19th-century letterpress font.
This will be the first Tesseract LSTM model for Santali written in Ol Chiki script.
A personal anecdote: Though not a native speaker, I grew up listening Santali where I was born and spent the first few years of my life. I also initiated and coordinated the Project Guru Gomke (https://www.mediawiki.org/wiki/Project_Ol_chiki) back in 2014. This trained data includes some synthetic text typed in Guru Gomke, a typeface created as a part of Project Guru Gomke. Ramjit Tudu, an old friend and a fellow Wikimedian shared links to books on Wikisource which were immensely helpful for training. Still training and will be working towards improving the model.