Multiple language OCR (Santali+Odia+English) combination is not working with gImageReader

21 views

Skip to first unread message

Prasanta Hembram

unread,

Jun 1, 2020, 2:41:10 AM6/1/20

to tesseract-ocr

I am trying to scan a Santali book with multiple character (Ol chiki script + English script + Odia script) with gImageReader 3.3.1 (17fa17) which uses Tesseract 4.1.0 but unable to get satisfactory results.

I have tried with English + Odia script are working fine they are giving very good result. But when I use Santali + Odia or English + Santali or Santali + Odia + English the output text becomes Odia, English or Odia and English respectively, instead of showing Ol chiki text in place. I have a file available for testing.

Also, by only using Santali tessdata it transliterate English and Odia words as Ol Chiki script.

When I use "sat.tessdata" to scan a normal santali image, it worked well.

Note: Ol chiki is the main writing script of Santali people approved by government of India. I think Ol Chiki is a new script not well supported by many software so the processed image text output always shows boxes, I solved this problem by coping it to the Notepad and saving. Exporting it to pdf is ok, I created editable text from it, no problem. I have created many OCR editable pdf with gImageReader.

My question is how to get combined multiple language output in Santali, Odia and English. Also I want to know why the text output of image when processed giving output for English and Odia but not for Santali or vice versa.

I have tried to train the language, it is taking a lot of time, I have little knowledge on coding. If their is any problem with sat.tessdata then i can take up with learning with Tesseract training.

I have used tessdata of

Santali - https://github.com/indic-ocr/tessdata/tree/master/sat
Odia - https://github.com/indic-ocr/tessdata/tree/master/ori
English - default of gImageReader

Reply all

Reply to author

Forward

0 new messages