Adding Modi Script to Tesseract

530 views
Skip to first unread message

Nilambari Joshi

unread,
Jan 26, 2020, 10:52:43 AM1/26/20
to tesseract-ocr
Hi... I want to create Modi script (Marathi language) traineddata in tesseract for OCR. Can somebody guide what steps should I follow.
I referred to https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but stuckup at a stage of creating box files.

Shree Devi Kumar

unread,
Jan 26, 2020, 11:16:50 AM1/26/20
to tesseract-ocr
Is there a Unicode font for modi script?

On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr <tesser...@googlegroups.com> wrote:
Hi... I want to create Modi script (Marathi language) traineddata in tesseract for OCR. Can somebody guide what steps should I follow.
I referred to https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but stuckup at a stage of creating box files.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f2fe2399-e6d4-4a7e-886e-7337176e2304%40googlegroups.com.

Patrick CHEW

unread,
Jan 26, 2020, 11:58:11 AM1/26/20
to tesser...@googlegroups.com

On Jan 26, 2020, at 08:16, Shree Devi Kumar <shree...@gmail.com> wrote:

Is there a Unicode font for modi script?

Shree Devi Kumar

unread,
Jan 26, 2020, 12:26:51 PM1/26/20
to tesseract-ocr
Thanks for the link to Modi Unicode font.

I would convert the Marathi training text to Modi script (use Aksharamukha) and then train using the unicode font.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nilambari Joshi

unread,
Jan 26, 2020, 3:49:33 PM1/26/20
to tesseract-ocr
Thanks for your response. I will work as suggested. Please also clarify whether I need to create separate language directory for Modi similar to Marathi with all files like number, punc wordlist included and a separate unicharset file as well?  
Thanks in advance.

On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote:
Thanks for the link to Modi Unicode font.

I would convert the Marathi training text to Modi script (use Aksharamukha) and then train using the unicode font.

On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW <patri...@gmail.com> wrote:

On Jan 26, 2020, at 08:16, Shree Devi Kumar <shree...@gmail.com> wrote:

Is there a Unicode font for modi script?


On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr <tesser...@googlegroups.com> wrote:
Hi... I want to create Modi script (Marathi language) traineddata in tesseract for OCR. Can somebody guide what steps should I follow.
I referred to https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but stuckup at a stage of creating box files.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Nilambari Joshi

unread,
Jan 26, 2020, 3:50:02 PM1/26/20
to tesseract-ocr
Thanks for your response

Shree Devi Kumar

unread,
Jan 27, 2020, 3:22:17 AM1/27/20
to tesseract-ocr
For LSTM training punc, numbers, wordlist are NOT required. You can add them if you like. Unicharset is generated from the training text.

Are you planning to train from text or images?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com.

Nilambari Joshi

unread,
Jan 28, 2020, 11:25:59 AM1/28/20
to tesseract-ocr
I was trying to do with image. I got one image online with all modi script characters and tried to create Box file for that image. 
In the box file I can see that it is considering each character as English character. 
My question is how to make it realise that it should refer to it as a modi character.

Then I tried to use tesstrain.sh as below
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist MarathiCursiveT --lang mar --linedata_only --noextract_font_properties --langdata_dir ../tesstutorial/langdata --tessdata_dir ../tesstutorial/tesseract/tessdata --training_text ../tesstutorial/langdata/mar/mar.modi.training_text --output_dir ../tesstutorial/moditrain

I got (by running make) MarathiCursiveT truetype Unicode modi font from the link https://github.com/MihailJP/MarathiCursive, mentioned in response to my query.
That file I kept at /usr/share/fonts/truetype/MarathiCursiveT 

I created mar.modi.training_text  by copying content of  marathi training data text file in Aksharmukh app and taking output text in modi.

for tesstrain.sh I am getting error Could not find font named 'MarathiCursiveT. Pango suggested font 'MarthiCursiveT Medium'

Please advise for both the queries.Thanks in advance

Shree Devi Kumar

unread,
Jan 28, 2020, 12:28:19 PM1/28/20
to tesseract-ocr

MarthiCursiveT Medium
Use the above as the font with tesstrain.sh

How are you creating the box file for the image?


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b65c4a9d-ea7c-44af-956e-e9628ba05ee4%40googlegroups.com.

Nilambari Joshi

unread,
Jan 28, 2020, 1:40:37 PM1/28/20
to tesser...@googlegroups.com
I tried using MarathiCursiveT Medium as font in fontlist and it worked. Thanks for that. 
It created traineddata and unicharset files in the destination folder.
I hope now I can continue with further instructions as mentioned at  https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

box file is created using command    tesseract A.png A lstmbox
where A.png is the image with modi characters.


Shree Devi Kumar

unread,
Jan 29, 2020, 1:20:31 AM1/29/20
to tesseract-ocr
The default language that tesseract uses when none are specified is eng. Hence you get box file with English characters.

There is currently no `Modi` traineddata so you can't use that, You could use `-l mar` to use Marathi but obviously the recognition will not be correct.

I suggest that you use `wordstrbox` instead of `lstmbox` - it will make it easier to correct the box files.

Have you looked at the tesstrain repo for training from images?

On Wed, Jan 29, 2020 at 12:10 AM 'Nilambari Joshi' via tesseract-ocr <tesser...@googlegroups.com> wrote:

box file is created using command    tesseract A.png A lstmbox
where A.png is the image with modi characters.

shree

unread,
Jan 31, 2020, 12:39:31 AM1/31/20
to tesseract-ocr
Please see https://github.com/Shreeshrii/tesstrain-modi for finetune training for Modi from Marathi using synthetic training data in 2 unicode fonts. However since Modi documents are mostly handwritten in cursive style, the training should preferably be done using images.

Nilambari Joshi

unread,
Jan 31, 2020, 1:30:25 PM1/31/20
to tesseract-ocr
Thank you very much for the finetuned traineddata for modi. It is giving good results (with some deviation) for images generated though Aksharmukh
. But as guessed, for scanned copy of handwritten text the result is still poor. I will try using images for training. 
As I understand tesseract doesnt support image based training by default need some extra steps to be followed.

If possible please share Links with the details about training tesseract with images. Thanks once again

Nilambari Joshi

unread,
Jan 31, 2020, 8:39:21 PM1/31/20
to tesseract-ocr
Need further guidance with respect to following points
1. what steps need to be followed after box files are created. 
2. Some characters are not correctly recognized in box file, is it expected to manually correct them
3. how many image file are expected to train it properly
Thanks in advance

Shree Devi Kumar

unread,
Jan 31, 2020, 10:58:31 PM1/31/20
to tesseract-ocr
tesseract-ocr/tesstrain repo has makefile for training with images.

See https://github.com/tesseract-ocr/tesstrain/wiki/German-Konzilsprotokolle for details about training of handwriting with images.

I suggest you choose first a set of images of one kind of typeface, similar to the ones in the fonts and test the process.

Lstm training doesn't need precise character level boxes. You just need to provide it correct line level transcription. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,
Jan 31, 2020, 11:36:05 PM1/31/20
to tesseract-ocr
If you send a couple of scanned images with their ground truth transcription and box files, I can test with that and suggest next steps.
Reply all
Reply to author
Forward
0 new messages