How to effeciently extend the training_text file?

37 views

Skip to first unread message

peter bence

unread,

Oct 10, 2019, 3:15:51 AM10/10/19

to tesseract-ocr

I'm working with Arabic `langdata_lstm`, where it only has 84 lines of training text in the `training_text` file, where I believe it is too small for building/training a reliable model. After reading the `training_text` file I can see a randomly generated text with no meaning, first I think that this is an Arabic problem, but later I found that it is the same for all other languages.

My questions are:

1. What specifications are followed when generating these `training_text` files (I can see for example that each line is no more than 60 characters long, is this one of the specification?)

2. Could I simply extend the `training_text` file then generate my training data with custom fonts and start training directly? or there are other files that should be changed after changing this file? if yes, what are they and how to regenerate them?

Best Regards

Shree Devi Kumar

unread,

Oct 10, 2019, 7:18:02 AM10/10/19

to tesseract-ocr

See https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

This was for Devanagari and Indic languages.

Also see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-text-requirements

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f40d972a-50d8-4a17-b69c-3f83271b3af8%40googlegroups.com.

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all

Reply to author

Forward

0 new messages