Explanation for training_text and wordlist files

Dd U

unread,

Jul 4, 2018, 4:02:13 AM7/4/18

to tesseract-ocr

Hello guys.

I want to add new language script to Tesseract OCR and researching to training data.

Then I want to know below things.

Is there any automatic tool that make a langdata training_text and wordlist files from massive text?
Is there any documentation about preparing text data and explanation about text data files? I just saw directory langdata/jpn/ and there are some files. But I have know idea about this files and how to create files like those? What rule should I use create langdata files?

Message has been deleted

Shree Devi Kumar

unread,

Jul 6, 2018, 1:22:59 PM7/6/18

to tesser...@googlegroups.com

See the following link to comment by Ray regarding building of Training data

https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

On Fri 6 Jul, 2018, 10:38 PM James Q, <james.qu...@taina.tech> wrote:

No tool I can think of. What I would do is edit the file in a large text file editor (such as EmEditor) to remove duplicate words. You could do this by replacing all spaces for newlines then sorting and removing duplicates. After that you can randomize the unique list of words, add an appropriate distribution of punctuation characters and re-edit to create a block of text wrapped at say 100 characters. There are online tools to do the randomizing and wrapping.

Having said this I don't know how valuable it is to have training text containing specific words. I have been struggling myself to train on specific word lists without much success. I think training text is just about a representative distribution of characters. Please let me know if you have any insights on the wordlists in langdata as I'm a bit hazy there.

Thanks
James

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jul 6, 2018, 1:27:49 PM7/6/18

to tesser...@googlegroups.com

Also see a community contributed perl script for generating langdata in https://github.com/tesseract-ocr/tesseract/tree/master/contrib

Reply all

Reply to author

Forward