Explanation for training_text and wordlist files

81 views
Skip to first unread message

Dd U

unread,
Jul 4, 2018, 4:02:13 AM7/4/18
to tesseract-ocr

Hello guys.


I want to add new language script to Tesseract OCR and researching to training data.


Then I want to know below things.

  1. Is there any automatic tool that make a langdata training_text and wordlist files from massive text?
  2. Is there any documentation about preparing text data and explanation about text data files? I just saw directory langdata/jpn/ and there are some files. But I have know idea about this files and how to create files like those? What rule should I use create langdata files?
Message has been deleted

Shree Devi Kumar

unread,
Jul 6, 2018, 1:22:59 PM7/6/18
to tesser...@googlegroups.com
See the following link to comment by Ray regarding building of Training data


On Fri 6 Jul, 2018, 10:38 PM James Q, <james.qu...@taina.tech> wrote:
No tool I can think of. What I would do is edit the file in a large text file editor (such as EmEditor) to remove duplicate words. You could do this by replacing all spaces for newlines then sorting and removing duplicates. After that you can randomize the unique list of words, add an appropriate distribution of punctuation characters and re-edit to create a block of text wrapped at say 100 characters. There are online tools to do the randomizing and wrapping.

Having said this I don't know how valuable it is to have training text containing specific words. I have been struggling myself to train on specific word lists without much success. I think training text is just about a representative distribution of characters. Please let me know if you have any insights on the wordlists in langdata as I'm a bit hazy there.

Thanks
James

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jul 6, 2018, 1:27:49 PM7/6/18
to tesser...@googlegroups.com
Also see a community contributed perl script for generating langdata in https://github.com/tesseract-ocr/tesseract/tree/master/contrib
Reply all
Reply to author
Forward
0 new messages