What is the proper format of the word list file for training tesseract?

271 views
Skip to first unread message

Sim Tov

unread,
Jun 20, 2021, 1:33:49 AM6/20/21
to tesseract-ocr

Hello,

it is written in the documentation/Creating Starter Traineddata:


that an "optional word list files" can be supplied for the training purpose.

1. what is the proper format for this file?
2. is there an example of such a file online?
3. can a standard MySpell/HunSpell/etc. dictionary be used for this purpose? If yes - what formats are supported?

Thank you in advance!
ST

Zdenko Podobny

unread,
Jun 20, 2021, 7:04:57 AM6/20/21
to tesser...@googlegroups.com

ne 20. 6. 2021 o 7:33 Sim Tov <smn...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ffc64b9c-9020-4398-9d17-c15f832d6b38n%40googlegroups.com.

Sim Tov

unread,
Jun 23, 2021, 5:16:33 AM6/23/21
to tesser...@googlegroups.com
Zdenko, thank you very much!

1. As far as I understand eng.wordlist is just a plain text file with a single word per line. Am I correct regarding the formal format?

2. Is this file is used *only* to generate synthetic texts to teach Tesseract a new language,
or
Is this vocabulary *also* used by Tesseract to guess (in case of a doubt) during word recognition? Or are spell checker dictionaries are used for this purpose and not eng.wordlist?

Thank you!

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/l8jqmKEdqgY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1XkeSz7NwyNpYtO8W%3D5QLny_za-9-w0pMi9poGAeE3A%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages