Creating wordlist from high confidence words

76 views
Skip to first unread message

Edvard Fagerholm

unread,
Feb 22, 2018, 5:21:26 AM2/22/18
to tesseract-ocr
I have a fairly large dataset that contains scans of various quality. However, all the documents are of the same type and the vocabulary is therefore very uniform. I would like to do basically the following:

1. Run tesseract on all the data and dump all words to a file that have a confidence over some threshold, say, 95.

2. Rerun tesseract on the data with the computed wordlist provided as a prior for likely words.

Is this currently supported by the tool?

Best,
Edvard

ShreeDevi Kumar

unread,
Feb 22, 2018, 5:56:40 AM2/22/18
to tesser...@googlegroups.com
Take a look at

--user-words

and the commands

Combine_tessdata

Dawg2wordlist

Wordlist2dawg

You can change the wordlist and it may improve chances of word being recognised, but I don't think recognition is limited to the list.

It also depends on the version of tesseract that u r using.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0d171c6e-e5a0-4d58-ab81-a0e5709b7b81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages