Trying to understand custom dictionaries

1,551 views
Skip to first unread message

Traun Leyden

unread,
Jul 20, 2014, 3:27:46 AM7/20/14
to tesser...@googlegroups.com

I followed the FAQ - How do I provide my own dictionary -- Tesseract 3 instructions to create a custom dictionary.

In my custom dictionary, I only have the following words:

local
variables
variable
name
names

When I ran tesseract against this test image, the output was:

You can ereate local variables for the pipelines within the template by
prefixing the variable name with a “$" Sign. Variable names have to be
eomposed of alphanumeric characters and the underseore. In the example
below I have used a few variations that work for variable names.

and I was expecting it to _only_ have words from the custom dictionary.  (eg, "local", "variable", etc..)

Am I misunderstanding how custom dictionaries are supposed to work?  Are the words in a custom dictionary merely a "hint" rather than a constraint on what words can be emitted in the ocr output?

Here are the steps I used to regenerate a new eng.traineddata file:

$ combine_tessdata -u tessdata/eng.traineddata /tmp/eng.
$ wordlist2dawg eng.wordlist eng.word-dawg eng.unicharset (where eng.wordlist contains word list mentioned above with "local", "variables", etc)
$ combine_tessdata /tmp/eng.
$ mv eng.traineddata ~/tmp/tessdata/eng.traineddata

And here is how I called tesseract

$ tesseract --tessdata-dir /tmp ocrimage ocrimage 

I'm using the latest subversion trunk version, built via this dockerfile.

Victoria A.

unread,
Jul 24, 2014, 8:53:56 AM7/24/14
to tesser...@googlegroups.com
From my experience, seeing that Tesseract's English training data can recognize words that are NOT contained in the dictionary, I suppose Tesseract only uses the custom dictionary for "hints" instead of only knowing the words in the dictionary.

I'm sure the word "ereate" you got from running Tesseract against that image is not contained in Tesseract's original English dictionary, yet there it was. 

Christopher Smeenk

unread,
Aug 10, 2014, 1:30:02 PM8/10/14
to tesser...@googlegroups.com
Hello Traun,

I am also interested in using tesseract to recognize words from a selected list. But sorry I don't have an answer to your question.

I am thinking about using tesseract to recognize data on scanned forms.
Is it necessary to completely retrain tesseract using the custom dictionary a user provides? Or is it possible to override the default behaviour using eng.user-words? 

Chris

Nick White

unread,
Aug 12, 2014, 12:34:32 PM8/12/14
to tesser...@googlegroups.com
On Thu, Jul 24, 2014 at 05:53:56AM -0700, Victoria A. wrote:
> From my experience, seeing that Tesseract's English training data can recognize
> words that are NOT contained in the dictionary, I suppose Tesseract only uses
> the custom dictionary for "hints" instead of only knowing the words in the
> dictionary.

Yes, that is exactly correct.

Traun & Christopher, if you want to only have certain recognised
words printed, the only way to do it is to recognise everything,
then run a regex or some other script over the output afterwards.
Tesseract doesn't do that itself.

Nick

Monte Shaffer

unread,
Aug 16, 2014, 7:15:40 AM8/16/14
to tesser...@googlegroups.com
Tesseract has a white-list for glyphs ... 

not words but glyphs.

Much of tesseract is "hints" and possibilities.  2014 is not 1997 ... it would be nice if we understood how to best train it.  I have built a tool to train tesseract, but it doesn't seem to improve my default results much.

Reply all
Reply to author
Forward
0 new messages