Can't get the user dictionary to work

246 views
Skip to first unread message

patrickq

unread,
Jul 30, 2010, 9:04:28 AM7/30/10
to tesseract-ocr
This what I did:

1. Created a text file called eng.user-words, containing:
Chest
Chestnut
Floor
Vice

2. Placed the file in the tessdata folder (next to eng.traineddata)

3. Ran recognition on an image returning "Chesf" instead of "Chest"
and "Fioor" instead of "Floor". Both mistaken "f" and "i" look quite
right visually so I can only assume their confidence level would be
low (but I didn't check).

No effect whatsoever - zip. I can only assume that a variable must be
set or a function needs to be called to turn this on (even though
there is no mention of needing to set anything in the documentation)
or (most likely) I just don't understand how this works and the
dictionary kicks in only on the day or the summer solstice and when
there is a full moon or something.

Patrick

Sven Pedersen

unread,
Jul 30, 2010, 1:55:45 PM7/30/10
to tesser...@googlegroups.com
Patrick,
This is a known issue which has been discussed in the last three days.
Please look in the archives or check the emails you've received from
the list for the last few days.
--Sven

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

patrickq

unread,
Jul 30, 2010, 3:12:43 PM7/30/10
to tesseract-ocr
Hi Sven,

Not only did I read these posts, but I was the one to which Jimmy
kindly responded. Here is one quote:

"At any point, if you ask Tesseract what the 'word' it sees is, it
will
simply give you a string composed of the highest-confidence
characters: the word structure also keeps an array of possible
characters along with the confidence from the recogniser. The weight
from a dictionary can add extra weight to a set of characters, but
only if the set of characters that word is composed from is among the
set of choices (some other steps can add or remove characters...
etc)."

Although I did not debug to inspect the alternative choices for the
mistaken 'f' and 'i', it's a reasonable expectations that 't' and 'l'
would be next in line in these two cases respectively, because these
ARE the letters clearly appearing in this image and these are known
frequent mistakes. I'd say 'i' instead of 'l' is the most common
mistake. So I think it's reasonable that I would be disappointed.

If I missed something else that would indicate how I can make it work,
please clarify!

Thanks,
Patrick
> > For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.

Sven Pedersen

unread,
Jul 30, 2010, 5:48:28 PM7/30/10
to tesser...@googlegroups.com
In a conversation between Philip Pemberton and Jimmy on the 27th, it
seems that the user wordlist may not work for Tesseract 3. You may
need to call the file 'eng.' or $LANG. and put it in the traindata
folder. It sounds like Jimmy is eventually planning to improve the
situation. In the mean time you may have to train tesseract yourself
with your corpus (and font) to improve results, or do image
manipulations (resize/adjust) to improve the input at runtime.
--Sven

Jimmy O'Regan

unread,
Jul 30, 2010, 5:59:00 PM7/30/10
to tesser...@googlegroups.com
On 30 July 2010 20:12, patrickq <patrick.q...@gmail.com> wrote:
> Hi Sven,
>
> Not only did I read these posts, but I was the one to which Jimmy
> kindly responded. Here is one quote:
>
> "At any point, if you ask Tesseract what the 'word' it sees is, it
> will
> simply give you a string composed of the highest-confidence
> characters: the word structure also keeps an array of possible
> characters along with the confidence from the recogniser. The weight
> from a dictionary can add extra weight to a set of characters, but
> only if the set of characters that word is composed from is among the
> set of choices (some other steps can add or remove characters...
> etc)."
>

I think I managed to miss mentioning it completely, but there's
nothing that *forces* that a word be recognised as a dictionary word;
it's just used to establish character confidences. Really, where you
see the difference is across a longer piece of text, when the adaptive
classifier has seen enough examples to know "hey, this thing I thought
was an 'f' might actually be a 't'". In short texts, there's not much
to adapt to. Making a bunch of training images, drawing boxfiles,
etc., only goes so far, so tess uses the dictionary as an
approximation; a low-confidence equivalent of training pages.

On the plus side, it turns out that there are functions buried in the
code to serialise/deserialise the classifier state, so it might be
useful to run a whole corpus of short images through tess in one
batch, save the state, and load that at startup.

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Jimmy O'Regan

unread,
Jul 30, 2010, 6:04:59 PM7/30/10
to tesser...@googlegroups.com
On 30 July 2010 22:48, Sven Pedersen <sven.p...@gmail.com> wrote:
> In a conversation between Philip Pemberton and Jimmy on the 27th, it
> seems that the user wordlist may not work for Tesseract 3. You may
> need to call the file 'eng.' or $LANG. and put it in the traindata
> folder. It sounds like Jimmy is eventually planning to improve the
> situation. In the mean time you may have to train tesseract yourself
> with your corpus (and font) to improve results, or do image
> manipulations (resize/adjust) to improve the input at runtime.

That's true, but the results would have been more or less the same anyway.

Anyway; going by some of the stuff Google have published, there will
be a post-editing facility in Tesseract in the future, where the
dictionaries and something very much like DangAmbigs will be used in
more or less the way people expected that they were used.

It might actually be in the codebase now (hey, it's quite large, and I
don't have a huge amount of spare time), but I've only found the
training code (and that's not quite set up to be used yet).

Dmitry Silaev

unread,
Jul 30, 2010, 6:10:23 PM7/30/10
to tesser...@googlegroups.com
On the plus side, it turns out that there are functions buried in the
code to serialise/deserialise the classifier state, so it might be
useful to run a whole corpus of short images through tess in one
batch, save the state, and load that at startup.

Could you please be more specific, what are your findings: which functions and what they do? I think it might be of interest for many subscribers...

Thanks,
Dmitry

Jimmy O'Regan

unread,
Jul 30, 2010, 6:50:13 PM7/30/10
to tesser...@googlegroups.com

This commit has the conversion to doxygen of the documentation of some
of those functions:
http://code.google.com/p/tesseract-ocr/source/detail?r=447#

Zdenko Podobný

unread,
Aug 1, 2010, 12:14:13 PM8/1/10
to tesser...@googlegroups.com
I played with strace & grep and I found out that user dictionary is not used (opened) in standard installation (svn revision 447).

When I set up variable "global_user_words_suffix" to "user-words" (or something else you like ;-) ) tesseract opened user dictionary file.

global_user_words_suffix can be found in 2 files:
dict/dict.h: extern STRING_VAR_H(global_user_words_suffix, "user-words",
                    "A list of user-provided words.");
dict/permute.cpp:STRING_VAR(global_user_words_suffix, "", "A list of user-provided words.");

I believe problem is in dict/permute.cpp that define this variable as empty string.

Zd.

Jimmy O'Regan

unread,
Aug 1, 2010, 12:24:54 PM8/1/10
to tesser...@googlegroups.com
2010/8/1 Zdenko Podobný <zde...@gmail.com>:

Seems right; the *_VAR and *_VAR_H declarations are usually
'balanced'. I put it back in in r448

Reply all
Reply to author
Forward
0 new messages