user-words / bazaar

2,363 views
Skip to first unread message

Stef

unread,
Sep 21, 2015, 9:29:39 AM9/21/15
to tesseract-ocr
I'm trying to use user wordlists with the bazaar config but it seems to have no effect on the OCR result in my case. Therefore I printed the current parameters to verify whether the user-words list is used. This confirmed that the variables
load_system_dawg, load_freq_dawg, user_words_suffix and user_patterns_suffix were set correctly (to 0, 0, user-words, and user-patterns, respectively) but it doesn't show the user-provided filenames:
user_words_file        A filename of user-provided words.
user_patterns_file        A filename of user
-provided patterns.
Is this just a bug/omission in the --print-parameters output or did tesseract fail to load these files (tessereact 3.05.00dev)?



Meh Hem

unread,
Sep 24, 2015, 8:32:32 AM9/24/15
to tesseract-ocr
Hi Stef,

They have indeed no effect as far as I have found. The idea is great, but unfortunately it just does not seem to work. 

I have found no working demonstrations of it after looking for quite an amount of time.

We have instead found a strong ambiguous character set combined with processing the output helps to get results within an expected pattern.

Tom Morris

unread,
Sep 24, 2015, 3:58:35 PM9/24/15
to tesseract-ocr
Works for me.  What command line are you using?

$ tesseract --user-words foo --user-patterns bar --print-parameters | grep user_

Error: failed to load foo

Error opening pattern file bar

Error: failed to load bar

user_words_file foo A filename of user-provided words.

user_words_suffix A suffix of user-provided words located in tessdata.

user_patterns_file bar A filename of user-provided patterns.

user_patterns_suffix A suffix of user-provided patterns located in tessdata.

Stef

unread,
Sep 28, 2015, 4:02:15 AM9/28/15
to tesseract-ocr
Tom,

I wasn't aware of the new possiblity to specify user words on the command line. Instead I used the config file method with the following command lines and outputs:

tesseract.exe --version
tesseract 3.05.00dev
 leptonica-1.72
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

tesseract.exe test.jpg stdout -l deu --print-parameters bazaar   | grep load_system\|load_freq\|user_
load_system_dawg    0    Load system word dawg.
load_freq_dawg    0    Load frequent word dawg.

user_words_file        A filename of user-provided words.
user_words_suffix    user-words    A suffix of user-provided words located in tessdata.
user_patterns_file        A filename of user-provided patterns.
user_patterns_suffix    user-patterns    A suffix of user-provided patterns located in tessdata.


tesseract.exe test.jpg stdout -l deu --print-parameters   | grep load_system\|load_freq\|user_
load_system_dawg    1    Load system word dawg.
load_freq_dawg    1    Load frequent word dawg.

user_words_file        A filename of user-provided words.
user_words_suffix        A suffix of user-provided words located in tessdata.
user_patterns_file        A filename of user-provided patterns.

user_patterns_suffix        A suffix of user-provided patterns located in tessdata.

My bazaar config file:

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns

For the time being, I solved my problem by increasing the scan resolution from 300 dpi to 600 dpi which ensures everything to be recognized correctly with the default (system) settings.

Tom Morris

unread,
Sep 30, 2015, 2:30:51 PM9/30/15
to tesseract-ocr
Perhaps this is just a misunderstanding or bad documentation.  The --print-parameters dump shows the input parameters, and the user_words_file / user_patterns_file parameters, if they're not set on the command line, will always be empty.

The actual file name that gets loaded gets computed on the fly here:
    https://github.com/tesseract-ocr/tesseract/blob/master/dict/dict.cpp#L274
but the result isn't saved into the user_words_file parameter

Tom

Stef

unread,
Sep 30, 2015, 3:29:04 PM9/30/15
to tesseract-ocr
Thanks for this clarification.

Stef
Reply all
Reply to author
Forward
0 new messages