User-words with Tesseract 5

1,530 views
Skip to first unread message

Natalia Zgirovskaya

unread,
Mar 23, 2020, 6:38:46 AM3/23/20
to tesseract-ocr
Hi all,

I have an issue with providing list of user word to tesseract. I use Windows 10.
Installed tesseract version:

>tesseract.exe -v
tesseract v5.0.0-alpha.20191030
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

My test image:

test.jpg

I have "eng.user-words" file in the directory with traindata files that contains:
B1adeb1ab1a


Config file "bazaar" as follow:
load_system_dawg     F
load_freq_dawg       F
user_words_file  path
/to/eng.user-words
user_words_suffix user
-words
language_model_penalty_non_freq_dict_word
1
language_model_penalty_non_dict_word
1


Running this command
"C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng bazaar
gives "Bladeblabla" instead of "B1adeb1ab1a"

As well as this command
"C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng --user-words path/to/eng.user-words
gives "Bladeblabla" instead of "B1adeb1ab1a"



Where am I wrong?

Gabriel de Oliveira

unread,
Mar 31, 2020, 3:27:32 PM3/31/20
to tesseract-ocr
I'm not sure if user-words and/or whitelist characters are supported by LSTMs engines (versions>= 4.00) Last news I had about this it was only suported on legacy engines (v3.x) with the --oem 0 option. Maybe someone can prove correct me if I'm wrong?
Reply all
Reply to author
Forward
0 new messages