User-words with Tesseract 5

1,530 views

Skip to first unread message

Natalia Zgirovskaya

unread,

Mar 23, 2020, 6:38:46 AM3/23/20

to tesseract-ocr

Hi all,

I have an issue with providing list of user word to tesseract. I use Windows 10.
Installed tesseract version:

>tesseract.exe -v

tesseract v5.0.0-alpha.20191030

 leptonica-1.78.0

  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2

 Found AVX

 Found FMA

 Found SSE

 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

My test image:

I have "eng.user-words" file in the directory with traindata files that contains:

B1adeb1ab1a

Config file "bazaar" as follow:

load_system_dawg     F 
load_freq_dawg       F 
user_words_file  path/to/eng.user-words 
user_words_suffix user-words 
language_model_penalty_non_freq_dict_word 1 
language_model_penalty_non_dict_word 1

Running this command

"C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng bazaar

gives "Bladeblabla" instead of "B1adeb1ab1a"

As well as this command

"C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng --user-words path/to/eng.user-words

gives "Bladeblabla" instead of "B1adeb1ab1a"

Where am I wrong?

Gabriel de Oliveira

unread,

Mar 31, 2020, 3:27:32 PM3/31/20

to tesseract-ocr

I'm not sure if user-words and/or whitelist characters are supported by LSTMs engines (versions>= 4.00) Last news I had about this it was only suported on legacy engines (v3.x) with the --oem 0 option. Maybe someone can prove correct me if I'm wrong?

Reply all

Reply to author

Forward

0 new messages