Question about mixed languages recognition (user-words are present and called by --user_words_suffix=user-words, but seems to be unworkable)

56 views
Skip to first unread message

Alexey Kostylev

unread,
Feb 10, 2018, 3:32:27 AM2/10/18
to tesseract-ocr
First of all - sorry 4 possible mistakes, English is not my native language indeed...

Used pytesseract with version

tesseract 4.0.0-alpha.20170804
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

parameters seems 2 be OK

config:
lang = eng+rus
tess_config = --tessdata-dir "D:\Program Files (x86)\Tesseract-OCR\tessdata" --psm 4 --user_words_suffix=user-words
tesseract_cmd = D:\Program Files (x86)\Tesseract-OCR\tesseract.exe

command:
text = pytesseract.image_to_string(Image.open("temp/BW.PNG"),lang=self._lang, config=self._tess_config )

User words files are exactly present,

D:\Program Files (x86)\Tesseract-OCR\tessdata>dir *.user-words
...
Содержимое папки D:\Program Files (x86)\Tesseract-OCR\tessdata

10.02.2018  09:40               169 eng.user-words
09.02.2018  22:21                 9 rus.user-words

but looks like, that the files are not loaded, or, at least, used not effectively. Here are the images (original and BW, cleaned by me)





Part of eng.user-words (was trying the words, starting from uppercase too)

societas
eruditoium
civitas
dei
icu
ino
rodentia
earth
expeditionary
pyrrha
santos
dumont

The result from OCR is below, lines with names from user-words file are made bold by me. 

Rock Research Ring
Текущее влияние: 1%
В ближайшем будущем ожидается изоляция.
В ближайшем будущем у них ожидаются гражданские
беспорядки.
Планируется экспансия из ВодепНа Petram.
В состоянии войны с Еагіб Нее в системе
Phoenix.
Их ждет экономический бум.
Colonia Council
Текущее влияние: 4%
В состоянии войны с босіетаз Егидиогит де Суйаз Реі в
cucteme Dubbuennel.

Any advice? Seems 2 me, that I need to force user dictionary over native, or add values for mixing languages, but how? 
P.S. I am complete noob with tesseract, but very impressed indeed. Nearly all regular text from PC screenshots is parsed OK
Reply all
Reply to author
Forward
0 new messages