How to apply user patterns

2,072 views
Skip to first unread message

Christopher Smeenk

unread,
Jun 3, 2014, 6:54:24 AM6/3/14
to tesser...@googlegroups.com
I would am attempting to use tesseract to read data from a scanned high school transcript. The forms contains a bunch of fields (student name, gender, address) and corresponding values (characters, words or numbers).

I wish to confirm that I can control the behviour of tesseract using the eng.user-patters and eng.user-words files as described in the man page and the file trie.h. I created a test image for this purpose (attached).

First some info about my system
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 -v
tesseract 3.03
 leptonica-1.70
  libjpeg 8b : libpng 1.2.46 : zlib 1.2.3.4



Here is the result of applying tesseract onto the test image with no config file
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco
:/data/OCR/tesseract/tests$ cat thetext.txt
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432



Next I create the config file and the user-patterns and user-words files
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.test-words
Name
Age
Roosevelt
Franklin
Harper
Stephen
Hawk
Tony
Shakespeare


cs@pleco
:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.test-patterns
Nam\c*


cs@pleco
:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/configs/bazaar_test
load_system_dawg
0
load_freq_dawg
0
user_words_suffix test
-words
user_patterns_suffix test
-patterns



Now here is the output when the config files are used
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco
:/data/OCR/tesseract/tests$ cat thetext.txt
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432


This is exactly the same as before! It appears the files eng.test-patterns and eng.test-words have no effect on tesseract. 



However, I can modify the config file to force tesseract to use only lower case letters
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/configs/bazaar_test
tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyz


The modified config file does affect the output
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco
:/data/OCR/tesseract/tests$ cat thetext.txt
we mei koosevelt lira nklin


 gei loz


wamei rlarpen stephen
 gei sa


wamei rlawk mny
 gei em


wanei shakespeara sill
 gei wz



So in this case the config file works.

What other steps can I take to confirm tesseract is using the user-pattern files? Is it necessary to train tesseract before applying user-patterns?

Thanks for reading,
Chris
test-002.png

Jing JC

unread,
Jul 11, 2014, 4:08:46 PM7/11/14
to tesser...@googlegroups.com
I have the same question. 
Any answers?
I tried to make tesseract to match the words in my own customized user-words,
but it returned the same result. 
I can not see the effect of the user-words and user-patterns. 
Reply all
Reply to author
Forward
0 new messages