How to apply user patterns

2,072 views

Skip to first unread message

Christopher Smeenk

unread,

Jun 3, 2014, 6:54:24 AM6/3/14

to tesser...@googlegroups.com

I would am attempting to use tesseract to read data from a scanned high school transcript. The forms contains a bunch of fields (student name, gender, address) and corresponding values (characters, words or numbers).

I wish to confirm that I can control the behviour of tesseract using the eng.user-patters and eng.user-words files as described in the man page and the file trie.h. I created a test image for this purpose (attached).

First some info about my system

cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 -v
tesseract 3.03
 leptonica-1.70
  libjpeg 8b : libpng 1.2.46 : zlib 1.2.3.4

Here is the result of applying tesseract onto the test image with no config file

cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432

Next I create the config file and the user-patterns and user-words files

cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.test-words
Name
Age
Roosevelt
Franklin
Harper
Stephen
Hawk
Tony
Shakespeare


cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.test-patterns 
Nam\c*


cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/configs/bazaar_test 
load_system_dawg 0
load_freq_dawg 0
user_words_suffix test-words
user_patterns_suffix test-patterns

Now here is the output when the config files are used

cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432

This is exactly the same as before! It appears the files eng.test-patterns and eng.test-words have no effect on tesseract.

However, I can modify the config file to force tesseract to use only lower case letters

cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/configs/bazaar_test 
tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyz

The modified config file does affect the output

cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt
we mei koosevelt lira nklin


 gei loz


wamei rlarpen stephen
 gei sa


wamei rlawk mny
 gei em


wanei shakespeara sill
 gei wz

So in this case the config file works.

What other steps can I take to confirm tesseract is using the user-pattern files? Is it necessary to train tesseract before applying user-patterns?

Thanks for reading,

Chris

test-002.png

Jing JC

unread,

Jul 11, 2014, 4:08:46 PM7/11/14

to tesser...@googlegroups.com

I have the same question.

Any answers?

I tried to make tesseract to match the words in my own customized user-words,

but it returned the same result.

I can not see the effect of the user-words and user-patterns.

Reply all

Reply to author

Forward

0 new messages