I would am attempting to use tesseract to read data from a scanned high school transcript. The forms contains a bunch of fields (student name, gender, address) and corresponding values (characters, words or numbers).
I wish to confirm that I can control the behviour of tesseract using the eng.user-patters and eng.user-words files as described in the man page and the file trie.h. I created a test image for this purpose (attached).
First some info about my system
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 -v
tesseract 3.03
leptonica-1.70
libjpeg 8b : libpng 1.2.46 : zlib 1.2.3.4
Here is the result of applying tesseract onto the test image with no config file
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt
Na me: Roosevelt, Fra nklin
Age: 102
Name: Harper, Stephen
Age: 58
Name: Hawk, Tony
Age: 34
Nane: Shakespeare, Bill
Age: 432
Next I create the config file and the user-patterns and user-words files
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.test-words
Name
Age
Roosevelt
Franklin
Harper
Stephen
Hawk
Tony
Shakespeare
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.test-patterns
Nam\c*
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/configs/bazaar_test
load_system_dawg 0
load_freq_dawg 0
user_words_suffix test-words
user_patterns_suffix test-patterns
Now here is the output when the config files are used
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt
Na me: Roosevelt, Fra nklin
Age: 102
Name: Harper, Stephen
Age: 58
Name: Hawk, Tony
Age: 34
Nane: Shakespeare, Bill
Age: 432
This is exactly the same as before! It appears the files eng.test-patterns and eng.test-words have no effect on tesseract.
However, I can modify the config file to force tesseract to use only lower case letters
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/configs/bazaar_test
tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyz
The modified config file does affect the output
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt
we mei koosevelt lira nklin
gei loz
wamei rlarpen stephen
gei sa
wamei rlawk mny
gei em
wanei shakespeara sill
gei wz
So in this case the config file works.
What other steps can I take to confirm tesseract is using the user-pattern files? Is it necessary to train tesseract before applying user-patterns?
Thanks for reading,
Chris