I would like to use tesseract to read data from a scanned high school transcript. The form contains a bunch of fields (student name, gender, address) and corresponding values (characters, words or numbers).
I understand the way to do this is using config files augmented with user data [see the
man page, patterns are explained in more detail in the file /path/to/tesseract-ocr/dict/trie.h].
However, when I try to set my own eng.user-words or eng.user-patterns, tesseract returns a Segmentation Fault.
First, here is a test image I am using to check the pattern matching: (attached file test-002.png)
Here is some info about my install:
cs@pleco:/data/OCR/tesseract/tests$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise
cs@pleco:/data/OCR/tesseract/tests$ tesseract -v
tesseract 3.02.02
leptonica-1.69
libjpeg 6b : libpng 1.2.46 : libtiff 3.9.5 : zlib 1.2.3.4
Here's is a good run, showing the output:
cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png thetext -psm 3
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt
Na me: Roosevelt, Fra nklin
Age: 102
Name: Harper, Stephen
Age: 58
Name: Hawk, Tony
Age: 34
Nane: Shakespeare, Bill
Age: 432
Here are the config file and user pattern files:
cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat configs/bazaar_test
load_system_dawg F
load_freq_dawg F
user_words_suffix test-words
user_patterns_suffix test-patterns
cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-patterns
Name: \A\c*, \A\c*
Age: \d*
cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-words
Name:
Age:
Roosevelt
Franklin
Harper
Stephen
Hawk
Tony
Shakespeare
And here is the result when running tesseract with the config file:
cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Segmentation fault
What am I doing wrong? Thanks for reading!
Chris