user-patterns Segmentation fault

Christopher Smeenk

unread,

May 30, 2014, 12:22:50 PM5/30/14

to tesser...@googlegroups.com

I would like to use tesseract to read data from a scanned high school transcript. The form contains a bunch of fields (student name, gender, address) and corresponding values (characters, words or numbers).

I understand the way to do this is using config files augmented with user data [see the man page, patterns are explained in more detail in the file /path/to/tesseract-ocr/dict/trie.h].

However, when I try to set my own eng.user-words or eng.user-patterns, tesseract returns a Segmentation Fault.

First, here is a test image I am using to check the pattern matching: (attached file test-002.png)

Here is some info about my install:

cs@pleco:/data/OCR/tesseract/tests$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise

cs@pleco:/data/OCR/tesseract/tests$ tesseract -v
tesseract 3.02.02
 leptonica-1.69
  libjpeg 6b : libpng 1.2.46 : libtiff 3.9.5 : zlib 1.2.3.4

Here's is a good run, showing the output:

cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png thetext -psm 3
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432

Here are the config file and user pattern files:

cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat configs/bazaar_test 
load_system_dawg F
load_freq_dawg F
user_words_suffix test-words
user_patterns_suffix test-patterns


cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-patterns 
Name: \A\c*, \A\c*
Age: \d*


cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-words 
Name:
Age:
Roosevelt
Franklin
Harper
Stephen
Hawk
Tony
Shakespeare

And here is the result when running tesseract with the config file:

cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Segmentation fault

What am I doing wrong? Thanks for reading!

Chris

test-002.png

zdenko podobny

unread,

May 31, 2014, 4:19:23 PM5/31/14

to tesser...@googlegroups.com

Hi,

I tried it in 3.03 version (on openSUSE 13.1) and there was no segfault (3.02 segfault also for me).

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bb5b289c-6453-437e-88e1-3506f8d8bf8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Smeenk

unread,

Jun 3, 2014, 5:36:30 AM6/3/14

to tesser...@googlegroups.com

Thank you Zdenko. I can confirm v3.03 works with no segfault on my system too.

I am still having trouble to use the user-patterns and user-words files to control the output from tesseract v3.03. I will start another thread about this.

Chris

Reply all

Reply to author

Forward