dict of 20 words

489 views
Skip to first unread message

Carl Karsten

unread,
Feb 10, 2014, 7:39:59 PM2/10/14
to tesser...@googlegroups.com
I have image of pre printed forms that have been filled out by hand.  I am not trying to recognize the hand writing, just the "date" and "room name" that is printed on the form.  With that data I can link the image file to the database row and a human can view the image and do whatever they need to do.

I am hoping to tell tesseract that anything not found in a 20 line text file is nothing.

Here is what I have so far:

carl@twist:~/Documents/scans/tests$ tesseract s1-0.png test
Tesseract Open Source OCR Engine v3.02.01 with Leptonica

carl@twist:~/Documents/scans/tests$ grep -E "(Feb 02|H.1301)" test.txt
[ ] Equipment problems [ ] Notes on back Feb 02 5""
H.1301 (cornil)

the 2nd line should really be
H.1301 (Cornil)


carl@twist:~/Documents/scans/tests$ head tessdata/foo.user-words
Feb 01
Feb 02
Janson
H.1301 (Cornil)
K.1.105 (La Fontaine)
H.2215 (Ferrer)

> Put the wordlist in <lang>.user-words or recreate <lang>.word-dawg using wordlist2dawg.

I haven't been able to figure out how to do either of those, but I get the feeling that is the wrong direction.



Nick White

unread,
Feb 11, 2014, 7:11:07 AM2/11/14
to tesser...@googlegroups.com
Hi Carl,

> I haven't been able to figure out how to do either of those, but I get the
> feeling that is the wrong direction.

No, it sounds right, and you're nearly there. The relevant
documentation for you is the "CONFIG FILES AND AUGMENTING WITH USER
DATA" section of the manual[0].

So, call your word list eng.user-words, put it in the tessdata
directory, then create a config file called 'customwords' in the
tessdata/configs directory, with the following contents:

load_system_dawg F
load_freq_dawg F
user_words_suffix user-words

Note that when I say "the tessdata directory", I mean a directory
that by default will probably be /usr/share/tesseract-ocr/tessdata.

Hope that helps.

Nick

0. http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data

Inês Martins

unread,
Jun 12, 2015, 9:05:52 AM6/12/15
to tesser...@googlegroups.com
"Now, if you pass the word bazaar as a trailing command line parameter to Tesseract, Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words and eng.user-patterns files you provided. The former is a simple word list, one per line. The format of the latter is documented in dict/trie.h on read_pattern_list()."

did not understand quite well.. I have done what you sugested
Reply all
Reply to author
Forward
0 new messages