OCR char restriction

748 views
Skip to first unread message

sam vara

unread,
Aug 29, 2013, 4:33:28 AM8/29/13
to tesser...@googlegroups.com
this is my first OCR project . I am trying to feed an image that is x...@gmail.com i.e an email field. I have a charset restriction defined which is alphanumeric (A thru Z and a thru z and @_). When tesseract processes this image it outputs 'G' for the @ symbol and _ for '.'. I get back xyzG gmail_com. What is the way to solve this ? Should i define a more restrictive char set?

Thanks

Quan Nguyen

unread,
Aug 29, 2013, 11:40:52 PM8/29/13
to tesser...@googlegroups.com
Try bazaar pattern matching and see if you will have better results.

http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

Shree Devi Kumar

unread,
Aug 29, 2013, 11:54:29 PM8/29/13
to tesser...@googlegroups.com
For details regarding bazaar pattern, see section regarding config files in

http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

Now, if you pass the word bazaar as a trailing command line parameter to Tesseract, Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words and eng.user-patterns files you provided. The former is a simple word list, one per line. The format of the latter is documented in dict/trie.h on read_pattern_list().

See link below  for details of the patterns

http://code.google.com/p/tesseract-ocr/source/browse/trunk/dict/trie.h?r=714

  // The pattern list file should contain one pattern per line in UTF-8 format.
  //
  // Each pattern can contain any non-whitespace characters, however only the
  // patterns that contain characters from the unicharset of the corresponding
  // language will be useful.
  // The only meta character is '\'. To be used in a pattern as an ordinary
  // string it should be escaped with '\' (e.g. string "C:\Documents" should
  // be written in the patterns file as "C:\\Documents").
  // This function supports a very limited regular expression syntax. One can
  // express a character, a certain character class and a number of times the
  // entity should be repeated in the pattern.
  //
  // To denote a character class use one of:
  // \c - unichar for which UNICHARSET::get_isalpha() is true (character)
  // \d - unichar for which UNICHARSET::get_isdigit() is true
  // \n - unichar for which UNICHARSET::get_isdigit() and
  //      UNICHARSET::isalpha() are true
  // \p - unichar for which UNICHARSET::get_ispunct() is true
  // \a - unichar for which UNICHARSET::get_islower() is true
  // \A - unichar for which UNICHARSET::get_isupper() is true
  //
  // \* could be specified after each character or pattern to indicate that
  // the character/pattern can be repeated any number of times before the next
  // character/pattern occurs.
  //
  // Examples:
  // 1-8\d\d-GOOG-411 will be expanded to strings:
  // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.
  //
  // http://www.\n\*.com will be expanded to strings like:
  // http://www.a.com http://www.a123.com ... http://www.ABCDefgHIJKLMNop.com
  //
  // Note: In choosing which patterns to include please be aware of the fact
  // providing very generic patterns will make tesseract run slower.
  // For example \n\* at the beginning of the pattern will make Tesseract
  // consider all the combinations of proposed character choices for each
  // of the segmentations, which will be unacceptably slow.
  // Because of potential problems with speed that could be difficult to
  // identify, each user pattern has to have at least kSaneNumConcreteChars
  // concrete characters from the unicharset at the beginning.


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

sam vara

unread,
Sep 3, 2013, 6:59:56 AM9/3/13
to tesser...@googlegroups.com
Thanks for the reply . A couple of clarifications 

1.Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words -- ?? mean i have to define all possible words that my application might encounter?

2. I want to use many regular expression patterns for various fields in my app. should i define one per line each pattern? if so which one will it pick up for which field?

Andrew McGrath

unread,
Jan 20, 2014, 1:34:57 PM1/20/14
to tesser...@googlegroups.com
Hey Sam,

Did you ever get this working sufficiently?

I'm using a user-pattern file containing the following:
(\d\d\d) \d\d\d-\d\d\d\d
www.\n\*.ca\n\*
www.\n\*.com\n\*
CHANGE DUE $\d\*.\d\d

My hope is to detect phone numbers in the format of "(123) 123-1234", website address that are .ca and .com with www in the front, and a change due field

Unsure if i'm approaching this right, so i'd love to hear about your experiences.

Juergen Harms

unread,
Dec 8, 2014, 6:41:55 AM12/8/14
to tesser...@googlegroups.com
I have a similar problem: trying to apply user patterns - such as  " \d*>d*+ \d*> " - to minimise errors when I convert the OCR field of payment slips read on a flatbed scanner. I have a nice gtk script that uses scanimage, imagemagick and tesseract, but tesseract is making too many stupid errors (such as converting a pain 6 into o - accent egu). The script can post-process and deal with such simple errors - but there are too many cases that it cannot deal with, user patterns would be ideal

This paymjent-slip application cannot provide 4 leading concrete characters. But I just read 1 or 2 lines - so speed (the reason given for having the 4-character rule) is not an argument.

I saw a remark that dropping that rule would require setting kSaneNumConcreteChars to 0. I this parameter configurable? is it compiled into tesseract? Can I "patch"  this into my tesseract?
Reply all
Reply to author
Forward
0 new messages