Train Tesseract to Only Find a Single 17 Character Word

ste...@fortyau.com

unread,

Nov 11, 2014, 11:50:57 AM11/11/14

to tesser...@googlegroups.com

I am working on getting Tesseract to recognize VINs for an application I am developing. I have a clean VIN image (work around to be black text on white background). Have traineddata using fonts Courier, HelveticaNeue, LatoBold, LatoLight, OpenSans, and RobotoSlab as a first attempt. I've also limited the unicharset to A-Z except I and O and 0-9.

The result is not very good. It returns a great deal of characters that surpass the number of characters present (17). Is there a way to limit tesseract to only detecting a 17 character word in one line? I'd also like to have tesseract prefer, but not require, the last 5 characters to be digits. There are a few other preferences that may help too, but I want to start with these. I'm not sure how to go about setting up those preferences.

Also, any suggestions past these on being able to clean up the OCR to read more correctly would be helpful. I can't post full data and image here (they're VINs. I'd need permission to do so), but I can say that a in one instance WM is coming back as 6W6M and that the digits 67258 are coming back as 572S5 in another.

Any guidance would be appreciated. I'll provide whatever information I can.

Thanks!

ShreeDevi Kumar

unread,

Nov 11, 2014, 12:32:55 PM11/11/14

to tesser...@googlegroups.com

Have you tested with the English traineddata from the git tessdata repo?

Please see https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

try with these,

/path/to/eng.user-patterns:

1-\d\d\d-GOOG-411
www.\n\\\*.com

I haven't tried this personally though

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1766c3a2-f13d-407b-a474-ad1fa8c7868c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Nov 11, 2014, 1:23:28 PM11/11/14

to tesser...@googlegroups.com

also see https://groups.google.com/forum/#!topic/tesseract-ocr/et7bS5QRf2o

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ste...@fortyau.com

unread,

Nov 11, 2014, 3:43:24 PM11/11/14

to tesser...@googlegroups.com

The eng trained data gets closer. Using -psm 8 to indicate a single word and whitelisting characters to my specific whitelist with eng does improve the odds. However, because it is looking for the dictionaries in eng and such, it will come up with words like WAR which are not present in the string.

Also, even using -psm 8, tesseract still employs spaces as if there are multiple words despite my specification that it be treated as a singular word.

The user-patterns looks helpful, but I can't find any documentation on formatting or how it works. Is there documentation on this somewhere?

ShreeDevi Kumar

unread,

Nov 11, 2014, 9:51:57 PM11/11/14

to tesser...@googlegroups.com

On Wed, Nov 12, 2014 at 2:13 AM, <ste...@fortyau.com> wrote:

The user-patterns looks helpful, but I can't find any documentation on formatting or how it works. Is there documentation on this somewhere?

Did you see the man page? I had also sent link to a related discussion in the past. Search the archives for other tips.

https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

says

"if you pass the word bazaar as a trailing command line parameter to Tesseract, Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words and eng.user-patterns files you provided. The former is a simple word list, one per line. The format of the latter is documented in dict/trie.h on read_pattern_list()."

https://code.google.com/p/tesseract-ocr/source/browse/dict/trie.h

see

lines 199-232

On Tuesday, November 11, 2014 10:50:57 AM UTC-6, ste...@fortyau.com wrote:
I am working on getting Tesseract to recognize VINs for an application I am developing. I have a clean VIN image (work around to be black text on white background). Have traineddata using fonts Courier, HelveticaNeue, LatoBold, LatoLight, OpenSans, and RobotoSlab as a first attempt. I've also limited the unicharset to A-Z except I and O and 0-9.
The result is not very good. It returns a great deal of characters that surpass the number of characters present (17). Is there a way to limit tesseract to only detecting a 17 character word in one line? I'd also like to have tesseract prefer, but not require, the last 5 characters to be digits. There are a few other preferences that may help too, but I want to start with these. I'm not sure how to go about setting up those preferences.
Also, any suggestions past these on being able to clean up the OCR to read more correctly would be helpful. I can't post full data and image here (they're VINs. I'd need permission to do so), but I can say that a in one instance WM is coming back as 6W6M and that the digits 67258 are coming back as 572S5 in another.
Any guidance would be appreciated. I'll provide whatever information I can.
Thanks!

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com.

Steven Norris

unread,

Nov 12, 2014, 12:03:15 AM11/12/14

to tesser...@googlegroups.com

I did see that. Unfortunately I cannot use bazaar, as the final version of what I'm using will be using an iOS CocoaPod that does not support the bazaar functionality of Tesseract.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoMKQg7enZUxOBfe35fCthkMOLvA6MmnwtqnuiFjacEw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Steven T. Norris

Software Engineer - Forty AU

p: (615)997-0836

e: s te...@fortyau.com

w: http://www.linkedin.com/in/steventnorris

ShreeDevi Kumar

unread,

Nov 12, 2014, 1:31:18 AM11/12/14

to tesser...@googlegroups.com

Are you able to pass a configuration variable with iOS CocoaPod ?

-c configvar=value

Set value for control parameter. Multiple -c arguments are allowed.

configfile

The name of a config to use. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTEGQcag4QsX9Gy5Ei7dXrHzB5N4icc3tEUj0vt3dO6Fbg%40mail.gmail.com.

Steven Norris

unread,

Nov 12, 2014, 10:39:22 AM11/12/14

to tesser...@googlegroups.com

In a way. I can set values for keys that would appear in a config file. Like the below:

[tesseract setVariableValue:@"0123456789" forKey:@"tessedit_char_whitelist"];

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVgjzY8GDv9wea4emyEju%2B3gXZdHZL0krUjzWOD3jHF%2BA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Nov 12, 2014, 11:13:22 AM11/12/14

to tesser...@googlegroups.com

bazaar is nothing but a config file which sets values for a set of config variables, please see

https://code.google.com/p/tesseract-ocr/source/browse/tessdata/configs/bazaar

So, if patterns are helpful, you can that as a config.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF%3DEXLTscCHxg%2B585E2Q7zKOH4Kn%2B3dPhmMDVDpV-P2hg%40mail.gmail.com.

Steven Norris

unread,

Nov 12, 2014, 11:27:31 AM11/12/14

to tesser...@googlegroups.com

That may work then. Is there any documentation on patterns that you know of? Syntax, format, anything? I'm not sure how to go about formatting my patterns.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJHWJbm1ku0dV8K-Wd_6O2i2%2B8%3DkgzK%2B7F2kmTmjMYeQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Nov 12, 2014, 12:02:59 PM11/12/14

to tesser...@googlegroups.com

https://code.google.com/p/tesseract-ocr/source/browse/dict/trie.h

 // Inserts the list of patterns from the given file into the Trie.

  // The pattern list file should contain one pattern per line in UTF-8 format.

//

  // Each pattern can contain any non-whitespace characters, however only the

  // patterns that contain characters from the unicharset of the corresponding

  // language will be useful.

  // The only meta character is '\'. To be used in a pattern as an ordinary

  // string it should be escaped with '\' (e.g. string "C:\Documents" should

  // be written in the patterns file as "C:\\Documents").

  // This function supports a very limited regular expression syntax. One can

  // express a character, a certain character class and a number of times the

  // entity should be repeated in the pattern.

//

  // To denote a character class use one of:

  // \c - unichar for which UNICHARSET::get_isalpha() is true (character)

  // \d - unichar for which UNICHARSET::get_isdigit() is true

  // \n - unichar for which UNICHARSET::get_isdigit() and

  //      UNICHARSET::isalpha() are true

  // \p - unichar for which UNICHARSET::get_ispunct() is true

  // \a - unichar for which UNICHARSET::get_islower() is true

  // \A - unichar for which UNICHARSET::get_isupper() is true

//

  // \* could be specified after each character or pattern to indicate that

  // the character/pattern can be repeated any number of times before the next

  // character/pattern occurs.

//

  // Examples:

  // 1-8\d\d-GOOG-411 will be expanded to strings:

  // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.

//

  // http://www.\n\*.com will be expanded to strings like:

  // http://www.a.com http://www.a123.com ... http://www.ABCDefgHIJKLMNop.com

//

  // Note: In choosing which patterns to include please be aware of the fact

  // providing very generic patterns will make tesseract run slower.

  // For example \n\* at the beginning of the pattern will make Tesseract

  // consider all the combinations of proposed character choices for each

  // of the segmentations, which will be unacceptably slow.

  // Because of potential problems with speed that could be difficult to

  // identify, each user pattern has to have at least kSaneNumConcreteChars

  // concrete characters from the unicharset at the beginning.

  bool read_pattern_list(const char *filename, const UNICHARSET &unicharset);

  // Initializes the values of *_pattern_ unichar ids.

  // This function should be called before calling read_pattern_list().

  void initialize_patterns(UNICHARSET *unicharset);

  // Fills in the given unichar id vector with the unichar ids that represent

  // the patterns of the character classes of the given unichar_id.

  void unichar_id_to_patterns(UNICHAR_ID unichar_id,

                              const UNICHARSET &unicharset,

                              GenericVector<UNICHAR_ID> *vec) const;

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF5D%2BDZPoNsaPWWe2wY26kM4_MApQid3p1DYXYwXxKz9Q%40mail.gmail.com.

Reply all

Reply to author

Forward