I am working on getting Tesseract to recognize VINs for an application I am developing. I have a clean VIN image (work around to be black text on white background). Have traineddata using fonts Courier, HelveticaNeue, LatoBold, LatoLight, OpenSans, and RobotoSlab as a first attempt. I've also limited the unicharset to A-Z except I and O and 0-9.
The result is not very good. It returns a great deal of characters that surpass the number of characters present (17). Is there a way to limit tesseract to only detecting a 17 character word in one line? I'd also like to have tesseract prefer, but not require, the last 5 characters to be digits. There are a few other preferences that may help too, but I want to start with these. I'm not sure how to go about setting up those preferences.
Also, any suggestions past these on being able to clean up the OCR to read more correctly would be helpful. I can't post full data and image here (they're VINs. I'd need permission to do so), but I can say that a in one instance WM is coming back as 6W6M and that the digits 67258 are coming back as 572S5 in another.
Any guidance would be appreciated. I'll provide whatever information I can.
Thanks!
/path/to/eng.user-patterns:
1-\d\d\d-GOOG-411 www.\n\\\*.com
I haven't tried this personally though
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1766c3a2-f13d-407b-a474-ad1fa8c7868c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The user-patterns looks helpful, but I can't find any documentation on formatting or how it works. Is there documentation on this somewhere?
On Tuesday, November 11, 2014 10:50:57 AM UTC-6, ste...@fortyau.com wrote:I am working on getting Tesseract to recognize VINs for an application I am developing. I have a clean VIN image (work around to be black text on white background). Have traineddata using fonts Courier, HelveticaNeue, LatoBold, LatoLight, OpenSans, and RobotoSlab as a first attempt. I've also limited the unicharset to A-Z except I and O and 0-9.
The result is not very good. It returns a great deal of characters that surpass the number of characters present (17). Is there a way to limit tesseract to only detecting a 17 character word in one line? I'd also like to have tesseract prefer, but not require, the last 5 characters to be digits. There are a few other preferences that may help too, but I want to start with these. I'm not sure how to go about setting up those preferences.
Also, any suggestions past these on being able to clean up the OCR to read more correctly would be helpful. I can't post full data and image here (they're VINs. I'd need permission to do so), but I can say that a in one instance WM is coming back as 6W6M and that the digits 67258 are coming back as 572S5 in another.
Any guidance would be appreciated. I'll provide whatever information I can.
Thanks!
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoMKQg7enZUxOBfe35fCthkMOLvA6MmnwtqnuiFjacEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Set value for control parameter. Multiple -c arguments are allowed.
The name of a config to use. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTEGQcag4QsX9Gy5Ei7dXrHzB5N4icc3tEUj0vt3dO6Fbg%40mail.gmail.com.
[tesseract setVariableValue:@"0123456789" forKey:@"tessedit_char_whitelist"];To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVgjzY8GDv9wea4emyEju%2B3gXZdHZL0krUjzWOD3jHF%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF%3DEXLTscCHxg%2B585E2Q7zKOH4Kn%2B3dPhmMDVDpV-P2hg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJHWJbm1ku0dV8K-Wd_6O2i2%2B8%3DkgzK%2B7F2kmTmjMYeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
| // Inserts the list of patterns from the given file into the Trie. |
| // The pattern list file should contain one pattern per line in UTF-8 format. |
| // |
| // Each pattern can contain any non-whitespace characters, however only the |
| // patterns that contain characters from the unicharset of the corresponding |
| // language will be useful. |
| // The only meta character is '\'. To be used in a pattern as an ordinary |
| // string it should be escaped with '\' (e.g. string "C:\Documents" should |
| // be written in the patterns file as "C:\\Documents"). |
| // This function supports a very limited regular expression syntax. One can |
| // express a character, a certain character class and a number of times the |
| // entity should be repeated in the pattern. |
| // |
| // To denote a character class use one of: |
| // \c - unichar for which UNICHARSET::get_isalpha() is true (character) |
| // \d - unichar for which UNICHARSET::get_isdigit() is true |
| // \n - unichar for which UNICHARSET::get_isdigit() and |
| // UNICHARSET::isalpha() are true |
| // \p - unichar for which UNICHARSET::get_ispunct() is true |
| // \a - unichar for which UNICHARSET::get_islower() is true |
| // \A - unichar for which UNICHARSET::get_isupper() is true |
| // |
| // \* could be specified after each character or pattern to indicate that |
| // the character/pattern can be repeated any number of times before the next |
| // character/pattern occurs. |
| // |
| // Examples: |
| // 1-8\d\d-GOOG-411 will be expanded to strings: |
| // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411. |
| // |
| // http://www.\n\*.com will be expanded to strings like: |
| // http://www.a.com http://www.a123.com ... http://www.ABCDefgHIJKLMNop.com |
| // |
| // Note: In choosing which patterns to include please be aware of the fact |
| // providing very generic patterns will make tesseract run slower. |
| // For example \n\* at the beginning of the pattern will make Tesseract |
| // consider all the combinations of proposed character choices for each |
| // of the segmentations, which will be unacceptably slow. |
| // Because of potential problems with speed that could be difficult to |
| // identify, each user pattern has to have at least kSaneNumConcreteChars |
| // concrete characters from the unicharset at the beginning. |
| bool read_pattern_list(const char *filename, const UNICHARSET &unicharset); |
| // Initializes the values of *_pattern_ unichar ids. |
| // This function should be called before calling read_pattern_list(). |
| void initialize_patterns(UNICHARSET *unicharset); |
| // Fills in the given unichar id vector with the unichar ids that represent |
| // the patterns of the character classes of the given unichar_id. |
| void unichar_id_to_patterns(UNICHAR_ID unichar_id, |
| const UNICHARSET &unicharset, |
| GenericVector<UNICHAR_ID> *vec) const; |
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF5D%2BDZPoNsaPWWe2wY26kM4_MApQid3p1DYXYwXxKz9Q%40mail.gmail.com.