Re: [tesseract-ocr] OCR failing on simple and clear text codes

415 views

Skip to first unread message

Dmitri Silaev

unread,

May 20, 2015, 6:29:08 AM5/20/15

to tesser...@googlegroups.com

One no-brainer method to try out would be turning off all dictionaries and using your own custom "user-patterns" file. Since you said about "your application" I suppose you can program. So you can take a look at the comment preceding read_pattern_list() declaration in "dict/trie.h" for more details.

It seems all your strings are of the same format:

\A\A\d\d\d\d\d\d\d\d\d\d

(Tess understands very limited pattern syntax).

But if accuracy is critical in your app, in the long run I would absolutely avoid using any parts of Tesseract except char classifier. I.e. crop every single char out of your source image and run Tess in the single char PSM. I think it's should be easy as long as location of every character is quite stable among your source images. ImageMagick/shell scripts would suffice.

Best regards,
Dmitri Silaev
www.CustomOCR.com

On Wed, May 20, 2015 at 12:52 PM, Yoann Nicod <th3.t...@gmail.com> wrote:

Hello,

Being a beginner toward Tesseract, I'm facing a problem I hope experienced Tesseract users will bring a simple/obvious solution to.
I am running Tesseract on codes I want to read. I run tesseract.exe with this command line : "tesseract.exe in.png out configfile"
Here is the content of my configfile :

tessedit_create_boxfile 1
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ

I run it on images that look like this one :
Most of the time, the characters read and the boxes are OK. But I identified 3 different issues that happen time to time.

I - Wrong character read, confusion between '0', 'O' and 'D'.

For example, for this image :
Tesseract gives me : "UFO05D424091"
I am aware that a training would improve recognition but for some reasons I don't want to explain here, I can not do that and I was hopping the recognition engine would work well on such a simple font. Is there any parameters to set in order to improve the results ? I add that since D, 0 and O are likely to appear in the codes, I can't exclude D and O with the whitelist.

II - Threshold artifacts disturb the recognition.

When my threshold operation leaves some black pixels, like on this picture :
The resulting boxes are :
The recognized code is right, but the fact that the boxe is wrong is very problematic in my application. I know I could improve my pre-processing, doing a morphologic operation for example, but I want to know if there is a setting that could make tesseract ignore these black pixels. That's strange that the fact that a character of a word is way bigger than the others does not bother tesseract.

III - Wrong character segmentation.

Whereas the 2 first problems are understandable, I don't get how this one can happen.
Let's take the first example :
it leads to these boxes :
and the following recognised code : UM050409017.
Here is the second example :
leading to :
and the code is : UAZZO51717151.
How is this possible ? The input images are perfectly clear, I don't see the problem. Again, is there a setting to set in order to avoid this ?

I hope I am missing something obvious, for at least 1 of my problems. I have to admit that the list of all the possible parameters (that I found here : http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version) is hard to master, and since I am a beginner I don't know what to do now.
Thanks in advance for your help, I attached an archive containing all the images.

Regards

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Dmitri Silaev

unread,

May 20, 2015, 8:48:03 AM5/20/15

to tesser...@googlegroups.com

I can't really use pre defined patterns since the code pattern and font can change over time.

Think of using a bit more flexible patterns - by means of '*'. Second, you can use more than one pattern in "user-patterns". And fonts have nothing to do with patterns.

Implementing your own char-by-char segmentation is relatively easy even with ImageMagick and shell scripts, given you receive nicely binarized and cleaned source images. As far as I can see, this indeed is the case. I suggest CC labeling. For one possible implementation you can see my reply here: https://groups.google.com/d/msg/tesseract-ocr/STHaLGYsiCo/pYZyAG2AuMAJ

From my experience, solely by parameter tweaking a problem like your #3 cannot be solved reliably. You defeat one issue, eventually another rises. Then you're wasting your time to investigate if it's caused by a recent parameter change or it's independent. Change back, tweak another, fight a new issue. Repeat.

A better way is to *force* conditions for reliable OCR. Preprocessing, white-/blacklists, own segmentation using layout priors, etc.

Or, at least OCR output *postprocessing*. E.g. at some positions your O's are definitely zeros. I know people who ended up with *thousands* of such rules for Tess output in an app that allows much more diverse input than yours.

-Dmitri

On Wed, May 20, 2015 at 2:52 PM, Yoann Nicod <th3.t...@gmail.com> wrote:

Thanks for your reply,

I can't really use pre defined patterns since the code pattern and font can change over time.
I like the idea to segment the characters myself before giving it to tesseract one by one, but it looks time consuming (coding it I mean).
Isn't there any other suitable method ? In particular to solve the 3rd issue, which I think must be easy to solve.

On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote:

One no-brainer method to try out would be turning off all dictionaries and using your own custom "user-patterns" file. Since you said about "your application" I suppose you can program. So you can take a look at the comment preceding read_pattern_list() declaration in "dict/trie.h" for more details.

It seems all your strings are of the same format:
\A\A\d\d\d\d\d\d\d\d\d\d
(Tess understands very limited pattern syntax).

But if accuracy is critical in your app, in the long run I would absolutely avoid using any parts of Tesseract except char classifier. I.e. crop every single char out of your source image and run Tess in the single char PSM. I think it's should be easy as long as location of every character is quite stable among your source images. ImageMagick/shell scripts would suffice.

Best regards,
Dmitri Silaev
www.CustomOCR.com

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0da310e9-57b6-41a1-a363-66d35dc1bc19%40googlegroups.com.