Ugly behavior when recognizing – advice requirement

545 views
Skip to first unread message

Andres

unread,
May 3, 2013, 12:24:13 PM5/3/13
to tesser...@googlegroups.com

Dear people,

I trained Tesseract for my font (FE-Schrift: http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad results. I am using Tesseract 3.01 under Windows.

In this image:

https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing

Where text is SAA5298 I’m getting SM529B, this is being done from inside a program and I know that the “M” from the result is the result of the “AA” of the source.  So, Tesseract is making a very bad segmentation of these two characters, and even they are very good separated, as you can see.  Do you have an idea about why is this happening ? In the other hand, is there a way to give tesseract a hint for this (e.g., telling it the character width).

The other problem is with this one:

https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing

Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, even when the image is very good.

 

Here is my fonts training file:

https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing

Here is my box file:

https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing

Here is my .traineddata file:

https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing

And here is a .cmd file for testing these 2 images:

https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing

 

Thanks,

Andres

Dmitri Silaev

unread,
May 3, 2013, 3:05:50 PM5/3/13
to tesser...@googlegroups.com
Andres,

Above all, your first link seem to be pointing to a "traineddata" file
instead of an image. Second, without actually diving deep into your
problem, I can suggest specifying the single line psm mode in the
command line. And finally you can use the user patterns feature to
restrict possible output of Tesseract (for the format see comments in
dict/trie.h on read_pattern_list()). Another way of achieving the
latter, like we do in CustomOCR, and it seems to be more reliable, is
to use the API to get a number of of character variants for each blob
alng with confidence levels and match them against a set of possible
patterns. You can find how to do this by searching around this forum.

HTH and good luck with Tesseract!

Warm regards,
Dmitri Silaev
www.CustomOCR.com
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-oc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Andres

unread,
May 6, 2013, 12:50:42 AM5/6/13
to tesser...@googlegroups.com
Hi Dmitri,

Many thanks for your hints, as always.

Regarding the links in my previous message, sorry for that, I'll repost the entire message below this message, fixed.

I like the method that you tell that you use in CustomOCR. Is there a way of getting the character variants without making a hack ? As I saw, the interface of the API just exposes the confidence level for each character. Am I right with this ?

Regarding psm mode, I'm using this from insinde my code with value 7, which is for "Treat the image as a single text line". Is that the parameter that you are suggesting me ?

Anyway, I think that I might have big newbie errors in my training, so I will be grateful if you just see my training image and my problematic image, to know if you see an obvious error at first sight.

My training image:
https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing

Problematic image (a "6" recognized as a "5"):
https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing

Another problematic image ("A A" recognized as "M")
https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit

The following is my original message with the links fixed:


Dear people,

I trained Tesseract for my font (FE-Schrift: http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad results. I am using Tesseract 3.01 under Windows.

In this image:

https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing

Where text is SAA5298 I’m getting SM529B, this is being done from inside a program and I know that the “M” from the result is the result of the “AA” of the source.  So, Tesseract is making a very bad segmentation of these two characters, and even they are very good separated, as you can see.  Do you have an idea about why is this happening ? In the other hand, is there a way to give tesseract a hint for this (e.g., telling it the character width).

The other problem is with this one:

https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing

Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, even when the image is very good.

 Here is my fonts training file:

https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing

Here is my box file:

https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing

Here is my .traineddata file:

https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing

Andres

unread,
May 6, 2013, 1:28:56 AM5/6/13
to tesser...@googlegroups.com
Answering part of what I asked last, I've found a way of getting the alternatives to each char, but seems to be not working in 3.01 according to what I tested and http://code.google.com/p/tesseract-ocr/issues/detail?id=714
My snippet:

#include <api/resultiterator.h>

...

tess_api.SetVariable("save_blob_choices", "T");

...


tesseract::ResultIterator* it = tess_api.GetIterator();

do
{
    char* uval = it->GetUTF8Text(tesseract::RIL_SYMBOL);
    cout<<uval<<"("<<it->Confidence(tesseract::RIL_SYMBOL)<<"){";
    tesseract::ChoiceIterator ci(*it);
    do
    {
        const char* val = ci.GetUTF8Text();
        cout<<" "<<(val == NULL ? "#" : val)<<" "<<ci.Confidence();
    } 
    while (ci.Next());
    cout<<"}";
}
while (it->Next(tesseract::RIL_SYMBOL));

Dmitri Silaev

unread,
May 7, 2013, 4:50:12 AM5/7/13
to tesser...@googlegroups.com
Andres,

Your code seems to be correct. I personally use a few more lines right
after the call to GetIterator():
it->Begin();
if(it->IsAtFinalElement(RIL_BLOCK, RIL_SYMBOL))
return;
if(!it->IsAtBeginningOf(RIL_SYMBOL))
return;
But this shouldn't bother you if you rely on non-degenerate cases.

Well, I suggest using revision 724. It is battle-tested by me and
probably contains less bugs and has better balance between accuracy
and speed compared to any newer revision. Although newer ones may
introduce many fancy features, I'll refrain of using them in
production. Maybe this can help you.

Warm regards,
Dmitri Silaev
www.CustomOCR.com


Andres

unread,
May 21, 2013, 1:54:23 AM5/21/13
to tesser...@googlegroups.com

Hi Dmitri,

Many thanks for your help.

I’ve tried with PageSegMode in PSM_SINGLE_BLOCK_VERT_TEXT and surprisingly I got very good results.

But then I switched from Tesseract 3.01 to 3.02 (revision 724) and the behavior of tesseract changed significantly, not for good in my case. It began to detect two characters in the same character, one in a higher position and another  in a lower position.

So I tested calling tesseract for each char (PSM_SINGLE_CHAR ), as I do the segmentation by myself. The results on some characters were ok but in some others it detected the inner contours of characters like Q as a character (please see the red rectangle on this image https://docs.google.com/file/d/0BxkuvS_LuBAzeDJQRWg2aHBnNFU/edit?usp=sharing )

Do you have any suggestions on this ?

I’ve been thinking that perhaps there could be a variable to restrict tesseract a little ( http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version ) but the list is so wide that discourages me.

I also have been thinking in doing something with RunAdaptiveClassifier which is exposed by the API, but I’m not sure if that function could serve to make OCR of a single char.

The main particularity of my case of use is that I already have the text segmented, so I wonder that it should be easy. That’s why  I think that perhaps I’m making a big error in some part.

Best regards,

Andres

 



2013/5/7 Dmitri Silaev <daemo...@gmail.com>
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/et7bS5QRf2o/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Dmitri Silaev

unread,
May 23, 2013, 1:34:17 AM5/23/13
to tesser...@googlegroups.com
Andres,

Inherently, Tesseract is designed to detect both straight and inverted
text, probably in the same text image. Often this is a source of its
confusion with what is the background and what is the foreground:
sometimes for closed character interior is treated as a character and
foreground pixels as surrounding background. That's why sometimes it's
not practical to pass isolated character images or images with little
text: they can screw Tesseract up. I suggest passing a whole text line
and then iterate over the results, reading recognized characters and
their confidence levels.

Warm regards,
Dmitri Silaev
www.CustomOCR.com


Reply all
Reply to author
Forward
0 new messages