Why do I get such poor results from Tesseract for simple single character recognizing?

Yuliana Zigangirova

unread,

Oct 15, 2018, 4:44:16 PM10/15/18

to tesseract-ocr

Hi everyone,

I am trying to use Tesseract for single character recognizing and the results are awful.

"h" is recognized as "n", "4" as "/i", "O" as "()";

Single character mode seems not to act, as many characters are recognized as two characters,
not just one. My images are simple bilevel black and white TIFF images,
latin characters. This is bitmap font, not scanned images, they are absolutely clean and
need no improvement.
Оnly about half of the characters are correctly recognized, which seems to be
a very low percent for such a simple task.

The library Tesseract version I am using is "4.0.0-beta.3".
This is how I call Tesseract.

int CharRecognizer::recognizeTIFFData(char* data, int datalength){
            char *outText;
            TessBaseAPI* api = new TessBaseAPI();
            // Initialize tesseract-ocr with English, without specifying tessdata path
            if (api->Init(NULL, "deu")) {
                    fprintf(stderr, "Could not initialize tesseract.\n");
                    exit(1);
            }
            api->SetPageSegMode(tesseract::PSM_SINGLE_CHAR);
            Pix *image = pixReadMem(data,datalength);
            api->SetImage(image);
            // Get OCR result
            outText = api->GetUTF8Text();
            printf("\nOCR output:\n%s", outText);
            // Destroy used object and release memory
            int utf8 = outText[0];
            api->End();
            delete[] outText;
            pixDestroy(&image);
            return utf8;
}

I am new to Tesseract, so probably I am missing something. Do I have to somehow train
the library first? May be I should set another OcrEngineMode? I have expected no
problems with simple bitmap font recognizing and am quite at lost now.
Thank you very much in advance,
Yuliana

Lorenzo Bolzani

unread,

Oct 15, 2018, 5:32:27 PM10/15/18

to tesser...@googlegroups.com

Try to use psm 7 or 13 (SINGLE_LINE and RAW_LINE). In my case 7 works best.

I'm not 100% sure but it may be easier to recognize full words rather than single characters. But I do not know if this is just a test or if this is what you need to do.

The default oem mode (lstm) should be the best, but you may also try the old one and see what works best in this case.

You can train (fine tune) the lstm models but it is not mandatory.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f3cbddee-f620-4479-a967-97b52c98c64c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zdenko Podobny

unread,

Oct 16, 2018, 2:04:04 AM10/16/18

to tesser...@googlegroups.com

If you have quality problem - it good to play with tesseract executable instead of API ;-)
It is know that passing text (in your case just one letter) is not best idea - please try to add small white border e.g. 10 px
Please set dpi for image after SetImage

See attachment for improved images.

$ tesseract.char_4_b.png - --psm 10 -c page_separator=""

4

For single character recognition legacy engine is better and it can process your images without modification (but rules above are generally good to follow!):

$ tesseract char_0.png - --psm 10 --oem 0 --dpi 800 -c page_separator=""

0

$ tesseract char_4.png - --psm 10 --oem 0 --dpi 800 -c page_separator=""

4

$ tesseract char_h.png - --psm 10 --oem 0 --dpi 800 -c page_separator=""

h

Zdenko

po 15. 10. 2018 o 22:44 'Yuliana Zigangirova' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--

char_h_b.png

char_0_b.png

char_4_b.png

Yuliana Zigangirova

unread,

Oct 16, 2018, 7:00:43 AM10/16/18

to tesseract-ocr

Thank you very much, I'll try all suggested changes. I have already tried borders