Why do I get such poor results from Tesseract for simple single character recognizing?

7,549 views
Skip to first unread message

Yuliana Zigangirova

unread,
Oct 15, 2018, 4:44:16 PM10/15/18
to tesseract-ocr
Hi everyone,

I am trying to use Tesseract  for single character recognizing and the results are awful.
"h" is recognized as "n",  "4" as "/i",  "O" as "()";

1testtiff.png

6testtiff.png



2testtiff.png




Single character mode seems not to act, as many characters are recognized as two characters,
not  just one. My images are  simple bilevel black and white TIFF images,
latin characters.  This is bitmap font, not scanned images, they are absolutely clean and
need no improvement.
Оnly about half of the characters are correctly recognized, which seems to be
a very low percent for such a simple task.

 The library Tesseract version I am using is  "4.0.0-beta.3".
This is how I call Tesseract.

 int CharRecognizer::recognizeTIFFData(char* data, int datalength){
            char *outText;
            TessBaseAPI* api = new TessBaseAPI();
            // Initialize tesseract-ocr with English, without specifying tessdata path
            if (api->Init(NULL, "deu")) {
                    fprintf(stderr, "Could not initialize tesseract.\n");
                    exit(1);
            }
            api->SetPageSegMode(tesseract::PSM_SINGLE_CHAR);
            Pix *image = pixReadMem(data,datalength);
            api->SetImage(image);
            // Get OCR result
            outText = api->GetUTF8Text();
            printf("\nOCR output:\n%s", outText);
            // Destroy used object and release memory
            int utf8 = outText[0];
            api->End();
            delete[] outText;
            pixDestroy(&image);
            return utf8;
 }


 I am new to Tesseract, so probably I am missing something.  Do I have to somehow train
 the library first?  May be I should set another  OcrEngineMode?  I have expected no
 problems  with simple  bitmap font recognizing and am quite at lost now.
Thank you very much in advance,
Yuliana 

Lorenzo Bolzani

unread,
Oct 15, 2018, 5:32:27 PM10/15/18
to tesser...@googlegroups.com

Try to use psm 7 or 13 (SINGLE_LINE and RAW_LINE). In my case 7 works best.

I'm not 100% sure but it may be easier to recognize full words rather than single characters. But I do not know if this is just a test or if this is what you need to do.

The default oem mode (lstm) should be the best, but you may also try the old one and see what works best in this case.

You can train (fine tune) the lstm models but it is not mandatory.


Bye

Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f3cbddee-f620-4479-a967-97b52c98c64c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zdenko Podobny

unread,
Oct 16, 2018, 2:04:04 AM10/16/18
to tesser...@googlegroups.com
  1. If you have quality problem - it good to play with tesseract executable instead of API ;-)
  2. It is know that passing text (in your case just one letter) is not best idea - please try to add small white border e.g. 10 px
  3. Please set dpi for image after SetImage
See attachment for improved images. 
$ tesseract.char_4_b.png - --psm 10 -c page_separator=""
4

For single character recognition legacy engine is better and it can process your images without modification (but rules above are generally good to follow!):
$ tesseract char_0.png - --psm 10 --oem 0 --dpi 800 -c page_separator=""
0

$ tesseract char_4.png - --psm 10 --oem 0 --dpi 800 -c page_separator=""
4

$ tesseract char_h.png - --psm 10 --oem 0 --dpi 800 -c page_separator=""
h

Zdenko


po 15. 10. 2018 o 22:44 'Yuliana Zigangirova' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
--
char_h_b.png
char_0_b.png
char_4_b.png

Yuliana Zigangirova

unread,
Oct 16, 2018, 7:00:43 AM10/16/18
to tesseract-ocr
Thank you very much,  I'll try all suggested changes.  I have already tried borders
and they seem to work!
Yuliana
Reply all
Reply to author
Forward
0 new messages