Simple OCR fails - why and what can be done

76 views

Skip to first unread message

Levent Serbesatik

unread,

May 6, 2014, 4:46:54 AM5/6/14

to tesser...@googlegroups.com

Hello,

I just tried using tesseract-ocr for the first time but failed to get it working for a simple case. I have a 400x45 pixels .bmp picture where the characters are about 28 pixels high. They don't have to be digits but the characters are machine printed.

cv::Mat gray = cv::imread("sample.bmp",0);

tesseract::TessBaseAPI tess; 
int tesscreated = tess.Init("C:/Program Files/Tesseract-OCR/tessdata", "eng", tesseract::OEM_DEFAULT);
    if (tesscreated==-1) {
        throw(stderr, "Could not initialize tesseract.\n");
    }

tess.SetImage((uchar*)gray.data, gray.cols, gray.rows, 1, gray.cols);

char* text = tess.GetUTF8Text();

result: "&ï¬‚"

using

tess.SetPageSegMode(tesseract::PSM_SINGLE_LINE);

result: "4Lu2A0â€”UJP"

using

tess.SetVariable("tessedit_char_whitelist", "0123456789");

result: "43"

using

tess.SetPageSegMode(tesseract::PSM_SINGLE_LINE);
tess.SetVariable("tessedit_char_whitelist", "0123456789");

result: "4 32523133"

The results are not even close to what the ground truth is ("8712400764278").

Pre-processing like binarization helps but still it's far from perfect for this simple case where I am planning to introduce noise for the next stage.

Does anybody know why I am getting poor results and have a suggestion how I can improve them?

Thank you,

Levent

sample.bmp

Reply all

Reply to author

Forward

0 new messages