Simple OCR fails - why and what can be done

76 views
Skip to first unread message

Levent Serbesatik

unread,
May 6, 2014, 4:46:54 AM5/6/14
to tesser...@googlegroups.com
Hello,

I just tried using tesseract-ocr for the first time but failed to get it working for a simple case. I have a 400x45 pixels .bmp picture where the characters are about 28 pixels high. They don't have to be digits but the characters are machine printed. 

cv::Mat gray = cv::imread("sample.bmp",0);

tesseract
::TessBaseAPI tess;
int tesscreated = tess.Init("C:/Program Files/Tesseract-OCR/tessdata", "eng", tesseract::OEM_DEFAULT);
   
if (tesscreated==-1) {
       
throw(stderr, "Could not initialize tesseract.\n");
   
}

tess
.SetImage((uchar*)gray.data, gray.cols, gray.rows, 1, gray.cols);

char* text = tess.GetUTF8Text();
result:        "&fl"

using
tess.SetPageSegMode(tesseract::PSM_SINGLE_LINE);
result: "4Lu2A0—UJP"

using
tess.SetVariable("tessedit_char_whitelist", "0123456789");
result:       "43"

using
tess.SetPageSegMode(tesseract::PSM_SINGLE_LINE);
tess.SetVariable("tessedit_char_whitelist", "0123456789");
result: "4 32523133"

The results are not even close to what the ground truth is ("8712400764278").
Pre-processing like binarization helps but still it's far from perfect for this simple case where I am planning to introduce noise for the next stage.

Does anybody know why I am getting poor results and have a suggestion how I can improve them?

Thank you,
Levent



sample.bmp
Reply all
Reply to author
Forward
0 new messages