Tesseract API - hOCR output doesn't match what I get using console

467 views
Skip to first unread message

Przemysław Woźniak

unread,
May 24, 2014, 7:11:32 AM5/24/14
to tesser...@googlegroups.com
Hey guys,


I'm currently working on a university project which goal is to process invoice documents.

The problem which I encountered is that hOCR output that I produce using C++ code isn't the same as what I get using tesseract.exe from Windows console. I'm speaking of course about the accuracy of words recognition. For example: NIP vs NIp (console - code) - look at the attachments, where hocrOut.html is console output while fvathOCR.html is code output.

I suppose it has to do something with Leptonica, as I read that it preprocess the image automatically when calling tesseract.exe. The question is: how to implement it in code?

Here is the console command I'm using (pol for Polish):

tesseract fvat.jpg hocrOut -l pol

The info "Tesseract v. 3.02 with Leptonica is displayed and I get the output of fvat.jpg which scanned invoice document image.

Here is C++ code I use to achieve the desired result:

cv::Mat img = cv::imread(filename, CV_LOAD_IMAGE_GRAYSCALE);  //load the grayscale image
tesseract::TessBaseAPI tess;
tess.Init(NULL, "pol", tesseract::OEM_DEFAULT);   //matches -l fra fromcommand line,  i guess OEM_default was the one used by the command line ?
tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);   //matches -psm 1 from the command line
tess.SetImage((uchar*)img.data, img.cols, img.rows, 1, img.cols); //define the image
std::ofstream fileOp;
fileOp.open(resultFile);
fileOp << tess.GetHOCRText(0) << std::endl;
fileOp.close();

I've also tried this one, the results where the same as for the code provided above:

char *outText;

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

    if (api->Init(NULL, "pol")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
Pix *image = pixRead(filename.c_str());
    api->SetImage(image);
    // Get OCR result
    outText = api->GetHOCRText(0);
    //printf("OCR output:\n%s", outText);
std::ofstream fileOp;
fileOp.open(resultFile);
fileOp << outText << std::endl;
fileOp.close();

    // Destroy used object and release memory
    api->End();
    delete [] outText;
    pixDestroy(&image);


I'll be verty grateful for your help guys! Cheers,
Przemek
fvathOCR.html
hocrOut.html

Nick White

unread,
May 24, 2014, 10:14:50 AM5/24/14
to tesser...@googlegroups.com
Hi Przemysław,

On Sat, May 24, 2014 at 04:11:32AM -0700, Przemysław Woźniak wrote:
> The problem which I encountered is that hOCR output that I produce using C++
> code isn't the same as what I get using tesseract.exe from Windows console. I'm
> speaking of course about the accuracy of words recognition.

The easy thing to do would be read how the tesseract binary does it;
see api/tesseractmain.cpp. It doesn't do anything secret or odd, so
just go through it and see where your code differs.

Nick
Reply all
Reply to author
Forward
0 new messages