Hey guys,
I'm currently working on a university project which goal is to process invoice documents.
The problem which I encountered is that hOCR output that I produce using C++ code isn't the same as what I get using tesseract.exe from Windows console. I'm speaking of course about the accuracy of words recognition. For example: NIP vs NIp (console - code) - look at the attachments, where hocrOut.html is console output while fvathOCR.html is code output.
I suppose it has to do something with Leptonica, as I read that it preprocess the image automatically when calling tesseract.exe. The question is: how to implement it in code?
Here is the console command I'm using (pol for Polish):
tesseract fvat.jpg hocrOut -l pol
The info "Tesseract v. 3.02 with Leptonica is displayed and I get the output of fvat.jpg which scanned invoice document image.
Here is C++ code I use to achieve the desired result:
cv::Mat img = cv::imread(filename, CV_LOAD_IMAGE_GRAYSCALE); //load the grayscale image
tesseract::TessBaseAPI tess;
tess.Init(NULL, "pol", tesseract::OEM_DEFAULT); //matches -l fra fromcommand line, i guess OEM_default was the one used by the command line ?
tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK); //matches -psm 1 from the command line
tess.SetImage((uchar*)img.data, img.cols, img.rows, 1, img.cols); //define the image
std::ofstream fileOp;
fileOp.open(resultFile);
fileOp << tess.GetHOCRText(0) << std::endl;
fileOp.close();
I've also tried this one, the results where the same as for the code provided above:
char *outText;
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "pol")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
// Open input image with leptonica library
Pix *image = pixRead(filename.c_str());
api->SetImage(image);
// Get OCR result
outText = api->GetHOCRText(0);
//printf("OCR output:\n%s", outText);
std::ofstream fileOp;
fileOp.open(resultFile);
fileOp << outText << std::endl;
fileOp.close();
// Destroy used object and release memory
api->End();
delete [] outText;
pixDestroy(&image);
I'll be verty grateful for your help guys! Cheers,
Przemek