Inconsistent behavior on full/cropped image

Paul P

unread,

Oct 17, 2024, 12:53:50 AM10/17/24

to tesseract-ocr

Here is my code:

api_ = new tesseract::TessBaseAPI();

api_->Init(tessdata_path.c_str(), "eng", tesseract::OEM_LSTM_ONLY);

Pix* image = pixRead(m_img_filename.c_str());
api_->SetImage(image);

api_->Recognize(nullptr);

auto i = 0;
tesseract::ResultIterator* ri = api_->GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if (ri != 0) {
do {
i++;
const char* word = ri->GetUTF8Text(level);
float conf = ri->Confidence(level);
int x1, y1, x2, y2;
ri->BoundingBox(level, &x1, &y1, &x2, &y2);

...

/* do something with the word and coordinates */
...
delete[] word;
} while (ri->Next(level));
}
pixDestroy(&image);
api_->End();

I am working with floor plans which contains some text, lines and other objects. Here is what happens. If I cut a smaller piece from a large image, save it to a file (e.g. image_crop.jpeg) and run the above code, all of the text blocks get detected and OCRed, here is an example:

Note the three "HDU2" blocks (borders coloring represents the confidence)

If I run the same exact code against the whole image, which is 5400x3600 pixels here is what I get (this is not the whole image, just the problematic part):

Only one of the three "HDU2" pieces got detected.

Most of the text on the large images is recognized correctly (it has same size, same font) but there are problematic parts like this.

DPI is the same (150) on both the whole drawing and the crop.

I've tried all the engine modes, all page segmentation modes and a few random variables from the "tesseract --print-parameters" list.

There must be a trick to make it work. I mean it obviously can detect this text and yet for some reason it won't.

Any suggestion would be much appreciated.

Paul P

unread,

Oct 17, 2024, 6:29:51 PM10/17/24

to tesseract-ocr

Here is a full image with everything deleted from it except the problematic area. Works perfectly:

There is also a block of dense text in the image, a few paragraphs of it. If some of it is present, then it immediately gets all of tesseract's attention and all the smaller blocks are left unrecognized:

Page seg mode is set to PSM_SPARSE_TEXT, which, in theory, should prevent this from happening but it doesn't.

There must be some parameter that would force tesseract to return ALL text blocks, not just the ones it considers more significant (which the large paragraphs are).

Tom Morris

unread,

Oct 18, 2024, 12:04:37 PM10/18/24

to tesseract-ocr

On Thursday, October 17, 2024 at 6:29:51 PM UTC-4 paul...@gmail.com wrote:

There must be some parameter that would force tesseract to return ALL text blocks, not just the ones it considers more significant (which the large paragraphs are).

Your investigations seem to confirm what has been widely reported previously - that Tesseract's page segmentation performs poorly on use cases which diverge greatly from what it was designed for, namely, large blocks of book style text.

I would suggest that you do your own page segmentation first and then feed the resulting text segments to Tesseract for recognition. The search phrase "scene text detection" might give you some starting points to investigate.

Tom

Paul P

unread,

Oct 24, 2024, 5:49:54 AM10/24/24

to tesseract-ocr

Thanks for the reply. I am now doing the text detection with openCV/EAST and then passing the bounding boxes to tesseract like this

api_->SetPageSegMode(tesseract::PSM_SPARSE_TEXT);

api_->SetRectangle(rect->x - 1, rect->y - 1, rect->width + 2, rect->height + 2);

api_->Recognize(nullptr);

Adding an extra pixel on each side is a trick which, for some reason, increases the recognition accuracy a lot, even though the original bounding box detected by EAST already has some space around the characters. Adding more space, however, decreases the accuracy. This will obviously change from image to image so I have to do multiple attempts with different settings, which makes the overall process very slow.

Setting PSM to "RAW_LINE" or "SINGLE_BLOCK" doesn't really make a difference.

Am I missing something?

Tom Morris

unread,

Oct 27, 2024, 1:17:21 PM10/27/24

to tesseract-ocr

Have you tried either of the following two PSM modes? They would seem to be closest to what your original image shows (depending on what EAST is generating for output).

7 Treat the image as a single text line.
8 Treat the image as a single word.

Reply all

Reply to author

Forward