Are character bboxes trustworthy?

Robin Watts

unread,

Jul 24, 2020, 1:01:37 PM7/24/20

to tesseract-ocr

Hi all,

I'm using tesseract as a library, and broadly it seems to be working well. I am having some very strange problems with the character boxes I get back from the iterator though.

The attached image is a png made from the 8bpp greyscale image that I feed it, overlaid with boxes to show all the 'b' characters I get back.

Only one of the 4 'b' characters I get appears to have the box in the right place.

The code I'm using to extract the data is:

tesseract::ResultIterator *res_it = api->GetIterator();

while (!res_it->Empty(tesseract::RIL_BLOCK))

{

if (res_it->Empty(tesseract::RIL_WORD))

{

res_it->Next(tesseract::RIL_WORD);

continue;

}

res_it->BoundingBox(tesseract::RIL_TEXTLINE,

line_bbox, line_bbox+1,

line_bbox+2, line_bbox+3);

res_it->BoundingBox(tesseract::RIL_WORD,

word_bbox, word_bbox+1,

word_bbox+2, word_bbox+3);

font_name = res_it->WordFontAttributes(&bold,

&italic,

&underlined,

&monospace,

&serif,

&smallcaps,

&pointsize,

&font_id);

do

{

const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);

if (graph && graph[0] != 0)

{

int unicode;

res_it->BoundingBox(tesseract::RIL_SYMBOL,

char_bbox, char_bbox+1,

char_bbox+2, char_bbox+3);

fz_chartorune(&unicode, graph);

callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox, pointsize);

}

res_it->Next(tesseract::RIL_SYMBOL);

}

while (!res_it->Empty(tesseract::RIL_BLOCK) &&

!res_it->IsAtBeginningOf(tesseract::RIL_WORD));

}

The characters are coming back correctly, and *most* are in the correct position. Just a few are shifted.

Is this to be expected? Am I doing something stupid?

(Even being told "It's reliably correct for me" would be helpful at this point.)

Thanks,

Robin

b.png

Zdenko Podobny

unread,

Jul 24, 2020, 2:01:30 PM7/24/20

to tesser...@googlegroups.com

Do you use lstm or legacy engine?

If lstm: search issue tracker/PR/(forum?) for bounding box problem (and Noah Metzger patches)

There are rumours that if you need really good bounding boxes you have to use the latest 3.5 version because changes in the 4.x version (and later) also affected legacy engine bounding box accuracy (compared to version 3). But I never saw comparison test (especially on high volume of images)

Zdenko

pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com.

robinw...@googlemail.com

unread,

Jul 24, 2020, 6:44:46 PM7/24/20

to tesseract-ocr

> Do you use lstm or legacy engine?

lstm.

I can find a couple of Noah Metzger patches:

https://github.com/tesseract-ocr/tesseract/commit/c350077b96077fa50fefe97fbaed04014407f0f1

and

https://github.com/tesseract-ocr/tesseract/pull/2576

etc, but they've all been merged into master. As far as I can tell from his github, all his patches have been pulled in.

I'm using master.

Crap bounding boxes really knock the effectiveness of Tesseract as a library :(

Thanks.

Zdenko Podobny

unread,

Jul 25, 2020, 11:25:59 AM7/25/20

to tesser...@googlegroups.com

As I mentioned, if you need good bounding boxes you have to use a legacy engine.

There are several issues & comments why it is problem to get accurate bounding boxes e.g.

https://github.com/tesseract-ocr/tesseract/issues/2825#issuecomment-579220987

Zdenko

so 25. 7. 2020 o 0:44 'robinw...@googlemail.com' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com.

Reply all

Reply to author

Forward