Are character bboxes trustworthy?

68 views
Skip to first unread message

Robin Watts

unread,
Jul 24, 2020, 1:01:37 PM7/24/20
to tesseract-ocr
Hi all,

I'm using tesseract as a library, and broadly it seems to be working well. I am having some very strange problems with the character boxes I get back from the iterator though.

The attached image is a png made from the 8bpp greyscale image that I feed it, overlaid with boxes to show all the 'b' characters I get back.

Only one of the 4 'b' characters I get appears to have the box in the right place.

The code I'm using to extract the data is:

tesseract::ResultIterator *res_it = api->GetIterator(); 
while (!res_it->Empty(tesseract::RIL_BLOCK))
{
if (res_it->Empty(tesseract::RIL_WORD))
{
res_it->Next(tesseract::RIL_WORD);
continue;
}

res_it->BoundingBox(tesseract::RIL_TEXTLINE,
line_bbox, line_bbox+1,
line_bbox+2, line_bbox+3);
res_it->BoundingBox(tesseract::RIL_WORD,
word_bbox, word_bbox+1,
word_bbox+2, word_bbox+3);
font_name = res_it->WordFontAttributes(&bold,
&italic,
&underlined,
&monospace,
&serif,
&smallcaps,
&pointsize,
&font_id);
do
{
const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
if (graph && graph[0] != 0)
{
int unicode;
res_it->BoundingBox(tesseract::RIL_SYMBOL,
char_bbox, char_bbox+1,
char_bbox+2, char_bbox+3);
fz_chartorune(&unicode, graph);
callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox, pointsize);
}
res_it->Next(tesseract::RIL_SYMBOL);
}
while (!res_it->Empty(tesseract::RIL_BLOCK) &&
!res_it->IsAtBeginningOf(tesseract::RIL_WORD));
}

The characters are coming back correctly, and *most* are in the correct position. Just a few are shifted.

Is this to be expected? Am I doing something stupid?

(Even being told "It's reliably correct for me" would be helpful at this point.)

Thanks,

Robin

b.png

Zdenko Podobny

unread,
Jul 24, 2020, 2:01:30 PM7/24/20
to tesser...@googlegroups.com
Do you use lstm or legacy engine?

If lstm: search issue tracker/PR/(forum?) for bounding box problem (and  Noah Metzger patches) 

There are rumours that if you need really good bounding boxes you have to use the latest 3.5 version because changes in the 4.x version (and later) also affected legacy engine bounding box accuracy (compared to version 3). But I never saw comparison test (especially on high volume of images)

Zdenko


pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com.

robinw...@googlemail.com

unread,
Jul 24, 2020, 6:44:46 PM7/24/20
to tesseract-ocr
> Do you use lstm or legacy engine?  

lstm.

I can find a couple of Noah Metzger patches:

and 

etc, but they've all been merged into master. As far as I can tell from his github, all his patches have been pulled in.

I'm using master.

Crap bounding boxes really knock the effectiveness of Tesseract as a library :(

Thanks.

Zdenko Podobny

unread,
Jul 25, 2020, 11:25:59 AM7/25/20
to tesser...@googlegroups.com
As I mentioned, if you need good bounding boxes you have to use a legacy engine.
There are several issues & comments why it is problem to get accurate bounding boxes e.g.


Zdenko


so 25. 7. 2020 o 0:44 'robinw...@googlemail.com' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages