What part of the code performs Box Identification?

57 views
Skip to first unread message

Danny

unread,
Sep 3, 2024, 9:02:32 AM9/3/24
to tesseract-ocr
I'm still trying to improve recognition of television subtitles, especially traditional Chinese (see here)

With either the stock chi_tra or our own trained model, it fails on certain text.  To investigate, I used the API to render box outlines on the input image. Something like:

mpTessApi = new tesseract::TessBaseAPI();
mpTessApi->Init(0, mLanguage.c_str()); // chi_tra, eng, etc
mpTessApi->SetImage(image);

// Get character and box rect for each detected character
const char *bt = mpTessApi->GetBoxText()

Then plot the boxes over the original input image.

Setting the language to 'eng'  does not properly recognize the text but the boxes are pretty close:
sub_2.png

But selecting chi_tra or our own model shows the boxes all over the place. Results vary a bit by changing the Page Segmentation Mode but none are even close.

With stock chi_tra:
PSM 6
sub_2.png

PSM 7
sub_2.png

PSM 13
sub_2.png
I'm planning to fix this but the in-code documentation is almost non-existent.

Can anyone tell me where in the code this gets done?  That would help a lot with debugging. We're running non-legacy mode.
Thanks.

PS. While the boxes are wildly all over the place, the output text is mostly accurate:

你可以來接我嗎?

How is that possible?  Does that mean GetBoxText()  is unreliable?
 
Reply all
Reply to author
Forward
0 new messages