What part of the code performs Box Identification?

57 views

Skip to first unread message

Danny

unread,

Sep 3, 2024, 9:02:32 AM9/3/24

to tesseract-ocr

I'm still trying to improve recognition of television subtitles, especially traditional Chinese (see here)

With either the stock chi_tra or our own trained model, it fails on certain text. To investigate, I used the API to render box outlines on the input image. Something like:

mpTessApi = new tesseract::TessBaseAPI();
mpTessApi->Init(0, mLanguage.c_str()); // chi_tra, eng, etc
mpTessApi->SetImage(image);

// Get character and box rect for each detected character
const char *bt = mpTessApi->GetBoxText()

Then plot the boxes over the original input image.

Setting the language to 'eng' does not properly recognize the text but the boxes are pretty close:

But selecting chi_tra or our own model shows the boxes all over the place. Results vary a bit by changing the Page Segmentation Mode but none are even close.

With stock chi_tra:

PSM 6

PSM 7

PSM 13

I'm planning to fix this but the in-code documentation is almost non-existent.

Can anyone tell me where in the code this gets done? That would help a lot with debugging. We're running non-legacy mode.

Thanks.

PS. While the boxes are wildly all over the place, the output text is mostly accurate:

你可以來接我嗎？

How is that possible? Does that mean GetBoxText() is unreliable?

Reply all

Reply to author

Forward

0 new messages