I'm still trying to improve recognition of television subtitles, especially traditional Chinese (
see here)
With either the stock chi_tra or our own trained model, it fails on certain text. To investigate, I used the API to render box outlines on the input image. Something like:
mpTessApi = new tesseract::TessBaseAPI();
mpTessApi->Init(0, mLanguage.c_str()); // chi_tra, eng, etc
mpTessApi->SetImage(image);
// Get character and box rect for each detected character
const char *bt = mpTessApi->GetBoxText()
Then plot the boxes over the original input image.
Setting the language to 'eng' does not properly recognize the text but the boxes are pretty close:
But selecting chi_tra or our own model shows the boxes all over the place. Results vary a bit by changing the Page Segmentation Mode but none are even close.
With stock chi_tra:
PSM 6
PSM 7
PSM 13
I'm planning to fix this but the in-code documentation is almost non-existent.
Can anyone tell me where in the code this gets done? That would help a lot with debugging. We're running non-legacy mode.
Thanks.
PS. While the boxes are wildly all over the place, the output text is mostly accurate:
你可以來接我嗎?
How is that possible? Does that mean GetBoxText() is unreliable?