How to get para & line boxes of word from ResultIterator

34 views
Skip to first unread message

Lakshman Kumar

unread,
Jul 20, 2023, 2:18:13 AM7/20/23
to tesseract-ocr
Hi All,

Currently am doing OCR line by line and getting words details from ResultIterator like below

tessAPI->SetPageSegMode(tesseract::PageSegMode::PSM_SINGLE_LINE);
tessAPI->SetRectangle(iXmin, iYmin, iW, iH); //these line boxes are being calculated by our pre-processing and segmentation code)
tessAPI->Recognize(nullptr);
tesseract::ResultIterator* rst_iter = tessAPI->GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if (nullptr != rst_iter)
{
do
{
const char* text = rst_iter->GetUTF8Text(level);
                 rst_iter->WordFontAttributes(&is_bold, &is_italic, &is_underlined, &is_monospace, &is_serif, &is_smallcaps, &pointsize, &font_id);
                 //here I want to get the line & para of the current word belongs to from tess API

} while (rst_iter->Next(level));
}

I can get paras/lines/words using tessAPI->GetComponentImages() function, but for words only can get block/paras only. Somehow I am mapping those words with lines, but still getting some garbage.

Is there any way to get the line & para of the current word belongs to?

Thanks in advance,
Lakshman.
Reply all
Reply to author
Forward
0 new messages