Including removed lines in hOCR output

88 views
Skip to first unread message

Ewan Mellor

unread,
Apr 12, 2018, 1:14:34 AM4/12/18
to tesseract-dev
Hi,

I'm considering extending the hOCR output from Tesseract to include information about which lines were removed before page segmentation.  My idea is that a visible line on the page is a more meaningful separator than simply a gap between detected blocks, so I might decide that all the text below a visible line is a footer and not body text (for example).

Apologies if this has been discussed before, but it's basically impossible to search for the mention of "line" and I haven't found anything under "horizontal rule" or similar terms.

I don't see anything in the hOCR spec that would be a good representation of these removed lines.  I'm thinking that it would be something like <div class='ocr_hrule' title='bbox x0 y0 x1 y1'></div> where hrule means "horizontal rule" and there would be a corresponding ocr_vrule class.  It feels a bit weird to use an empty div though.  Alternatively, we could use the <hr> tag, but since there's no <vr> tag in HTML that would be a bit odd too.

Any preferences?

I've only looked at the code briefly, but it looks to me like the line info is destroyed along with the ColumnFinder at the end of AutoPageSeg, so I'd need to carry the line info through to the block list that SegmentPage populates.  Does that sound plausible?

Thanks,

Ewan.

Reply all
Reply to author
Forward
0 new messages