ocr_line vs. ocrx_line

107 views
Skip to first unread message

Zdenko Podobný

unread,
May 30, 2012, 10:49:03 AM5/30/12
to hOCR
I need clarification of ocr_line vs. ocrx_line

hOCR spec define ocrx_line as:
* any kind of "line" returned by an OCR system that differs from the
standard ocr_line above
* might be some kind of "logical" line


hocr-tools provide this example of ocr_line[1]:

<span class='ocr_line' title='bbox 461 648 2077 707'>Alice was
beginning to get very tired of sitting by her sister on the bank,</
span>

And tesseract-ocr (r729) produce this hocr output:

<span class='ocr_line' id='line_2' title="bbox 464 651 2074 704">
<span class='ocrx_word' id='word_5' title="bbox 464 651 569
688">Alice</span>
<span class='ocrx_word' id='word_6' title="bbox 591 665 667
688">was</span>
...
<span class='ocrx_word' id='word_19' title="bbox 1962 660 2074
704">bank,</span>
</span>

Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?

[1] http://code.google.com/p/hocr-tools/source/browse/sample.html#13

--
Zdenko
Reply all
Reply to author
Forward
0 new messages