I need clarification of ocr_line vs. ocrx_line
hOCR spec define ocrx_line as:
* any kind of "line" returned by an OCR system that differs from the
standard ocr_line above
* might be some kind of "logical" line
hocr-tools provide this example of ocr_line[1]:
<span class='ocr_line' title='bbox 461 648 2077 707'>Alice was
beginning to get very tired of sitting by her sister on the bank,</
span>
And tesseract-ocr (r729) produce this hocr output:
<span class='ocr_line' id='line_2' title="bbox 464 651 2074 704">
<span class='ocrx_word' id='word_5' title="bbox 464 651 569
688">Alice</span>
<span class='ocrx_word' id='word_6' title="bbox 591 665 667
688">was</span>
...
<span class='ocrx_word' id='word_19' title="bbox 1962 660 2074
704">bank,</span>
</span>
Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?
[1]
http://code.google.com/p/hocr-tools/source/browse/sample.html#13
--
Zdenko