using tesseract 3.01
I am a working with a tool that parses generated pdf files from tesseract's hOCR output. This tool is choking because it expects the ocrx_word element to include the bbox position information. Before I patch the tool I just wanted to confirm a few things.
the hOCR output generated by tesseract 3.01 wraps each word in two tags:
<span class='ocr_word' title="bbox
x0 y0 x1 y1"><span class='ocrx_word' id='xword_1_1' title="x_wconf -2"><strong>Text</strong></span></span>
The hOCR spec (
https://docs.google.com/a/touzon.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0) doesn't mention ocr_word and ocrx_word is considered engine specific markup.
Is there a disconnect between the hOCR spec and tesseract or am I reading too much into this? If ocr_word isn't part of the spec, why not drop it and place the bbox position information in the ocrx_word element? This would make parsing slightly easier and reduce the size of the generated hOCR.
Carlos