hocr output - ocr_word and ocrx_word

388 views
Skip to first unread message

Carlos

unread,
Apr 19, 2012, 12:15:17 PM4/19/12
to tesser...@googlegroups.com
using tesseract 3.01

I am a working with a tool that parses generated pdf files from tesseract's hOCR output.  This tool is choking because it expects the ocrx_word element to include the bbox position information.  Before I patch the tool I just wanted to confirm a few things.

the hOCR output generated by tesseract 3.01 wraps each word in two tags:

<span class='ocr_word' title="bbox x0 y0 x1 y1"><span class='ocrx_word' id='xword_1_1' title="x_wconf -2"><strong>Text</strong></span></span>

The hOCR spec (https://docs.google.com/a/touzon.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0) doesn't mention ocr_word and ocrx_word is considered engine specific markup.

Is there a disconnect between the hOCR spec and tesseract or am I reading too much into this?  If ocr_word isn't part of the spec, why not drop it and place the bbox position information in the ocrx_word element?  This would make parsing slightly easier and reduce the size of the generated hOCR.

Carlos

Muster Mann

unread,
Nov 28, 2013, 6:44:18 AM11/28/13
to tesser...@googlegroups.com
Hi Carlos!

Thank you for pointing this out, i am also trying to get my head around it, maybe i am wrong, but as far as i understand the x inside specific engine differs from the standard or will differ always between engines because each engine recognizes text differently:
  • ocrx_word

    • any kind of "word" returned by an OCR system

    • engine specific because the definition of a "word" depends on the engine


Here you can see some examples where even using the same engine there can be different results:  https://sourceforge.net/p/tess4j/bugs/7/
I can understand why the raw text recognition results differs even using the same engine, but i do not understand why there is no consistency in the hocr using the same engine like tesseract.

Anyone?

Reply all
Reply to author
Forward
0 new messages