hocr output - ocr_word and ocrx_word

388 views

Skip to first unread message

Carlos

unread,

Apr 19, 2012, 12:15:17 PM4/19/12

to tesser...@googlegroups.com

using tesseract 3.01

I am a working with a tool that parses generated pdf files from tesseract's hOCR output. This tool is choking because it expects the ocrx_word element to include the bbox position information. Before I patch the tool I just wanted to confirm a few things.

the hOCR output generated by tesseract 3.01 wraps each word in two tags:

Text

The hOCR spec (https://docs.google.com/a/touzon.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0) doesn't mention ocr_word and ocrx_word is considered engine specific markup.

Is there a disconnect between the hOCR spec and tesseract or am I reading too much into this? If ocr_word isn't part of the spec, why not drop it and place the bbox position information in the ocrx_word element? This would make parsing slightly easier and reduce the size of the generated hOCR.

Carlos

Muster Mann

unread,

Nov 28, 2013, 6:44:18 AM11/28/13

to tesser...@googlegroups.com

Hi Carlos!

Thank you for pointing this out, i am also trying to get my head around it, maybe i am wrong, but as far as i understand the x inside specific engine differs from the standard or will differ always between engines because each engine recognizes text differently:

ocrx_word
any kind of "word" returned by an OCR system
engine specific because the definition of a "word" depends on the engine

Here you can see some examples where even using the same engine there can be different results: https://sourceforge.net/p/tess4j/bugs/7/
I can understand why the raw text recognition results differs even using the same engine, but i do not understand why there is no consistency in the hocr using the same engine like tesseract.

Anyone?

Reply all

Reply to author

Forward

0 new messages