tesseract hocr character output

372 views
Skip to first unread message

Chang Alden

unread,
Nov 11, 2015, 8:43:42 PM11/11/15
to tesseract-ocr
So, this is an extension to my problem in case someone skipped the title for the spacing problem. Pretty much I want to analyze the spacing problem using hocr, but hocr only gives bounding box for word output. So I would like to know if there is a file in tessdata/configs that I can modify to get the character bounding box output from hocr, so far I have not found a post through Google Search so I am not sure if such a technique exist. Ignoring the api way for now.

Chang Alden

unread,
Nov 12, 2015, 8:55:12 AM11/12/15
to tesseract-ocr
It seems it has to do with enabling the api.GetBoxText option, anyone know how to get it to work?


Chang Alden於 2015年11月12日星期四 UTC+8上午9時43分42秒寫道:

Chang Alden

unread,
Nov 12, 2015, 10:18:06 AM11/12/15
to tesseract-ocr
Alright I got it, just type makebox in option, it seems everything else in the configs folder can be accessed this way as well.

Chang Alden於 2015年11月12日星期四 UTC+8下午9時55分12秒寫道:

Helmut Wollmersdorfer

unread,
Nov 12, 2015, 1:13:11 PM11/12/15
to tesseract-ocr
Sorry, in which option do you write it? Sound like the shell console, and you get a box-file. Or have you found how to get single character boxes in hOCR?

Chang Alden

unread,
Nov 12, 2015, 11:15:18 PM11/12/15
to tesseract-ocr
Hi,
With "makebox" you get the coordinates for the box for each character of the image you input and scanned with tesseract, it is not in html format (sorry about the confusion). I didn't work on training so I didn't know such option exists.


Helmut Wollmersdorfer於 2015年11月13日星期五 UTC+8上午2時13分11秒寫道:

Helmut Wollmersdorfer

unread,
Nov 15, 2015, 4:50:14 PM11/15/15
to tesseract-ocr
Ok, then it's like what I always do:

   $ tesseract isis_0153.png isis_0153 -l deu-frak+deu  makebox hocr tessedit_write_images


This way I can extract blocks, lines, words and character images from the clean page tiff.

Reply all
Reply to author
Forward
0 new messages