Convert HOCR output to HTML with positioning

581 views
Skip to first unread message

gonx

unread,
Aug 30, 2015, 4:30:12 AM8/30/15
to tesseract-ocr
Hi,

Is there a way to output the HOCR tesseract generates into a good HTML5 page complete with the text's positioning and font style ?

Or best to just read the bbox coordinates as is and output to an HTML5 ?

<div class='ocr_carea' id='block_2_8' title="bbox 1165 1335 1644 1358">
   
<p class='ocr_par' dir='ltr' id='par_2_8' title="bbox 1165 1335 1644 1358">
     
<span class='ocr_line' id='line_2_21' title="bbox 1165 1335 1644 1358; baseline 0 -1"><span class='ocrx_word' id='word_2_122' title='bbox 1165 1335 1275 1358; x_wconf 98' lang='eng' dir='ltr'>TOTAL</span> <span class='ocrx_word' id='word_2_123' title='bbox 1302 1335 1412 1358; x_wconf 82' lang='eng' dir='ltr'>AMoUNT</span> <span class='ocrx_word' id='word_2_124' title='bbox 1439 1335 1644 1357; x_wconf 89' lang='eng' dir='ltr'>TAKEN</span>
     
</span>
   
</p>
   
</div>


oguzhang...@gmail.com

unread,
Aug 20, 2016, 6:31:37 AM8/20/16
to tesseract-ocr
Any update on this?
Reply all
Reply to author
Forward
0 new messages