tesseract character level details

173 views

Skip to first unread message

mit

unread,

Nov 23, 2020, 8:01:31 AM11/23/20

to tesseract-ocr

I am trying to get character level details of a file using

hocr_char_boxes=1 option.

But the output it generates seems to be overlapping between the characters.

<div class='ocr_page' id='page_1' title='image "file-0.png"; bbox 0 0 1653 2336; ppageno 0'>
 <div class='ocr_carea' id='block_1_1' title="bbox 111 203 930 219">
 
 
 
 S
 e
 e

How can two characters have the same starting point(For S: 'x_bboxes 111 204 117 216 and for e: x_bboxes 111 204 119 216 )

Tesseract details:

tesseract v4.1.0-elag2019
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

Attached the image file.

file-0.png

mit

unread,

Nov 24, 2020, 12:27:07 AM11/24/20

to tesseract-ocr

Hi,

Anyone has update on this?

Thanks

Reply all

Reply to author

Forward

0 new messages