Inconsistent outputs between TEXT and hOCR formats

Matthew Getzin

unread,

Jan 11, 2020, 4:44:05 PM1/11/20

to tesseract-ocr

Hello,

I created an issue (see below) on Github. Not sure if it is a bug or something for discussion forum...

### Environment

* **Tesseract Version**: tesseract 4.0.0-beta.1

leptonica-1.75.3

libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2

Found AVX

Found SSE

* **Platform**: Linux getzinmw-XPS-15-9550 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

### Current Behavior:

I am currently having issues with the hOCR output from tesseract as compared to the default .txt output. In the attached image, for example, my hOCR output does not register the majority of the numbers on the left side of the page, while they are registered in the .txt output file.

Commands tried:

tesseract input.png output -l eng --psm 6

tesseract input.png output -l eng --psm 6 hocr

### Expected Behavior:

I would expect that the recognition of text would be consistent between the two modes with the output format being the only difference.

### Suggested Fix:

Ensuring consistent output from the various formats.

input.png

output.txt

output-hocr.txt

output.hocr

Zdenko Podobny

unread,

Jan 11, 2020, 4:45:51 PM1/11/20

to tesser...@googlegroups.com

You use old tesseract version....

Dňa so 11. 1. 2020, 22:43 Matthew Getzin <matthew...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/62a1f36b-cfd2-4c4d-901d-337d6bbcc12d%40googlegroups.com.

Matthew Getzin

unread,

Jan 11, 2020, 6:54:18 PM1/11/20

to tesser...@googlegroups.com

Thanks. I'll try upgrading and seeing if the issue is resolved.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zTQAKnOxyTiXG2X9on9L%2BkbacMVN_CYXQDyzi9_7sVsA%40mail.gmail.com.

Reply all

Reply to author

Forward