Inconsistent outputs between TEXT and hOCR formats

31 views
Skip to first unread message

Matthew Getzin

unread,
Jan 11, 2020, 4:44:05 PM1/11/20
to tesseract-ocr
Hello,

I created an issue (see below) on Github. Not sure if it is a bug or something for discussion forum...

### Environment

* **Tesseract Version**: tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

* **Platform**: Linux getzinmw-XPS-15-9550 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

### Current Behavior:
I am currently having issues with the hOCR output from tesseract as compared to the default .txt output. In the attached image, for example, my hOCR output does not register the majority of the numbers on the left side of the page, while they are registered in the .txt output file.

Commands tried:
tesseract input.png output -l eng --psm 6
tesseract input.png output -l eng --psm 6 hocr

### Expected Behavior:
I would expect that the recognition of text would be consistent between the two modes with the output format being the only difference.

### Suggested Fix:
Ensuring consistent output from the various formats.

input.png
output.txt
output-hocr.txt
output.hocr

Zdenko Podobny

unread,
Jan 11, 2020, 4:45:51 PM1/11/20
to tesser...@googlegroups.com
You use old tesseract version....

Dňa so 11. 1. 2020, 22:43 Matthew Getzin <matthew...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/62a1f36b-cfd2-4c4d-901d-337d6bbcc12d%40googlegroups.com.

Matthew Getzin

unread,
Jan 11, 2020, 6:54:18 PM1/11/20
to tesser...@googlegroups.com
Thanks. I'll try upgrading and seeing if the issue is resolved.

Reply all
Reply to author
Forward
0 new messages