New issue 690 by marco.st...@gmail.com: Non utf-8 character output in hocr
http://code.google.com/p/tesseract-ocr/issues/detail?id=690
With the particular file attached the hocr output has some character with a
wrong encoding:
~$ tesseract 024.tif 024 hocr
~$ file 024.html
024.html: data
~$ emacs 024.html
[remove the unrecognized characters]
~$ file 024.html
024.html: HTML document, UTF-8 Unicode text, with very long lines
The default text output is perfectly fine.
I'm using tesseract 3.02.01 on debian 7 unstable 64 bit
Attachments:
024.html 38.2 KB
024.tif 45.2 KB