Issue 690 in tesseract-ocr: Non utf-8 character output in hocr

tesser...@googlecode.com

unread,

Apr 23, 2012, 1:45:07 PM4/23/12

to tesserac...@googlegroups.com

Status: New
Owner: ----

New issue 690 by marco.st...@gmail.com: Non utf-8 character output in hocr
http://code.google.com/p/tesseract-ocr/issues/detail?id=690

With the particular file attached the hocr output has some character with a
wrong encoding:

~$ tesseract 024.tif 024 hocr
~$ file 024.html
024.html: data
~$ emacs 024.html
[remove the unrecognized characters]
~$ file 024.html
024.html: HTML document, UTF-8 Unicode text, with very long lines

The default text output is perfectly fine.

I'm using tesseract 3.02.01 on debian 7 unstable 64 bit

Attachments:
024.html 38.2 KB
024.tif 45.2 KB

tesser...@googlecode.com

unread,

May 7, 2012, 7:11:36 AM5/7/12

to tesserac...@googlegroups.com

Comment #1 on issue 690 by jw...@jwilk.net: Non utf-8 character output in
hocr
http://code.google.com/p/tesseract-ocr/issues/detail?id=690

It shall be noted that the hOCR file contains also control characters, like
^A.

tesser...@googlecode.com

unread,

Aug 1, 2012, 6:01:18 PM8/1/12

to tesserac...@googlegroups.com

Updates:
Status: Fixed

Comment #2 on issue 690 by zde...@gmail.com: Non utf-8 character output in
hocr
http://code.google.com/p/tesseract-ocr/issues/detail?id=690

fixed in r736.

Reply all

Reply to author

Forward