Issue 376 in tesseract-ocr: hOCR output: special characters not escaped

53 views
Skip to first unread message

tesser...@googlecode.com

unread,
Oct 15, 2010, 5:45:44 PM10/15/10
to tesserac...@googlegroups.com
Status: New
Owner: ----

New issue 376 by jw...@jwilk.net: hOCR output: special characters not
escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376

Tesseract doesn't escape special characters when outputting hOCR. As a
consequence, it's possible to inject (almost) arbitrary HTML into the
resulting hOCR file using a specially crafted input images. Please see the
attached examples.


Attachments:
xss.tif 3.1 KB
xss.html 693 bytes

tesser...@googlecode.com

unread,
Oct 24, 2010, 6:02:13 PM10/24/10
to tesserac...@googlegroups.com
Updates:
Status: Accepted

Comment #1 on issue 376 by joregan: hOCR output: special characters not
escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376

(No comment was entered for this change.)

tesser...@googlecode.com

unread,
Oct 25, 2010, 3:53:15 PM10/25/10
to tesserac...@googlegroups.com

Comment #2 on issue 376 by aizvorski: hOCR output: special characters not
escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376

Aside from the security issues, this also makes it impossible to process
the hocr output from many normal page images that just happen to contain a
special character (such as < or >)

tesser...@googlecode.com

unread,
Oct 29, 2010, 11:42:56 AM10/29/10
to tesserac...@googlegroups.com

Comment #3 on issue 376 by aizvorski: hOCR output: special characters not
escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376

Patch attached which fixes this. Tested on several examples that used to
fail before (or crash hocr2pdf), now works as far as I can tell. Patch is
against tesseract-3.00.

Remaining todo:
- cleanup: replace repeated use of choice->unichar_string()[i] with a local
variable
- escape more: characters with ascii codes > 128 to numeric character
references? newlines as <br/>? run of spaces to single space followed by
&nbsp;?
- is this unicode-friendly?

Attachments:
tesseract-3.00-hocr-escaping.patch 1.0 KB

tesser...@googlecode.com

unread,
Oct 29, 2010, 3:05:37 PM10/29/10
to tesserac...@googlegroups.com
Updates:
Status: Fixed

Comment #4 on issue 376 by zde...@gmail.com: hOCR output: special

fixed in r515

tesser...@googlecode.com

unread,
Oct 29, 2010, 6:47:02 PM10/29/10
to tesserac...@googlegroups.com

Comment #5 on issue 376 by joregan: hOCR output: special characters not
escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376

First of all, thanks for the patch.

> Remaining todo:
> - cleanup: replace repeated use of choice->unichar_string()[i] with a
> local variable

Why not use a switch instead?

> - escape more: characters with ascii codes > 128 to numeric character
> references?

That would be nice, if optional, but there are already tools (xmllint,
tidy) that can do that, so I wouldn't place a high priority on it.

> newlines as <br/>? run of spaces to single space followed by &nbsp;?

Hmm. It might be better, especially if the image uses a fixed width font,
to use a <pre> block instead, but that would be much more complicated.
Tesseract 3.01 has font detection, which it would be nice to retain in
hOCR, so it may be worth waiting until 3.01 hits SVN (tomorrow) before
thinking about it.

> - is this unicode-friendly?

Tesseract uses UTF-8, which degrades properly for ASCII characters -- if it
had been UTF-16, for example, there would be a problem, but there's nothing
to worry about.

tesser...@googlecode.com

unread,
May 2, 2011, 10:22:56 AM5/2/11
to tesserac...@googlegroups.com

Comment #6 on issue 376 by zde...@gmail.com: hOCR output: special

Issue 482 has been merged into this issue.

Reply all
Reply to author
Forward
0 new messages