New issue 376 by jw...@jwilk.net: hOCR output: special characters not
escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376
Tesseract doesn't escape special characters when outputting hOCR. As a
consequence, it's possible to inject (almost) arbitrary HTML into the
resulting hOCR file using a specially crafted input images. Please see the
attached examples.
Attachments:
xss.tif 3.1 KB
xss.html 693 bytes
Comment #1 on issue 376 by joregan: hOCR output: special characters not
escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376
(No comment was entered for this change.)
Aside from the security issues, this also makes it impossible to process
the hocr output from many normal page images that just happen to contain a
special character (such as < or >)
Patch attached which fixes this. Tested on several examples that used to
fail before (or crash hocr2pdf), now works as far as I can tell. Patch is
against tesseract-3.00.
Remaining todo:
- cleanup: replace repeated use of choice->unichar_string()[i] with a local
variable
- escape more: characters with ascii codes > 128 to numeric character
references? newlines as <br/>? run of spaces to single space followed by
?
- is this unicode-friendly?
Attachments:
tesseract-3.00-hocr-escaping.patch 1.0 KB
Comment #4 on issue 376 by zde...@gmail.com: hOCR output: special
characters not escaped
http://code.google.com/p/tesseract-ocr/issues/detail?id=376
fixed in r515
First of all, thanks for the patch.
> Remaining todo:
> - cleanup: replace repeated use of choice->unichar_string()[i] with a
> local variable
Why not use a switch instead?
> - escape more: characters with ascii codes > 128 to numeric character
> references?
That would be nice, if optional, but there are already tools (xmllint,
tidy) that can do that, so I wouldn't place a high priority on it.
> newlines as <br/>? run of spaces to single space followed by ?
Hmm. It might be better, especially if the image uses a fixed width font,
to use a <pre> block instead, but that would be much more complicated.
Tesseract 3.01 has font detection, which it would be nice to retain in
hOCR, so it may be worth waiting until 3.01 hits SVN (tomorrow) before
thinking about it.
> - is this unicode-friendly?
Tesseract uses UTF-8, which degrades properly for ASCII characters -- if it
had been UTF-16, for example, there would be a problem, but there's nothing
to worry about.
Issue 482 has been merged into this issue.