tesseract export into txt vs. into pdf (issues with some characters)

72 views
Skip to first unread message

Jan

unread,
Mar 13, 2015, 8:29:13 AM3/13/15
to tesser...@googlegroups.com
Hi,

I noticed that when I use tesseract to create a searchable pdf (I use pdfsandwich fot this), some characters are not displayed and are replaced by blank spaces instead. If I, however, ocr the same file with tesseract only in order to obtain a plain text (I use OCRfeeder), everything is recognized AND displayed properly. It seems as if tesseract had issues exporting some characters specifically to PDFs, even though it's obviously capable of recognizing them. This happens with quotation marks, ligatures ("Th", "ff", etc.) but also, for example, with some special Czech characters such as "ě", "č," or "š" (even when the option "-l ces" is activated). Does anybody have an idea what can be wrong?

I've been trying to find out whether any one has had the same issue but could not find any relevant forum (yet).

Any advice would be much appreciated!
Jan

Tom Morris

unread,
May 14, 2015, 12:08:52 PM5/14/15
to tesser...@googlegroups.com
On Friday, March 13, 2015 at 8:29:13 AM UTC-4, Jan wrote:
I noticed that when I use tesseract to create a searchable pdf (I use pdfsandwich fot this), some characters are not displayed and are replaced by blank spaces instead. If I, however, ocr the same file with tesseract only in order to obtain a plain text (I use OCRfeeder), everything is recognized AND displayed properly. It seems as if tesseract had issues exporting some characters specifically to PDFs, even though it's obviously capable of recognizing them. This happens with quotation marks, ligatures ("Th", "ff", etc.) but also, for example, with some special Czech characters such as "ě", "č," or "š" (even when the option "-l ces" is activated). Does anybody have an idea what can be wrong?

There's some discussion on this bug report which may be relevant:


It doesn't sound like the same problem, but it could be related.

Tom
Reply all
Reply to author
Forward
0 new messages