tesseract export into txt vs. into pdf (issues with some characters)

72 views

Skip to first unread message

Jan

unread,

Mar 13, 2015, 8:29:13 AM3/13/15

to tesser...@googlegroups.com

Hi,

I noticed that when I use tesseract to create a searchable pdf (I use pdfsandwich fot this), some characters are not displayed and are replaced by blank spaces instead. If I, however, ocr the same file with tesseract only in order to obtain a plain text (I use OCRfeeder), everything is recognized AND displayed properly. It seems as if tesseract had issues exporting some characters specifically to PDFs, even though it's obviously capable of recognizing them. This happens with quotation marks, ligatures ("Th", "ff", etc.) but also, for example, with some special Czech characters such as "ě", "č," or "š" (even when the option "-l ces" is activated). Does anybody have an idea what can be wrong?

I've been trying to find out whether any one has had the same issue but could not find any relevant forum (yet).

Any advice would be much appreciated!
Jan

Tom Morris

unread,

May 14, 2015, 12:08:52 PM5/14/15

to tesser...@googlegroups.com

On Friday, March 13, 2015 at 8:29:13 AM UTC-4, Jan wrote:

I noticed that when I use tesseract to create a searchable pdf (I use pdfsandwich fot this), some characters are not displayed and are replaced by blank spaces instead. If I, however, ocr the same file with tesseract only in order to obtain a plain text (I use OCRfeeder), everything is recognized AND displayed properly. It seems as if tesseract had issues exporting some characters specifically to PDFs, even though it's obviously capable of recognizing them. This happens with quotation marks, ligatures ("Th", "ff", etc.) but also, for example, with some special Czech characters such as "ě", "č," or "š" (even when the option "-l ces" is activated). Does anybody have an idea what can be wrong?

There's some discussion on this bug report which may be relevant:

http://bugs.ghostscript.com/show_bug.cgi?id=695869

It doesn't sound like the same problem, but it could be related.

Tom

Reply all

Reply to author

Forward

0 new messages