Hi,
I noticed that when I use tesseract to create a searchable pdf (I use pdfsandwich fot this), some characters are not displayed and are replaced by blank spaces instead. If I, however, ocr the same file with tesseract only in order to obtain a plain text (I use OCRfeeder), everything is recognized AND displayed properly. It seems as if tesseract had issues exporting some characters specifically to PDFs, even though it's obviously capable of recognizing them. This happens with quotation marks, ligatures ("Th", "ff", etc.) but also, for example, with some special Czech characters such as "ě", "č," or "š" (even when the option "-l ces" is activated). Does anybody have an idea what can be wrong?
I've been trying to find out whether any one has had the same issue but could not find any relevant forum (yet).
Any advice would be much appreciated!
Jan