Text output vs. PDF

Tobias Fritz

unread,

Jun 20, 2015, 4:45:45 PM6/20/15

to tesser...@googlegroups.com

Hi,

I'm using tesseract 3.04 on OSX. It works very well but I'm having troubles with searchable PDF output.

I tried running tesseract on a tif file and created pdf output. The text file that is also created is almost accurate except for some little glitches. However, the text overlay in the pdf is not. In one place it inserted spaces in the words like this: t h i s i s a n e x a m p l e.

In another place it removed all the spaces like this: thisisanexample.

What's the reason for this when the text file is almost perfect? How can I avoid this behavior?

Many thanks for any advice,

Tobias

Tobias Fritz

unread,

Jun 21, 2015, 5:06:08 AM6/21/15

to tesser...@googlegroups.com

Update: I found out that it is apparently a Preview.app problem as it works correctly in Adobe Reader. Still, is there anything I can do about it? I read with Skim.app which depends on PDFkit. So it would be great if it worked correctly with it.

supriya Das

unread,

Jun 22, 2015, 12:18:11 AM6/22/15

to tesser...@googlegroups.com

Hello Tobias,

I am try to build Tesseract 3.04 in Visual studio 2010, but there is some issue with leptonica version. Which version of leponica you used. Please suggest me.

Thanks in advance.

Jeff Breidenbach

unread,

Jun 29, 2015, 3:45:37 AM6/29/15

to tesser...@googlegroups.com

Unfortunately, I think there is nothing we can do. I've done everything I can to

maximize compatibility with various PDF rendering engines, but Preview uses

particularly terrible text extraction heuristics. To be fair, the root problem is

the design and complexity of the PDF specification itself.

H. Mijail Antón Quiles

unread,

Jul 19, 2016, 10:34:23 AM7/19/16

to tesseract-ocr

I just spent a couple of hours debugging a workflow, because the finally generated PDF seemed to have been OCR'd but with every character being a space.
Turns out that the problem was not in the workflow, but me using Preview.app, as explained in this thread. Acrobat Reader does extract the correct text when selecting + copying.

I see a number of other questions in the forum that could be related to this same problem, so I've just added a FAQ ( https://github.com/tesseract-ocr/tesseract/wiki/FAQ#the-produced-searchable-pdf-seems-to-only-contain-spaces )

Reply all

Reply to author

Forward