Text output vs. PDF

178 views
Skip to first unread message

Tobias Fritz

unread,
Jun 20, 2015, 4:45:45 PM6/20/15
to tesser...@googlegroups.com
Hi,

I'm using tesseract 3.04 on OSX. It works very well but I'm having troubles with searchable PDF output.

I tried running tesseract on a tif file and created pdf output. The text file that is also created is almost accurate except for some little glitches. However, the text overlay in the pdf is not. In one place it inserted spaces in the words like this: t h i s  i s  a n  e x a m p l e.

In another place it removed all the spaces like this: thisisanexample.

What's the reason for this when the text file is almost perfect? How can I avoid this behavior?

Many thanks for any advice,

Tobias

Tobias Fritz

unread,
Jun 21, 2015, 5:06:08 AM6/21/15
to tesser...@googlegroups.com
Update: I found out that it is apparently a Preview.app problem as it works correctly in Adobe Reader. Still, is there anything I can do about it? I read with Skim.app which depends on PDFkit. So it would be great if it worked correctly with it.

supriya Das

unread,
Jun 22, 2015, 12:18:11 AM6/22/15
to tesser...@googlegroups.com
Hello Tobias,

     I am try to build Tesseract 3.04 in Visual studio 2010, but there is some issue with leptonica version. Which version of leponica you used. Please suggest me.
Thanks in advance.

Jeff Breidenbach

unread,
Jun 29, 2015, 3:45:37 AM6/29/15
to tesser...@googlegroups.com
Unfortunately, I think there is nothing we can do. I've done everything I can to 
maximize compatibility with various PDF rendering engines, but Preview uses 
particularly terrible text extraction heuristics. To be fair, the root problem is
the design and complexity of the PDF specification itself.

H. Mijail Antón Quiles

unread,
Jul 19, 2016, 10:34:23 AM7/19/16
to tesseract-ocr
I just spent a couple of hours debugging a workflow, because the finally generated PDF seemed to have been OCR'd but with every character being a space.
Turns out that the problem was not in the workflow, but me using Preview.app, as explained in this thread. Acrobat Reader does extract the correct text when selecting + copying.

I see a number of other questions in the forum that could be related to this same problem, so I've just added a FAQ ( https://github.com/tesseract-ocr/tesseract/wiki/FAQ#the-produced-searchable-pdf-seems-to-only-contain-spaces )
Reply all
Reply to author
Forward
0 new messages