PDF output not searchable within SumatraPDF

Chris Cameron

unread,

Oct 14, 2014, 9:46:58 PM10/14/14

to tesser...@googlegroups.com

This command:

$ tesseract.exe 18.jpg test

Gives me "test.txt", which has all the text from 18.jpg, as expected.

This command:

$ tesseract.exe 18.jpg test pdf

Gives me "test.pdf", which doesn't appear to have most of the sentences that exist in test.txt when opened in SumatraPDF. All the PDF text can be highlighted, but when doing a search from within the PDF, only fragments of sentences are found. Opening this same file in Adobe Reader, all text can be found with the find function.

My environment:

$ tesseract.exe -v

tesseract 3.04.00

leptonica-1.71

libjpeg 8d : libpng 1.5.18 : libtiff 4.0.3 : zlib 1.2.8

SumatraPDF v2.5.2

Adobe Reader 11.0.07

Can someone help me out with why this might be happening?

Thanks,

Chris

simon.ei...@vol.at

unread,

Oct 15, 2014, 2:55:53 AM10/15/14

to tesser...@googlegroups.com

hi,

i have the same issue with adobe reader 11 and tesseract 3.04.00 most
recent git.
compiled under cygwin.

adobe reader can't open the pdf saying its broken.

greetings,
simon

> --
> You received this message because you are subscribed to the Google
>Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it,
>send an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
>https://groups.google.com/d/msgid/tesseract-ocr/9653f6bd-5251-42b5-a5e1-592d85c26c5c%40googlegroups.com.
>For more options, visit https://groups.google.com/d/optout.

--
Simon Eigeldinger
simon.ei...@vol.at

zdenko podobny

unread,

Oct 15, 2014, 3:11:59 AM10/15/14

to tesser...@googlegroups.com

can you post somewhere 18.jpg?

Zdenko

--

Chris Cameron

unread,

Oct 15, 2014, 12:06:15 PM10/15/14

to tesser...@googlegroups.com

All the files I mention can be found here:

https://www.dropbox.com/sh/v5w4zl0c2z1wra1/AACxjmomYL4o-iQEhBrLvNgHa

Incidentally, I now see that Chrome's PDF viewer is also unable to search the PDF.

Thanks,

Chris

Simon Eigeldinger

unread,

Oct 15, 2014, 12:22:04 PM10/15/14

to tesser...@googlegroups.com

hi,

seems you at least get some output.
i wonder if its just tif files but tif files seem to be a complete no go
currently with tesseract if you output a pdf.

additionally you should be able to not add the picture just the text it
has extracted to the pdf which it would make smaller.

greetings,
simon

--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.ei...@vol.at
MSN: simon_ei...@hotmail.com
ICQ: 121823966
Jabber: doma...@andrelouis.com

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

Reply all

Reply to author

Forward