Tesseract from git and pdf output

212 Aufrufe
Direkt zur ersten ungelesenen Nachricht

simon.ei...@vol.at

ungelesen,
02.10.2014, 08:25:2102.10.14
an tesser...@googlegroups.com
hi all,

i compiled tesseract from git yesterday and played with it a little
bit.
pretty impressive what happened since around 2 years.
not only has tesseract a lower filesize but it seems its also faster
and more accurate.

But to the topic of this message:

I used the following command to read from a tif file and convert it
into a searchable pdf:

$ tesseract image.tif -l eng image pdf

which resulted into a text file with the very accurate ocr result and
a pdf.
when i opened the pdf with adobe reader i got an error that something
is wrong with the pdf file i just created.

the pdf was also large around 1 mb. a lot for just a little bit of
text.

am i doing something wrong there?

and i have a feature request as well.
more and more copiers/scanners output in multipage pdf which contain
images of the pages scanned.
is that planned to be included as well to be used as input file?

greetings,
simon



--
Simon Eigeldinger
simon.ei...@vol.at

zdenko podobny

ungelesen,
02.10.2014, 08:30:0702.10.14
an tesser...@googlegroups.com
post somewhere your input and output files

Zdenko



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/web-797443707%40stalker1.tele.net.
For more options, visit https://groups.google.com/d/optout.

simon.ei...@vol.at

ungelesen,
02.10.2014, 09:00:5002.10.14
an tesser...@googlegroups.com
hello,

the files are over there:
https://www.dropbox.com/s/9u3nkk1hahyu9o7/image.zip?dl=0

and the output of the console is:


$ tesseract image.tif image -l eng pdf

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found


greetings,
simon
>> email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>>https://groups.google.com/d/
>> msgid/tesseract-ocr/web-797443707%40stalker1.tele.net.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google
>Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it,
>send an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
>https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zKSMJafnCoc2n5s2BmY-u5zXdrhyP4SXc_O5zX-ZkJQQ%40mail.gmail.com.
>For more options, visit https://groups.google.com/d/optout.

--
Simon Eigeldinger
simon.ei...@vol.at

Shree Devi Kumar

ungelesen,
02.10.2014, 09:55:4702.10.14
an tesser...@googlegroups.com
Usually that error comes if pdf.ttf and pdf.ttx are not in your tessdata directory. 

Please check that files from https://code.google.com/p/tesseract-ocr/source/browse/#git%2Ftessdata are there in your tessdata directory pointed by the tessdata_prefix.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

simon.ei...@vol.at

ungelesen,
02.10.2014, 10:00:3702.10.14
an tesser...@googlegroups.com
hi,

pdf.ttf and pdf.ttx are in the tessdata directory.
as are all the other language files which can be accessed fine.

greetings,
simon
>> email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>>https://groups.google.com/d/
>> msgid/tesseract-ocr/web-797443707%40stalker1.tele.net.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google
>Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it,
>send an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
>https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWi1M%3DfJdmkFWigLdF5Gp-iJSGR-hcmx%3DtHrrSXgnmYwA%40mail.gmail.com.
>For more options, visit https://groups.google.com/d/optout.

--
Simon Eigeldinger
simon.ei...@vol.at
Allen antworten
Antwort an Autor
Weiterleiten
0 neue Nachrichten