Extracting text from PDF

868 views
Skip to first unread message

Eitan

unread,
Jan 4, 2010, 8:09:33 AM1/4/10
to tesseract-ocr
Hi

I am a newbie...
Is there a standard way to extract text from PDF using tesseract-ocr ?

Thanks

nguyenq

unread,
Jan 4, 2010, 2:24:54 PM1/4/10
to tesseract-ocr
No, you would have to convert PDF to an image before feeding it to the
OCR engine. Ghostscript supports such PDF conversion tasks.

Jeffrey Ratcliffe

unread,
Jan 4, 2010, 5:26:29 PM1/4/10
to tesser...@googlegroups.com
On Mon, Jan 04, 2010 at 11:24:54AM -0800, nguyenq wrote:
> > Is there a standard way to extract text from PDF using tesseract-ocr ?
>
> No, you would have to convert PDF to an image before feeding it to the
> OCR engine. Ghostscript supports such PDF conversion tasks.

I would not recommend that, as it resamples the image. The pdfimages
program extracts raster images from PDF. These you can then feed to tesseract.

The text is actually stored as text, rather than as images, then
pdftotext will extract the text.

signature.asc

Chris Faust

unread,
Jan 4, 2010, 2:21:20 PM1/4/10
to tesser...@googlegroups.com
Personally, I would just use Image::Magick or GD to convert the .pdf into a
.tiff and then simply have tesseract ocr it.

Someone else may have a better solution though.

Hi

Thanks

--

You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

James Le Cuirot

unread,
Jan 4, 2010, 2:49:48 PM1/4/10
to tesser...@googlegroups.com

Unless your PDF is comprised of images, this is not the way to go. PDF
is a document format, not an image format. Use a tool like pdftotext.

James

Hussein Al-Hussein

unread,
Jan 5, 2010, 12:37:42 AM1/5/10
to tesser...@googlegroups.com
In addition to all that has been suggested, if you have the Adobe Acrobat (Writer) installed (version 6 and up), go to File menu and then Save As and select the image type like jpg; then all the pages will be saved in a separate image each.

Hussein Al-Hussein

Hussein Al-Hussein

unread,
Jan 5, 2010, 12:44:59 AM1/5/10
to tesser...@googlegroups.com

However, if the PDF files you have are structured documents with real text in them not inserted images, then there are tools to extract all the text.  Even Adobe has a free toolkit in Java that I have used to access words, images, etc via java.  I used it to search through the document, extract all words, frequency of words, etc.

See the attached getting started guide and the link below; it has been 3 years since I did that and I remember it was called XPAAJ:

http://blogs.adobe.com/mikepotter/2006/07/download_xpaaj.html

Hussein Al-Hussein


Getting_Started.pdf

bharath bhooshan

unread,
Jan 5, 2010, 2:05:18 AM1/5/10
to tesser...@googlegroups.com
is it an image or text that the pdf contains?

it is an imagepdf the answer is yes you can try it out with tesseract

if it a normal pdf , the answer is No with tesseract,but programatically u can using pdfbox...njoy madi

Reply all
Reply to author
Forward
0 new messages