Extracting text from PDF

Eitan

unread,

Jan 4, 2010, 8:09:33 AM1/4/10

to tesseract-ocr

Hi

I am a newbie...
Is there a standard way to extract text from PDF using tesseract-ocr ?

Thanks

nguyenq

unread,

Jan 4, 2010, 2:24:54 PM1/4/10

to tesseract-ocr

No, you would have to convert PDF to an image before feeding it to the
OCR engine. Ghostscript supports such PDF conversion tasks.

Jeffrey Ratcliffe

unread,

Jan 4, 2010, 5:26:29 PM1/4/10

to tesser...@googlegroups.com

On Mon, Jan 04, 2010 at 11:24:54AM -0800, nguyenq wrote:
> > Is there a standard way to extract text from PDF using tesseract-ocr ?
>

> No, you would have to convert PDF to an image before feeding it to the
> OCR engine. Ghostscript supports such PDF conversion tasks.

I would not recommend that, as it resamples the image. The pdfimages
program extracts raster images from PDF. These you can then feed to tesseract.

The text is actually stored as text, rather than as images, then
pdftotext will extract the text.

signature.asc

Chris Faust

unread,

Jan 4, 2010, 2:21:20 PM1/4/10

to tesser...@googlegroups.com

Personally, I would just use Image::Magick or GD to convert the .pdf into a
.tiff and then simply have tesseract ocr it.

Someone else may have a better solution though.

Hi

Thanks

--

You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

James Le Cuirot

unread,

Jan 4, 2010, 2:49:48 PM1/4/10

to tesser...@googlegroups.com

Unless your PDF is comprised of images, this is not the way to go. PDF
is a document format, not an image format. Use a tool like pdftotext.

James

Hussein Al-Hussein

unread,

Jan 5, 2010, 12:37:42 AM1/5/10

to tesser...@googlegroups.com

In addition to all that has been suggested, if you have the Adobe Acrobat (Writer) installed (version 6 and up), go to File menu and then Save As and select the image type like jpg; then all the pages will be saved in a separate image each.

Hussein Al-Hussein

Hussein Al-Hussein

unread,

Jan 5, 2010, 12:44:59 AM1/5/10

to tesser...@googlegroups.com

However, if the PDF files you have are structured documents with real text in them not inserted images, then there are tools to extract all the text. Even Adobe has a free toolkit in Java that I have used to access words, images, etc via java. I used it to search through the document, extract all words, frequency of words, etc.

See the attached getting started guide and the link below; it has been 3 years since I did that and I remember it was called XPAAJ:

http://blogs.adobe.com/mikepotter/2006/07/download_xpaaj.html

Hussein Al-Hussein

Getting_Started.pdf

bharath bhooshan

unread,

Jan 5, 2010, 2:05:18 AM1/5/10

to tesser...@googlegroups.com

is it an image or text that the pdf contains?

it is an imagepdf the answer is yes you can try it out with tesseract

if it a normal pdf , the answer is No with tesseract,but programatically u can using pdfbox...njoy madi

Reply all

Reply to author

Forward