Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Digitizing to text photocoped PDFs

1 view

Skip to first unread message

Skookum

unread,

Jun 22, 2009, 4:20:29 PM6/22/09

I'm so ignorant I'm not even sure I've titkled this right ... but
anyway: Here's wht I'd like to do but don't know how: as a university
student, I download a lot of research articles as PDFs. I have Acrobat
Professional and many of those papers are already digitized (if that's
the right word), i.e. you can use the search function to find a
specific string or word in the text. But still quite a few are based
on having just been scanned and are essentially pictures of pages.

Now, if I print such scanned pages and rescan using some of the
conversion software I can get a rough text version (lots of errors but
basically useable). It strikes me that there must be a simpler way of
taking a PDF document that is a picture of a page and converting it to
text but darned if i know how!

Any guidance out there?

Peter Flynn

unread,

Jun 22, 2009, 7:09:11 PM6/22/09

OCR programs work with bitmap files, so you don't need to scan the
printed version, just extract the images (eg with pdfimages) and convert
them to an acceptable format (eg TIFF) with ImageMagick.

$ pdfimages file.pdf image

(this produces image-000.ppm, image-001.ppm, image-002.ppm, etc, or
perhaps .jpg, depending on what pdfimages finds in there)

$ for f in image-*.ppm; do convert $f $(f/ppm/tif); done

(or something similar, depending).

///Peter

0 new messages