Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Tools and methods to extract text from PDF files?

0 views

Skip to first unread message

Ramon F Herrera

unread,

Oct 29, 2009, 12:15:33 PM10/29/09

I am involved in a project which requires to store some text
(programmatically) in PDF documents. I guess my first step would be to
look at how Adobe does it. I was surprised to see that the text being
discovered by the Adobe OCR phase is stored in a fashion in the PDF
file, while the text discovered by another OCR company is stored
differently. Perhaps they are trying to stay out of each other's way?

In any event, some of my questions are: Is the mechanism to store text
in the PDF file documented? Is there some sort of standard?

Tools that extract such words from PDF files could be useful in my
research.

TIA,

-Ramon

ken

unread,

Oct 30, 2009, 3:52:06 AM10/30/09

In article <791c07c4-e66d-4f99-bff1-
d475a7...@m13g2000vbf.googlegroups.com>, ra...@conexus.net says...

> In any event, some of my questions are: Is the mechanism to store text
> in the PDF file documented? Is there some sort of standard?

You need to read teh PDF Rederence Manual, which is available from the
Adobe web site. Warning; text is stored in an encoded fashion, while it
*may* be ASCII or similar it equally well may not be, and is dependent
(amongst other things) on the font being used.

This is a complex subject, and in the general case there is no guarantee
of being able to recover text from a PDF file in any way other than
printing and OCR'ing it.

That being said, since you are generating the text, its perfectly
possible to ensure that you can get it back out again, just don't assume
that you can do this with any random PDF file.

> Tools that extract such words from PDF files could be useful in my
> research.

Ghostscript has a simple tool, ps2ascii, which can extract text, but is
not well supported.

Ken

0 new messages