extract plain text from pdf via acrobat or gsview

Günter Bachmann

unread,

May 28, 2004, 11:08:59 AM5/28/04

to

Hi Group,
I'm a pdf newbie. I often have to extract plain text from
pdf-documents. Sometimes it works fine with the text-selection-tools of
Acrobat Reader.
Sometimes this do not work - what is the reason? I assume it depends on
the "creatione-method", but can't find a hint in the document
properties. Here is one example, where text selection is disabled and I
don't know, how to get the plain text:
http://www.srh.de/internet/ec3bd652-ada4-11d8-9a3f-0002b34c0328 (250k)

I tried this way: Printed the pdf to a PostScript prn-File, opended the
file with GSview (Ghostscript) and tried to extract the text. Since I
have no Idea of this program (sorry, I have the German version and I'm
not sure, if my below wording is right for the English version), I
tried:
- File/Extract what generates a *.ps file (??)
- various settings in options/PStoText
- edit/extract text brings up the error 7 (regardless the pdf-file I
try):
--- End offending input ---
gsapi_run_string_continue returns -7
Unrecoverable error: invalidaccess in put
Operand stack:
false false
nND --nostringval-- --nostringval-- --nostringval--
PermitFileReading --nostringval--
true --nostringval-- --nostringval-- --nostringval--
PermitFileReading --nostringval--
Extracting text using pstotext...
Ghostscript returns error code -7
and so on.

I assume, that my GSview setting or handling is the mistake. Do you
have any hint to solve my problem?
Thanks a lot
Günter

Perhaps you know a way

Larry T.

unread,

May 28, 2004, 11:34:06 AM5/28/04

to

Hi Guntar,

I am not an expert, but I did look at your PDF and it is not searchable with
Adobe Reader 6.0 so I suspect it is not a searchable PDF, but rather more
like a tiff file. There is no text and that is why you can not extract text
(which is not always easy or even possible with PDFs). Best bet would be to
scan the pdf with OCR software and then edit the output accordingly.

Larry T.

Günter Bachmann

unread,

May 29, 2004, 3:13:24 AM5/29/04

to

thanks for your anwers, so I learned why Text is not selectable in
Acrobat. Do you have any idea if it would be possible with
GSview/Ghostscript? Or is it off topic here and is there any special
GSview newsgroup?? I don't know if the way pdf->ps and then text
extraction is possible. Here I described what I tried in GSview:

> I tried this way: Printed the pdf to a PostScript prn-File, opended
> the file with GSview (Ghostscript) and tried to extract the text.
> Since I have no Idea of this program (sorry, I have the German
> version and I'm not sure, if my below wording is right for the
> English version), I tried:
> - File/Extract what generates a *.ps file (??)
> - various settings in options/PStoText
> - edit/extract text brings up the error 7 (regardless the pdf-file I
> try):
> --- End offending input ---
> gsapi_run_string_continue returns -7
> Unrecoverable error: invalidaccess in put
> Operand stack:
> false false
> nND --nostringval-- --nostringval-- --nostringval--
> PermitFileReading --nostringval--
> true --nostringval-- --nostringval-- --nostringval--
> PermitFileReading --nostringval--
> Extracting text using pstotext...
> Ghostscript returns error code -7
> and so on.
>
> I assume, that my GSview setting or handling is the mistake. Do you
> have any hint to solve my problem?
Thanks

Günter

Michael Hemmer

unread,

Jun 1, 2004, 4:36:45 AM6/1/04

to

Günter Bachmann wrote:
> thanks for your anwers, so I learned why Text is not selectable in
> Acrobat. Do you have any idea if it would be possible with
> GSview/Ghostscript?

Please reconsider Fabrizio's statement
| The text present in your pdf is a vector image.
| It is not text. It is graphic.
and then explain why you think any software other than Acrobat would be
able to extract text when there's none.

If you have an OCR program that can read from TIFF, BMP or another
bitmapped graphics format (not only from a scanner), conversion to that
format from PDF using Ghostscript *might* be a feasible way.

Michael