Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

PDF extract text

15 views
Skip to first unread message

Philipp Kraus

unread,
Apr 7, 2014, 12:53:06 AM4/7/14
to
Hello,

how can I extract text, images and other structures can be ignored,
with PHP from a PDF file?
We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
extract only the text content
to create a text analysis of the content eg for LaTeX scripts we would
like the chapter structure as well.

Is there any solution to do this with build-in PHP functions?

Thanks

Phil

Thomas 'PointedEars' Lahn

unread,
Apr 7, 2014, 7:44:10 AM4/7/14
to
Philipp Kraus wrote:

> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?

For example with “PDF Parser”. You cannot have searched before posting; it
took me less than a minute to find that out with the Google keywords “pdf
php read”.

<http://www.catb.org/~esr/faqs/smart-questions.html>

> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.

PDF files generated with pdflatex usually contain that as TOC metadata.

> Is there any solution to do this with build-in PHP functions?
^t
No.

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
Message has been deleted

Thomas 'PointedEars' Lahn

unread,
Apr 7, 2014, 1:38:55 PM4/7/14
to
Michael Vilain wrote:

> Philipp Kraus <philip...@flashpixx.de> wrote:
>> how can I extract text, images and other structures can be ignored,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> with PHP from a PDF file?
>> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
^^^^^ ^^^^^^^^^^
>> extract only the text content
>> to create a text analysis of the content eg for LaTeX scripts we would
>> like the chapter structure as well.
>>
>> Is there any solution to do this with build-in PHP functions?
>
> I tried a bunch of stuff to read some bank statements that were in PDF
> format so I could import them via CSV. Didn't work out so well. Adobe's
> OCR feature only works if the PDFs are unlocked to allow it. I found an
> application that would do that but the OCRed text was unusable.
>
> So, my question is "what's generating the PDF files?"

The ability to read can be of advantage sometimes …

> Can you get whomever to do it in text or some other format?

OMG. One can leave it to you to give the worst possible technical advice.

> If they're encrypted images, then you've got a lot of work to do in order
> to get some output. Maybe.

Nobody but you is talking about images and OCR. You really don't have a
clue what PDF is, do you?

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.t r

Christoph Michael Becker

unread,
Apr 7, 2014, 3:52:47 PM4/7/14
to
Thomas 'PointedEars' Lahn wrote:

> Philipp Kraus wrote:
>
>> Is there any solution to do this with build-in PHP functions?
> ^t
> No.

Well, there may not be a solution to do this with built-in PHP functions
(whatever a built-in PHP function might be; actually (almost) all PHP
functions are part of an extension), but at least *theoretically* it
would be possible by processing the PDF file "bytewise". (The PDF
specification is available online for free.)

--
Christoph M. Becker

Thomas 'PointedEars' Lahn

unread,
Apr 7, 2014, 4:12:04 PM4/7/14
to
*rolls eyes*

*bags collected trolls’ eyes*
0 new messages