PDF extract text

Philipp Kraus

unread,

Apr 7, 2014, 12:53:06 AM4/7/14

to

Hello,

how can I extract text, images and other structures can be ignored,
with PHP from a PDF file?
We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
extract only the text content
to create a text analysis of the content eg for LaTeX scripts we would
like the chapter structure as well.

Is there any solution to do this with build-in PHP functions?

Thanks

Phil

Thomas 'PointedEars' Lahn

unread,

Apr 7, 2014, 7:44:10 AM4/7/14

to

Philipp Kraus wrote:

> how can I extract text, images and other structures can be ignored,
> with PHP from a PDF file?

For example with “PDF Parser”. You cannot have searched before posting; it
took me less than a minute to find that out with the Google keywords “pdf
php read”.

<http://www.catb.org/~esr/faqs/smart-questions.html>

> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to
> extract only the text content
> to create a text analysis of the content eg for LaTeX scripts we would
> like the chapter structure as well.

PDF files generated with pdflatex usually contain that as TOC metadata.

> Is there any solution to do this with build-in PHP functions?

^t
No.

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.

Message has been deleted

Thomas 'PointedEars' Lahn

unread,

Apr 7, 2014, 1:38:55 PM4/7/14

to

Michael Vilain wrote:

> Philipp Kraus <philip...@flashpixx.de> wrote:
>> how can I extract text, images and other structures can be ignored,

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>> with PHP from a PDF file?
>> We have a lot of LaTeX PDFs and Powerpoint PDFs and would like to

^^^^^ ^^^^^^^^^^

>> extract only the text content
>> to create a text analysis of the content eg for LaTeX scripts we would
>> like the chapter structure as well.
>>
>> Is there any solution to do this with build-in PHP functions?
>

> I tried a bunch of stuff to read some bank statements that were in PDF
> format so I could import them via CSV. Didn't work out so well. Adobe's
> OCR feature only works if the PDFs are unlocked to allow it. I found an
> application that would do that but the OCRed text was unusable.
>
> So, my question is "what's generating the PDF files?"

The ability to read can be of advantage sometimes …

> Can you get whomever to do it in text or some other format?

OMG. One can leave it to you to give the worst possible technical advice.

> If they're encrypted images, then you've got a lot of work to do in order
> to get some output. Maybe.

Nobody but you is talking about images and OCR. You really don't have a
clue what PDF is, do you?

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.t r

Christoph Michael Becker

unread,

Apr 7, 2014, 3:52:47 PM4/7/14

to

Thomas 'PointedEars' Lahn wrote:

> Philipp Kraus wrote:
>
>> Is there any solution to do this with build-in PHP functions?
> ^t
> No.

Well, there may not be a solution to do this with built-in PHP functions
(whatever a built-in PHP function might be; actually (almost) all PHP
functions are part of an extension), but at least *theoretically* it
would be possible by processing the PDF file "bytewise". (The PDF
specification is available online for free.)

--
Christoph M. Becker

Thomas 'PointedEars' Lahn

unread,

Apr 7, 2014, 4:12:04 PM4/7/14

to

*rolls eyes*

*bags collected trolls’ eyes*