Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

extracting images from pdf files

2,283 views
Skip to first unread message

Jos van den Oever

unread,
Nov 28, 2012, 5:09:39 AM11/28/12
to dev-p...@lists.mozilla.org
Hi all,

Is there a more or less stable API in pdf.js to extract embedded images
from pdf files? Rendering is not needed but getting embedded CCITTFax
images as original and/or PNG is.

Perhaps there is some example code for achieving this?

/**
* Read all images from a PDF file and return them via callback.
* If the returned blob is null, no more Images will be reported.
*
* @param {!string} uri pdf location
* @param {!function(string,Blob):undefined} image handler
function getImages() {
...
}

Best regards,
Jos

Julian Viereck

unread,
Nov 28, 2012, 3:13:51 PM11/28/12
to Jos van den Oever, dev-p...@lists.mozilla.org
I don't think there is an API to extract images :/

The way it goes right now is like this:

- a single page gets extracted from the PDF and all required objects
for the page are build up (e.g. font data, images etc)
- the images are send to the main thread using the messageHandler. In
case of images, you want to look at this line:

https://github.com/mozilla/pdf.js/blob/master/src/api.js#L598

If you want to have a good solution, you would have to implement
something like an imageExtractor in the PartialEvaluator, that looks
for images, parses only them (to get good performance) and then send
them back to the main thread in a way, such that you can catch them
more easily.

If you want to implement such an image extractor, I'm happy to give you
guidence to get going.

Best,

Julian
> _______________________________________________
> dev-pdf-js mailing list
> dev-p...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-pdf-js

Jos van den Oever

unread,
Nov 28, 2012, 5:15:39 PM11/28/12
to dev-p...@lists.mozilla.org
On 11/28/2012 09:13 PM, Julian Viereck wrote:
> I don't think there is an API to extract images :/
>
> The way it goes right now is like this:
>
> - a single page gets extracted from the PDF and all required objects for
> the page are build up (e.g. font data, images etc)
> - the images are send to the main thread using the messageHandler. In
> case of images, you want to look at this line:
>
> https://github.com/mozilla/pdf.js/blob/master/src/api.js#L598
>
> If you want to have a good solution, you would have to implement
> something like an imageExtractor in the PartialEvaluator, that looks for
> images, parses only them (to get good performance) and then send them
> back to the main thread in a way, such that you can catch them more easily.

I understand that the API is really layed out for rendering and not for
accessing or even editing parts of the documents.

> If you want to implement such an image extractor, I'm happy to give you
> guidence to get going.

For the moment, I will stick with my Java based server solution that
uses PdfBox, since the adaptation would exceed the time I have currently
available for this use.

Thank you for the explanation and the offer of guidance,
Jos

Julian Viereck

unread,
Feb 6, 2013, 8:53:56 AM2/6/13
to dev-p...@lists.mozilla.org, fridaka...@gmail.com
Someone ask me via private mail on some more details on this one. So here I go:


Basically, take a look at this here:

https://github.com/mozilla/pdf.js/blob/master/src/evaluator.js#L247

The `PartialEvaluator_getOperatorList` function goes over the drawing operations. For some of them, images are generated for the drawing, which is done in the line #247 I've linked above. If you only want to get the images, then you should loop over the operation list and look only for commands that call into `buildPaintImageXObject`.

Take a look at the `PartialEvaluator_getTextContent` function at

https://github.com/mozilla/pdf.js/blob/master/src/evaluator.js#L676

which is already a "simplier" iteration over the operation list.

Once you have that in place, you have to catch the image objects that are sent from the worker to the client. The images are send to the client from this line:

https://github.com/mozilla/pdf.js/blob/master/src/evaluator.js#L306

On the main thread, the images are then received either at the:

https://github.com/mozilla/pdf.js/blob/master/src/api.js#L616

or

https://github.com/mozilla/pdf.js/blob/master/src/api.js#L656

line.

I know I very simplify the stuff and if you don't have a fully understanding how the code works, it might be difficult to get going. By advice is to add some breakpoints at the lines i've listed above, look at the stacktrace and figure out from there how things are working together.

If you have any future questions, let me know!

Best,

Julian

Julian Viereck

unread,
Feb 6, 2013, 8:53:56 AM2/6/13
to mozilla.d...@googlegroups.com, fridaka...@gmail.com, dev-p...@lists.mozilla.org
0 new messages