Replacing non-JPEG images with JPEG images...

nerdy...@gmail.com

unread,

Sep 2, 2013, 11:48:20 PM9/2/13

to pdfhummus-in...@googlegroups.com

I'm making some progress with this but it's slow going.

Please, how do I read all the images of a PDF file and replace them with JPEG versions of themselves?

Thank you...

Gal Kahana

unread,

Sep 3, 2013, 7:46:37 AM9/3/13

to pdfhummus-in...@googlegroups.com

you'll need to modify the resources references to new images that you'll create.

you can do that while modifying an existing PDF or importing the PDF and the images to a new PDF.

Assuming that you are going for modification you'll have to recreate the resources dictionary with references to the new images.

follow this process:

1. Open the PDF document for modification

2. Get the parser, loop the pages, for each page get the page dictionary object, and get it's resources dictionary.

3. Loop the resources dictionary getting to the image xobjects. for each image xobject:

3.1. if it is an image that you want to replace, import a jpg image to the PDF file. register the result form/image ID alongside with the existing image id

4. You will need to update the page object now. create a modified object (using ObjectsContent::StartModifiedObject) and copy into it all entries in the page dictionary but the resources dictionary.

5. create a resources dictionary entry in the page (or as indirect object), copy all but xobjects entries. then in it create an xobject entry and into it copy the keys of the old xobjects, where if you have a register replacement put the new object ID instead of the old one.

6. if you haven't ened the page/resources dictionary objects, do so now.

now you should have a new PDF with the images updated, and the pages updated with new resources dictionaries that point to the new images instead of the old ones.

it could have been a little shorter if there was a way to define new images as replacements to the old ones, while modifying, but there isn't.

there's something a little more comfortable if you are creating a new PDF and importing the one with the original images into it page by page. then you can simply create the image objects in advance on the new pdf, then using the copying content to mark the new images as replacements to the old ones, and then import the pages from the old PDF. the underlying copying mechanism will know not to copy the old images, using the new representatives instead.

Regards,

Gal.

nerdy...@gmail.com

unread,

Sep 3, 2013, 6:57:16 PM9/3/13

to pdfhummus-in...@googlegroups.com

Thank you! I will likely take your latter suggestion and use PDFDocumentCopyingContext::ReplaceSourceObjects() to replace the images. What's vague to me is how to take existing PDFImageXObjects and turn them into actual images, with pixels and stuff. I don't see any way of doing this. This is necessary in order to take the existing images in a PDF document, in whatever format they may be in, and turn them into JPEGs of some quality and resolution.

Is there a way of doing this?

Thank you...

Gal Kahana

unread,

Sep 4, 2013, 7:01:10 AM9/4/13

to pdfhummus-in...@googlegroups.com

I know that some used ImageMagick for this - http://www.imagemagick.org/script/index.php

this seems to show how to do that: http://pario.no/2008/02/02/extracting-images-from-adobe-acrobat-pdf-file/

hth,

Gal.

Gal Kahana

unread,

Sep 4, 2013, 7:02:18 AM9/4/13

to pdfhummus-in...@googlegroups.com

i mean...you could fetch the images raw data with hummus...but it will require moar work as the library doesn't provide this as a high level service.

Gal.

Reply all

Reply to author

Forward