extracting objects from pdf

Michał Nowotka

unread,

Feb 28, 2013, 9:17:55 AM2/28/13

to dev-p...@lists.mozilla.org

Hello,

The general problem I have with pdf.js is that it looks like really
tailored to the viewer. I would like to use pdf.js not only to view
pdfs online. It would be nice it pdf.js had methods like:

- getImages(page) - extract all images in their original resolution
along with information about their size and coordinates in the
document.
- getText(page) - extract text from pdf without rendering it.
- getObjects(page) - returns all objects from specified page with
information about their type, size and coordinates.
- toXML - outputs the hierarchy of its distinct logical elements in
an XML format, similar to http://pdfx.cs.man.ac.uk/

So far in order to extract images from pdf.js I was hacking
paintInlineImageXObject where I have imgData ready to use. The problem
with this approach is that this method is called when the image needs
to be rendered so in order to get all images I need to scroll through
the whole document. Second problem is that image size depends on zoom
settings.

It woudl be nice to have something independent from viewing.

Regards,
Michal Nowotka

Julian Viereck

unread,

Mar 1, 2013, 6:32:17 AM3/1/13

to dev-p...@lists.mozilla.org

Hi Michal,

some functionality is already available in the api.js:

- getTextContent: https://github.com/mozilla/pdf.js/blob/master/src/api.js#L419

to get extract the images you have to pass an imageLayer to the CanvasGraphics object. This will record the images as they are rendered to the canvas:

- https://github.com/mozilla/pdf.js/blob/master/src/canvas.js#L217

I don't think there is something like a `getObjects(page)` and `toXML` at this point. Adding the getObject(page) function might not be too complicated to implement and I'm happy to give you some pointers in case you would like to implement it.

Let me know if you have any furture questions!

Best,

Julian

Julian Viereck

unread,

Mar 1, 2013, 6:32:17 AM3/1/13

to mozilla.d...@googlegroups.com, dev-p...@lists.mozilla.org

Hi Michal,

some functionality is already available in the api.js:

- getTextContent: https://github.com/mozilla/pdf.js/blob/master/src/api.js#L419

to get extract the images you have to pass an imageLayer to the CanvasGraphics object. This will record the images as they are rendered to the canvas:

- https://github.com/mozilla/pdf.js/blob/master/src/canvas.js#L217

I don't think there is something like a `getObjects(page)` and `toXML` at this point. Adding the getObject(page) function might not be too complicated to implement and I'm happy to give you some pointers in case you would like to implement it.

Let me know if you have any furture questions!

Best,

Julian

On Thursday, February 28, 2013 3:17:55 PM UTC+1, Michał Nowotka wrote:

hc.de...@gmail.com

unread,

Mar 3, 2015, 2:41:20 PM3/3/15

to mozilla-d...@lists.mozilla.org

It's been two years ago since you posted your response, but I have the same question as the original poster. Is it possible to extract the objects? Using the PDFPageProxy and the api.js, I haven't been able to extract any objects from a PDF document created in InDesign.

If you'd be so kind as to point me in the right direction, I'd be very grateful.

AlQemist

unread,

Mar 23, 2015, 11:22:12 AM3/23/15

to mozilla-d...@lists.mozilla.org

Checkout TET from PDFlib.com