Hello,
The general problem I have with pdf.js is that it looks like really
tailored to the viewer. I would like to use pdf.js not only to view
pdfs online. It would be nice it pdf.js had methods like:
- getImages(page) - extract all images in their original resolution
along with information about their size and coordinates in the
document.
- getText(page) - extract text from pdf without rendering it.
- getObjects(page) - returns all objects from specified page with
information about their type, size and coordinates.
- toXML - outputs the hierarchy of its distinct logical elements in
an XML format, similar to
http://pdfx.cs.man.ac.uk/
So far in order to extract images from pdf.js I was hacking
paintInlineImageXObject where I have imgData ready to use. The problem
with this approach is that this method is called when the image needs
to be rendered so in order to get all images I need to scroll through
the whole document. Second problem is that image size depends on zoom
settings.
It woudl be nice to have something independent from viewing.
Regards,
Michal Nowotka