> I know that this is a non-trivial task; one option I've considered is to
wait for the SVG rendering work to be complete and then look into
extracting formatted content from the SVG DOM - if it's not currently in
the roadmap, I might throw in a feature request for a way to render only
the SVG, skipping the canvas-based rendering, and return the SVG DOM root
from a function call like page.getSVG() - probably this would need to
accept a callback to be run when the SVG was ready.
Extracting information from PDF is as you said non trivial. I'm not sure if
the SVG backend will help you a lot here. You can then inspect the DOM
nodes, but there won't be something like a "table" DOM element, that makes
it easier to get the information you're looking for. In fact, "tables" are
just a bunch of lines on the screen with some text at the right position,
such that it looks to a human beeing as a "table".
You might find this interesting:
https://github.com/jviereck/pdf.js/blob/svg/svg/irqueue.txt
This is a dump of what the internally used "IR" looks like. This is the
IRQueue for the second page of the trace monkey paper. As far as I can
tell, the SVG backend will use the same IRQueue to build the SVG DOM.
> My intended use is to enable (some) PDF scraping with my Pjscrape library
(
https://github.com/nrabinowitz/pjscrape), which runs on top of PhantomJS
(a headless webkit implementation). I hadn't thought this would be possible
until I saw the pdf.js library - very cool.
That sound pritty cool & interesting. Would like to know what you come up
with or where you got stuck with this :)
Best,
Julian
On Wed, Dec 21, 2011 at 6:40 PM, Nick Rabinowitz
<
nick.ra...@gmail.com>wrote:
> Hello Julian -
>
> Thanks, this looks very promising! Being able to call something like
> page.extractTextContent() would be great, at least as a start. I'm
> particularly interested in anything that would give me some elements of the
> page structure as well - my basic use case is extracting structured content
> from a PDF containing a table of information. I know that this is a
> non-trivial task; one option I've considered is to wait for the SVG
> rendering work to be complete and then look into extracting formatted
> content from the SVG DOM - if it's not currently in the roadmap, I might
> throw in a feature request for a way to render only the SVG, skipping the
> canvas-based rendering, and return the SVG DOM root from a function call
> like page.getSVG() - probably this would need to accept a callback to be
> run when the SVG was ready.
>
> My intended use is to enable (some) PDF scraping with my Pjscrape library (
>
https://github.com/nrabinowitz/pjscrape), which runs on top of PhantomJS
> (a headless webkit implementation). I hadn't thought this would be possible
> until I saw the pdf.js library - very cool.
>
> I'll try to pull this request locally and see if it meets my immediate
> needs - thanks for sharing!
>
> Yours,
> -Nick