Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Using pdf.js for extracting PDF text / structured data

2,047 views
Skip to first unread message

Julian Viereck

unread,
Dec 21, 2011, 11:27:18 AM12/21/11
to Nick Rabinowitz, mozilla-d...@lists.mozilla.org
Hi Nick,

at the moment, there is no API to extract text information, but there is
some work going on to add this. You can see the progress here:

https://github.com/mozilla/pdf.js/pull/964

Is this somewhat you're looking for?

Best,

Julian

On Sat, Dec 17, 2011 at 5:44 PM, Nick Rabinowitz
<nick.ra...@gmail.com>wrote:

> Hello -
>
> This seems like an amazing project! I had a quick question - I'm
> interested in using pdf.js for extracting text and structured data
> from PDFs on the web (no rendering required). How difficult would it
> be to adapt the library for this use? I've looked around the code a
> bit, but I don't have any sense of the API the PDFJS.PDFDoc object
> exposes.
>
> Any thoughts you have on this would be appreciated - just trying to
> get a quick sense whether this is a reasonable road to start down.
>
> Thanks!
>
> -Nick Rabinowitz
> _______________________________________________
> dev-pdf-js mailing list
> dev-p...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-pdf-js
>

Julian Viereck

unread,
Dec 21, 2011, 12:59:20 PM12/21/11
to Nick Rabinowitz, mozilla-d...@lists.mozilla.org
> I know that this is a non-trivial task; one option I've considered is to
wait for the SVG rendering work to be complete and then look into
extracting formatted content from the SVG DOM - if it's not currently in
the roadmap, I might throw in a feature request for a way to render only
the SVG, skipping the canvas-based rendering, and return the SVG DOM root
from a function call like page.getSVG() - probably this would need to
accept a callback to be run when the SVG was ready.

Extracting information from PDF is as you said non trivial. I'm not sure if
the SVG backend will help you a lot here. You can then inspect the DOM
nodes, but there won't be something like a "table" DOM element, that makes
it easier to get the information you're looking for. In fact, "tables" are
just a bunch of lines on the screen with some text at the right position,
such that it looks to a human beeing as a "table".

You might find this interesting:

https://github.com/jviereck/pdf.js/blob/svg/svg/irqueue.txt

This is a dump of what the internally used "IR" looks like. This is the
IRQueue for the second page of the trace monkey paper. As far as I can
tell, the SVG backend will use the same IRQueue to build the SVG DOM.

> My intended use is to enable (some) PDF scraping with my Pjscrape library
(https://github.com/nrabinowitz/pjscrape), which runs on top of PhantomJS
(a headless webkit implementation). I hadn't thought this would be possible
until I saw the pdf.js library - very cool.

That sound pritty cool & interesting. Would like to know what you come up
with or where you got stuck with this :)

Best,

Julian

On Wed, Dec 21, 2011 at 6:40 PM, Nick Rabinowitz
<nick.ra...@gmail.com>wrote:

> Hello Julian -
>
> Thanks, this looks very promising! Being able to call something like
> page.extractTextContent() would be great, at least as a start. I'm
> particularly interested in anything that would give me some elements of the
> page structure as well - my basic use case is extracting structured content
> from a PDF containing a table of information. I know that this is a
> non-trivial task; one option I've considered is to wait for the SVG
> rendering work to be complete and then look into extracting formatted
> content from the SVG DOM - if it's not currently in the roadmap, I might
> throw in a feature request for a way to render only the SVG, skipping the
> canvas-based rendering, and return the SVG DOM root from a function call
> like page.getSVG() - probably this would need to accept a callback to be
> run when the SVG was ready.
>
> My intended use is to enable (some) PDF scraping with my Pjscrape library (
> https://github.com/nrabinowitz/pjscrape), which runs on top of PhantomJS
> (a headless webkit implementation). I hadn't thought this would be possible
> until I saw the pdf.js library - very cool.
>
> I'll try to pull this request locally and see if it meets my immediate
> needs - thanks for sharing!
>
> Yours,
> -Nick

Nick Rabinowitz

unread,
Dec 21, 2011, 12:40:42 PM12/21/11
to Julian Viereck, mozilla-d...@lists.mozilla.org

Nick Rabinowitz

unread,
Dec 22, 2011, 10:06:09 AM12/22/11
to Julian Viereck, mozilla-d...@lists.mozilla.org
Well, I'm more interested in supporting "by example" scraping of
identically-formatted documents than semantic scraping of arbitrary
documents - so just being able to break up the text into uniquely
reference-able chunks is a good step in this direction. Even just splitting
your extractTextContent value into an array of text chunks might be enough.

Thanks again!

-Nick

On Wed, Dec 21, 2011 at 9:59 AM, Julian Viereck <jviere...@googlemail.com
> wrote:

> > I know that this is a non-trivial task; one option I've considered is to
> wait for the SVG rendering work to be complete and then look into
> extracting formatted content from the SVG DOM - if it's not currently in
> the roadmap, I might throw in a feature request for a way to render only
> the SVG, skipping the canvas-based rendering, and return the SVG DOM root
> from a function call like page.getSVG() - probably this would need to
> accept a callback to be run when the SVG was ready.
>
> Extracting information from PDF is as you said non trivial. I'm not sure
> if the SVG backend will help you a lot here. You can then inspect the DOM
> nodes, but there won't be something like a "table" DOM element, that makes
> it easier to get the information you're looking for. In fact, "tables" are
> just a bunch of lines on the screen with some text at the right position,
> such that it looks to a human beeing as a "table".
>
> You might find this interesting:
>
> https://github.com/jviereck/pdf.js/blob/svg/svg/irqueue.txt
>
> This is a dump of what the internally used "IR" looks like. This is the
> IRQueue for the second page of the trace monkey paper. As far as I can
> tell, the SVG backend will use the same IRQueue to build the SVG DOM.
>
>
> > My intended use is to enable (some) PDF scraping with my Pjscrape
> library (https://github.com/nrabinowitz/pjscrape), which runs on top of
> PhantomJS (a headless webkit implementation). I hadn't thought this would
> be possible until I saw the pdf.js library - very cool.
>
> That sound pritty cool & interesting. Would like to know what you come up
> with or where you got stuck with this :)
>
> Best,
>
> Julian
>
>
> On Wed, Dec 21, 2011 at 6:40 PM, Nick Rabinowitz <
> nick.ra...@gmail.com> wrote:
>
>> Hello Julian -
>>
>> Thanks, this looks very promising! Being able to call something like
>> page.extractTextContent() would be great, at least as a start. I'm
>> particularly interested in anything that would give me some elements of the
>> page structure as well - my basic use case is extracting structured content
>> from a PDF containing a table of information. I know that this is a
>> non-trivial task; one option I've considered is to wait for the SVG
>> rendering work to be complete and then look into extracting formatted
>> content from the SVG DOM - if it's not currently in the roadmap, I might
>> throw in a feature request for a way to render only the SVG, skipping the
>> canvas-based rendering, and return the SVG DOM root from a function call
>> like page.getSVG() - probably this would need to accept a callback to be
>> run when the SVG was ready.
>>
>> My intended use is to enable (some) PDF scraping with my Pjscrape library
>> (https://github.com/nrabinowitz/pjscrape), which runs on top of
0 new messages