API Documentation

Showing 1-25 of 25 messages
API Documentation pdfjsNewbie 8/21/12 4:47 AM
I'm new to pdfjs. Where can I find more documentation on how to use it (perhaps with code examples)? I've downloaded the code from github; I'm just wondering if there are examples I can look at. Thanks.
Re: API Documentation Yury Delendik 8/21/12 7:57 AM
On 8/21/2012 6:47 AM, pdfjsNewbie wrote:
> I'm new to pdfjs. Where can I find more documentation on how to use it (perhaps with code examples)? I've downloaded the code from github; I'm just wondering if there are examples I can look at. Thanks.
>

On the https://github.com/mozilla/pdf.js page (and README.md file), the
following examples are listed:

* Hello world: http://jsbin.com/pdfjs-helloworld-v2/edit#html,live
* Simple reader with prev/next page controls:
http://jsbin.com/pdfjs-prevnext-v2/edit#html,live

Also the project contains few examples in the examples in the examples/
folder (https://github.com/mozilla/pdf.js/tree/master/examples).

The JavaScript code contains the API documentation, please find the
documentation comments in the
https://github.com/mozilla/pdf.js/blob/master/src/api.js file.
Re: API Documentation Yury Delendik 8/21/12 7:57 AM
On 8/21/2012 6:47 AM, pdfjsNewbie wrote:
> I'm new to pdfjs. Where can I find more documentation on how to use it (perhaps with code examples)? I've downloaded the code from github; I'm just wondering if there are examples I can look at. Thanks.
>

Re: API Documentation ata.met...@gmail.com 11/26/12 1:00 AM
Those documentations are not complete enough.
For example, I wanna know you to load a .PDF file in a way to make it selectable.
Online samples like This(http://mozilla.github.com/pdf.js/web/viewer.html) are selectable.
Re: API Documentation Julian Viereck 11/27/12 6:48 AM
> Those documentations are not complete enough.
> For example, I wanna know you to load a .PDF file in a way to make it selectable.
> Online samples like This(http://mozilla.github.com/pdf.js/web/viewer.html) are selectable.

The API is complete in the sense, that the core PDF.JS library is only
about extracting data form a PDF and rendering a page to a HTML canvas
element. Anything beyond that (e.g. text selection) might be provided
in the viewer that is shipping with the project, but there is no API
for it at this point.

It would be nice to make the viewer more modular and reuseable and
contribution for this is highly welcome.

Best,

Julian
> _______________________________________________
> dev-pdf-js mailing list
> dev-p...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-pdf-js
Re: API Documentation ata.met...@gmail.com 11/26/12 1:00 AM
Re: API Documentation Julian Viereck 11/28/12 12:17 AM
> One more thing left, and that is I need to have my own viewer for pdf.js.
> I wanna customize the UI, better to say I wanna remove whole toolbar
> and call Events like "page--" and "zoomin" by myself.

You can remove the toolbar from the HTML code. Look at the viewer.html.
I guess there is some JS code that assumes the toolbar to be present.
You will have to change that code.

An easy (but very hacky way) is to set the toolbar's style to
"display:none" to just hide it from the user.

Very best,

Julian

On Wed Nov 28 06:57:06 2012, Ata Iravani wrote:
> Thanks for reply
> I found a good sample at "web" directory of downloadable zip file
> (https://github.com/mozilla/pdf.js/archive/master.zip)
> It is really close to what I need.
> One more thing left, and that is I need to have my own viewer for pdf.js.
> I wanna customize the UI, better to say I wanna remove whole toolbar
> and call Events like "page--" and "zoomin" by myself.
> Any way thanks , I really appreciate PDF.JS
> Regards
> Ata Iravani
>
>
> On Tue, Nov 27, 2012 at 6:18 PM, Julian Viereck
> <jviere...@googlemail.com <mailto:jviere...@googlemail.com>> wrote:
>
>         Those documentations are not complete enough.
>         For example, I wanna know you to load a .PDF file in a way to
>         make it selectable.
>         Online samples like
>         This(http://mozilla.github.__com/pdf.js/web/viewer.html
>         <http://mozilla.github.com/pdf.js/web/viewer.html>) are
>         selectable.
>
>
>     The API is complete in the sense, that the core PDF.JS library is
>     only about extracting data form a PDF and rendering a page to a
>     HTML canvas element. Anything beyond that (e.g. text selection)
>     might be provided in the viewer that is shipping with the project,
>     but there is no API for it at this point.
>
>     It would be nice to make the viewer more modular and reuseable and
>     contribution for this is highly welcome.
>
>     Best,
>
>     Julian
>
>     On Mon Nov 26 10:00:42 2012, ata.met...@gmail.com
>     <mailto:ata.met...@gmail.com> wrote:
>
>         Those documentations are not complete enough.
>         For example, I wanna know you to load a .PDF file in a way to
>         make it selectable.
>         Online samples like
>         This(http://mozilla.github.__com/pdf.js/web/viewer.html
>         <http://mozilla.github.com/pdf.js/web/viewer.html>) are
>         selectable.
>         _________________________________________________
>         dev-pdf-js mailing list
>         dev-p...@lists.mozilla.org <mailto:dev-p...@lists.mozilla.org>
>         https://lists.mozilla.org/__listinfo/dev-pdf-js
>         <https://lists.mozilla.org/listinfo/dev-pdf-js>
>
>
Re: API Documentation Geoffrey Booth 11/30/12 10:42 PM
I agree with the above, the example viewer is very valuable and deserves some documentation treatment all in itself. I agree that I wish it were more modular so that it would be easier to understand and to take it apart to make custom viewers. Perhaps another example in the pdf.js examples folder could be a viewer with no UI? No toolbar, no keyboard shortcuts, none of that; just the PDF displayed and selectable. That would be an example much easier to analyze, without all the UI stuff in the way, and such a viewer would fill a lot of use cases (such as blog posts that currently have embedded PDFs using Scribd, where the point is really only to scroll through the document and nothing else).

My own usage case is I want to have the browser load a PDF into a viewer that generates selectable divs (in other words, so far everything that web/viewer.html does)—and then the viewer uses Ajax to send the divs to a server (by "divs" I mean the created divs' text plus properties like x-y coordinates and font styling). Basically I want to use pdf.js + the viewer to serve as a client-side PDF parser, and then on the server side I'll do some fun stuff with the div data. Like the other poster I don't want my viewer to have any UI; if I show the PDF at all it would only be as a thumbnail for progress purposes, or I might have it completely hidden.

Julian could you be so kind as to perhaps explain the basics of the div generation functions? Any pointers on where I should look to get started?

Thanks much!
Re: API Documentation Leonard Rosenthol 12/3/12 8:25 AM
On your usage case, you do realize that the conversion from PDF->HTML is a lossy one, yes?  So that if you are only looking at the data in the HTML produced from pdf.js, then you are potentially missing HUGE amounts of information contained in the PDF about the content (as well as metadata, etc.).

Depending on what you are doing with the information - perhaps that's OK - but you should be aware of it.

Leonard

-----Original Message-----
From: dev-pdf-js-bounces+lrosenth=adobe.com@lists.mozilla.org [mailto:dev-pdf-js-bounces+lrosenth=adob...@lists.mozilla.org] On Behalf Of Geoffrey Booth
Sent: Saturday, December 01, 2012 1:42 AM
To: mozilla.d...@googlegroups.com
Cc: dev-p...@lists.mozilla.org; Ata Iravani
Subject: Re: discussions of PDF.js project API Documentation

I agree with the above, the example viewer is very valuable and deserves some documentation treatment all in itself. I agree that I wish it were more modular so that it would be easier to understand and to take it apart to make custom viewers. Perhaps another example in the pdf.js examples folder could be a viewer with no UI? No toolbar, no keyboard shortcuts, none of that; just the PDF displayed and selectable. That would be an example much easier to analyze, without all the UI stuff in the way, and such a viewer would fill a lot of use cases (such as blog posts that currently have embedded PDFs using Scribd, where the point is really only to scroll through the document and nothing else).

My own usage case is I want to have the browser load a PDF into a viewer that generates selectable divs (in other words, so far everything that web/viewer.html does)-and then the viewer uses Ajax to send the divs to a server (by "divs" I mean the created divs' text plus properties like x-y coordinates and font styling). Basically I want to use pdf.js + the viewer to serve as a client-side PDF parser, and then on the server side I'll do some fun stuff with the div data. Like the other poster I don't want my viewer to have any UI; if I show the PDF at all it would only be as a thumbnail for progress purposes, or I might have it completely hidden.
Re: API Documentation Julian Viereck 12/3/12 12:58 PM
> Julian could you be so kind as to perhaps explain the basics of the
div generation functions? Any pointers on where I should look to get
started?

What you're looking for is the TextLayerBuilder:

   https://github.com/mozilla/pdf.js/blob/master/web/viewer.js#L2399

You need to pass in the textLayer when rendering the page as you can
find here:

   https://github.com/mozilla/pdf.js/blob/master/web/viewer.js#L2064

 > That would be an example much easier to analyze, without all the UI
stuff in the way, and such a viewer would fill a lot of use cases (such
as blog posts that currently have embedded PDFs using Scribd, where the
point is really only to scroll through the document and nothing else).

Having "just something to scroll through" is already very hard to get
right. But I agree that an example that uses the textLayer could be useful.

 > Basically I want to use pdf.js + the viewer to serve as a client-side
PDF parser, and then on the server side I'll do some fun stuff with the
div data. Like the other poster I don't want my viewer to have any UI;
if I show the PDF at all it would only be as a thumbnail for progress
purposes, or I might have it completely hidden.

I don't really get the idea behind this. Is running PDF.JS on the server
not an easier solution?

Best,

Julian

On 12/1/12 7:42 AM, Geoffrey Booth wrote:
> I agree with the above, the example viewer is very valuable and deserves some documentation treatment all in itself. I agree that I wish it were more modular so that it would be easier to understand and to take it apart to make custom viewers. Perhaps another example in the pdf.js examples folder could be a viewer with no UI? No toolbar, no keyboard shortcuts, none of that; just the PDF displayed and selectable. That would be an example much easier to analyze, without all the UI stuff in the way, and such a viewer would fill a lot of use cases (such as blog posts that currently have embedded PDFs using Scribd, where the point is really only to scroll through the document and nothing else).
>
> My own usage case is I want to have the browser load a PDF into a viewer that generates selectable divs (in other words, so far everything that web/viewer.html does)—and then the viewer uses Ajax to send the divs to a server (by "divs" I mean the created divs' text plus properties like x-y coordinates and font styling). Basically I want to use pdf.js + the viewer to serve as a client-side PDF parser, and then on the server side I'll do some fun stuff with the div data. Like the other poster I don't want my viewer to have any UI; if I show the PDF at all it would only be as a thumbnail for progress purposes, or I might have it completely hidden.
Re: API Documentation Geoffrey Booth 12/3/12 2:32 PM
Hi Julian and Leonard,

Thank you for your guidance. To answer your questions, I have in fact looked at the pdf2json project (https://github.com/modesty/pdf2json) and gotten it to work to the point of generating lots of JavaScript objects representing the text blocks, including the text content and X/Y. Unfortunately the text blocks I've managed to generate so far are blocks for approximately every letter, occasionally a few letters or a whole word, but not a whole line like the divs in the in pdf.js viewer on the client side (using the same example pdf as in the viewer). I was hoping to find the code in the viewer or pdf.js that joined together these one-letter blocks into more sensible longer chunks.

I'm open to parsing the PDF either on the client or server side, whatever works. The goal is to send data to the server very similar to the divs that are created in the viewer, with text content, X/Y and font properties. The use case is to parse PDFs that are created in a systematic way, for example a PDF invoice generated by QuickBooks that always has the payee at a certain X/Y and a table of costs at another X/Y and lines of expenses within the table that can be parsed by X/Y, etc.

Any direction is much appreciated. Thanks!
> > My own usage case is I want to have the browser load a PDF into a viewer that generates selectable divs (in other words, so far everything that web/viewer.html does)�and then the viewer uses Ajax to send the divs to a server (by "divs" I mean the created divs' text plus properties like x-y coordinates and font styling). Basically I want to use pdf.js + the viewer to serve as a client-side PDF parser, and then on the server side I'll do some fun stuff with the div data. Like the other poster I don't want my viewer to have any UI; if I show the PDF at all it would only be as a thumbnail for progress purposes, or I might have it completely hidden.
Re: API Documentation Julien Bourdon 12/5/12 4:17 AM
Hi everyone. First post in this group.

I just started to use pdf.js for a project which is described in more details here: http://stackoverflow.com/questions/13497639/scanned-and-text-pdf-parallel-scrolling-and-selection-in-web-application#comment18648170_13497639

So I might have found an answer to your question.

First you need to access to the page you want to extract text from. If you do that as it is done hello world example, you can proceed as follows:

var pageToExtract;

function loadPDFInCanvas(pdfname,canvasid,pageNumber,scale){
        PDFJS.getDocument(pdfname).then(function(pdf) {
  // Using promise to fetch the page
  pdf.getPage(pageNumber).then(function(page) {
          pageToExtract = page;
...
}

Then you can access the complete text of each page, cut for each line (at least in the test PDF I am using), as follows:

var promiseExtractor = testPage.getTextContent();
promiseExtractor.data.bidiTexts; // Return an array of objects containing the text and the direction of reading.

I'm still looking for a way to extract the X/Y but the solution might lie in the geom object used by the textLayerBuilderAppendText of the viewer:
https://github.com/mozilla/pdf.js/blob/master/web/viewer.js#L2474

I just do not know yet where this geom object comes from.

Hope that helps.

Julien.
Re: API Documentation Julian Viereck 12/5/12 8:17 AM
> I just do not know yet where this geom object comes from.

The textLayer is build up form two parts. One is the content of the
divs. This is what you get from the `getTextContent()` function. The
other part needed is the position. This information is generated during
rendering of the page. The `showSpacedText` function in the `canvas.js`
file calls a method on the TextLayer to provided the required geometry
information. This is done here:

  https://github.com/mozilla/pdf.js/blob/master/src/canvas.js#L981

Let me know if this is what you're looking for or if you have any other
issues.

Best,

Julian
Re: API Documentation Julian Viereck 12/5/12 8:17 AM
Re: API Documentation Julien Bourdon 12/5/12 4:17 AM
Hi everyone. First post in this group.

I just started to use pdf.js for a project which is described in more details here: http://stackoverflow.com/questions/13497639/scanned-and-text-pdf-parallel-scrolling-and-selection-in-web-application#comment18648170_13497639

So I might have found an answer to your question.

First you need to access to the page you want to extract text from. If you do that as it is done hello world example, you can proceed as follows:

var pageToExtract;

function loadPDFInCanvas(pdfname,canvasid,pageNumber,scale){
        PDFJS.getDocument(pdfname).then(function(pdf) {
  // Using promise to fetch the page
  pdf.getPage(pageNumber).then(function(page) {
          pageToExtract = page;
...
}

Then you can access the complete text of each page, cut for each line (at least in the test PDF I am using), as follows:

var promiseExtractor = testPage.getTextContent();
promiseExtractor.data.bidiTexts; // Return an array of objects containing the text and the direction of reading.

I'm still looking for a way to extract the X/Y but the solution might lie in the geom object used by the textLayerBuilderAppendText of the viewer:
https://github.com/mozilla/pdf.js/blob/master/web/viewer.js#L2474

I just do not know yet where this geom object comes from.

Hope that helps.

Julien.

On Tuesday, 4 December 2012 07:32:54 UTC+9, Geoffrey Booth  wrote:
Re: API Documentation Julian Viereck 12/5/12 1:28 PM
> I just do not know yet where this geom object comes from.

Re: API Documentation Julian Viereck 12/5/12 1:28 PM
> I just do not know yet where this geom object comes from.

Re: API Documentation Julien Bourdon 12/7/12 8:19 AM
Julian,

Thank you for the answer.

Let me explain what I am trying to achieve in more details. If the explanation was too long to read, I  wrote a short version at the bottom of the post. If you want the explanation with a figure, I posted this question on StackOverflow some time ago: http://stackoverflow.com/questions/13497639/scanned-and-text-pdf-parallel-scrolling-and-selection-in-web-application).

Basically I have two sets of PDFs in Malay, one in latin characters text form and another in arabic characters from scanned images. I want to implement some text search tools on the arabic pdf by using the latin transcription.

In practice, I would like to work on the paragraph level, two corresponding paragraphs having roughly the same position in both documents. In other words, if a user clicks on the canvas containing the arabic document at (x,y), it is very likely that the corresponding paragraph in the latin transcript will be located at (canvasWidth-x,y), since Arabic is a RTL script.

In order to do that, I would need to store the list of paragraphs in the latin transcription document and their corresponding bounding boxes. As information about the paragraph division is not stored in the PDF, I need to check if there is an alinea on each line to detect if a new paragraph is beginning.

I managed to extract the text and write it to a div using the JQuery code below (complete code available here: http://pastebin.com/N3jJi8KW ):

page.getTextContent().then(function(text){
   extractedString = $.makeArray($(text.bidiTexts).map(function(element,value){return value.str})).join(' ');
   $('div#extractedText').text(extractedString);
}

Now I would need to get the bounding box of each line, or if it was not possible, reconstruct it from the bounding box of each character. The problem is that I do not know where to get this information, even if I know where it is done in the source code. I tried to find a way to access the geom object without success. I suspected that I could access the geom object through the renderer but all I get is a promise object with no data.

----------------
SHORT VERSION

Is there a way to get the bounding box of each line of a PDF document in an array, similarly to the way to get the text content from a pdf via getTextContent()?

Thank you in advance for your answer.

Julien.
Re: API Documentation Julian Viereck 12/11/12 9:26 AM
Julien,

in most cases PDFs don't store paragraph/line information. AFAIK there
is no code in PDF.JS at the moment to detect lines or paragraphs.

The way to get the position information for text is by specifing a
textLayer on the renderContext object, that is used for rendering a
page using PDF.JS. While the page is rendered, the textLayer's
appendText function is called. From the passed in `geom` object, you
can get the informations about the position of some text on the canvas.
See the code here:

https://github.com/mozilla/pdf.js/blob/master/web/viewer.js#L2499

Does this help?

very best,

Julian
Re: API Documentation Julian Viereck 12/11/12 9:26 AM
Re: API Documentation kousi...@gmail.com 12/16/12 8:43 AM
Hello,

Can anyone please let me know if I can disable right click on the rendered pdf by pdf.js. Otherwise is it possible to customize right click menu ?
Thanks
Kousik
Re: API Documentation Julien Bourdon 12/20/12 7:17 AM
Julian,

Sorry for the late reply as I was sucked into another project for a while. I dived into the code and the key was in a comment in api.js: https://github.com/mozilla/pdf.js/blob/master/src/api.js#L244 (textLayer: An object that has beginLayout, endLayout, and appendText functions)

So far, I was focused on examining the code in viewer.js but since I do not need the search functions, I just needed to create a simple TextLayerBuilder class and to store the rendered objects on the fly in the appendText(geom) function.

Thank you again for guiding me in the right direction. I'll be sure to post the code for reference when it will be completed.

Julien.
Re: API Documentation Julien Bourdon 12/20/12 7:17 AM
Julian,

Sorry for the late reply as I was sucked into another project for a while. I dived into the code and the key was in a comment in api.js: https://github.com/mozilla/pdf.js/blob/master/src/api.js#L244 (textLayer: An object that has beginLayout, endLayout, and appendText functions)

So far, I was focused on examining the code in viewer.js but since I do not need the search functions, I just needed to create a simple TextLayerBuilder class and to store the rendered objects on the fly in the appendText(geom) function.

Thank you again for guiding me in the right direction. I'll be sure to post the code for reference when it will be completed.

Julien.

On Wednesday, 12 December 2012 02:26:03 UTC+9, Julian Viereck  wrote:
Re: API Documentation Julien Bourdon 12/20/12 7:17 AM
Julian,

Sorry for the late reply as I was sucked into another project for a while. I dived into the code and the key was in a comment in api.js: https://github.com/mozilla/pdf.js/blob/master/src/api.js#L244 (textLayer: An object that has beginLayout, endLayout, and appendText functions)

So far, I was focused on examining the code in viewer.js but since I do not need the search functions, I just needed to create a simple TextLayerBuilder class and to store the rendered objects on the fly in the appendText(geom) function.

Thank you again for guiding me in the right direction. I'll be sure to post the code for reference when it will be completed.

Julien.

On Wednesday, 12 December 2012 02:26:03 UTC+9, Julian Viereck  wrote:
Re: API Documentation jith...@gmail.com 12/5/13 12:31 AM
go to this link or the source js file. And you can find out all the available options on pdfjs.

No official documentations are available.

https://raw.github.com/mozilla/pdf.js/gh-pages/build/pdf.js
More topics »