Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why does pdf.js run slowly and use a lot of memory?

3,006 views
Skip to first unread message

Nicholas Nethercote

unread,
Jun 19, 2014, 11:34:21 PM6/19/14
to dev-p...@lists.mozilla.org
Hi,

Despite recent improvements, pdf.js still runs more slowly and uses
more memory than a native PDF viewer. I've been thinking about why
this is, and my attempt at an explanation is below. I'd be interested
to hear if people think this is correct.

Thanks!

Nick


Why does pdf.js run slowly and use more memory than a native PDF viewer?

Part of the reason is just the fact that it's written in JavaScript, which
means that it uses garbage collection, and it's difficult to represent data as
compactly as in C++.

But pdf.js's design also has a big impact.

PDF is basically a bytecode format which describes how to render a document:
print this text here, draw this line here, etc.

The simple, obvious way to implement a PDF viewer is like this: when you need
to render a page (because it becomes visible) you read the data for that page
from the source (file or network) and immediately execute the bytecode (i.e.
render the document).

But pdf.js doesn't do this. It instead reads the data in a worker thread and
constructs an in-memory representation of it. (The original data is fairly
complex, so this representation is too -- arrays of arrays of objects, that
kind of thing.) It then passes this representation to the main thread, which
renders it.

This is slower, because you have to basically process the bytecode twice.

And it uses more memory because it has to create this representation. Actually,
it's even worse than it first seems, because of the way data gets copied
between threads in JavaScript: we temporarily end up with *three* copies of
that data: one in the worker thread, another identical copy in the main thread,
and then one in a different, serialized format that's used by the structured
cloning process.

(It's not quite as simple as this. For example, the first pass does some
bytecode transformations to reduce its size and simplify the job of the main
thread code. In particular, some of the data can be stored in typed arrays
which can be transferred without copying. This is done as much as possible for
large homogeneous data elements, such as image pixels data.)

Allocating and copying those structures also takes yet more time.

Also, when scrolling quickly we can end up with rendering requests for multiple
pages live at the same time, which increases memory usage further.

So why does pdf.js have this two-phase process? Rendering can be slow, and if
all the work is done on the main thread this can cause the entire browser to
freeze. By doing part of the work in a worker, this problem is reduced. A
standalone PDF viewer doesn't have to worry about this issue.

Leonard Rosenthol

unread,
Jun 20, 2014, 4:47:20 AM6/20/14
to Nicholas Nethercote, dev-p...@lists.mozilla.org
Actually, most native PDF viewers use a similar concept of a two pass
parse of page content. The first pass creates an in memory representation
(in Acrobat/Reader, it¹s called a Display List) and then a separate pass
renders this to screen/print/etc. The reason that this is done is because
when the user zooms in, scrolls around, etc. you don¹t have to reparse the
page content - you need to do a rendering pass (and not necessarily on the
entire list - just the pieces in view). This also works nicely for other
operations such as Editing, Printing and Saving. HOWEVER, no copying is
done - there is a single list that is passed around.

Leonard Rosenthol
PDF Architect
Adobe Systems
>_______________________________________________
>dev-pdf-js mailing list
>dev-p...@lists.mozilla.org
>https://lists.mozilla.org/listinfo/dev-pdf-js

Nicholas Nethercote

unread,
Jun 22, 2014, 11:55:30 PM6/22/14
to Leonard Rosenthol, dev-p...@lists.mozilla.org
On Fri, Jun 20, 2014 at 1:47 AM, Leonard Rosenthol <lros...@adobe.com> wrote:
> Actually, most native PDF viewers use a similar concept of a two pass
> parse of page content. The first pass creates an in memory representation
> (in Acrobat/Reader, it¹s called a Display List) and then a separate pass
> renders this to screen/print/etc. The reason that this is done is because
> when the user zooms in, scrolls around, etc. you don¹t have to reparse the
> page content - you need to do a rendering pass (and not necessarily on the
> entire list - just the pieces in view). This also works nicely for other
> operations such as Editing, Printing and Saving. HOWEVER, no copying is
> done - there is a single list that is passed around.

Thanks for the info! I didn't know that.

Let me amend and boil down my observation to this:
- it's difficult to represent such a data structure compactly in JavaScript;
- it's difficult to avoid making multiple copies when copying this
data structure between threads in JavaScript;
- these two facts contribute significantly to pdf.js's memory usage.

Nick

Nicholas Nethercote

unread,
Aug 14, 2014, 7:34:09 PM8/14/14
to Leonard Rosenthol, dev-p...@lists.mozilla.org
Another big factor I didn't appreciate recently is the textLayer. For
pages with lots of text, creating the textLayer is slow, sometimes
causing freezes. And it can really hurt scrolling speed. And it can
take up lots of memory.

Nick

Christian Hajdú

unread,
Jan 16, 2015, 11:02:46 AM1/16/15
to mozilla-d...@lists.mozilla.org
Interesting observations made here. I am not too sure how far this has been approached yet?
Anyone having tried to process magazine-style PDF's? I think there might be a performance boost by removing the core to external processes and then call it thru display layer so that you get preprocessed PDFs. Unfortunately it would still take a while with large images but if you can reduce 1.hand-load time by reference to eg an imageserver then I assume this could help with the local workers workload reduction.

Any thoughts or experiences with external systems?

On Friday, August 15, 2014 at 1:34:09 AM UTC+2, Nicholas Nethercote wrote:
> Another big factor I didn't appreciate recently is the textLayer. For
> pages with lots of text, creating the textLayer is slow, sometimes
> causing freezes. And it can really hurt scrolling speed. And it can
> take up lots of memory.
>
> Nick
>
> On Mon, Jun 23, 2014 at 1:55 PM, Nicholas Nethercote
> <n.n----@g----.com> wrote:
0 new messages