Including Adobe CMaps

Brendan Dahl

unread,

Feb 24, 2014, 5:41:13 PM2/24/14

to dev-pl...@lists.mozilla.org

PDF.js plans to soon start including and using Adobe CMap files for converting character codes to character id’s(CIDs) and mapping character codes to unicode values. This will fix a number of bugs in PDF.js and will improve our support for Chinese, Korean, and Japanese(CJK) documents.

I wanted to inform dev-platform because there are quite a few files and they are large. The files are loaded lazily as needed so they shouldn’t affect the size of Firefox when running, but they will affect the installation size. There are 168 files with an average size of ~40KB, and all of the files together are roughly:
6.9M
2.2M when gzipped

http://sourceforge.net/adobe/cmap/wiki/Home/

Andreas Gal

unread,

Feb 24, 2014, 6:01:32 PM2/24/14

to Brendan Dahl, dev-pl...@lists.mozilla.org

Is this something we could load dynamically and offline cache?

Andreas

Sent from Mobile.

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Kyle Huey

unread,

Feb 24, 2014, 6:40:22 PM2/24/14

to Andreas Gal, Brendan Dahl, dev-pl...@lists.mozilla.org

Alternatively, is this data duplicated in ICU?

- Kyle

Ralph Giles

unread,

Feb 24, 2014, 6:54:53 PM2/24/14

to dev-pl...@lists.mozilla.org

On 2014-02-24 2:41 PM, Brendan Dahl wrote:
> There are 168 files with an average size of ~40KB, and all of the files together are roughly:
> 6.9M
> 2.2M when gzipped

IIRC mupdf was able to save significant space by pre-parsing the files.
Their code for that is GPL (and oriented toward compiling-in the tables
as C code) but it might be worth a look.

http://git.ghostscript.com/?p=mupdf.git;a=blob;f=scripts/cmapdump.c

4.2M resources/cmaps/
1008K build/debug/pdf/pdf-cmap-table.o

-r

Rik Cabanier

unread,

Feb 24, 2014, 7:23:06 PM2/24/14

to Andreas Gal, Brendan Dahl, dev-pl...@lists.mozilla.org

On Mon, Feb 24, 2014 at 3:01 PM, Andreas Gal <andre...@gmail.com> wrote:

> Is this something we could load dynamically and offline cache?
>

That should be possible. The CMap name is in the PDF so Firefox could
download it on demand.
Also, if the user has acrobat, the CMaps are already on their machine.

> On Feb 24, 2014, at 23:41, Brendan Dahl <bd...@mozilla.com> wrote:
> >
> > PDF.js plans to soon start including and using Adobe CMap files for
> converting character codes to character id's(CIDs) and mapping character
> codes to unicode values. This will fix a number of bugs in PDF.js and will
> improve our support for Chinese, Korean, and Japanese(CJK) documents.
> >
> > I wanted to inform dev-platform because there are quite a few files and
> they are large. The files are loaded lazily as needed so they shouldn't
> affect the size of Firefox when running, but they will affect the

> installation size. There are 168 files with an average size of ~40KB, and

> all of the files together are roughly:
> > 6.9M
> > 2.2M when gzipped
> >

Brendan Dahl

unread,

Feb 24, 2014, 7:27:33 PM2/24/14

to Andreas Gal, dev-pl...@lists.mozilla.org

It’s certainly possible to load dynamically. Do we currently do this for any other Firefox resources?

From what I’ve seen, many PDF’s use CMaps even if they don’t necessarily have CJK characters, so it may just be better to include them. FWIW both Popper and Mupdf embed the CMaps.

Brendan

On Feb 24, 2014, at 3:01 PM, Andreas Gal <andre...@gmail.com> wrote:

> Is this something we could load dynamically and offline cache?
>

> Andreas
>
> Sent from Mobile.

Andreas Gal

unread,

Feb 24, 2014, 8:01:48 PM2/24/14

to Brendan Dahl, dev-pl...@lists.mozilla.org

My assumption is that certain users only need certain CMaps because they tend to read only documents in certain languages. This seems like something we can really optimize and avoid ahead-of-time download cost for.

The fact that we don’t do this yet doesn’t seem like a good criteria. There is a lot of good things we aren’t doing yet. You can be the first to change that on this particular topic, if it technically makes sense.

Andreas

Rik Cabanier

unread,

Feb 25, 2014, 1:14:27 AM2/25/14

to Andreas Gal, Brendan Dahl, dev-pl...@lists.mozilla.org

On Mon, Feb 24, 2014 at 5:01 PM, Andreas Gal <andre...@gmail.com> wrote:

>
> My assumption is that certain users only need certain CMaps because they
> tend to read only documents in certain languages. This seems like something
> we can really optimize and avoid ahead-of-time download cost for.
>

So, you'd only install the Korean CMaps if the language is Korean?
The problem with that is that if a user might install a English version of
Firefox but still open Korean PDFs (which will then display as junk)

>
> The fact that we don't do this yet doesn't seem like a good criteria.
> There is a lot of good things we aren't doing yet. You can be the first to
> change that on this particular topic, if it technically makes sense.

Load-on-demand (with an option to download all of them) seems like a nice
solution. A large majority of users will never need CMaps or only a very
small subset.

Brendan Dahl

unread,

Feb 26, 2014, 2:38:07 PM2/26/14

to dev-pl...@lists.mozilla.org

Yury Delendik worked on reformatting the files a bit and was able to get them down to 1.1MB binary which gzips to 990KB. This seems like a reasonable size to me and involves a lot less work than setting up a process for distributing these files via CDN.

Brendan

Bobby Holley

unread,

Feb 26, 2014, 2:53:52 PM2/26/14

to Brendan Dahl, dev-pl...@lists.mozilla.org

That's still a ton for something that most of our users will not (or will
rarely) use. I think we absolutely need to get an on-demand story for this
kind of stuff. It isn't the first time it has come up.

bholley

Andreas Gal

unread,

Feb 26, 2014, 2:56:37 PM2/26/14

to Brendan Dahl, dev-pl...@lists.mozilla.org

This randomly reminds me that it might be time to review zip as our compression format for omni.ja.

ls -l omni.ja

7862939

ls -l omni.tar.xz (tar and then xz -z)

4814416

LZMA2 is available as a public domain implementation. It uses a bit more memory than zip, but its still within reason (the default level 6 is around 1MB to decode I believe). A fairly easy way to use it would be to add support for a custom compression format for our version of libjar.

Andreas

Andreas Gal

unread,

Feb 26, 2014, 2:57:27 PM2/26/14

to Bobby Holley, Brendan Dahl, dev-pl...@lists.mozilla.org

Lets turn this question around. If we had an on-demand way to load stuff like this, what else would we want to load on demand?

Andreas

On Feb 26, 2014, at 8:53 PM, Bobby Holley <bobby...@gmail.com> wrote:

> That's still a ton for something that most of our users will not (or will
> rarely) use. I think we absolutely need to get an on-demand story for this
> kind of stuff. It isn't the first time it has come up.
>
> bholley
>
>

Jonathan Kew

unread,

Feb 26, 2014, 3:21:12 PM2/26/14

to dev-pl...@lists.mozilla.org

On 26/2/14 19:57, Andreas Gal wrote:
>
> Lets turn this question around. If we had an on-demand way to load stuff like this, what else would we want to load on demand?

A few examples:

Spell-checking dictionaries
Hyphenation tables
Fonts for additional scripts

JK

Benjamin Smedberg

unread,

Feb 26, 2014, 3:44:46 PM2/26/14

to Jonathan Kew, dev-pl...@lists.mozilla.org

On 2/26/2014 3:21 PM, Jonathan Kew wrote:
> On 26/2/14 19:57, Andreas Gal wrote:
>>
>> Lets turn this question around. If we had an on-demand way to load
>> stuff like this, what else would we want to load on demand?
>
> A few examples:
>
> Spell-checking dictionaries
> Hyphenation tables
> Fonts for additional scripts

Yes!

Also maybe ICU data tables, although the current web-facing APIs don't
support asynchronous download very well.

--BDS

Andreas Gal

unread,

Feb 26, 2014, 3:49:36 PM2/26/14

to Benjamin Smedberg, Jonathan Kew, dev-platform

This sounds like quite an opportunity to shorten download times and reduce CDN load. Who wants to file the bug? :)

Andreas

Wesley Hardman

unread,

Feb 26, 2014, 3:58:09 PM2/26/14

to

Here are a few things to think about:

1. Not everyone will have internet access. Businesses sometimes have limited access, or access to specific site only. There was discussion on this in the addons file registration a little while ago.
What if you are travelling without internet and working on something locally? What if you are connecting to a hotspot, but need one of those items to view the "accept terms" to get internet access?

2. Speed. If you are on a slow connection, like cell wireless, having to download another file is going to cause the page to load that much slower.

3. Data caps. Yea, it is small, but if it can be downloaded when you are on an uncapped wifi when the program updates, as opposed to downloading over a capped network when needed, wouldn't that be prefered?

If it is big enough to make a difference in download times, would it be big enough that you wouldn't want to download it over a slow capped network on-demand?

Personally, I would prefer to have it already available. I tend to live by "Its better to have it and not need it than to not have it and need it." (or have to fetch it.) This is my opinion though.

Jonathan Kew

unread,

Feb 26, 2014, 3:58:08 PM2/26/14

to dev-platform, andre...@gmail.com

On 26/2/14 20:49, Andreas Gal wrote:
>
> This sounds like quite an opportunity to shorten download times and reduce CDN load. Who wants to file the bug? :)

Re fonts, see bug 619521 and bug 648548; there's even an old
proof-of-concept patch there, but it's been stalled for a while....

If we're going to do this more broadly, though, I expect we'll want to
back up and design a more general framework to manage such resources.

JK

Gregory Szorc

unread,

Feb 26, 2014, 3:57:51 PM2/26/14

to Andreas Gal, Benjamin Smedberg, Jonathan Kew, dev-platform

https://bugzilla.mozilla.org/show_bug.cgi?id=977292

Assigned to nobody.

On 2/26/2014 12:49 PM, Andreas Gal wrote:
>
> This sounds like quite an opportunity to shorten download times and reduce CDN load. Who wants to file the bug? :)
>

Nick Alexander

unread,

Feb 26, 2014, 4:23:40 PM2/26/14

to dev-pl...@lists.mozilla.org

On 2/26/2014, 11:56 AM, Andreas Gal wrote:
>
> This randomly reminds me that it might be time to review zip as our compression format for omni.ja.
>
> ls -l omni.ja
>
> 7862939
>
> ls -l omni.tar.xz (tar and then xz -z)
>
> 4814416
>
> LZMA2 is available as a public domain implementation. It uses a bit more memory than zip, but its still within reason (the default level 6 is around 1MB to decode I believe). A fairly easy way to use it would be to add support for a custom compression format for our version of libjar.

Is there a meta ticket for this? I'm interested in evaluating how much
this would trim the mobile/android APK size.

Nick

Bobby Holley

unread,

Feb 26, 2014, 4:36:56 PM2/26/14

to Wesley Hardman, dev-pl...@lists.mozilla.org

On Wed, Feb 26, 2014 at 12:58 PM, Wesley Hardman <whard...@gmail.com>wrote:

> Personally, I would prefer to have it already available. I tend to live
> by "Its better to have it and not need it than to not have it and need it."
> (or have to fetch it.) This is my opinion though.
>

It seems like it would be trivial to add a button in the Preferences UI to
let people precache all dynamically-loaded data.

bholley

Boris Zbarsky

unread,

Feb 26, 2014, 4:40:51 PM2/26/14

to

On 2/26/14 3:58 PM, Wesley Hardman wrote:
> Personally, I would prefer to have it already available.

We have several deployment targets with different tradeoffs. Broadly
speaking:

Phones: expensive data, limited storage. Want to not use up the
storage, so download lazily.

Consumer laptops/desktops: cheap data, plentiful storage. Probably ok
to download opportunistically after initial install even if not
immediately needed.

Locked-down corporate laptops/desktops: Need a way to push out an
install with everything already included.

Limited-connectivity kiosks and whatnot: Need a way to push out an
install with whatever components are desired already included.

> I tend to live by "Its better to have it and not need it than to not have it and need it."

If you have unlimited storage, sure. We don't, on phones.

-Boris

Mike Hommey

unread,

Feb 26, 2014, 6:25:00 PM2/26/14

to Andreas Gal, Brendan Dahl, dev-pl...@lists.mozilla.org

On Wed, Feb 26, 2014 at 08:56:37PM +0100, Andreas Gal wrote:
>
> This randomly reminds me that it might be time to review zip as our
> compression format for omni.ja.
>
> ls -l omni.ja
>
> 7862939
>
> ls -l omni.tar.xz (tar and then xz -z)
>
> 4814416
>
> LZMA2 is available as a public domain implementation. It uses a bit
> more memory than zip, but its still within reason (the default level 6
> is around 1MB to decode I believe). A fairly easy way to use it would
> be to add support for a custom compression format for our version of
> libjar.

IIRC, it's also slower both to compress and decompress. Note you're
comparing oranges with apples, too.
Jars are per-file compression. tar.xz is per-archive compression.
This is what i get:

$ stat -c %s ../omni.ja
8609399

$ unzip -q ../omni.ja
$ find -type f -not -name *.xz | while read f; do a=$(stat -c %s $f); xz --keep -z $f; b=$(stat -c %s $f.xz); if [ "$a" -lt "$b" ]; then rm $f.xz; else rm $f; fi; done
# The above compresses each file individually, and keeps either the
# decompressed file of the compressed file depending which is smaller,
# which is essentially what we do when creating omni.ja

$ find -type f | while read f; do stat -c %s $f; done | awk '{t+=$1}END{print t}'
# Sum all file sizes, excluding directories that du would add.
7535827

That is, obviously, without jar headers.
$ unzip -lv ../omni.ja 2>/dev/null | tail -1
27696753 8260243 70% 2068 files
$ echo $((8609399 - 8260243))
349156

Thus, that same omni.ja that is 8609399, with xz compression would be
7884983. Not much of a win, and i doubt it's worth it considering the
runtime implication.

However, there is probably room for improvement on the installer side.

Mike

Mike Hommey

unread,

Feb 26, 2014, 7:07:54 PM2/26/14

to Andreas Gal, Brendan Dahl, dev-pl...@lists.mozilla.org

Well, the overhead would be different because of different alignments,
but the order of magnitude should be the same.

> Thus, that same omni.ja that is 8609399, with xz compression would be
> 7884983. Not much of a win, and i doubt it's worth it considering the
> runtime implication.
>
> However, there is probably room for improvement on the installer side.
>
> Mike

Andreas Gal

unread,

Feb 26, 2014, 7:30:58 PM2/26/14

to Mike Hommey, Brendan Dahl, dev-pl...@lists.mozilla.org

Could we compress major parts of omni.ja en block? We could for example stick all JS we load at startup into a zip with zero compression and then compress that into an outer zip. I think we already support nested containers like that. Assuming your math is correct even without adding LZMA2 just sticking with zip we should get better compression and likely better load times. Wdyt?

Andreas

Axel Hecht

unread,

Feb 27, 2014, 3:30:22 AM2/27/14

to

The feature of zip we want is the index, that let's us seek to a
position in the bundle and start unpacking, just given the filename.

How hard is to actually create a datastructure for the same purpose for
a tar.xz or so? I don't know really anything about the uncompression
algorithms to know if we could do something like
- seek to position N in bundle
- set state to X, if applicable
- uncompress, skip M bytes
-- you get your file contents, L bytes long

Or so. Yes, it'd be a new file format, I guess, at least as far as I can
tell? Maybe it's worth it.

Axel

Mike Hommey

unread,

Feb 27, 2014, 5:32:50 AM2/27/14

to Axel Hecht, dev-pl...@lists.mozilla.org

On Thu, Feb 27, 2014 at 09:30:22AM +0100, Axel Hecht wrote:
> The feature of zip we want is the index, that let's us seek to a
> position in the bundle and start unpacking, just given the filename.
>
> How hard is to actually create a datastructure for the same purpose
> for a tar.xz or so? I don't know really anything about the
> uncompression algorithms to know if we could do something like
> - seek to position N in bundle
> - set state to X, if applicable
> - uncompress, skip M bytes
> -- you get your file contents, L bytes long
>
> Or so. Yes, it'd be a new file format, I guess, at least as far as I
> can tell? Maybe it's worth it.

That's essentially what the 7z format does... it we're ready to take the
decompression overhead.

Mike

Mike Hommey

unread,

Feb 27, 2014, 5:33:39 AM2/27/14

to Andreas Gal, Brendan Dahl, dev-pl...@lists.mozilla.org

On Thu, Feb 27, 2014 at 01:30:58AM +0100, Andreas Gal wrote:
>
> Could we compress major parts of omni.ja en block? We could for
> example stick all JS we load at startup into a zip with zero
> compression and then compress that into an outer zip. I think we
> already support nested containers like that. Assuming your math is
> correct even without adding LZMA2 just sticking with zip we should get
> better compression and likely better load times. Wdyt?

I doubt this would benefit much, considering the limited window size of
the deflate algorithm.

Mike

Neil

unread,

Feb 27, 2014, 7:52:44 AM2/27/14

to

Andreas Gal wrote:

>Could we compress major parts of omni.ja en block? We could for example stick all JS we load at startup into a zip with zero compression and then compress that into an outer zip. I think we already support nested containers like that. Assuming your math is correct even without adding LZMA2 just sticking with zip we should get better compression and likely better load times. Wdyt?
>

You could compress the whole of omni.ja en bloc and stick the startup
cache at the beginning so that you don't have to completely decompress
omni.jar until such time as you use the last entry, if at all.

--
Warning: May contain traces of nuts.

Benjamin Smedberg

unread,

Feb 27, 2014, 9:27:59 AM2/27/14

to Bobby Holley, Wesley Hardman, dev-pl...@lists.mozilla.org

I don't think that would be trivial. In particular, which spellchecking
dictionaries would we download? All hundreds of them? Typically people
are only ever going to use one or a few. The different kinds of data
here may need different defaults.

I think this whole thing is valuable, but it's going to require some
significant thought about how we manage this data and probably a whole
new release mechanism for updating the data if necessary. If we did
decide to automatically stream-download some or all of the data for
desktop builds after the initial install, that might nullify the cost
savings associated with reducing the initial download size.

--BDS

Bobby Holley

unread,

Feb 27, 2014, 12:35:01 PM2/27/14

to Benjamin Smedberg, dev-pl...@lists.mozilla.org, Wesley Hardman

On Thu, Feb 27, 2014 at 6:27 AM, Benjamin Smedberg <benj...@smedbergs.us>wrote:

> On 2/26/2014 4:36 PM, Bobby Holley wrote:
>
>> On Wed, Feb 26, 2014 at 12:58 PM, Wesley Hardman <whard...@gmail.com
>> >wrote:
>>
>> It seems like it would be trivial to add a button in the Preferences UI to
>> let people precache all dynamically-loaded data.
>>
>
> I don't think that would be trivial. In particular, which spellchecking
> dictionaries would we download? All hundreds of them? Typically people are
> only ever going to use one or a few. The different kinds of data here may
> need different defaults.
>

Well, there are two kinds of data here. One is stuff (like CMaps and ICU)
that we're currently (or considering) shipping in the browser, for which
this would be a download-size optimization. The other is stuff (like
langpacks) that we wouldn't ship in the browser, for which this would be a
UX optimization to avoid sending the user to AMO et al.

We could very easily label the data sources per the above, at which point
users have the option to revert the the behavior they would have gotten if
we hadn't implemented this feature. An addon could probably provide
additional UI to let users pick what to precache.

bholley

Nick Alexander

unread,

Feb 27, 2014, 12:38:29 PM2/27/14

to dev-pl...@lists.mozilla.org

On 2/27/2014, 12:30 AM, Axel Hecht wrote:
> The feature of zip we want is the index, that let's us seek to a
> position in the bundle and start unpacking, just given the filename.
>
> How hard is to actually create a datastructure for the same purpose for
> a tar.xz or so? I don't know really anything about the uncompression
> algorithms to know if we could do something like
> - seek to position N in bundle
> - set state to X, if applicable
> - uncompress, skip M bytes
> -- you get your file contents, L bytes long
>
> Or so. Yes, it'd be a new file format, I guess, at least as far as I can
> tell? Maybe it's worth it.

Big -1 to a new format. glandium's testing earlier in this thread was
made trivial by the fact that we're using zip and other standard
formats; that kind of tooling is the first thing to go when we introduce
a custom format. As an Android developer particularly interested in
packaging, I inspect and rebuild omni.ja with a whole bunch of different
tools.

Nick

Mike Hommey

unread,

Feb 27, 2014, 5:15:31 PM2/27/14

to Andreas Gal, Neil, Brendan Dahl, dev-pl...@lists.mozilla.org

On Thu, Feb 27, 2014 at 07:33:39PM +0900, Mike Hommey wrote:

> On Thu, Feb 27, 2014 at 01:30:58AM +0100, Andreas Gal wrote:
> >
> > Could we compress major parts of omni.ja en block? We could for
> > example stick all JS we load at startup into a zip with zero
> > compression and then compress that into an outer zip. I think we
> > already support nested containers like that. Assuming your math is
> > correct even without adding LZMA2 just sticking with zip we should get
> > better compression and likely better load times. Wdyt?
>

> I doubt this would benefit much, considering the limited window size of
> the deflate algorithm.

That's actually easily verified by recreating a zip with no
decompression and gzipping it... the result is 7568209 bytes if you do
that for the entire omni.ja.

On Thu, Feb 27, 2014 at 12:52:44PM +0000, Neil wrote:
> You could compress the whole of omni.ja en bloc and stick the
> startup cache at the beginning so that you don't have to completely
> decompress omni.jar until such time as you use the last entry, if at
> all.

Note we do order omni.ja following how it's used, but that's only done
on PGO builds.

That said, the problem with this is that either you keep every file you
ever extracted in memory (or in a disk cache), or you risk to have to
decompress the archive from the beginning again when you need to read
something you don't have in memory again.

Mike

Gervase Markham

unread,

Feb 28, 2014, 6:44:53 AM2/28/14

to Jonathan Kew

On 26/02/14 20:21, Jonathan Kew wrote:
>> Lets turn this question around. If we had an on-demand way to load
>> stuff like this, what else would we want to load on demand?
>
> A few examples:
>
> Spell-checking dictionaries
> Hyphenation tables
> Fonts for additional scripts

If this came with an update system (i.e. a way for Firefox to know the
data is out of date) then the Public Suffix List would benefit. It's a
small amount of data, but non-ideal if it goes stale.

But maybe that's scope creep.

Gerv

Jonathan Kew

unread,

Feb 28, 2014, 7:37:52 AM2/28/14

to Gervase Markham, dev-pl...@lists.mozilla.org

Presumably we always want the complete PSL available. So it really
should be part of the base product, not a [try-to-]load-on-demand resource.

Isn't it sufficient to update that with each new Firefox release?

If there is data such as this that is always included, but would benefit
from being updated separately from the regular release schedule (without
actually pushing out a dot release or chemspill), I think that's a
rather different use-case, even if a common underlying mechanism could
perhaps end up serving both.

JK

Gervase Markham

unread,

Feb 28, 2014, 8:50:51 AM2/28/14

to Jonathan Kew

On 28/02/14 12:37, Jonathan Kew wrote:
> Presumably we always want the complete PSL available. So it really
> should be part of the base product, not a [try-to-]load-on-demand resource.

I was proposing it be part of the base product, but updated on demand.

> Isn't it sufficient to update that with each new Firefox release?

Not everyone takes those. :-|

> If there is data such as this that is always included, but would benefit
> from being updated separately from the regular release schedule (without
> actually pushing out a dot release or chemspill), I think that's a
> rather different use-case, even if a common underlying mechanism could
> perhaps end up serving both.

Fair enough; it is scope creep, then :-)

Gerv

Robert Kaiser

unread,

Feb 28, 2014, 11:07:11 AM2/28/14

to

Boris Zbarsky schrieb:

> On 2/26/14 3:58 PM, Wesley Hardman wrote:
>> Personally, I would prefer to have it already available.
>
> We have several deployment targets with different tradeoffs. Broadly
> speaking:
>
> Phones: expensive data, limited storage. Want to not use up the
> storage, so download lazily.

Esp. on phones I would expect to have *very* expensive and slow download
(as many Firefox OS users do from what I'm told) so it's actually there
that I would say we cannot download in most cases, esp. as we want
installed apps to work the same on all devices.

Robert Kaiser