An analysis of content process memory overhead

Nicholas Nethercote

unread,

Mar 16, 2016, 10:22:40 AM3/16/16

to dev-platform

Greetings,

erahm recently wrote a nice blog post with measurements showing the
overhead of
enabling multiple content processes:

http://www.erahm.org/2016/02/11/memory-usage-of-firefox-with-e10s-enabled/

The overhead is high -- 8 content processes *doubles* our physical memory
usage -- which limits the possibility of increasing the number of content
processes beyond a small number. Now I've done some follow-up
measurements to find out what is causing the per-content-process overhead.

I did this by measuring memory usage with four trivial web pages open, first
with a single content process, then with four content processes, and then
getting the diff between content processes of the two. (about:memory's diff
algorithm normalizes PIDs in memory reports as "NNN" so multiple content
processes naturally get collapsed together, which in this case is exactly
what
we want.) I call this the "small processes" measurement.

If we divide the memory usage increase by 3 (the increase in the number of
content processes) we get a rough measure of the minimum per-content process
overhead.

I then did a similar thing but with four more complex web pages (gmail,
Google
Docs, TreeHerder, Bugzilla). I call this the "large processes" measurement.

-----------------------------------------------------------------------------
LINUX (64-bit), small processes
-----------------------------------------------------------------------------

Some top-level numbers from the "small processes" diff are as follows.

> 68.54 MB (100.0%) -- explicit
> ├──33.54 MB (48.94%) ++ js-non-window
> │ ├──22.97 MB (33.52%) -- zones/zone(0xNNN)
> │ │ ├──18.54 MB (27.05%) ++ (92 tiny)
> │ │ ├───1.94 MB (02.84%) ── unused-gc-things [12]
> │ │ ├───1.71 MB (02.49%) ++ strings/string(<non-notable strings>)
> │ │ └───0.78 MB (01.14%) ++ object-groups
> │ ├───6.97 MB (10.17%) -- runtime
> │ │ ├──3.72 MB (05.42%) ── script-data [4]
> │ │ ├──1.34 MB (01.95%) -- gc
> │ │ │ ├──1.00 MB (01.46%) ── nursery-committed [4]
> │ │ │ └──0.34 MB (00.49%) ++ (3 tiny)
> │ │ ├──1.05 MB (01.54%) ── atoms-table [4]
> │ │ └──0.86 MB (01.26%) ++ (7 tiny)
> │ └───3.60 MB (05.25%) -- gc-heap
> │ ├──3.00 MB (04.38%) ── unused-chunks [4]
> │ └──0.60 MB (00.87%) ++ (2 tiny)
> ├──13.58 MB (19.82%) ── heap-unclassified
> ├──11.51 MB (16.79%) ++ heap-overhead
> │ ├───7.64 MB (11.15%) ── page-cache [4]
> │ ├───3.03 MB (04.42%) ── bin-unused [4]
> │ └───0.84 MB (01.22%) ── bookkeeping [4]
> ├───2.84 MB (04.14%) ── xpti-working-set [4]
> ├───2.05 MB (03.00%) ++ layout
> ├───1.33 MB (01.95%) ++ (10 tiny)
> ├───1.09 MB (01.58%) ── preferences [4]
> ├───1.02 MB (01.49%) ++ xpconnect
> ├───0.80 MB (01.17%) ++ atom-tables
> └───0.77 MB (01.13%) ++ xpcom
>
> 48.36 MB (100.0%) -- heap-committed
> ├──36.86 MB (76.21%) ── allocated [4]
> └──11.51 MB (23.79%) ── overhead [4]
>
> 33.54 MB (100.0%) -- js-main-runtime
> ├──17.76 MB (52.94%) ++ compartments
> ├───6.97 MB (20.78%) ── runtime [4]
> ├───5.22 MB (15.55%) ++ zones
> └───3.60 MB (10.73%) ++ gc-heap
>
> 261 (100.0%) -- js-main-runtime-compartments
> ├──255 (97.70%) ++ system
> └────6 (02.30%) ++ user
>
> 310.06 MB ── resident [4]
> 114.39 MB ── resident-unique [4]

The "[4]" annotations just indicate that these measurements are all repeated
four times in the second case, due to the four content processes.

Among the internal measurements, "explicit" increases by 69 MiB, which
indicates a 23 MiB overhead per content process.

As for the OS measurements, "resident" is not a good metric here because it
will quadruple-count any memory shared between processes. "resident-unique"
shouldn't suffer from that problem, and it suggests a 38 MiB overhead.

The 15 MiB gap between these two surprised me. The only thing that would
account for that difference is unshared (non-read-only) static data,
including
vtables, lookup tables that contain pointers, etc. I've started digging and
it
actually seems plausible. Some of this data will be in our own code, and
some
is in external libraries that we rely on. Some small improvements are
possible
but there's an incredibly long tail and so it's unlikely to improve a lot.
See
bug 1254777 for more details.

Digging into the "explicit" numbers some more:

- The "js-non-window" memory (11 MiB per process) is all system JS code and
data, mostly modules in resource://gre/modules/. We create about 85 JS
system
compartments for these. A fraction of this is per-compartment overhead,
which
might be avoidable by merging them into a single compartment (see bug
1186409). (B2G did something similar a long time ago and saw big
improvements.)

Even if we can fix that, it's just a lot of JS code. We can lazily import
JSMs; I wonder if we are failing to do that as much as we could, i.e. are
all these modules really needed at start-up? It would be great if we
could instrument module-loading code in some way that answers this
question.

- "heap-unclassified" memory is 4.5 MiB per process. I've analyzed this with
DMD and this is mostly GTK and glib memory that we can't measure in our
memory reporters. I haven't investigated closely to see if any of this
could
be avoided.

- "heap-overhead" is 4 MiB per process. I've looked at this closely.
The numbers tend to be noisy.

- "page-cache" is pages that jemalloc holds onto for fast recycling. It is
capped at 4 MiB per process and we can reduce that with a jemalloc
configuration, though this may make allocation slightly slower.

- "bin-unused" is fragmentation in smaller allocations and very hard to
reduce.

- "bookkeeping" is jemalloc's internal data structures and very hard to
reduce.

- Then there's the not-so-long tail of things less than 1 MiB per process.
Some of these may be shrinkable with effort, or made shareable between
processes with effort. (E.g. I reduced xpti-working-set by 216 KiB per
process in bug 1249174, and I've heard that making it shared was
considered
for B2G but never implemented.) It's getting into diminishing returns,
though.

-----------------------------------------------------------------------------
LINUX (64-bit), large processes
-----------------------------------------------------------------------------

> 115.98 MB (100.0%) -- explicit
> ├───66.80 MB (57.60%) -- js-non-window
> │ ├──39.31 MB (33.90%) -- runtime
> │ │ ├──32.69 MB (28.19%) -- gc
> │ │ │ ├──32.00 MB (27.59%) ── nursery-committed [4]
> │ │ │ └───0.69 MB (00.59%) ++ (3 tiny)
> │ │ ├───4.01 MB (03.46%) ── script-data [4]
> │ │ ├───1.80 MB (01.56%) ── atoms-table [4]
> │ │ └───0.80 MB (00.69%) ++ (9 tiny)
> │ ├──24.04 MB (20.73%) -- zones/zone(0xNNN)
> │ │ ├──19.59 MB (16.90%) ++ (98 tiny)
> │ │ ├───2.35 MB (02.03%) ++ strings
> │ │ └───2.10 MB (01.81%) ── unused-gc-things [12]
> │ └───3.45 MB (02.97%) -- gc-heap
> │ ├──3.00 MB (02.59%) ── unused-chunks [4]
> │ └──0.45 MB (00.38%) ++ (2 tiny)
> ├───19.93 MB (17.19%) -- heap-overhead
> │ ├──11.53 MB (09.94%) ── bin-unused [4]
> │ ├───6.96 MB (06.00%) ── page-cache [4]
> │ └───1.44 MB (01.24%) ── bookkeeping [4]
> ├───15.44 MB (13.31%) ── heap-unclassified
> ├────3.16 MB (02.73%) ++ window-objects
> ├────4.40 MB (03.80%) ++ (12 tiny)
> ├────2.84 MB (02.45%) ── xpti-working-set [4]
> ├────2.24 MB (01.93%) ++ layout
> └────1.17 MB (01.01%) ++ xpconnect
>
> 362.36 MB ── resident [4]
> 157.92 MB ── resident-unique [4]

The "explicit" overhead is now 39 MiB per process, and for "resident-unique"
it's 53 MiB per process. The gap between the two is 14 MiB, similar to
before,
so that's additional evidence that static data accounts for the gap.

Both of those overheads are about 16 MiB higher than in the "small
processes"
case. It's mostly JS, esp. "nursery-committed" -- it looks like all four
content processes have 8 MiB nurseries. I know for B2G we allow much smaller
nurseries (256 KiB?) so maybe shrinking it down as we increase content
processes would also be wise.

Other than JS, "heap-overhead" is a bit higher, and most of the other
buckets
are relatively stable.

-----------------------------------------------------------------------------
WINDOWS (32-bit), small processes
-----------------------------------------------------------------------------

> 47.79 MB (100.0%) -- explicit
> ├──25.14 MB (52.60%) -- js-non-window
> │ ├──15.30 MB (32.02%) -- zones/zone(0xNNN)
> │ │ ├──12.36 MB (25.85%) ++ (94 tiny)
> │ │ ├───1.61 MB (03.37%) -- strings/string(<non-notable strings>)
> │ │ │ ├──1.22 MB (02.55%) -- gc-heap
> │ │ │ │ ├──1.22 MB (02.55%) ── latin1 [8]
> │ │ │ │ └──0.00 MB (00.00%) ── two-byte [4]
> │ │ │ └──0.39 MB (00.82%) ── malloc-heap/latin1 [8]
> │ │ └───1.34 MB (02.80%) ── unused-gc-things [12]
> │ ├──10.49 MB (21.96%) -- runtime
> │ │ ├───5.32 MB (11.13%) -- gc
> │ │ │ ├──5.00 MB (10.46%) ── nursery-committed [4]
> │ │ │ └──0.32 MB (00.67%) ++ (3 tiny)
> │ │ ├───3.36 MB (07.02%) ── script-data [4]
> │ │ ├───1.00 MB (02.10%) ── atoms-table [4]
> │ │ ├───0.52 MB (01.09%) ++ script-sources
> │ │ └───0.30 MB (00.62%) ++ (6 tiny)
> │ └──-0.66 MB (-1.37%) ++ gc-heap
> ├──11.47 MB (23.99%) -- heap-overhead
> │ ├───8.51 MB (17.80%) ── page-cache [4]
> │ ├───2.59 MB (05.42%) ── bin-unused [4]
> │ └───0.37 MB (00.77%) ── bookkeeping [4]
> ├───4.43 MB (09.26%) ── heap-unclassified
> ├───2.27 MB (04.76%) ── xpti-working-set [4]
> ├───1.29 MB (02.70%) ++ layout
> ├───0.81 MB (01.69%) ++ (10 tiny)
> ├───0.81 MB (01.69%) ── preferences [4]
> ├───0.56 MB (01.18%) ++ xpcom
> ├───0.53 MB (01.11%) ++ xpconnect
> └───0.49 MB (01.02%) ++ atom-tables
>
> 33.35 MB (100.0%) -- heap-committed
> ├──21.89 MB (65.62%) ── allocated [4]
> └──11.47 MB (34.38%) ── overhead [4]
>
> 25.14 MB (100.0%) -- js-main-runtime
> ├──11.57 MB (46.05%) ++ compartments
> ├──10.49 MB (41.74%) ── runtime [4]
> ├───3.73 MB (14.82%) ++ zones
> └──-0.66 MB (-2.61%) ++ gc-heap
>
> 264 (100.0%) -- js-main-runtime-compartments
> ├──258 (97.73%) ++ system
> └────6 (02.27%) ++ user
>
> 21.89 MB ── heap-allocated [4]
> 151.89 MB ── private [4]
> 222.57 MB ── resident [4]
> 119.76 MB ── resident-unique [4]

The numbers are lower here than for Linux because it's 32-bit and so
pointers
are smaller, but the same basic patterns apply. Differences of note:

- The difference between "explicit" and "resident-unique" per process is 25
MiB, as opposed to the 15 MiB we saw on Linux. I don't know why. Does
Windows
have some inherent per-process memory cost higher than Linux?

- "private" and "resident-unique' are significantly different. Not sure
what to
make of that.

- "heap-unclassified" is a lot lower.

The "large processes" numbers for Windows don't show much interesting beyond
what we've already seen.

-----------------------------------------------------------------------------
MAC (64-bit), small processes
-----------------------------------------------------------------------------

> 64.31 MB (100.0%) -- explicit
> ├──33.40 MB (51.93%) -- js-non-window
> │ ├──23.00 MB (35.76%) -- zones/zone(0xNNN)
> │ │ ├──16.53 MB (25.70%) ++ (90 tiny)
> │ │ ├───1.97 MB (03.07%) ── unused-gc-things [12]
> │ │ ├───1.71 MB (02.66%) ++ strings/string(<non-notable strings>)
> │ │ ├───0.78 MB (01.22%) ++ object-groups
> │ │ ├───0.67 MB (01.05%) ++ compartment([System Principal], Addon-SDK
(from: resource://gre/modules/commonjs/toolkit/loader.js:249))
> │ │ ├───0.67 MB (01.04%) ++
compartment(moz-nullprincipal:{NNNNNNNN-NNNN-NNNN-NNNN-NNNNNNNNNNNN},
XPConnect Compilation Compartment)
> │ │ └───0.66 MB (01.03%) ++ compartment([System Principal],
resource://gre/modules/commonjs/toolkit/loader.js)
> │ ├───6.97 MB (10.84%) -- runtime
> │ │ ├──3.72 MB (05.78%) ── script-data [4]
> │ │ ├──1.34 MB (02.08%) ++ gc
> │ │ ├──1.05 MB (01.64%) ── atoms-table [4]
> │ │ └──0.86 MB (01.34%) ++ (7 tiny)
> │ └───3.43 MB (05.34%) ++ gc-heap
> ├──10.92 MB (16.98%) ── heap-unclassified
> ├──10.06 MB (15.65%) -- heap-overhead
> │ ├───6.38 MB (09.92%) ── page-cache [4]
> │ ├───2.89 MB (04.50%) ── bin-unused [4]
> │ └───0.79 MB (01.22%) ── bookkeeping [4]
> ├───2.84 MB (04.41%) ── xpti-working-set [4]
> ├───2.01 MB (03.12%) ++ layout
> ├───1.37 MB (02.14%) ++ (9 tiny)
> ├───1.12 MB (01.74%) ── preferences [4]
> ├───1.02 MB (01.59%) ++ xpconnect
> ├───0.81 MB (01.25%) ++ atom-tables
> └───0.77 MB (01.20%) ++ xpcom
>
> 44.27 MB (100.0%) -- heap-committed
> ├──34.21 MB (77.27%) ── allocated [4]
> └──10.06 MB (22.73%) ── overhead [4]
>
> 33.40 MB (100.0%) -- js-main-runtime
> ├──17.76 MB (53.16%) ++ compartments
> ├───6.97 MB (20.87%) ── runtime [4]
> ├───5.24 MB (15.69%) ++ zones
> └───3.43 MB (10.28%) ++ gc-heap
>
> 261 (100.0%) -- js-main-runtime-compartments
> ├──255 (97.70%) ++ system
> └────6 (02.30%) ++ user
>
> 282.06 MB ── resident [4]
> 147.94 MB ── resident-unique [4]

The difference between "explicit" and "resident-unique" per process is 27
MiB, as opposed to the 15 MiB we saw on Linux and 25 MiB we saw on Windows.
Again, I don't know why.

Other than that, the numbers are quite similar to the Linux numbers.

-----------------------------------------------------------------------------
Conclusion
-----------------------------------------------------------------------------

The overhead per content process is significant. I can see scope for
moderate
improvements, but I'm having trouble seeing how big improvements can be
made.
Without big improvements, scaling the number of content processes beyond
4 (*maybe* 8) won't be possible.

- JS overhead is the biggest factor. We execute a lot of JS code just
starting
up for each content process -- can that be reduced? We should also
consider a
smaller nursery size limit for content processes.

- Heap overhead is significant. Reducing the page-cache size could save a
couple of MiBs. Improvements beyond that are hard. Turning on jemalloc4
*might* help a bit, but I wouldn't bank on it, and there are other
complications with that.

- Static data is a big chunk. It's hard to make much of a dent there because
it has a *very* long tail.

- The remaining buckets are a lot smaller.

I'm happy to gives copies of the raw data files to anyone who wants to look
at
them in more detail.

Nick

Thinker Li

unread,

Mar 17, 2016, 4:05:15 AM3/17/16

to

On Wednesday, March 16, 2016 at 10:22:40 PM UTC+8, Nicholas Nethercote wrote:
> Even if we can fix that, it's just a lot of JS code. We can lazily import
> JSMs; I wonder if we are failing to do that as much as we could, i.e. are
> all these modules really needed at start-up? It would be great if we
> could instrument module-loading code in some way that answers this
> question.

B2G also did dropping JS source, for Tarako branch, since source is useless for loaded module save for stringify functions. (Gecko compress in-memory source.) But, I am not sure if it was landed on m-c then.

>
> - "heap-unclassified" memory is 4.5 MiB per process. I've analyzed this with
> DMD and this is mostly GTK and glib memory that we can't measure in our
> memory reporters. I haven't investigated closely to see if any of this
> could
> be avoided.
>
> - "heap-overhead" is 4 MiB per process. I've looked at this closely.
> The numbers tend to be noisy.
>
> - "page-cache" is pages that jemalloc holds onto for fast recycling. It is
> capped at 4 MiB per process and we can reduce that with a jemalloc
> configuration, though this may make allocation slightly slower.
>
> - "bin-unused" is fragmentation in smaller allocations and very hard to
> reduce.
>
> - "bookkeeping" is jemalloc's internal data structures and very hard to
> reduce.
>
> - Then there's the not-so-long tail of things less than 1 MiB per process.
> Some of these may be shrinkable with effort, or made shareable between
> processes with effort. (E.g. I reduced xpti-working-set by 216 KiB per
> process in bug 1249174, and I've heard that making it shared was
> considered
> for B2G but never implemented.) It's getting into diminishing returns,
> though.

xpti sharing was implemented for B2G. It would be easy to enable them on Linux and Mac, but I am not sure on Windows.

I guess preference is worth to be shared too.
atoms-table maybe!

Nicolas B. Pierron

unread,

Mar 17, 2016, 9:50:23 AM3/17/16

to

On 03/17/2016 08:05 AM, Thinker Li wrote:
> On Wednesday, March 16, 2016 at 10:22:40 PM UTC+8, Nicholas Nethercote wrote:
>> Even if we can fix that, it's just a lot of JS code. We can lazily import
>> JSMs; I wonder if we are failing to do that as much as we could, i.e. are
>> all these modules really needed at start-up? It would be great if we
>> could instrument module-loading code in some way that answers this
>> question.
>
> B2G also did dropping JS source, for Tarako branch, since source is useless for loaded module save for stringify functions. (Gecko compress in-memory source.) But, I am not sure if it was landed on m-c then.

Note, this worked on B2G, but this would not work for Gecko. For example all
tabs addons have to use toSource to patch the JS functions.

Source compressions should already be enabled. I think we do not do it for
small sources, and for Huge sources, as the compression would either be
useless, or it would take a noticeable amount of time.

--
Nicolas B. Pierron

Boris Zbarsky

unread,

Mar 17, 2016, 9:59:50 AM3/17/16

to

On 3/17/16 9:50 AM, Nicolas B. Pierron wrote:
> Note, this worked on B2G, but this would not work for Gecko. For example
> all tabs addons have to use toSource to patch the JS functions.

Note that we do have the capability to lazily load the source from disk
when someone does this, and we do use it in Gecko for some things. We
could use it for more things....

-Boris

Ben Kelly

unread,

Mar 17, 2016, 10:04:27 AM3/17/16

to Nicolas B. Pierron, dev-pl...@lists.mozilla.org

On Thu, Mar 17, 2016 at 9:50 AM, Nicolas B. Pierron <
nicolas....@mozilla.com> wrote:

> Source compressions should already be enabled. I think we do not do it
> for small sources, and for Huge sources, as the compression would either be
> useless, or it would take a noticeable amount of time.
>

I think Luke suggested that we could compress larger JS sources off the
main thread if we implemented this bug:

https://bugzilla.mozilla.org/show_bug.cgi?id=1001231

Its been in my queue for 2 years, unfortunately. If anyone wants to make
that happen, please feel free to steal it. :-)

Ben

Till Schneidereit

unread,

Mar 17, 2016, 10:08:22 AM3/17/16

to Nicholas Nethercote, Gabor Krizsanits, dev-platform

I filed bug 876173[1] about this a long time ago. Recently, I talked to
Gabor, who's started looking into enabling multiple content processes.

One other thing we should be able to do is sharing the self-hosting
compartment as we do between runtimes within a process. It's not that big,
but it's not nothing, either.

till

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=876173

[ lots of analysis omitted to not get caught in the 40kb+ moderation queue ]

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>

Gabriele Svelto

unread,

Mar 17, 2016, 10:33:36 AM3/17/16

to Nicholas Nethercote, dev-platform

On 15/03/2016 04:34, Nicholas Nethercote wrote:
> - "heap-overhead" is 4 MiB per process. I've looked at this closely.
> The numbers tend to be noisy.
>
> - "page-cache" is pages that jemalloc holds onto for fast recycling. It is
> capped at 4 MiB per process and we can reduce that with a jemalloc
> configuration, though this may make allocation slightly slower.

We aggressively got rid of that on B2G by sending memory-pressure events
to apps that were unused. We did have the advantage there that we had
only one page per process so establishing if one was not being used was
very easy. On desktop Firefox we might consider to try and minimize the
memory usage of processes which do not have active tabs (e.g. none of
the tabs is visible, or none of the tabs has received input for a while).

Besides the immediate memory usage reduction this had the important
side-effect of reducing steady-state consumption. A lot of the
structures and caches that were purged had often been bloated by
transient data required only during startup. Once minimized they would
start to grow again once a process would become active again but never
as much as before the minimization.

Gabriele

signature.asc

David Rajchenbach-Teller

unread,

Mar 17, 2016, 11:30:05 AM3/17/16

to dev-pl...@lists.mozilla.org

I seem to remember that our ChromeWorkers (SessionWorker,
PageThumbsWorker, OS.File Worker) were pretty memory-hungry, but I don't
see any workers there. Does this mean that they have negligible overhead
or that they are only in the parent process?

Cheers,
David

Nicholas Nethercote

unread,

Mar 17, 2016, 6:18:39 PM3/17/16

to David Rajchenbach-Teller, dev-platform

On Fri, Mar 18, 2016 at 2:29 AM, David Rajchenbach-Teller <
dte...@mozilla.com> wrote:
>
> I seem to remember that our ChromeWorkers (SessionWorker,
> PageThumbsWorker, OS.File Worker) were pretty memory-hungry, but I don't
> see any workers there. Does this mean that they have negligible overhead
> or that they are only in the parent process?

I checked the data again: they are only in the parent process, so they
don't affect content process scaling. And they're not *that* big -- here's
the biggest I saw in my data (from the Mac "large processes" data):

> 6.33 MB (04.00%) -- workers/workers(chrome)
> ├──2.15 MB (01.36%) ++
worker(resource://gre/modules/osfile/osfile_async_worker.js, 0x113881800)
> ├──2.11 MB (01.33%) ++
worker(resource:///modules/sessionstore/SessionWorker.js, 0x1297e7800)
> └──2.06 MB (01.30%) ++ worker(resource://gre/modules/PageThumbsWorker.js,
0x1169c1000)

Nick

Nicholas Nethercote

unread,

Mar 21, 2016, 12:51:10 AM3/21/16

to dev-platform

On Tue, Mar 15, 2016 at 2:34 PM, Nicholas Nethercote <n.neth...@gmail.com>
wrote:

>
>
-----------------------------------------------------------------------------
> Conclusion
>
-----------------------------------------------------------------------------
>
> The overhead per content process is significant. I can see scope for
moderate
> improvements, but I'm having trouble seeing how big improvements can be
made.
> Without big improvements, scaling the number of content processes beyond
> 4 (*maybe* 8) won't be possible.
>
> - JS overhead is the biggest factor. We execute a lot of JS code just
starting
> up for each content process -- can that be reduced? We should also
consider a
> smaller nursery size limit for content processes.
>
> - Heap overhead is significant. Reducing the page-cache size could save a
> couple of MiBs. Improvements beyond that are hard. Turning on jemalloc4
> *might* help a bit, but I wouldn't bank on it, and there are other
> complications with that.
>
> - Static data is a big chunk. It's hard to make much of a dent there
because
> it has a *very* long tail.
>
> - The remaining buckets are a lot smaller.

Just to expand upon that, here are the top-level numbers for all three
platforms, both small and large processes. For this computation I assumed
that
"explicit" memory is entirely a subset of "resident-unique", which is
probably
true or very close to it. (Note: this data looks best with a fixed-width
font.)

Linux64, small processes
- resident-unique 38.1 MiB (100%)
- explicit - 22.8 MiB (60%)
- js-non-window - 11.2 MiB (29%)
- other - 7.8 MiB (20%)
- heap-overhead - 3.8 MiB (10%)
- static? - 15.3 MiB (40%)

Linux64, large processes
- resident-unique 52.6 MiB (100%)
- explicit - 38.7 MiB (74%)
- js-non-window - 22.3 MiB (42%)
- other - 9.8 MiB (19%)
- heap-overhead - 6.6 MiB (13%)
- static? - 13.9 MiB (26%)

Mac64, small processes
- resident-unique 49.3 MiB (100%)
- static? - 27.9 MiB (57%)
- explicit - 21.4 MiB (43%)
- js-non-windows - 11.1 MiB (23%)
- other - 6.9 MiB (14%)
- heap-overhead - 3.4 MiB ( 7%)

Mac64, large processes
- resident-unique 59.4 MiB (100%)
- explicit - 30.1 MiB (51%)
- js-non-windows - 15.7 MiB (26%)
- heap-overhead - 7.7 MiB (13%)
- other - 6.7 MiB (11%)
- static? - 29.3 MiB (49%)

Win32, small processes
- resident-unique 39.3 MiB (100%)
- static? - 23.4 MiB (60%)
- explicit - 15.9 MiB (40%)
- js-non-windows - 8.4 MiB (21%)
- heap-overhead - 3.8 MiB (10%)
- other - 3.7 MiB ( 9%)

Win32, large processes
- resident-unique 51.6 MiB (100%)
- explicit - 28.5 MiB (55%)
- js-non-windows - 16.1 MiB (31%)
- heap-overhead - 6.8 MiB (13%)
- other - 5.6 MiB (11%)
- static? - 23.1 MiB (45%)

The "resident-unique" increases by 38--59 MiB per content process. That's a
bit
lower than erahm got in his measurements, possibly because his methodology
involved doing a lot more work in each content process.

Of that increase:
- "static?" accounts for 26--60%
- "explicit/js-non-windows" accounts for 21--42%
- "explicit/heap-overhead" accounts for 7--13%
- "explicit/other" (everything not accounted for by the above three lines)
accounts for 9--20%

About the "static?" measure -- On Linux64, libxul contains about 5.3 MiB of
static data. Other libraries used by Firefox contain much less. So I don't
know
what else is being measured in the "static?" number (i.e. what accounts for
the
change in difference between "resident-unique" and "explicit").

Nick

Nicholas Nethercote

unread,

Apr 14, 2016, 2:35:55 AM4/14/16

to dev-platform

On Mon, Mar 21, 2016 at 3:50 PM, Nicholas Nethercote
<n.neth...@gmail.com> wrote:
>
> - Heap overhead is significant. Reducing the page-cache size could save a
> couple of MiBs. Improvements beyond that are hard. Turning on jemalloc4
> *might* help a bit, but I wouldn't bank on it, and there are other
> complications with that.

I reduced the page-cache size from 4 MiB to 1 MiB in bug 1258257,
saving up to 3 MiB per process. There was no discernible performance
impact.

> On Linux64, libxul contains about 5.3 MiB of static data.

I've done some work to reduce this, as has Nathan Froyd, mostly under
bug 1254777.

I just did local Linux64 builds of the release branch and
mozilla-inbound. The 'data' measurement provided by the |size| utility
has dropped from 5,515,676 to 4,683,616 bytes, a reduction of 832,060
bytes.

I also double-checked this by enabling memory.system_memory_reporter
(which provides detailed OS-level memory measurements, Linux-only) and
then looking at the appropriate "libxul.so/[rw-p]" entry in
about:memory. The change there was from 5,472,256 to 4,685,824 bytes,
a reduction of 786,432 bytes. I'm not sure why these numbers don't
quite match the |size| numbers -- difference between on-disk an memory
representations, perhaps? Nonetheless, they're similar.

So that's the good news. The bad news is (a) there's still a long way
to go before we can reasonably ship with more than 2 or perhaps 4
content processes enabled, (b) those changes above represent the
lowest-hanging fruit I could find, and (c) I will have very limited
time to work further on this in the medium-term.

Nick