Reacting more strongly to low-memory situations in Firefox 25

268 views
Skip to first unread message

Benjamin Smedberg

unread,
Nov 25, 2013, 12:02:50 PM11/25/13
to dev-pl...@lists.mozilla.org, Bas Schouten, David Major, Nathan Froyd, Firefox Dev
In crashkill we have been tracking crashes that occur in low-memory
situations for a while. However, we are seeing a troubling uptick of
issues in Firefox 23 and then 25. I believe that some people may not be
able to use Firefox because of these bugs, and I think that we should be
reacting more strongly to diagnose and solve these issues and get any
fixes that already exist sent up the trains.

Followup to dev-platform, please.

= Data and Background =

See, as some anecdotal evidence:

Bug 930797 is a user who just upgraded to Firefox 25 and is seeing these
a lot.
Bug 937290 is another user who just upgraded to Firefox 25 and is seeing
a bunch of crashes, some of which are empty-dump and some of which are
all over the place (maybe OOM crashes).
See also a recent thread "How to track down why Firefox is crashing so
much." in firefox-dev, where two additional users are reporting
consistent issues (one mac, one windows).

Note that in many cases, the user hasn't actually run out of memory:
they have plenty of physical memory and page file available. In most
cases they also have enough available VM space! Often, however, this VM
space is fragmented to the point where normal allocations (64k jemalloc
heap blocks, or several-megabyte graphics or network buffers) cannot be
made. Because of work done during the recent tree closure, we now have
this measurement in about:memory (on Windows) as vsize-max-contiguous.
It is also being computed for Windows crashes on crash-stats for clients
that are new enough (win7+).

Unfortunately, often when we are out of memory crash reports come back
as empty minidumps (because the crash reporter has to allocation memory
and/or VM space to create minidumps). We believe that most of the
empty-minidump crashes present on crash-stats are in fact also
out-of-memory crashes.

I've been creating reports about OOM crashes using crash-stats and found
some startling data:
Looking just at the Windows crashes from last Friday (22-Nov):
* probably not OOM: 91565
* probably OOM: 57841
* unknown (not enough data because they are running an old version of
Windows that doesn't report VM information in crash reports): 150874

The criterion for "probably OOM" are:
* Has an OOMAnnotationSize marking meaning jemalloc aborted an
infallible allocator
* Has "ABORT: OOM" in the app notes meaning XPCOM aborted in infallible
string/hashtable/array code
* Has <50MB of contiguous free VM space

This data seems to indicate that almost 40% of our Firefox crashes are
due to OOM conditions.

Because one of the long-term possibilities discussed for solving this
issue is releasing a 64-bit version of Firefox, I additionally broke
down the "OOM" crashes into users running a 32-bit version of Windows
and users running a 64-bit version of Windows:

OOM,win64,15744
OOM,win32,42097

I did this by checking the "TotalVirtualMemory" annotation in the crash
report: if it reports 4G of TotalVirtualMemory, then the user has a
64-bit Windows, and if it reports either 2G or 3G, the user is running a
32-bit Windows. So I do not expect that doing Firefox for win64 will
help users who are already experiencing memory issues, although it may
well help new users and users who are running memory-intensive
applications such as games.

Scripts for this analysis at
https://github.com/mozilla/jydoop/blob/master/scripts/oom-classifier.py
if you want to see what it's doing.

= Next Steps =

As far as I can tell, there are several basic problems that we should be
tackling. For now, I'm going to brainstorm some ideas and hope that
people will react or take of these items.

== Measurement ==

* Move minidump collection out of the Firefox process. This is something
we've been talking about for a while but apparently never filed, so it's
now filed as https://bugzilla.mozilla.org/show_bug.cgi?id=942873
* Develop a tool/instructions for users to profile the VM allocations in
their Firefox process. We know that many of the existing VM problems are
graphics-related, but we're not sure exactly who is making the
allocations, and whether they are leaks, cached textures, or other
things, and whether it's Firefox code, Windows code, or driver code
causing problems. I know dmajor is working on some xperf logging for
this, and we should probably try to expand that out into something that
we can ask end users who are experiencing problems to run.
* The about:memory patches which add contiguous-vm measurement should
probably be uplifted to Fx26, and any other measurement tools that would
be valuable diagnostics.

== VM fragmentation ==

Bug 941837 identified a bad VM allocation pattern in our JS code which
was causing 1MB VM fragmentation. Getting this patch uplifted seems
important. But I know that several other things landed as a part of
fixing the recent tree closure: has anyone identified whether any of the
other patches here could be affecting release users and should be uplifted?

== Graphics Solutions ==

The issues reported in bug 930797 at least appear to be related to HTML5
<video> rendering. The STR aren't precise, but it seems that we should
try and understand and fix the issue reported by that user. Disabling
hardware acceleration does not appear to help.

Bas has a bunch of information in bug 859955 about degenerate behavior
of graphics drivers: they often map textures into the Firefox process,
and sometimes cache the latest N textures (N=200 in one test) no matter
what the texture size is. I have a feeling that we need to do something
here, but it's not clear what. Perhaps it's driver-specific workarounds,
or blacklisting old driver versions, or working with driver vendors to
have better behavior.

== Dealing with OOM crash sites ==

Currently we still have a fair number of call sites that crash with
infallible allocation or after allocation failure where the allocations
are potentially large or huge. In general, infallible allocation should
only be used for fixed-size quantities (C++ classes). Any arrays where
the count is controlled by content, or large buffers for graphics or
networking data should be allocated using fallible allocators,
null-checked, and the system should propagate failure.

I am working on generating some reports on existing crashes where
OOMAllocationSize is variable, and also crash signatures that correlate
highly with OOM conditions. We should fix these sites.

This is only a stopgap measure, because we see plenty of crashes where
OOMAllocationSize is very small (56 bytes), but it will help keep the
browser alive for longer and also foil some trivial DoS attacks.

== Regression ranges ==

Some of the issues appear to be recently introduced in Firefox 25. We
need to jump on regression ranges ASAP. I could really use help working
with users such as those identified at the top of this message to see if
there are regression ranges in nightly builds that cause more issues.

== Last-ditch UI==

When contiguous VM starts getting low, we should probably warn the user
and ask them to restart Firefox soon or risk crashing. I know that this
sucks, but a warning before you crash at least gives you a chance to
save things. I have filed this as
https://bugzilla.mozilla.org/show_bug.cgi?id=942892

--BDS

_______________________________________________
firefox-dev mailing list
firef...@mozilla.org
https://mail.mozilla.org/listinfo/firefox-dev

Mike Hommey

unread,
Nov 25, 2013, 5:58:03 PM11/25/13
to dev-pl...@lists.mozilla.org, David Major, Bas Schouten, Firefox Dev, Nathan Froyd
On Mon, Nov 25, 2013 at 12:02:50PM -0500, Benjamin Smedberg wrote:
> Note that in many cases, the user hasn't actually run out of memory:
> they have plenty of physical memory and page file available. In most
> cases they also have enough available VM space! Often, however, this
> VM space is fragmented to the point where normal allocations (64k
> jemalloc heap blocks, or several-megabyte graphics or network
> buffers)

jemalloc heap blocks are 1MB.

Mike

Bas Schouten

unread,
Nov 25, 2013, 8:15:44 PM11/25/13
to dev-pl...@lists.mozilla.org, David Major, Firefox Dev, Nathan Froyd
I'm a little confused, when I force OOM my firefox build on 64-bit windows it -definitely- goes down before it reaches more than 3GB of working set. Am I missing something here?

>
> Scripts for this analysis at
> https://github.com/mozilla/jydoop/blob/master/scripts/oom-classifier.py
> if you want to see what it's doing.
>
> = Next Steps =
>
> As far as I can tell, there are several basic problems that we should be
> tackling. For now, I'm going to brainstorm some ideas and hope that
> people will react or take of these items.
>

...

>
> == Graphics Solutions ==
>
> The issues reported in bug 930797 at least appear to be related to HTML5
> <video> rendering. The STR aren't precise, but it seems that we should
> try and understand and fix the issue reported by that user. Disabling
> hardware acceleration does not appear to help.
>
> Bas has a bunch of information in bug 859955 about degenerate behavior
> of graphics drivers: they often map textures into the Firefox process,
> and sometimes cache the latest N textures (N=200 in one test) no matter
> what the texture size is. I have a feeling that we need to do something
> here, but it's not clear what. Perhaps it's driver-specific workarounds,
> or blacklisting old driver versions, or working with driver vendors to
> have better behavior.

I should highlight something here, caching the last N textures is only occurring in drivers which do -not- map into your address space as far as I have see in my testing. Intel stock drivers seem to map into your address space, but do -not- seem to do any caching. The most likely cause of the OOM here is simply that currently, we keep both the texture, and a RAM copy around of any image in our image cache that has been drawn. This means for users using Direct2D with these drivers an image will use twice as much address space as for users using software rendering. We should probably alter imagelib to discard the RAM copy when having hardware acceleration, and in that case actual address space usage should be the same for users with, and without hardware acceleration.

For what it's worth, just to add some info to this, in my own experience on my machines in most cases Firefox seems to climb to about 1.1-1.3 GB of memory usage fairly quickly (i.e. < 2 days of keeping it open), and sort of stabilize around that number. Usually when I do an about memory in this case my memory reports about 500 MB+ in JS, a surprising amount (150 MB) in VRAM usage for DrawTargets (this would be in our address space on the affected intel machines), we should investigate the latter. This is usually with about 20 tabs open.

Bas

Bas Schouten

unread,
Nov 28, 2013, 8:20:41 PM11/28/13
to dev-pl...@lists.mozilla.org, David Major, Firefox Dev, Nathan Froyd

----- Original Message -----
> From: "Bas Schouten" <bsch...@mozilla.com>
> To: dev-pl...@lists.mozilla.org
> Cc: "David Major" <dma...@mozilla.com>, "Nathan Froyd" <fro...@mozilla.com>, "Firefox Dev" <firef...@mozilla.org>
> Sent: Tuesday, November 26, 2013 1:15:44 AM
> Subject: Re: Reacting more strongly to low-memory situations in Firefox 25
>
>
> ----- Original Message -----
> > From: "Benjamin Smedberg" <benj...@smedbergs.us>
> > To: dev-pl...@lists.mozilla.org, "Bas Schouten" <bsch...@mozilla.com>,
> > "David Major" <dma...@mozilla.com>,
> > "Nathan Froyd" <fro...@mozilla.com>, "Firefox Dev"
> > <firef...@mozilla.org>
> > Sent: Monday, November 25, 2013 5:02:50 PM
> > Subject: Reacting more strongly to low-memory situations in Firefox 25
> >

...

> >
> > == Graphics Solutions ==
> >
> > The issues reported in bug 930797 at least appear to be related to HTML5
> > <video> rendering. The STR aren't precise, but it seems that we should
> > try and understand and fix the issue reported by that user. Disabling
> > hardware acceleration does not appear to help.
> >

...

>
> For what it's worth, just to add some info to this, in my own experience on
> my machines in most cases Firefox seems to climb to about 1.1-1.3 GB of
> memory usage fairly quickly (i.e. < 2 days of keeping it open), and sort of
> stabilize around that number. Usually when I do an about memory in this case
> my memory reports about 500 MB+ in JS, a surprising amount (150 MB) in VRAM
> usage for DrawTargets (this would be in our address space on the affected
> intel machines), we should investigate the latter. This is usually with
> about 20 tabs open.

I've filed a bug for the latter part of this (the higher than expected DrawTarget memory usage), bug 944562. I'll look into it there. Although I don't think the number's high enough to cause a lot of trouble. I don't understand it and would like to.

Best regards,
Reply all
Reply to author
Forward
0 new messages