In crashkill we have been tracking crashes that occur in low-memory
situations for a while. However, we are seeing a troubling uptick of
issues in Firefox 23 and then 25. I believe that some people may not be
able to use Firefox because of these bugs, and I think that we should be
reacting more strongly to diagnose and solve these issues and get any
fixes that already exist sent up the trains.
Followup to dev-platform, please.
= Data and Background =
See, as some anecdotal evidence:
Bug 930797 is a user who just upgraded to Firefox 25 and is seeing these
a lot.
Bug 937290 is another user who just upgraded to Firefox 25 and is seeing
a bunch of crashes, some of which are empty-dump and some of which are
all over the place (maybe OOM crashes).
See also a recent thread "How to track down why Firefox is crashing so
much." in firefox-dev, where two additional users are reporting
consistent issues (one mac, one windows).
Note that in many cases, the user hasn't actually run out of memory:
they have plenty of physical memory and page file available. In most
cases they also have enough available VM space! Often, however, this VM
space is fragmented to the point where normal allocations (64k jemalloc
heap blocks, or several-megabyte graphics or network buffers) cannot be
made. Because of work done during the recent tree closure, we now have
this measurement in about:memory (on Windows) as vsize-max-contiguous.
It is also being computed for Windows crashes on crash-stats for clients
that are new enough (win7+).
Unfortunately, often when we are out of memory crash reports come back
as empty minidumps (because the crash reporter has to allocation memory
and/or VM space to create minidumps). We believe that most of the
empty-minidump crashes present on crash-stats are in fact also
out-of-memory crashes.
I've been creating reports about OOM crashes using crash-stats and found
some startling data:
Looking just at the Windows crashes from last Friday (22-Nov):
* probably not OOM: 91565
* probably OOM: 57841
* unknown (not enough data because they are running an old version of
Windows that doesn't report VM information in crash reports): 150874
The criterion for "probably OOM" are:
* Has an OOMAnnotationSize marking meaning jemalloc aborted an
infallible allocator
* Has "ABORT: OOM" in the app notes meaning XPCOM aborted in infallible
string/hashtable/array code
* Has <50MB of contiguous free VM space
This data seems to indicate that almost 40% of our Firefox crashes are
due to OOM conditions.
Because one of the long-term possibilities discussed for solving this
issue is releasing a 64-bit version of Firefox, I additionally broke
down the "OOM" crashes into users running a 32-bit version of Windows
and users running a 64-bit version of Windows:
OOM,win64,15744
OOM,win32,42097
I did this by checking the "TotalVirtualMemory" annotation in the crash
report: if it reports 4G of TotalVirtualMemory, then the user has a
64-bit Windows, and if it reports either 2G or 3G, the user is running a
32-bit Windows. So I do not expect that doing Firefox for win64 will
help users who are already experiencing memory issues, although it may
well help new users and users who are running memory-intensive
applications such as games.
Scripts for this analysis at
https://github.com/mozilla/jydoop/blob/master/scripts/oom-classifier.py
if you want to see what it's doing.
= Next Steps =
As far as I can tell, there are several basic problems that we should be
tackling. For now, I'm going to brainstorm some ideas and hope that
people will react or take of these items.
== Measurement ==
* Move minidump collection out of the Firefox process. This is something
we've been talking about for a while but apparently never filed, so it's
now filed as
https://bugzilla.mozilla.org/show_bug.cgi?id=942873
* Develop a tool/instructions for users to profile the VM allocations in
their Firefox process. We know that many of the existing VM problems are
graphics-related, but we're not sure exactly who is making the
allocations, and whether they are leaks, cached textures, or other
things, and whether it's Firefox code, Windows code, or driver code
causing problems. I know dmajor is working on some xperf logging for
this, and we should probably try to expand that out into something that
we can ask end users who are experiencing problems to run.
* The about:memory patches which add contiguous-vm measurement should
probably be uplifted to Fx26, and any other measurement tools that would
be valuable diagnostics.
== VM fragmentation ==
Bug 941837 identified a bad VM allocation pattern in our JS code which
was causing 1MB VM fragmentation. Getting this patch uplifted seems
important. But I know that several other things landed as a part of
fixing the recent tree closure: has anyone identified whether any of the
other patches here could be affecting release users and should be uplifted?
== Graphics Solutions ==
The issues reported in bug 930797 at least appear to be related to HTML5
<video> rendering. The STR aren't precise, but it seems that we should
try and understand and fix the issue reported by that user. Disabling
hardware acceleration does not appear to help.
Bas has a bunch of information in bug 859955 about degenerate behavior
of graphics drivers: they often map textures into the Firefox process,
and sometimes cache the latest N textures (N=200 in one test) no matter
what the texture size is. I have a feeling that we need to do something
here, but it's not clear what. Perhaps it's driver-specific workarounds,
or blacklisting old driver versions, or working with driver vendors to
have better behavior.
== Dealing with OOM crash sites ==
Currently we still have a fair number of call sites that crash with
infallible allocation or after allocation failure where the allocations
are potentially large or huge. In general, infallible allocation should
only be used for fixed-size quantities (C++ classes). Any arrays where
the count is controlled by content, or large buffers for graphics or
networking data should be allocated using fallible allocators,
null-checked, and the system should propagate failure.
I am working on generating some reports on existing crashes where
OOMAllocationSize is variable, and also crash signatures that correlate
highly with OOM conditions. We should fix these sites.
This is only a stopgap measure, because we see plenty of crashes where
OOMAllocationSize is very small (56 bytes), but it will help keep the
browser alive for longer and also foil some trivial DoS attacks.
== Regression ranges ==
Some of the issues appear to be recently introduced in Firefox 25. We
need to jump on regression ranges ASAP. I could really use help working
with users such as those identified at the top of this message to see if
there are regression ranges in nightly builds that cause more issues.
== Last-ditch UI==
When contiguous VM starts getting low, we should probably warn the user
and ask them to restart Firefox soon or risk crashing. I know that this
sucks, but a warning before you crash at least gives you a chance to
save things. I have filed this as
https://bugzilla.mozilla.org/show_bug.cgi?id=942892
--BDS
_______________________________________________
firefox-dev mailing list
firef...@mozilla.org
https://mail.mozilla.org/listinfo/firefox-dev