Quantum Flow Engineering Newsletter #9

Ehsan Akhgari

unread,

May 12, 2017, 2:27:55 AM5/12/17

to dev-pl...@lists.mozilla.org, Firefox Dev

Hi everyone,

It's been 10 weeks since I have started writing these newsletters (the
number in the title isn't an off by one error, there was a one week hiatus
due to a work week!). We still have quite a bit of work ahead of us, but
we have also accomplished a good amount. Finding a good metric for
progress is hard, but we live and breathe in Bugzilla, so we use a bug-based
burn-down chart
<https://people-mozilla.org/%7Ecpeterson/bb/burndown.html?whiteboard=[qf:p1]&Since=2017-01-01>.
As you can see, we are starting to see a decrease in the number of open
bugs, and this is as we are actively adding tens of new bugs to the pool in
the weekly triage meetings.

The other thing that this burn-down chart shows is that we need help! Very
recently Kan-Ru came up with the great idea of creating the
qf-bugs-upforgrabs
<https://bugzilla.mozilla.org/show_bug.cgi?id=qf-bugs-upforgrabs> tracker
bug. These are reasonably self-contained bugs that require less specific
domain knowledge and can be worked on by anyone in a reasonable time
frame. Please consider taking a look at the dependency list of that bug to
see if something interests you! (The similarity of this tacker bug to
photon-perf-upforgrabs
<https://bugzilla.mozilla.org/show_bug.cgi?id=photon-perf-upforgrabs> isn't
an accident!)

On the telemetry hang reports data collection, the new data from hangs of
128ms or longer have been coming in, but there have been some wrinkles in
actually receiving this data, and also in receiving the hang data
correlated to user interactivity
<https://bugzilla.mozilla.org/show_bug.cgi?id=1353440>. Michael Layzell
has been tirelessly at work on the BHR backend to make it suit our needs,
and has been discovering the edges of computation limits in order to
symbolicate the BHR reports on people.mozilla.org (now moved to AWS
<https://s3-us-west-2.amazonaws.com/bhr-data/index.html>!).

I realized we haven't had a performance mini-story for a while -- I sort of
dropped the ball on that. Running over this bug
<https://bugzilla.mozilla.org/show_bug.cgi?id=1362886> made me want to talk
about a pretty well known sort of slowness in C++ code, virtual functions.
The cost of virtual functions comes from several different aspects, firstly
they effectively prevent the compiler from doing inlining the function
which enables a host of compiler optimizations, essentially by enabling the
compiler to see more of the code and optimize more effectively based on
that. But then there is the runtime cost of the function, which mostly
comes from the indirect call. The majority of the performance penalty here
on modern hardware is due to branch midpredictions when different
implementations of a virtual function get called at a call site. You
should remember that on modern <http://www.7-cpu.com/cpu/Haswell.html>
desktop <http://www.7-cpu.com/cpu/SandyBridge.html> processors
<http://www.7-cpu.com/cpu/Skylake.html>, the cost of a branch misprediction
can be around 15-20 cycles (depending on the processor) so if what your
function does is very trivial and it has many overrides that can be called
in hot code chances are that you are spending a considerable amount of time
waiting for the instruction cache misses on the calls to the virtual
function in question. Of course, finding which virtual functions in your
program are these expensive ones requires profiling the workloads you care
about improving, but always keep an eye for this problem as unfortunately
the object-oriented programming model in C++ really encourages writing code
like this. This is the kind of issue that a native profiler is probably
more suitable for discovering, for example if you are using a simple native
sampling profiler these issues typically show up as a long amount of time
being spent on the first instruction of the virtual function being called
(which is typically an inexpensive instruction otherwise.)

Now it's time to acknowledge the work of all of you who have helped in
improving the performance of the browser in the last week. As always, I
hope I'm not forgetting anyone:

-

Doug Thayer ported the Gecko Profiler add-on to be a WebExtension
<https://github.com/devtools-html/Gecko-Profiler-Addon/pull/44>! One
important impact of this work is that this makes it possible to profile
Firefox using this add-on without incurring the performance impact of
having an extension using the add-on SDK installed.
-

Kris Maglione added support for pre-loading scripts during startup on a
background thread <https://bugzilla.mozilla.org/show_bug.cgi?id=1359653>.
This helps improve startup performance for the parent process.
-

David Anderson made us composite asynchronously on Windows when resizing
a widget <https://bugzilla.mozilla.org/show_bug.cgi?id=1361257>. This
can reduce main thread jank for example when opening a window. He also made
PLayerTransaction’s constructor async
<https://bugzilla.mozilla.org/show_bug.cgi?id=1350634> removing a
synchronous IPC message that we used to incur when opening a new window.
-

David Baron ensured that PLDHashTable’s second hash doesn't have padding
with 0 bits for tables with capacity larger than 2^16
<https://bugzilla.mozilla.org/show_bug.cgi?id=1352889>. This hopefully
reduces the risk of encountering long chains in large hash tables, which
could improve some of the hash table performance issues we have noticed
come up in profiles.
-

Cameron McCormack made dom::FontFace cache its gfxCharacterMap instead
of rebuilding it every time
<https://bugzilla.mozilla.org/show_bug.cgi?id=1352531>.
-

William Chen made us reuse StackNodes in HTML parser TreeBuilder in
order to avoid malloc overhead
<https://bugzilla.mozilla.org/show_bug.cgi?id=1355441>.
-

Gabor Krizsanits enabled preallocating content processes by default
<https://bugzilla.mozilla.org/show_bug.cgi?id=1341008>, which should
give us perceived performance wins on new tab and window opens.
-

Nathan Froyd made it possible to profile Stylo Rayon threads using the
Gecko profiler <https://bugzilla.mozilla.org/show_bug.cgi?id=1322656>.
-

Bas Schouten moved pointers to DisplayDataItems directly on nsIFrame
<https://bugzilla.mozilla.org/show_bug.cgi?id=1331718>. This will allow
more efficient access to them by avpiding a lot of hashtable lookups, and
providing better data locality.
-

Michael Layzell made us avoid checking for permissions that almost never
exist unless they do exist to bypass the overhead of nsContentBlocker
<https://bugzilla.mozilla.org/show_bug.cgi?id=1363243>.
-

William Chen flattened attribute storage in the HTML parser
<https://bugzilla.mozilla.org/show_bug.cgi?id=1355479> in order to avoid
the cost of dynamic memory allocation.
-

Thinker Li added a shortcut to nsFrame::BuildDisplayListForChild()
<https://bugzilla.mozilla.org/show_bug.cgi?id=1342009> in order to
improve display list construction speeds by remembering the results of the
previous rounds of computation.
-

Tim Taubert imposed a 2KB limit on the amount of session storage data
preserved by session restore
<https://bugzilla.mozilla.org/show_bug.cgi?id=1362058>.
-

Jan de Mooij optimized Array.prototype.shit to have O(1) rather than
O(n) behavior <https://bugzilla.mozilla.org/show_bug.cgi?id=1348772>.
This is especially nice considering JS libraries using arrays as queues
which tend to call shift() inside a loop which would cause us have
quadratic behavior before this change.

Until next week, happy hacking!

Cheers,
--
Ehsan

Tom Ritter

unread,

May 12, 2017, 12:30:17 PM5/12/17

to Ehsan Akhgari, Mozilla, Firefox Dev

On Fri, May 12, 2017 at 1:27 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
> I realized we haven't had a performance mini-story for a while -- I sort of

> dropped the ball on that. Running over this bug made me want to talk about

> a pretty well known sort of slowness in C++ code, virtual functions. The
> cost of virtual functions comes from several different aspects, firstly they
> effectively prevent the compiler from doing inlining the function which
> enables a host of compiler optimizations, essentially by enabling the
> compiler to see more of the code and optimize more effectively based on
> that. But then there is the runtime cost of the function, which mostly
> comes from the indirect call. The majority of the performance penalty here
> on modern hardware is due to branch midpredictions when different
> implementations of a virtual function get called at a call site. You should

> remember that on modern desktop processors, the cost of a branch

> misprediction can be around 15-20 cycles (depending on the processor) so if
> what your function does is very trivial and it has many overrides that can
> be called in hot code chances are that you are spending a considerable
> amount of time waiting for the instruction cache misses on the calls to the
> virtual function in question. Of course, finding which virtual functions in
> your program are these expensive ones requires profiling the workloads you
> care about improving, but always keep an eye for this problem as
> unfortunately the object-oriented programming model in C++ really encourages
> writing code like this. This is the kind of issue that a native profiler is
> probably more suitable for discovering, for example if you are using a
> simple native sampling profiler these issues typically show up as a long
> amount of time being spent on the first instruction of the virtual function
> being called (which is typically an inexpensive instruction otherwise.)

This reminded me of
https://bugzilla.mozilla.org/show_bug.cgi?id=1332680 (and
https://bugzilla.mozilla.org/show_bug.cgi?id=1332682 )

Adding -Wsuggest-final-types and -Wsuggest-final-methods and looking
at the output seems pretty low-effort to find a lot of more
opportunities for this. (Unless I'm misunderstanding things).

Plus, it benefits security!

-tom

Ehsan Akhgari

unread,

May 12, 2017, 12:55:47 PM5/12/17

to Tom Ritter, Mozilla, Firefox Dev

Yes, this is indeed a good point. Even though this will really only
have a measurable impact on performance if the functions are called in
hot code, it seems like a shame to not listen to the compiler when it
tells you I could make your code faster if you added this keyword in a
bunch of places. :-)

Cameron Kaiser

unread,

May 16, 2017, 8:55:54 PM5/16/17

to

On 5/11/17 11:27 PM, Ehsan Akhgari wrote:

> Jan de Mooij optimized Array.prototype.shit to have O(1) rather than
> O(n) behavior <https://bugzilla.mozilla.org/show_bug.cgi?id=1348772>.

Wow. That's an appropriately named function. ;)

Cameron Kaiser
tier-3 port, tier-1 heart

Chris Peterson

unread,

May 19, 2017, 2:59:22 PM5/19/17

to

On 2017-05-12 9:55 AM, Ehsan Akhgari wrote:
>> This reminded me of
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1332680 (and
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1332682 )
>>
>> Adding -Wsuggest-final-types and -Wsuggest-final-methods and looking
>> at the output seems pretty low-effort to find a lot of more
>> opportunities for this. (Unless I'm misunderstanding things).
>>
>> Plus, it benefits security!
> Yes, this is indeed a good point. Even though this will really only
> have a measurable impact on performance if the functions are called in
> hot code, it seems like a shame to not listen to the compiler when it
> tells you I could make your code faster if you added this keyword in a
> bunch of places. :-)

Should the Mozilla Coding Style document recommend that all new classes
use `final` unless they are designed to be derived? It would be a good
habit even for simple classes that don't derive from a base class.

Also, Herb Sutter recommends [1] that all base classes should either
have a public virtual destructor or protected non-virtual destructor.
This prevents the problem where a derived class's non-virtual destructor
doesn't get called if the object is deleted through a pointer to a base
class.

So all classes would either:

- be a final class,
- have a public virtual destructor, or
- have a protected non-virtual dtor (possibly an empty inline dtor)

[1] http://www.gotw.ca/publications/mill18.htm