Proposed changes to Talos (performance) alerting

William Lachance

unread,

Jan 6, 2016, 5:54:39 PM1/6/16

to

I'd like to propose some changes in how we report and triage Talos
alerts. Over the past couple years, Joel Maher (with occasional
assistance from myself and others) has taken over the job of triaging
and responding to ("sheriffing") Talos regressions. He's done this
through a bunch of existing systems:

1. Graphserver: https://graphs.mozilla.org - a system for visualizing
the results of Talos tests
2. Graphserver alerts: A system that monitors Graphserver for sustained
regressions and improvements in particular Talos tests, and emails
developers and posts to a newsgroup (m.d.tree-alerts) when it detects them.
3. AlertManager: http://alertmanager.allizom.org:8080/alerts.html - A
system Joel created that ingests the emails produced by Graphserver alerts

Over the past year, I've been working on a system called Perfherder
which aims to subsume the above functionality, streamlining this process
and make it easier for developers to understand and respond to these
reports. The final pieces have been coming together, the final one being
a better interface for triaging and responding to alerts, which you can
see here:

https://treeherder.mozilla.org/perf.html#/alerts

We've found that the automated alert emails have not been an effective
way of getting developers to respond to performance regressions. Indeed,
it might have had the opposite effect: because there are so many
"downstream" alerts (notifications produced due to merges and uplifts)
people have been largely ignoring them, thinking that graphserver alerts
(and perhaps talos in general?) are "just noise".

In fact, this isn't true: the alert subsystem *does* in fact produce
correct reports of regressions and improvements in most cases (there are
certainly exceptions), but I think history has shown that we need at
least some hands-on work by a performance sheriff (someone like Joel) to
triage these results and file bugs appropriately. Doing this with the
help of the AlertManager dashboard has proven much more effective than
the automated e-mails for getting results over the past few years, so
we're going to continue with that approach as we transition over to
Perfherder.

Here's my rough roadmap for changing the current system:

1. Effective immediately (well, as soon as possible): Stop emailing
developers graphserver regression reports. These reports will continue
to be sent to mozilla.dev.tree-alerts (mostly so that they can be picked
up by the existing AlertManager system while we transition over to
Perfherder).
2. End of January: Sheriffs will start using Perfherder to triage
performance alerts and file bugs (we're almost ready to do this now,
pending a few important features, like prepopulating a bug based on a
template). Hopefully the only thing developers will notice are
easier-to-understand bugs due to the various improvements of Perfherder
over Graphserver. :)
3. End of Q1: After 2 months of running side-by-side with Perfherder, we
will stop submitting talos data to Graphserver. Graphserver will keep on
running read-only, for historical purposes.

Will

William Lachance

unread,

Jan 15, 2016, 1:24:36 PM1/15/16

to

On 2016-01-06 5:54 PM, William Lachance wrote:
> 1. Effective immediately (well, as soon as possible): Stop emailing
> developers graphserver regression reports. These reports will continue
> to be sent to mozilla.dev.tree-alerts (mostly so that they can be picked
> up by the existing AlertManager system while we transition over to
> Perfherder).

This part is done now. Thanks in advance for responding promptly to
alert reports on your commits. If you want to watch for performance
regressions, either look at mozilla.dev.tree-alerts or follow the new
performance alert dashboard as it develops:

https://treeherder.mozilla.org/perf.html#/alerts

Will

Nicolas B. Pierron

unread,

Jan 18, 2016, 4:42:51 AM1/18/16

to

On 01/06/2016 10:54 PM, William Lachance wrote:
> […] thinking that graphserver alerts (and perhaps talos

> in general?) are "just noise".

I think one of the reason might be a miss understanding of the causality.

When a modification of the JS engine causes a TS Paint regression, without a
clear understanding of the causality this is not blindly-actionable out of
the report.

I agree, this should be the part of the developer to work that out, but the
TS Paint benchmark is out of the knowledge base of JS developers. I feel
that the problem is reaching developers with a wording known by the developers.

This is just a raw idea, but maybe this would make more sense to provide a
diff of profiles, and show what decreased / increased. At least this would
make these benchmarks less obscure.

--
Nicolas B. Pierron

jmaher

unread,

Jan 19, 2016, 5:01:42 AM1/19/16

to

>
> This is just a raw idea, but maybe this would make more sense to provide a
> diff of profiles, and show what decreased / increased. At least this would
> make these benchmarks less obscure.
>

Pushing a before/after patch to try with profiling (note the numbers are not useful) can be done and there is a simple checkbox on trychooser. I am not familiar with any diff tools and would wonder how much noise would show up. It does seem worthwhile.

William Lachance

unread,

Jan 19, 2016, 10:06:44 AM1/19/16

to

On 2016-01-18 4:42 AM, Nicolas B. Pierron wrote:
>
> I agree, this should be the part of the developer to work that out, but
> the TS Paint benchmark is out of the knowledge base of JS developers. I
> feel that the problem is reaching developers with a wording known by the
> developers.

So we have a wiki page which describes each Talos test in detail, we
link to these in all perf bugs, but it can be useful reading on its own.
For example, ts_paint is described here:

https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint

I'm sure some of the test descriptions could use some improvements, feel
free to ask on #perf if anything's unclear.

> This is just a raw idea, but maybe this would make more sense to provide
> a diff of profiles, and show what decreased / increased. At least this
> would make these benchmarks less obscure.

As Joel mentioned, it's pretty easy to schedule profiling runs for talos
using trychooser. Scheduling a profiling run as part of the
regression-filing process is something we could consider doing, if
there's a broad consensus it would be useful (I'm always wary of putting
extra load on the machines for something that would be only useful in
10%, say, of cases).

Will

Mike Conley

unread,

Jan 20, 2016, 10:22:25 AM1/20/16

to dev-pl...@lists.mozilla.org

>>
>> As Joel mentioned, it's pretty easy to schedule profiling runs for talos
>> using trychooser. Scheduling a profiling run as part of the
>> regression-filing process is something we could consider doing, if
>> there's a broad consensus it would be useful (I'm always wary of putting
>> extra load on the machines for something that would be only useful in
>> 10%, say, of cases).

I think it would be very useful. Performance profiles are, imo, the best
first step in attempting to drill down into the main issue causing the
regression.

> (I'm always wary of putting
>>> extra load on the machines for something that would be only useful in
>>> 10%, say, of cases).

Note that for most cases you'd only need or want a single run (multiple
runs with profiling aren't generally useful unless you're looking for an
intermittent performance problem). You also wouldn't need to do a full
rebuild if you could trigger the single profiling run on the same builds
that we've been doing the retriggers on.

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

jmaher

unread,

Jan 21, 2016, 11:49:04 AM1/21/16

to

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1241535 to determine if we can have a more streamlined method for collecting profiles.