The current state of Talos benchmarks

168 views
Skip to first unread message

Ehsan Akhgari

unread,
Aug 29, 2012, 7:03:17 PM8/29/12
to dev-pl...@lists.mozilla.org
Hi everyone,

The Talos regression detection emails caught a number of regressions
during the Monday uplift (see [1] for Aurora and [2] for Beta
regressions). To put things into perspective, I prepared a spreadsheet
of the most notable performance regressions [3] (and please do take a
look at the spreadsheet!).

The way the current situation happens is that many of the developers
ignore the Talos regression emails that go to dev-tree-management, and
in many cases regressions of a few percents slide in without being
tracked. This trend of relatively big performance regressions becomes
more evident every time we do an uplift, which means that 6 weeks worth
of development get compared to the previous version.

A few people (myself included) have tried to go through these emails and
notify the people responsible in the past. This process has proved to
be ineffective, because (1) the problem is not officially owned by
anyone (currently the only person going through those emails is
mbrubeck), and (2) because of problems such as the difficulty of
diagnosing and reproducing performance regressions, many people think
that their patches are unlikely to have caused a regression, and
therefore no investigation gets done.

Some people have noted in the past that some Talos measurements are not
representative of something that the users would see, the Talos numbers
are noisy, and we don't have good tools to deal with these types of
regressions. There might be some truth to all of these, but I believe
that the bigger problem is that nobody owns watching over these numbers,
and as a result as take regressions in some benchmarks which can
actually be representative of what our users experience.

I don't believe that the current situation is acceptable, especially
with the recent focus on performance (through the Snappy project), and I
would like to ask people if they have any ideas on what we can do to fix
this. The fix might be turning off some Talos tests if they're really
not useful, asking someone or a group of people to go over these test
results, get better tools with them, etc. But _something_ needs to
happen here.

Cheers,
Ehsan



[1]
<https://groups.google.com/forum/?fromgroups=#!topic/mozilla.dev.tree-management/QYvG8wIen6Y>
[2]
<https://groups.google.com/forum/?fromgroups=#!topic/mozilla.dev.tree-management/Sh62iFF4GtY>
[3]
<https://docs.google.com/spreadsheet/pub?key=0AuDE0NKbf0EOdF9uRi1qa1hwNFhneEFLcUt2TzI3WXc&single=true&gid=0&output=html>

Nicholas Nethercote

unread,
Aug 29, 2012, 7:32:01 PM8/29/12
to Ehsan Akhgari, dev-pl...@lists.mozilla.org
On Wed, Aug 29, 2012 at 4:03 PM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>
> Some people have noted in the past that some Talos measurements are not
> representative of something that the users would see, the Talos numbers are
> noisy, and we don't have good tools to deal with these types of regressions.
> There might be some truth to all of these, but I believe that the bigger
> problem is that nobody owns watching over these numbers, and as a result as
> take regressions in some benchmarks which can actually be representative of
> what our users experience.

In my experience, a lot of those emails say "there was a regression
caused by one of the following 100 patches", and I will have written 1
of those patches. I usually ignore those ones (though it depends on
the nature of the patch).

But if I get an email saying something like "there was a regression
caused by one of the following 3 commits", I'll look into it.

Nick

Matt Brubeck

unread,
Aug 29, 2012, 7:38:54 PM8/29/12
to
On 08/29/2012 04:32 PM, Nicholas Nethercote wrote:
> In my experience, a lot of those emails say "there was a regression
> caused by one of the following 100 patches", and I will have written 1
> of those patches. I usually ignore those ones (though it depends on
> the nature of the patch).
>
> But if I get an email saying something like "there was a regression
> caused by one of the following 3 commits", I'll look into it.

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=752002 to stop
sending mail to every single author when the regression range includes a
large number of patches.

Ehsan Akhgari

unread,
Aug 29, 2012, 7:53:22 PM8/29/12
to Nicholas Nethercote, dev-pl...@lists.mozilla.org
On 12-08-29 7:32 PM, Nicholas Nethercote wrote:
> On Wed, Aug 29, 2012 at 4:03 PM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>>
>> Some people have noted in the past that some Talos measurements are not
>> representative of something that the users would see, the Talos numbers are
>> noisy, and we don't have good tools to deal with these types of regressions.
>> There might be some truth to all of these, but I believe that the bigger
>> problem is that nobody owns watching over these numbers, and as a result as
>> take regressions in some benchmarks which can actually be representative of
>> what our users experience.
>
> In my experience, a lot of those emails say "there was a regression
> caused by one of the following 100 patches", and I will have written 1
> of those patches. I usually ignore those ones (though it depends on
> the nature of the patch).
>
> But if I get an email saying something like "there was a regression
> caused by one of the following 3 commits", I'll look into it.

Yeah, I know that those emails are not perfect, and I know that some
people do look into these problems, but I don't think that is the
general trend, as is evident from the 6-week cycle results on the latest
uplift.

Cheers,
Ehsan

Matt Brubeck

unread,
Aug 29, 2012, 8:00:53 PM8/29/12
to
On 08/29/2012 04:03 PM, Ehsan Akhgari wrote:
> I don't believe that the current situation is acceptable, especially
> with the recent focus on performance (through the Snappy project), and I
> would like to ask people if they have any ideas on what we can do to fix
> this. The fix might be turning off some Talos tests if they're really
> not useful, asking someone or a group of people to go over these test
> results, get better tools with them, etc. But _something_ needs to
> happen here.

Thanks for starting this discussion. I have some suggestions:

* Less is more. We can pay more attention to tests if every alert is
for something we care about. We can *track* stuff like Trace Malloc
Allocs if there are people who find the data useful in their work, but
we should not *alert* on it unless it is a key user-facing metric.

* I don't like our reactive approach that focuses on trying to identify
regressions, and then decide whether to fix them in place, back them
out, or ignore them. Instead we should proactively set goals for what
our performance should be, and focus on the best way to get it there (or
keep it there). The goals should be based the desired user experience,
and we should focus only on metrics that reflect those user experience
goals.

* Engineering teams should have to answer for these metrics; for example
they should be included in quarterly goals. At Amazon, item #1 in the
quarterly goals for each team was always to meet our metrics
commitments. Slipping a key metric past a certain threshold should stop
other work for the team until it's rectified.

* We need staff whose job includes deciding which regressions are
meaningful, identifying the cause, following up to make sure it's backed
out or fixed, and refining the process and tools used to make all this
possible. Too much slips through the cracks when we leave this to
volunteers (including "employeeteers" like Ehsan or me).

Justin Lebar

unread,
Aug 29, 2012, 8:10:03 PM8/29/12
to Ehsan Akhgari, dev-pl...@lists.mozilla.org, Nicholas Nethercote
After getting an e-mail every single time m-c was merged for a day or
two, I filtered the e-mails and completely forgot about them. I
imagine most other people did the same. If we fix bug 752002, we'd
also need to change the e-mails so as to get around everyone's
existing filters.

More on topic: I think the essential problem is, you can spend a week
chasing down a perf regression when there's a good chance it's not
your fault (and also a good chance it's not a regression). So people
are making a reasonable trade-off here when they ignore these
problems.

This is the essential problem that the Signal from Noise project is
tackling. But it's a medium-term play at best, since they're
rewriting large pieces of infrastructure.

So the question is: Are we willing to sit back and wait for SfN, or do
we need improvements Right Now? It's my feeling that incremental
improvements are often more beneficial than the "stop fixing all but
critical bugs in existing code, focus on v2 code" approach that we
seem to be taking with SfN, but I understand that there are trade-offs
here.

Anyway there are simple things we could do to improve the current
situation without taking too much of the focus away from SfN, such as
unbreaking compare-talos (e.g. [1], but also in general applying real
statistics so it's simple to find true regressions and rule out false
ones). It's just a question of priorities.

-Justin

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=696196
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Robert O'Callahan

unread,
Aug 29, 2012, 8:41:23 PM8/29/12
to Ehsan Akhgari, dev-pl...@lists.mozilla.org
Some of the 16->17 regressions are known and due to DLBI patches (bug
539356). Since we don't have full DLBI on trunk yet, those changes should
just be preffed off on Aurora for 17. We should do that and see how that
affects the numbers. Matt Woodrow will take care of that :-).

Rob
--
“You have heard that it was said, ‘Love your neighbor and hate your enemy.’
But I tell you, love your enemies and pray for those who persecute you,
that you may be children of your Father in heaven. ... If you love those
who love you, what reward will you get? Are not even the tax collectors
doing that? And if you greet only your own people, what are you doing more
than others?" [Matthew 5:43-47]

Anthony Jones

unread,
Aug 29, 2012, 8:56:19 PM8/29/12
to dev-pl...@lists.mozilla.org
On 30/08/12 12:10, Justin Lebar wrote:
> More on topic: I think the essential problem is, you can spend a week
> chasing down a perf regression when there's a good chance it's not
> your fault (and also a good chance it's not a regression). So people
> are making a reasonable trade-off here when they ignore these
> problems.

It doesn't seem realistic to try to break down performance regressions
to individual commits or developers. Perhaps performance should be
reported weekly to teams so they can prioritise resources in terms of
who is working on performance.

Ehsan Akhgari

unread,
Aug 29, 2012, 8:58:52 PM8/29/12
to Justin Lebar, dev-pl...@lists.mozilla.org, Nicholas Nethercote
On 12-08-29 8:10 PM, Justin Lebar wrote:
> After getting an e-mail every single time m-c was merged for a day or
> two, I filtered the e-mails and completely forgot about them. I
> imagine most other people did the same. If we fix bug 752002, we'd
> also need to change the e-mails so as to get around everyone's
> existing filters.

And hope that they do not add new filters. ;-)

> More on topic: I think the essential problem is, you can spend a week
> chasing down a perf regression when there's a good chance it's not
> your fault (and also a good chance it's not a regression). So people
> are making a reasonable trade-off here when they ignore these
> problems.

I don't necessarily agree. The common practice as far as I can tell is
for people to not investigate, *assuming* the above. I don't believe we
have data to support any of these propositions on an average case.

> This is the essential problem that the Signal from Noise project is
> tackling. But it's a medium-term play at best, since they're
> rewriting large pieces of infrastructure.
>
> So the question is: Are we willing to sit back and wait for SfN, or do
> we need improvements Right Now? It's my feeling that incremental
> improvements are often more beneficial than the "stop fixing all but
> critical bugs in existing code, focus on v2 code" approach that we
> seem to be taking with SfN, but I understand that there are trade-offs
> here.

I agree. I don't think that the SfN project will be a magic bullet here.

> Anyway there are simple things we could do to improve the current
> situation without taking too much of the focus away from SfN, such as
> unbreaking compare-talos (e.g. [1], but also in general applying real
> statistics so it's simple to find true regressions and rule out false
> ones). It's just a question of priorities.

Absolutely.

Ehsan

Ehsan Akhgari

unread,
Aug 29, 2012, 9:00:52 PM8/29/12
to Anthony Jones, dev-pl...@lists.mozilla.org
On 12-08-29 8:56 PM, Anthony Jones wrote:
> On 30/08/12 12:10, Justin Lebar wrote:
>> More on topic: I think the essential problem is, you can spend a week
>> chasing down a perf regression when there's a good chance it's not
>> your fault (and also a good chance it's not a regression). So people
>> are making a reasonable trade-off here when they ignore these
>> problems.
>
> It doesn't seem realistic to try to break down performance regressions
> to individual commits or developers. Perhaps performance should be
> reported weekly to teams so they can prioritise resources in terms of
> who is working on performance.

I agree with that if we talk about performance in general. But this
thread is about specific regressions in performance as a result of
changeset going into our tree. I don't think the same argument applies
here, unless we decide that we don't care about some of the things that
we measure on Talos (in which case we should stop running those tests if
we're not going to act on their results.)

Ehsan

Ehsan Akhgari

unread,
Aug 29, 2012, 9:02:43 PM8/29/12
to rob...@ocallahan.org, dev-pl...@lists.mozilla.org
On 12-08-29 8:41 PM, Robert O'Callahan wrote:
> Some of the 16->17 regressions are known and due to DLBI patches (bug
> 539356). Since we don't have full DLBI on trunk yet, those changes should
> just be preffed off on Aurora for 17. We should do that and see how that
> affects the numbers. Matt Woodrow will take care of that :-).

That's good to hear. Is there a bug on file for that?

Ehsan

Ehsan Akhgari

unread,
Aug 29, 2012, 9:07:05 PM8/29/12
to Matt Brubeck, dev-pl...@lists.mozilla.org
On Wed, Aug 29, 2012 at 8:00 PM, Matt Brubeck <mbru...@mozilla.com> wrote:

> On 08/29/2012 04:03 PM, Ehsan Akhgari wrote:
>
>> I don't believe that the current situation is acceptable, especially
>> with the recent focus on performance (through the Snappy project), and I
>> would like to ask people if they have any ideas on what we can do to fix
>> this. The fix might be turning off some Talos tests if they're really
>> not useful, asking someone or a group of people to go over these test
>> results, get better tools with them, etc. But _something_ needs to
>> happen here.
>>
>
> Thanks for starting this discussion. I have some suggestions:
>
> * Less is more. We can pay more attention to tests if every alert is for
> something we care about. We can *track* stuff like Trace Malloc Allocs if
> there are people who find the data useful in their work, but we should not
> *alert* on it unless it is a key user-facing metric.
>

Yes. I think one major problem is that we have some Talos tests which
measure useless things, and people tend to think that Talos measurements in
general are useless because of that bad reputation. And it also increases
the cognitive load of dealing with the problem, as you state.


> * I don't like our reactive approach that focuses on trying to identify
> regressions, and then decide whether to fix them in place, back them out,
> or ignore them. Instead we should proactively set goals for what our
> performance should be, and focus on the best way to get it there (or keep
> it there). The goals should be based the desired user experience, and we
> should focus only on metrics that reflect those user experience goals.
>

+1.


> * Engineering teams should have to answer for these metrics; for example
> they should be included in quarterly goals. At Amazon, item #1 in the
> quarterly goals for each team was always to meet our metrics commitments.
> Slipping a key metric past a certain threshold should stop other work for
> the team until it's rectified.
>

We should definitely do that if we want care about performance as an
important measure of quality (which is my impression. :-)


> * We need staff whose job includes deciding which regressions are
> meaningful, identifying the cause, following up to make sure it's backed
> out or fixed, and refining the process and tools used to make all this
> possible. Too much slips through the cracks when we leave this to
> volunteers (including "employeeteers" like Ehsan or me).


This, a thousand times!

--
Ehsan
<http://ehsanakhgari.org/>

Dave Mandelin

unread,
Aug 29, 2012, 9:20:24 PM8/29/12
to dev-pl...@lists.mozilla.org
On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
> Hi everyone,
>
> The way the current situation happens is that many of the developers
> ignore the Talos regression emails that go to dev-tree-management,

Talos is widely disliked and distrusted by developers, because it's hard to understand what it's really measuring, and there are lots of false alarms. Metrics and A-Team have been doing a ton of work to improve this. In particular, I told them that some existing Talos JS tests were not useful to us, and they deleted them. And v2 is going to have exactly the tests we want, with regression alarms. So Talos can (and will) be fixed for developers.

> and in many cases regressions of a few percents slide in without being
> tracked. This trend of relatively big performance regressions becomes
> more evident every time we do an uplift, which means that 6 weeks worth
> of development get compared to the previous version.
>
> A few people (myself included) have tried to go through these emails and
> notify the people responsible in the past. This process has proved to
> be ineffective, because (1) the problem is not officially owned by
> anyone (currently the only person going through those emails is
> mbrubeck), and (2) because of problems such as the difficulty of
> diagnosing and reproducing performance regressions, many people think
> that their patches are unlikely to have caused a regression, and
> therefore no investigation gets done.

Yeah, that's no good.

> Some people have noted in the past that some Talos measurements are not
> representative of something that the users would see, the Talos numbers
> are noisy, and we don't have good tools to deal with these types of
> regressions. There might be some truth to all of these, but I believe
> that the bigger problem is that nobody owns watching over these numbers,
> and as a result as take regressions in some benchmarks which can
> actually be representative of what our users experience.

The interesting thing is that we basically have no idea if that's true for any given Talos alarm.

> I don't believe that the current situation is acceptable, especially
> with the recent focus on performance (through the Snappy project), and I
> would like to ask people if they have any ideas on what we can do to fix
> this. The fix might be turning off some Talos tests if they're really
> not useful, asking someone or a group of people to go over these test
> results, get better tools with them, etc. But _something_ needs to
> happen here.

I would say:

- First, and most important, fix the test suite so that it measures only things that are useful and meaningful to developers and users. We can easily take a first cut at this if engineering teams go over the tests related to their work, and tell A-Team which are not useful. Over time, I think we need to get a solid understanding of what performance looks like to users, what things to test, and how to test them soundly. This may require dedicated performance engineers or a performance product manager.

- Second, as you say, get an owner for performance regressions. There are lots of ways we could do this. I think it would integrate fairly easily into our existing processes if we (automatically or by a designated person) filed a bug for each regression and marked it tracking (so the release managers would own followup). Alternately, we could have a designated person own followup. I'm not sure if that has any advantages, but release managers would probably know. But doing any of this is going to severely annoy engineers unless we get the false positive rate under control.

- Speaking of false positives, we should seriously start tracking them. We should keep track of each Talos regression found and its outcome. (It would be great to track false negatives too but it's a lot harder to catch them and record them accurately.) That way we'd actually know whether we have a few false positives or a lot, or whether the false positives were coming up on certain tests. And we could use that information to improve the false positive rate over time.

Dave

Joshua Cranmer

unread,
Aug 29, 2012, 9:46:17 PM8/29/12
to
On 8/29/2012 6:03 PM, Ehsan Akhgari wrote:
> Hi everyone,
>
> The Talos regression detection emails caught a number of regressions
> during the Monday uplift (see [1] for Aurora and [2] for Beta
> regressions). To put things into perspective, I prepared a
> spreadsheet of the most notable performance regressions [3] (and
> please do take a look at the spreadsheet!).
>
> The way the current situation happens is that many of the developers
> ignore the Talos regression emails that go to dev-tree-management, and
> in many cases regressions of a few percents slide in without being
> tracked. This trend of relatively big performance regressions becomes
> more evident every time we do an uplift, which means that 6 weeks
> worth of development get compared to the previous version.

Here's another question worth asking: suppose every commit introduces a
very tiny regression. How big would this regression have to be to add up
to a 10% regression over a 6-week cycle? Assuming 400 commits per cycle,
I calculate that you need to consistently have about 0.023% regression
per commit to add up to that 10% regression. 400 commits is much fewer
than the rate we commit at, but it illustrates that some regressions
might not be because of one big regression no one missed but a death by
a thousand papercuts.

Robert O'Callahan

unread,
Aug 29, 2012, 9:54:13 PM8/29/12
to Ehsan Akhgari, Anthony Jones, dev-pl...@lists.mozilla.org
On Thu, Aug 30, 2012 at 1:00 PM, Ehsan Akhgari <ehsan....@gmail.com>wrote:

> I agree with that if we talk about performance in general. But this
> thread is about specific regressions in performance as a result of
> changeset going into our tree. I don't think the same argument applies
> here, unless we decide that we don't care about some of the things that we
> measure on Talos (in which case we should stop running those tests if we're
> not going to act on their results.)
>

Yeah. Troubled as Talos is, we have had a lot of cases where Talos reported
a regression that we tracked down to a real bug in some checkin, and
backing out the checkin and fixing it was absolutely the right thing to do.
Talos is useful.

Justin Lebar

unread,
Aug 30, 2012, 4:39:40 AM8/30/12
to Mike Hommey, Ehsan Akhgari, dev-pl...@lists.mozilla.org
On Thu, Aug 30, 2012 at 3:34 AM, Mike Hommey <m...@glandium.org> wrote:
> Ideally, we should make talos regressions visible on tbpl as oranges,
> and star them as other oranges.

FWIW, making this possible is an explicit goal of the SfN effort.

-Justin

Matt Woodrow

unread,
Aug 30, 2012, 6:05:22 AM8/30/12
to Ehsan Akhgari, dev-pl...@lists.mozilla.org, rob...@ocallahan.org
I've just filed bug 786978 for this.

- Matt

Mark Finkle

unread,
Aug 30, 2012, 8:44:59 AM8/30/12
to rob...@ocallahan.org, Ehsan Akhgari, Anthony Jones
On 08/29/2012 09:54 PM, Robert O'Callahan wrote:
> On Thu, Aug 30, 2012 at 1:00 PM, Ehsan Akhgari <ehsan....@gmail.com>wrote:
>
>> I agree with that if we talk about performance in general. But this
>> thread is about specific regressions in performance as a result of
>> changeset going into our tree. I don't think the same argument applies
>> here, unless we decide that we don't care about some of the things that we
>> measure on Talos (in which case we should stop running those tests if we're
>> not going to act on their results.)
>>
>
> Yeah. Troubled as Talos is, we have had a lot of cases where Talos reported
> a regression that we tracked down to a real bug in some checkin, and
> backing out the checkin and fixing it was absolutely the right thing to do.
> Talos is useful.

Agreed. Talos does find real regressions from individual developer commits.

Mark Finkle

unread,
Aug 30, 2012, 8:47:41 AM8/30/12
to Nicholas Nethercote, Ehsan Akhgari
On 08/29/2012 07:32 PM, Nicholas Nethercote wrote:

> In my experience, a lot of those emails say "there was a regression
> caused by one of the following 100 patches", and I will have written 1
> of those patches. I usually ignore those ones (though it depends on
> the nature of the patch).
>
> But if I get an email saying something like "there was a regression
> caused by one of the following 3 commits", I'll look into it.

I usually go back to mozilla-inbound when I see a talos regression on a
mozilla-central merge push. Looking on inbound makes it easier to spot
the individual push that may have triggered the regression.

Dave Mandelin

unread,
Aug 29, 2012, 9:20:24 PM8/29/12
to mozilla.de...@googlegroups.com, dev-pl...@lists.mozilla.org
On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
> Hi everyone,
>
> The way the current situation happens is that many of the developers
> ignore the Talos regression emails that go to dev-tree-management,

Talos is widely disliked and distrusted by developers, because it's hard to understand what it's really measuring, and there are lots of false alarms. Metrics and A-Team have been doing a ton of work to improve this. In particular, I told them that some existing Talos JS tests were not useful to us, and they deleted them. And v2 is going to have exactly the tests we want, with regression alarms. So Talos can (and will) be fixed for developers.

> and in many cases regressions of a few percents slide in without being
> tracked. This trend of relatively big performance regressions becomes
> more evident every time we do an uplift, which means that 6 weeks worth
> of development get compared to the previous version.
>
> A few people (myself included) have tried to go through these emails and
> notify the people responsible in the past. This process has proved to
> be ineffective, because (1) the problem is not officially owned by
> anyone (currently the only person going through those emails is
> mbrubeck), and (2) because of problems such as the difficulty of
> diagnosing and reproducing performance regressions, many people think
> that their patches are unlikely to have caused a regression, and
> therefore no investigation gets done.

Yeah, that's no good.

> Some people have noted in the past that some Talos measurements are not
> representative of something that the users would see, the Talos numbers
> are noisy, and we don't have good tools to deal with these types of
> regressions. There might be some truth to all of these, but I believe
> that the bigger problem is that nobody owns watching over these numbers,
> and as a result as take regressions in some benchmarks which can
> actually be representative of what our users experience.

The interesting thing is that we basically have no idea if that's true for any given Talos alarm.

> I don't believe that the current situation is acceptable, especially
> with the recent focus on performance (through the Snappy project), and I
> would like to ask people if they have any ideas on what we can do to fix
> this. The fix might be turning off some Talos tests if they're really
> not useful, asking someone or a group of people to go over these test
> results, get better tools with them, etc. But _something_ needs to
> happen here.

Ehsan Akhgari

unread,
Aug 30, 2012, 11:53:02 AM8/30/12
to Mike Hommey, dev-pl...@lists.mozilla.org
On 12-08-30 2:34 AM, Mike Hommey wrote:
> On Wed, Aug 29, 2012 at 07:03:17PM -0400, Ehsan Akhgari wrote:
>> I don't believe that the current situation is acceptable, especially
>> with the recent focus on performance (through the Snappy project),
>> and I would like to ask people if they have any ideas on what we can
>> do to fix this. The fix might be turning off some Talos tests if
>> they're really not useful, asking someone or a group of people to go
>> over these test results, get better tools with them, etc. But
>> _something_ needs to happen here.
>
> Ideally, we should make talos regressions visible on tbpl as oranges,
> and star them as other oranges.
> The problem is that with the current state of signal/noise of talos
> results, this would lead to
> - a whole lot of oranges
> - oranges showing up long after the fact (by the time the email
> notification comes, plenty more pushes have happened already)
>
> The latter makes it hard to handle such oranges as other oranges.

The way that we handle similar situations with test failures is to close
inbound, back out enough patches so that things go green again, and then
reopen, and ask the people that had been backed out to retry landing
after testing on try.

This is entirely possible for Talos regressions as well if we want to,
it just takes longer, and will mean more patches will get backed out.

Ehsan

Ehsan Akhgari

unread,
Aug 30, 2012, 12:11:17 PM8/30/12
to Dave Mandelin, dev-pl...@lists.mozilla.org
On 12-08-29 9:20 PM, Dave Mandelin wrote:
> On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
>> Hi everyone,
>>
>> The way the current situation happens is that many of the developers
>> ignore the Talos regression emails that go to dev-tree-management,
>
> Talos is widely disliked and distrusted by developers, because it's hard to understand what it's really measuring, and there are lots of false alarms. Metrics and A-Team have been doing a ton of work to improve this. In particular, I told them that some existing Talos JS tests were not useful to us, and they deleted them. And v2 is going to have exactly the tests we want, with regression alarms. So Talos can (and will) be fixed for developers.

In my opinion, one of the reasons why Talos is disliked is because many
people don't know where its code lives (hint:
http://hg.mozilla.org/build/talos/) and can't run those tests like other
test suites. I think this would be very valuable to fix, so that
developers can read Talos tests like any other test, and fix or improve
them where needed.

>> Some people have noted in the past that some Talos measurements are not
>> representative of something that the users would see, the Talos numbers
>> are noisy, and we don't have good tools to deal with these types of
>> regressions. There might be some truth to all of these, but I believe
>> that the bigger problem is that nobody owns watching over these numbers,
>> and as a result as take regressions in some benchmarks which can
>> actually be representative of what our users experience.
>
> The interesting thing is that we basically have no idea if that's true for any given Talos alarm.

That's something that I think should be judged per benchmark. For
example, the Ts measurements will probably correspond very directly to
the startup time that our users experience. The Tp5 measurements don't
directly correspond to anything like that, since nobody loads those
pages sequentially, but it could be an indication of average page load
performance.

>> I don't believe that the current situation is acceptable, especially
>> with the recent focus on performance (through the Snappy project), and I
>> would like to ask people if they have any ideas on what we can do to fix
>> this. The fix might be turning off some Talos tests if they're really
>> not useful, asking someone or a group of people to go over these test
>> results, get better tools with them, etc. But _something_ needs to
>> happen here.
>
> I would say:
>
> - First, and most important, fix the test suite so that it measures only things that are useful and meaningful to developers and users. We can easily take a first cut at this if engineering teams go over the tests related to their work, and tell A-Team which are not useful. Over time, I think we need to get a solid understanding of what performance looks like to users, what things to test, and how to test them soundly. This may require dedicated performance engineers or a performance product manager.

Absolutely. I think developers need to act more proactively to address
this. I've heard so many times that the measurement X is useless. I
think it's time for us to even consider stopping some of the Talos tests
if we don't think they're useful. We could use the machine time to run
other tests at least!

> - Second, as you say, get an owner for performance regressions. There are lots of ways we could do this. I think it would integrate fairly easily into our existing processes if we (automatically or by a designated person) filed a bug for each regression and marked it tracking (so the release managers would own followup). Alternately, we could have a designated person own followup. I'm not sure if that has any advantages, but release managers would probably know. But doing any of this is going to severely annoy engineers unless we get the false positive rate under control.

Note that some of the work of to differentiate between false positives
and real regressions needs to be done by the engineers, similar to the
work required to investigate correctness problems. And people need to
accept that seemingly benign changes may also cause real performance
regressions, so it's not always possible to glance over a changeset and
say "nah, this can't be my fault." :-)

> - Speaking of false positives, we should seriously start tracking them. We should keep track of each Talos regression found and its outcome. (It would be great to track false negatives too but it's a lot harder to catch them and record them accurately.) That way we'd actually know whether we have a few false positives or a lot, or whether the false positives were coming up on certain tests. And we could use that information to improve the false positive rate over time.

I agree. Do you have any suggestions on how we would track them?

Thanks!
Ehsan

Ehsan Akhgari

unread,
Aug 30, 2012, 1:45:00 PM8/30/12
to Joshua Cranmer, dev-pl...@lists.mozilla.org
We do see regressions which cannot be attributed to a few changesets,
but I think we can all agree that every single changeset contributing
the same amount of performance regression in the 6 week cycle is highly
unlikely. I'm not sure what you're trying to suggest here.

Ehsan

Dave Mandelin

unread,
Aug 30, 2012, 5:28:01 PM8/30/12
to Dave Mandelin, dev-pl...@lists.mozilla.org
On Thursday, August 30, 2012 9:11:25 AM UTC-7, Ehsan Akhgari wrote:
> On 12-08-29 9:20 PM, Dave Mandelin wrote:
>
> > On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
>
> In my opinion, one of the reasons why Talos is disliked is because many
> people don't know where its code lives (hint:
> http://hg.mozilla.org/build/talos/) and can't run those tests like other
> test suites. I think this would be very valuable to fix, so that
> developers can read Talos tests like any other test, and fix or improve
> them where needed.

It is hard to find. And beyond that, it seems hard to use. It's been a while since I've run Talos locally, but last time I did it was a pain to set up and difficult to run, and I hear it's still kind of like that. For testing tools, "convenient for the developer" is a critical requirement, but has been neglected in the past.

js/src/jit-test/ is an example of something that is very convenient for developers: creating a test is just adding a .js file to a directory (no manifest or extra files; by default error or crash is a fail, but you can change that for a test), the harness is a Python file with nice options, the test configuration and basic usage is documented in a README, and it lives in the tree.

> >> [...] I believe
> >> that the bigger problem is that nobody owns watching over these numbers,
> >> and as a result as take regressions in some benchmarks which can
> >> actually be representative of what our users experience.
>
> > The interesting thing is that we basically have no idea if that's true for any given Talos alarm.
>
> That's something that I think should be judged per benchmark. For
> example, the Ts measurements will probably correspond very directly to
> the startup time that our users experience. The Tp5 measurements don't
> directly correspond to anything like that, since nobody loads those
> pages sequentially, but it could be an indication of average page load
> performance.

I exaggerated a bit--yes, some tests like Ts are pretty easy to understand and do correspond to user experience. With Tp5, I just don't know--I haven't spent any time trying to use it or looking at regressions, since JS doesn't affect it.

> >> I don't believe that the current situation is acceptable, especially
> >> with the recent focus on performance (through the Snappy project), and I
> >> would like to ask people if they have any ideas on what we can do to fix
> >> this. The fix might be turning off some Talos tests if they're really
> >> not useful, asking someone or a group of people to go over these test
> >> results, get better tools with them, etc. But _something_ needs to
> >> happen here.


> > - Second, as you say, get an owner for performance regressions. There are lots of ways we could do this. I think it would integrate fairly easily into our existing processes if we (automatically or by a designated person) filed a bug for each regression and marked it tracking (so the release managers would own followup). Alternately, we could have a designated person own followup. I'm not sure if that has any advantages, but release managers would probably know. But doing any of this is going to severely annoy engineers unless we get the false positive rate under control.
>
> Note that some of the work of to differentiate between false positives
> and real regressions needs to be done by the engineers, similar to the
> work required to investigate correctness problems. And people need to
> accept that seemingly benign changes may also cause real performance
> regressions, so it's not always possible to glance over a changeset and
> say "nah, this can't be my fault." :-)

Agreed.

> > - Speaking of false positives, we should seriously start tracking them. We should keep track of each Talos regression found and its outcome. (It would be great to track false negatives too but it's a lot harder to catch them and record them accurately.) That way we'd actually know whether we have a few false positives or a lot, or whether the false positives were coming up on certain tests. And we could use that information to improve the false positive rate over time.
>
> I agree. Do you have any suggestions on how we would track them?

The details would vary according to the preferences of the person doing it, but I'd sketch it out something like this: when Talos detects a regression, file a bug to "resolve" it (i.e., show that it's not a real regression, show that it's an acceptable regression for the patch, or fix the regression). Then keep a file listing those bugs (with metadata for each: tests regressed, date, component, etc), and as each is closed, mark down the result: false positive, allowed, backed out, or fixed. That's your data set. Of course, various parts of this could be automated but that's not required.

Dave

Taras Glek

unread,
Aug 30, 2012, 5:42:06 PM8/30/12
to
Hi,
We had a quick meeting focused on how to not regress Talos.


Attendance: Taras Glek, Ehsan Akhgari, Clint Talbert, Nathan Froyd, Dave
Mandelin, Christina Choi, Joel Maher


Notes:
* Clint's Automation&Tools team is improving Talos reporting abilities.
We should have much better tools for comparing performance between 2
different changesets soon.
* Talos is now significantly easier to run locally than it used to be.
Expect blog posts from Joel/Ehsan
* Joel will revisit maintaining Talos within mozilla-central to reduce
developer barriers to understanding what a particular Talos test result
means. This should also make Talos easier to run

* We will implement a formal policy on Talos impact of merges.
** focus on perf tracking on inbound, to avoid merge pains
** We will extend the current merge criteria of last green PGO changeset
to also include "good Talos numbers"

* Nathan Froyd will look at historical data for the last Firefox nightly
release cycle to come up with threshold numbers for backing out commits
* Joel/Ehsan will look into using mozregression with talos so we can
bisect performance regressions locally. We will also consider doing
something similar on try.

Taras

Ehsan Akhgari

unread,
Aug 30, 2012, 5:54:48 PM8/30/12
to Dave Mandelin, dev-pl...@lists.mozilla.org
On 12-08-30 5:28 PM, Dave Mandelin wrote:
> On Thursday, August 30, 2012 9:11:25 AM UTC-7, Ehsan Akhgari wrote:
>> On 12-08-29 9:20 PM, Dave Mandelin wrote:
>>
>>> On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
>>
>> In my opinion, one of the reasons why Talos is disliked is because many
>> people don't know where its code lives (hint:
>> http://hg.mozilla.org/build/talos/) and can't run those tests like other
>> test suites. I think this would be very valuable to fix, so that
>> developers can read Talos tests like any other test, and fix or improve
>> them where needed.
>
> It is hard to find. And beyond that, it seems hard to use. It's been a while since I've run Talos locally, but last time I did it was a pain to set up and difficult to run, and I hear it's still kind of like that. For testing tools, "convenient for the developer" is a critical requirement, but has been neglected in the past.
>
> js/src/jit-test/ is an example of something that is very convenient for developers: creating a test is just adding a .js file to a directory (no manifest or extra files; by default error or crash is a fail, but you can change that for a test), the harness is a Python file with nice options, the test configuration and basic usage is documented in a README, and it lives in the tree.

Absolutely! We really need to work hard to make them easier to run. I
hear that the Automation team has already been making progress towards
that goal.

>>>> [...] I believe
>>>> that the bigger problem is that nobody owns watching over these numbers,
>>>> and as a result as take regressions in some benchmarks which can
>>>> actually be representative of what our users experience.
>>
>>> The interesting thing is that we basically have no idea if that's true for any given Talos alarm.
>>
>> That's something that I think should be judged per benchmark. For
>> example, the Ts measurements will probably correspond very directly to
>> the startup time that our users experience. The Tp5 measurements don't
>> directly correspond to anything like that, since nobody loads those
>> pages sequentially, but it could be an indication of average page load
>> performance.
>
> I exaggerated a bit--yes, some tests like Ts are pretty easy to understand and do correspond to user experience. With Tp5, I just don't know--I haven't spent any time trying to use it or looking at regressions, since JS doesn't affect it.

Right. I think at the very least, on bigger tests like Tp5 we want to
know if something is regressed by a large amount, because that is very
likely to reflect an actual behavior change which is worth knowing about.

>>> - Speaking of false positives, we should seriously start tracking them. We should keep track of each Talos regression found and its outcome. (It would be great to track false negatives too but it's a lot harder to catch them and record them accurately.) That way we'd actually know whether we have a few false positives or a lot, or whether the false positives were coming up on certain tests. And we could use that information to improve the false positive rate over time.
>>
>> I agree. Do you have any suggestions on how we would track them?
>
> The details would vary according to the preferences of the person doing it, but I'd sketch it out something like this: when Talos detects a regression, file a bug to "resolve" it (i.e., show that it's not a real regression, show that it's an acceptable regression for the patch, or fix the regression). Then keep a file listing those bugs (with metadata for each: tests regressed, date, component, etc), and as each is closed, mark down the result: false positive, allowed, backed out, or fixed. That's your data set. Of course, various parts of this could be automated but that's not required.

Oh, sorry, I needed to ask my question better. I'm specifically
wondering who needs to track and investigate the regression if it
happened on a range of, let's say, 5 committers...


Cheers,
Ehsan

Ehsan Akhgari

unread,
Aug 30, 2012, 5:56:40 PM8/30/12
to Taras Glek, dev-pl...@lists.mozilla.org
On 12-08-30 5:42 PM, Taras Glek wrote:
> * Joel will revisit maintaining Talos within mozilla-central to reduce
> developer barriers to understanding what a particular Talos test result
> means. This should also make Talos easier to run

I have filed bug 787200 for this discussion.

Ehsan

L. David Baron

unread,
Aug 30, 2012, 7:22:22 PM8/30/12
to dev-pl...@lists.mozilla.org
On Thursday 2012-08-30 14:42 -0700, Taras Glek wrote:
> * Joel will revisit maintaining Talos within mozilla-central to
> reduce developer barriers to understanding what a particular Talos
> test result means. This should also make Talos easier to run

This will also solve one of the other problems that leads developers
to distrust talos, which is that a significant portion of the
performance regressions reported are (or at least were at one time)
the result of changes to the tests, but that changes to the tests
don't show up as part of the list of suspected causes of
regressions.

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla http://www.mozilla.org/ 𝄂

Rafael Ávila de Espíndola

unread,
Aug 30, 2012, 7:33:53 PM8/30/12
to dev-pl...@lists.mozilla.org
> Some people have noted in the past that some Talos measurements are not
> representative of something that the users would see, the Talos numbers
> are noisy, and we don't have good tools to deal with these types of
> regressions. There might be some truth to all of these, but I believe
> that the bigger problem is that nobody owns watching over these numbers,
> and as a result as take regressions in some benchmarks which can
> actually be representative of what our users experience.

I was recently hit by most of the shortcomings you mentioned while
trying to upgrade clang. Fortunately, I found the issue on try, but I
will admit that comparing talos on try is something I only do when I
expect a problem.

I still intend to write a blog post once I am done with the update and
have more data, but some interesting points that showed up so far

* compare-talos and compare.py were out of date. I was really lucky that
one of the benchmarks that still had the old name was the one that
showed the regression. I have started a script that I hope will be more
resilient to future changes. (bug 786504).

* our builds are *really* hard to reproduce. The build I was downloading
from try was faster than the one I was doing locally. In despair I
decided to fix at least part of this first. It found that our build was
depending on the way the bots use ccache (they set CCACHE_BASEDIR which
changes __FILE__), the build directory (shows up on debug info that is
not stripped), and the file system being case sensitive or not.

* testing on linux showed even more bizarre cases where small changes
cause performance problems. In particular, adding a nop *after the last
ret* in function would make the js interpreter faster on sunspider. The
nop was just enough to make the function size cross the next 16 bytes
boundary and that changed the address of every function linked after it.

* the histogram of some benchmarks don't look like a normal distribution
(https://plus.google.com/u/0/108996039294665965197/posts/8GyqMEZHHVR). I
still have to read the paper mentioned in the comments.

> I don't believe that the current situation is acceptable, especially
> with the recent focus on performance (through the Snappy project), and I
> would like to ask people if they have any ideas on what we can do to fix
> this. The fix might be turning off some Talos tests if they're really
> not useful, asking someone or a group of people to go over these test
> results, get better tools with them, etc. But _something_ needs to
> happen here.

There are many things we can do to make perf debugging/testing better,
but I don't think that is the main thing we need to do to solve the
problem. The tools we have do work. Try is slow and talos is noisy, but
it is possible to detect and debug regressions.

What I think we need to do is differentiate tests that we expect to
match user experience and synthetic tests. Synthetic tests *are* useful
as they can much more easily find what changed, even if it is something
as silly as the address of some function. The difference is that we
don't want to regress on the tests that match user experience. IMHO we
*can* regress on synthetic ones as long as we know what is going on. And
yes, if a particular synthetic test is too brittle then we should remove it.

With the distinction in place we can then handle perf regressions in a
similar way to how we handle test failures: revert the offending patch
and make the original developer responsible for tracking it down. If a
test is known to regress a synthetic benchmark, a comment on the commit
on the lines of "renaming this file causes __FILE__ to change in an
assert message and produces a spurious regression on md5" should be
sufficient. It is not the developers *fault* that that causes a problem,
but IHMO it should still be his responsibility to track it.

> Cheers,
> Ehsan
>
>

Cheers,
Rafael

Ben Hearsum

unread,
Aug 30, 2012, 9:05:36 PM8/30/12
to L. David Baron
On 08/30/12 07:22 PM, L. David Baron wrote:
> On Thursday 2012-08-30 14:42 -0700, Taras Glek wrote:
>> * Joel will revisit maintaining Talos within mozilla-central to
>> reduce developer barriers to understanding what a particular Talos
>> test result means. This should also make Talos easier to run
>
> This will also solve one of the other problems that leads developers
> to distrust talos, which is that a significant portion of the
> performance regressions reported are (or at least were at one time)
> the result of changes to the tests, but that changes to the tests
> don't show up as part of the list of suspected causes of
> regressions.

This isn't true anymore, actually. While Talos itself isn't stored in
mozilla-central, a pointer to a specific version of it is. The test
machines pull the Talos version specified in
https://mxr.mozilla.org/mozilla-central/source/testing/talos/talos.json
at test time. For example (from
https://tbpl.mozilla.org/php/getParsedLog.php?id=14852731&tree=Firefox&full=1):
/tools/buildbot/bin/python talos_from_code.py --talos-json-url
http://hg.mozilla.org/mozilla-central/raw-file/f972f1a71e7e/testing/talos/talos.json

This means that changes to the Talos suite *are* associated with a
mozilla-central revision, have tests run for them, can be backed out,
can ride trains, etc.

Dave Mandelin

unread,
Aug 30, 2012, 9:13:33 PM8/30/12
to Dave Mandelin, dev-pl...@lists.mozilla.org
On Thursday, August 30, 2012 2:54:55 PM UTC-7, Ehsan Akhgari wrote:
> Oh, sorry, I needed to ask my question better. I'm specifically
> wondering who needs to track and investigate the regression if it
> happened on a range of, let's say, 5 committers...

Ah. I believe that's a job for a bugmaster, a position that we don't have filled at the moment. We need one. Perhaps one or more people in QA can step into part of that role, possibly temporarily.

Otherwise, it seems we just have to share the pain. Bisecting changesets is not necessarily an enjoyable job but it is a necessary one. I would suggest that sheriffs pick one of the 5 committers and ask that person to bisect the change and try not to pick the same person repeatedly (unless that person keeps landing the regressions!).

Dave

Anthony Jones

unread,
Aug 30, 2012, 10:39:14 PM8/30/12
to dev-pl...@lists.mozilla.org
On 31/08/12 13:13, Dave Mandelin wrote:
> Otherwise, it seems we just have to share the pain. Bisecting changesets is not necessarily an enjoyable job but it is a necessary one. I would suggest that sheriffs pick one of the 5 committers and ask that person to bisect the change and try not to pick the same person repeatedly (unless that person keeps landing the regressions!).

Finding an offending commit within n commits is scriptable.

Anthony Hughes

unread,
Aug 30, 2012, 11:41:16 PM8/30/12
to Dave Mandelin, dev-pl...@lists.mozilla.org
I think tracking and investigating is all of our responsibility. QA definitely has a role to play and I think we've been playing that role to a certain extent. We don't always have the skills, knowledge, experience, or time to help but we always try and we are always willing to learn. We rely on Release Management to keep us apprised of what's important and we rely on developers to help us understand the code, tools, and testcases.

Having a Bugmaster will certainly improve things but I don't think it eliminates the necessity, nor the desire for this collaborative dynamic.

----- Original Message -----
From: "Dave Mandelin" <dman...@gmail.com>
To: dev-pl...@lists.mozilla.org
Cc: "Dave Mandelin" <dman...@gmail.com>, dev-pl...@lists.mozilla.org
Sent: Thursday, August 30, 2012 6:13:33 PM
Subject: Re: The current state of Talos benchmarks

jmaher

unread,
Aug 31, 2012, 3:35:10 AM8/31/12
to L. David Baron
I have backed out changes made to talos and the tests a few times due to performance regressions. While I might not catch every one, we do treat talos changes as another changeset in m-c.

If there is an expected shift in numbers, we create a new test. This is why there are 5+ versions of all the tests. It really adds a lot of overhead and breakage (e.g. compare-talos), but this way we don't confuse the old test data with the new adjusted tests.

Justin Lebar

unread,
Aug 31, 2012, 5:37:19 AM8/31/12
to Taras Glek, dev-pl...@lists.mozilla.org
Sorry to continue beating this horse, but I don't think it's quite dead yet:

One of the best things we could do to make finding these regressions
easier is to never coalesce Talos on mozilla-inbound. It's crazy to
waste developer time bisecting Talos locally when we don't run it on
every push.

Another thing that would help a lot is fixing bug 752002, so people
will stop filtering the e-mails.

Justin Lebar

unread,
Aug 31, 2012, 6:01:15 AM8/31/12
to Rafael Ávila de Espíndola, dev-pl...@lists.mozilla.org
> IMHO we *can* regress on synthetic ones as long as we know what is going on.

It's the requirement that we know what is going on that I think is unreasonable.

Indeed, we /have/ a no not-understood regresisons policy, IIRC. The
extent to which it's being ignored is at least partially indicative of
how difficult these changes can be to track down. Rafael's post has
some great examples of how insane tracking down perf regressions can
be.

I really don't think that the right way to go about fixing our
proclivity to regress Talos is to "get tough on regressions" and make
this every committer's problem. We shouldn't expect committers to
track down the fact that "my change pushes X function down 16 bytes,
which changes some other function's alignment, which, in combination
with a change to __FILE__, affects benchmark Y" as a regular part of
their job. And it's not clear to me that if we have any tests left if
we eliminated from the tree all tests which are affected by this sort
of thing.

I think the right way to go about this is to first investigate which
tests are stable, and how stable they are (*). Then a team of
engineers can gain some experience finding and understanding
regressions which occur over some period of time, so we can understand
how feasible it would be to seriously ask developers to do this as a
part of their day-to-day jobs.

I'm not saying it should be OK to regress our performance tests, as a
rule. But I think we need to acknowledge that hunting regressions can
be time-consuming, and that a policy requiring that all regressions be
understood may hamstring our ability to get anything else done.
There's a trade-off here that we seem to be ignoring.

-Justin

(*) This is essentially SfN.

Ehsan Akhgari

unread,
Aug 31, 2012, 11:32:25 AM8/31/12
to Justin Lebar, Rafael Ávila de Espíndola, dev-pl...@lists.mozilla.org
On 12-08-31 6:01 AM, Justin Lebar wrote:
> I'm not saying it should be OK to regress our performance tests, as a
> rule. But I think we need to acknowledge that hunting regressions can
> be time-consuming, and that a policy requiring that all regressions be
> understood may hamstring our ability to get anything else done.
> There's a trade-off here that we seem to be ignoring.

There is definitely a trade-off here, and at least for the past year
(and maybe for the past two years) we have in practice been weighing on
the side of the difficulty of tracking down performance regression to
the point that we've been ignoring them (except for perhaps a few people.)

It is a mistake to take Rafael's example and extend it to the average
regression that we measure on Talos. It's true that sometimes those
things happen, and in practice we cannot deal with them all, because we
don't have an army of Rafaels. But it bothers me when people take an
example of a very difficult to understand regression encountered by a
person who bravely dwells with low-level compiler code generation stuff
and extend it to come up with a policy covering all regressions.
Please, let's not do that.

And let's remember the other side of the trade-off too. A lot of blood
and tears has gone into shaving off milliseconds from our startup time.
Taking a ~5% hit on startup time within a 6-week cycle effectively
means that we have undone man-months of optimizations which have
happened to the startup time. So it's not like letting these
regressions in beneath our noses is going to make us all more productive.

There are extremely non-stable Talos tests, and relatively stable ones.
Let's focus on the relatively stable ones. There are extremely hard
to diagnose performance regressions, and extremely easy ones (i.e.,
let's not wait on this lock, do this I/O, run this exponential
algorithm, load tons of XUL/XBL when a window opens, etc.) We have many
great tools for the job, so not all regressions need to be treated the same.

Cheers,
Ehsan

Ehsan Akhgari

unread,
Aug 31, 2012, 11:38:05 AM8/31/12
to Justin Lebar, Taras Glek, dev-pl...@lists.mozilla.org
On 12-08-31 5:37 AM, Justin Lebar wrote:
> Sorry to continue beating this horse, but I don't think it's quite dead yet:
>
> One of the best things we could do to make finding these regressions
> easier is to never coalesce Talos on mozilla-inbound. It's crazy to
> waste developer time bisecting Talos locally when we don't run it on
> every push.

In order to help kill that horse, I filed bug 787447 and CCed John on
it. :-)

Ehsan

Chris AtLee

unread,
Aug 31, 2012, 11:45:20 AM8/31/12
to
On 31/08/12 11:32 AM, Ehsan Akhgari wrote:> There are extremely
non-stable Talos tests, and relatively stable ones.
> Let's focus on the relatively stable ones. There are extremely hard
> to diagnose performance regressions, and extremely easy ones (i.e.,
> let's not wait on this lock, do this I/O, run this exponential
> algorithm, load tons of XUL/XBL when a window opens, etc.) We have many
> great tools for the job, so not all regressions need to be treated the
> same.

What value do the extremely non-stable Talos tests have? Shouldn't we
stop running them if they're not giving useful information?

Justin Lebar

unread,
Aug 31, 2012, 11:46:44 AM8/31/12
to Ehsan Akhgari, Rafael Ávila de Espíndola, dev-pl...@lists.mozilla.org
> There are extremely non-stable Talos tests, and relatively stable ones.
> Let's focus on the relatively stable ones.

It's not exclusively a question of noise in the tests. Even
regressions in stable tests are sometimes hard to track down. I spent
two months trying to figure out why I could not reproduce a Dromaeo
regression I saw on m-i using try, and eventually gave up (bug
653961).

It's great if we try to track down this mysterious 5% startup
regression. We shouldn't ignore important regressions. But what I
object to is the idea that if I regress Dromaeo DOM by 2%, I'm
automatically backed out and prevented from doing any work until I
prove that the problem is I changed filename somewhere.

On Fri, Aug 31, 2012 at 12:32 PM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
> On 12-08-31 6:01 AM, Justin Lebar wrote:
>>
>> I'm not saying it should be OK to regress our performance tests, as a
>> rule. But I think we need to acknowledge that hunting regressions can
>> be time-consuming, and that a policy requiring that all regressions be
>> understood may hamstring our ability to get anything else done.
>> There's a trade-off here that we seem to be ignoring.
>
>
> There is definitely a trade-off here, and at least for the past year (and
> maybe for the past two years) we have in practice been weighing on the side
> of the difficulty of tracking down performance regression to the point that
> we've been ignoring them (except for perhaps a few people.)
>
> It is a mistake to take Rafael's example and extend it to the average
> regression that we measure on Talos. It's true that sometimes those things
> happen, and in practice we cannot deal with them all, because we don't have
> an army of Rafaels. But it bothers me when people take an example of a very
> difficult to understand regression encountered by a person who bravely
> dwells with low-level compiler code generation stuff and extend it to come
> up with a policy covering all regressions. Please, let's not do that.
>
> And let's remember the other side of the trade-off too. A lot of blood and
> tears has gone into shaving off milliseconds from our startup time. Taking
> a ~5% hit on startup time within a 6-week cycle effectively means that we
> have undone man-months of optimizations which have happened to the startup
> time. So it's not like letting these regressions in beneath our noses is
> going to make us all more productive.
>
> There are extremely non-stable Talos tests, and relatively stable ones.
> Let's focus on the relatively stable ones. There are extremely hard to
> diagnose performance regressions, and extremely easy ones (i.e., let's not
> wait on this lock, do this I/O, run this exponential algorithm, load tons of
> XUL/XBL when a window opens, etc.) We have many great tools for the job, so
> not all regressions need to be treated the same.
>
> Cheers,
> Ehsan

jmaher

unread,
Aug 31, 2012, 11:49:50 AM8/31/12
to dev-pl...@lists.mozilla.org
There are a few issues here which should be easy to address and a few other issues which are not so easy to address.

First off everybody who is interested in talos should read the wiki page:
https://wiki.mozilla.org/Buildbot/Talos

This explains where the code lives, what tests we run and on which branches and many other details on what is done to run talos. If you want to run talos locally, check out these instructions:
https://wiki.mozilla.org/Buildbot/Talos#Running_locally_-_Source_Code


Another concern I have read in this thread and have heard over the last few months is why are we even running these tests as they are old, irrelevant and nobody looks at them. A valid concern and something I have asked myself many times while working on Talos. I took it upon myself earlier this summer to find a developer who is a point of contact for each and every test we run. Then we figured out if the tests were relevant and testing things we care about. Many tests have been updated/added/disabled in the last couple months.

A similar complaint is about the noise in the numbers and how we can realistically detect a regression or gain value. For minor regressions our current toolchain will not be very effective. A lot of work has been done to look into how we run tests, the tools we use and if we can apply different models to the numbers to gain more reliable data. Most of that work is documented in the Signal from Noise project: https://wiki.mozilla.org/Auto-tools/Projects/Signal_From_Noise. I encourage folks to join into the public meetings we have to learn more about how we are actually solving this problem.


Back on subject, we want to detect regressions to the exact changeset as well as reducing our false positives that get mailed to dev.tree-management. There is probably no silver bullet or policy we can create today which will fix our problems. There is a big lag between a current patch's run of talos and when we get a notification in dev.tree-management. For large regressions this can be detected by visually looking at graph server (we have links to everything from tbpl), but for small regressions, you have to see this over time as a minor increase could look like the regular noise we have in our numbers.

Coming from a talos tool maintainer perspective, I am committed to making talos easy to run and documented so we can all work on fixing regressions instead of offering sacrifices to the try server. When there are requests for features, fixes or test adjustments somebody on the A*Team usually will resolve it quickly. While this only solves some of the pain, it is a step in the right direction until Signal From Noise can come out and solve a large portion of the other problems.

Dave Mandelin

unread,
Aug 30, 2012, 5:28:01 PM8/30/12
to mozilla.de...@googlegroups.com, Dave Mandelin, dev-pl...@lists.mozilla.org
On Thursday, August 30, 2012 9:11:25 AM UTC-7, Ehsan Akhgari wrote:
> On 12-08-29 9:20 PM, Dave Mandelin wrote:
>
> > On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
>
> In my opinion, one of the reasons why Talos is disliked is because many
> people don't know where its code lives (hint:
> http://hg.mozilla.org/build/talos/) and can't run those tests like other
> test suites. I think this would be very valuable to fix, so that
> developers can read Talos tests like any other test, and fix or improve
> them where needed.

It is hard to find. And beyond that, it seems hard to use. It's been a while since I've run Talos locally, but last time I did it was a pain to set up and difficult to run, and I hear it's still kind of like that. For testing tools, "convenient for the developer" is a critical requirement, but has been neglected in the past.

js/src/jit-test/ is an example of something that is very convenient for developers: creating a test is just adding a .js file to a directory (no manifest or extra files; by default error or crash is a fail, but you can change that for a test), the harness is a Python file with nice options, the test configuration and basic usage is documented in a README, and it lives in the tree.

> >> [...] I believe
> >> that the bigger problem is that nobody owns watching over these numbers,
> >> and as a result as take regressions in some benchmarks which can
> >> actually be representative of what our users experience.
>
> > The interesting thing is that we basically have no idea if that's true for any given Talos alarm.
>
> That's something that I think should be judged per benchmark. For
> example, the Ts measurements will probably correspond very directly to
> the startup time that our users experience. The Tp5 measurements don't
> directly correspond to anything like that, since nobody loads those
> pages sequentially, but it could be an indication of average page load
> performance.

I exaggerated a bit--yes, some tests like Ts are pretty easy to understand and do correspond to user experience. With Tp5, I just don't know--I haven't spent any time trying to use it or looking at regressions, since JS doesn't affect it.

> >> I don't believe that the current situation is acceptable, especially
> >> with the recent focus on performance (through the Snappy project), and I
> >> would like to ask people if they have any ideas on what we can do to fix
> >> this. The fix might be turning off some Talos tests if they're really
> >> not useful, asking someone or a group of people to go over these test
> >> results, get better tools with them, etc. But _something_ needs to
> >> happen here.


> > - Second, as you say, get an owner for performance regressions. There are lots of ways we could do this. I think it would integrate fairly easily into our existing processes if we (automatically or by a designated person) filed a bug for each regression and marked it tracking (so the release managers would own followup). Alternately, we could have a designated person own followup. I'm not sure if that has any advantages, but release managers would probably know. But doing any of this is going to severely annoy engineers unless we get the false positive rate under control.
>
> Note that some of the work of to differentiate between false positives
> and real regressions needs to be done by the engineers, similar to the
> work required to investigate correctness problems. And people need to
> accept that seemingly benign changes may also cause real performance
> regressions, so it's not always possible to glance over a changeset and
> say "nah, this can't be my fault." :-)

Agreed.

> > - Speaking of false positives, we should seriously start tracking them. We should keep track of each Talos regression found and its outcome. (It would be great to track false negatives too but it's a lot harder to catch them and record them accurately.) That way we'd actually know whether we have a few false positives or a lot, or whether the false positives were coming up on certain tests. And we could use that information to improve the false positive rate over time.
>
> I agree. Do you have any suggestions on how we would track them?

Dave Mandelin

unread,
Aug 30, 2012, 9:13:33 PM8/30/12
to mozilla.de...@googlegroups.com, Dave Mandelin, dev-pl...@lists.mozilla.org
On Thursday, August 30, 2012 2:54:55 PM UTC-7, Ehsan Akhgari wrote:
> Oh, sorry, I needed to ask my question better. I'm specifically
> wondering who needs to track and investigate the regression if it
> happened on a range of, let's say, 5 committers...

jmaher

unread,
Aug 31, 2012, 11:49:50 AM8/31/12
to mozilla.de...@googlegroups.com, dev-pl...@lists.mozilla.org

Ehsan Akhgari

unread,
Aug 31, 2012, 3:59:22 PM8/31/12
to Chris AtLee, dev-pl...@lists.mozilla.org
Either that, or find some way of making them more stable, such as not
measuring the wall clock time.

Ehsan

Chris AtLee

unread,
Aug 31, 2012, 4:03:44 PM8/31/12
to
Sure, that sounds like a great project. Until that's finished, is there
any value to running these suites, or are they expensive random number
generators?

Ehsan Akhgari

unread,
Sep 1, 2012, 10:08:40 AM9/1/12
to Chris AtLee, dev-pl...@lists.mozilla.org
I think that is something that needs to be evaluated on a per-test
per-platform basis, hopefully by someone who knows a bit about
statistics. :-)

Cheers,
Ehsan

jmaher

unread,
Sep 1, 2012, 4:18:33 PM9/1/12
to Chris AtLee, dev-pl...@lists.mozilla.org
We are detecting regressions with this despite the large levels of noise. So while it might appear to be a waste of machine resources to some, Talos serves a purpose. Having people look at the results more frequently will solve many of the problems.

I would say a handful of tests/counters on certain platforms are not very useful in the current way we are reporting numbers.

Justin Wood (Callek)

unread,
Sep 2, 2012, 12:33:59 AM9/2/12
to Taras Glek
Taras Glek wrote:
> * Joel will revisit maintaining Talos within mozilla-central to reduce
> developer barriers to understanding what a particular Talos test result
> means. This should also make Talos easier to run

To call out this point explicitly.

I'm not convinced that folding it into m-c is the necessary way forward,
and I think before folding in any other of our "stable" but
external-to-m-c repos we should start a community discussion on general
guidelines as to why/why not we would do that, and THEN evaluate those
against WHY we want talos, what goals are we solving, etc.

I don't feel that "reduce developer barriers to understanding what a
particular Talos test result means." is helped by this, if you [anyone]
thinks so, can you try to articulate why here in this thread?

[I note that myself and jhammel at least were discussing this in the bug
about moving talos to m-c as well, which we both agree does not belong
as an in-bug discussion -- and I do feel the move, if the talos module
owner feels is necessary should not get blocked on a need for an
external process, but I do feel we should think hard on this]

--
~Justin Wood (Callek)

jmaher

unread,
Sep 1, 2012, 4:18:33 PM9/1/12
to mozilla.de...@googlegroups.com, Chris AtLee, dev-pl...@lists.mozilla.org

Asa Dotzler

unread,
Sep 3, 2012, 4:46:08 PM9/3/12
to
On 8/29/2012 5:00 PM, Matt Brubeck wrote:

> * I don't like our reactive approach that focuses on trying to identify
> regressions, and then decide whether to fix them in place, back them
> out, or ignore them. Instead we should proactively set goals for what
> our performance should be, and focus on the best way to get it there (or
> keep it there). The goals should be based the desired user experience,
> and we should focus only on metrics that reflect those user experience
> goals.

I agree 100%. We need staff whose job includes understanding what users
need in terms of performance and turning those into goals for
engineering. I'm trying to hire this person.
http://careers.mozilla.org/en-US/position/oH5GWfw8 If you know of anyone
who fits the bill, please email me.

> * We need staff whose job includes deciding which regressions are
> meaningful, identifying the cause, following up to make sure it's backed
> out or fixed, and refining the process and tools used to make all this
> possible. Too much slips through the cracks when we leave this to
> volunteers (including "employeeteers" like Ehsan or me).

Also very true. I think this role is probably a Program Manager (or
three.) Sheila Mooney's team is small but growing. I'd sure love to have
another one or two hired into her team focused on this.

- A

Ed Morley

unread,
Sep 4, 2012, 12:01:10 PM9/4/12
to
Thank you for posting this - I've become increasingly worried by the number of real regressions that we are likely missing due to the current situation.

Like yourself I used to read every single dev.tree-management mail & try to act on those that looked real, however after many months of not feeling like it was making any difference at all, I stopped checking regularly since it was a very unproductive use of my time.

The main issues were:

1) Many seemingly irrelevant tests (eg codesize, number of constructors, ...?) and noisy tests causing the SNR of dev.tree-management to be very low, resulting in dev.tree-management receiving hundreds of regression emails a week. This meant it was both a struggle to find time for them all on a daily basis and also to remember which were reliable and so should be more urgently acted upon.

2) Large amounts of coalescing causing the regression range to be wide, resulting in either vast amounts of sheriff time spent trying retriggers/TryServer runs; or else developers believing the regression must have been caused by someone else in the range & so do not look into it themselves.

3) Lack of documentation for even working out what the test in question does & how to run it locally. I filed bug 770460 to deal with this, which jhammel kindly has, so the situation is a bit better than it was: https://wiki.mozilla.org/Buildbot/Talos

4) The combination of the above causing:
a) Devs to either not believe the regression could be anything other than noise/due to someone else's patch, or else they try really hard to investigate the regression but are not able to due to gaps in the documentation/difficultly in reproducing locally.
b) Sheriffs to not have enough conviction in the reliability of a reported regression, that they don't feel they are able to hassle said devs about (a).

As an example, I filed bug 778718 (30% Windows Ts regression since 1st March) a month ago, but yet it has sat inactive for a while now and I sadly don't see us making any progress on it before we release the regressed builds :-(

--

On Thursday, 30 August 2012 01:01:26 UTC+1, Matt Brubeck wrote:
> * Less is more. We can pay more attention to tests if every alert is
> for something we care about. We can *track* stuff like Trace Malloc
> Allocs if there are people who find the data useful in their work, but
> we should not *alert* on it unless it is a key user-facing metric.

On Thursday, 30 August 2012 02:20:24 UTC+1, Dave Mandelin wrote:
> - First, and most important, fix the test suite so that it measures only things that are useful and meaningful to developers and users. We can easily take a first cut at this if engineering teams go over the tests related to their work, and tell A-Team which are not useful. Over time, I think we need to get a solid understanding of what performance looks like to users, what things to test, and how to test them soundly. This may require dedicated performance engineers or a performance product manager.

On Thursday, 30 August 2012 17:11:25 UTC+1, Ehsan Akhgari wrote:
> Note that some of the work of to differentiate between false positives
> and real regressions needs to be done by the engineers, similar to the
> work required to investigate correctness problems. And people need to
> accept that seemingly benign changes may also cause real performance
> regressions, so it's not always possible to glance over a changeset and
> say "nah, this can't be my fault." :-)

I agree entirely with each of the above.


On Thursday, 30 August 2012 16:53:08 UTC+1, Ehsan Akhgari wrote:
> The way that we handle similar situations with test failures is to close
> inbound, back out enough patches so that things go green again, and then
> reopen, and ask the people that had been backed out to retry landing
> after testing on try.
>
> This is entirely possible for Talos regressions as well if we want to,
> it just takes longer, and will mean more patches will get backed out.

On Thursday, 30 August 2012 22:42:07 UTC+1, Taras Glek wrote:
> ** We will extend the current merge criteria of last green PGO changeset
> to also include "good Talos numbers"

Once we have the main issues resolved, then I agree this will be the best way to victory - and in fact will be happy to take lead on pro-actively checking dev.tree-management before merging.

However at this precise point in time, I feel that blocking merges on talos "green" is not yet viable, since the reasons at the start of my post still hold. I'm hopeful that many of them can mitigated over the next couple of weeks, but for now I think we're just going to have to deal with just the most obvious talos regressions in the days shortly after each merge, rather than blocking the merges on them.

In addition, as far as I'm aware, it takes several talos runs (after the regression) before the regression mails are triggered (presumably more so for noisy tests), so we would have to wait several hours longer than the usual "PGO green on all platforms" before we could be sure it was safe to merge. Add to this the end-to-end time of retriggers to counteract coalescing - and I suspect this would cause a fair amount of grief between sheriffs and devs, if our experience with delaying merges for unit test failures is anything to go by. Can someone who knows talos better than I, clarify talos' behaviour for when the regression emails are sent?
I would absolutely love it if we were able to do this, but with our current capacity issues (which really do need resolving, but that's another thread, to which I'm replying after this), we just don't have anywhere near the spare machine time.

What would be useful however, is an easy way to mark a regression-range (eg from that given in a dev.tree-management email) as needing retriggers on every push to combat coalescing - that then schedules them for our quieter machine times eg weekends/outside PDT hours. If we did this, we still wouldn't be able to block mozilla-inbound merges on 'green' talos, but we would at least be in a better position than we are now, where the regression ranges are so wide that they are not easily actionable.

Best wishes,

Ed

Taras Glek

unread,
Sep 4, 2012, 6:00:34 PM9/4/12
to
On 8/31/2012 8:46 AM, Justin Lebar wrote:
>> There are extremely non-stable Talos tests, and relatively stable ones.
>> Let's focus on the relatively stable ones.
>
> It's not exclusively a question of noise in the tests. Even
> regressions in stable tests are sometimes hard to track down. I spent
> two months trying to figure out why I could not reproduce a Dromaeo
> regression I saw on m-i using try, and eventually gave up (bug
> 653961).
>
> It's great if we try to track down this mysterious 5% startup
> regression. We shouldn't ignore important regressions. But what I
> object to is the idea that if I regress Dromaeo DOM by 2%, I'm
> automatically backed out and prevented from doing any work until I
> prove that the problem is I changed filename somewhere.

I think we can solve this by having go-to people for Talos performance
help. If you have spent a sufficient amount of time tracking down a
regression, you should be able to ask someone like Rafael* to help.
Same people should also have the final say on whether we should give up
on tracking down a regression.

Letting every individual committer decide 'this is too hard to track
down' makes it too easy to let the regression slide and is one of the
reasons we ended up in the current situation.


Taras


* I'm using Rafael as a hypothetical go-to person because he is
participating in this thread. We have other people in the project who
are well-versed in tracking down odd perf regressions.

Ehsan Akhgari

unread,
Sep 4, 2012, 8:28:48 PM9/4/12
to Justin Wood (Callek), dev-pl...@lists.mozilla.org
I have distilled the gist of what I have to say on this subject in
<https://bugzilla.mozilla.org/show_bug.cgi?id=787200#c29>. While
discussing a global strategy might be interesting, I am not really
interested in that, since I don't believe that it will actually result
in any concrete results as it is too abstract, and more importantly I
don't believe that the question at hand requires us to have a global
policy covering all of the future similar cases.

I would appreciate if you or someone else who's interested in that
discussion would start it in the appropriate forum. :-)

Thanks!
Ehsan

Mounir Lamouri

unread,
Sep 19, 2012, 10:32:57 AM9/19/12
to dev-pl...@lists.mozilla.org
On 08/31/2012 12:33 AM, Rafael Ávila de Espíndola wrote:
> I was recently hit by most of the shortcomings you mentioned while
> trying to upgrade clang. Fortunately, I found the issue on try, but I
> will admit that comparing talos on try is something I only do when I
> expect a problem.

I'm a bit late into the game here but I would like to mention that I've
never been able to use correctly talos on try.

The few times one of my patches were suspected to regress, I sent them
to try and every time the results were useless. The delta between min
and max from different runs was so huge that I could have been able to
use those results to say that my patches were actually improving
performances as much as I could have say they were reducing them.

As a result, my process has always been to pretend to use try, re-land
my patch and hope it was actually not guilty. So far, I never had to try
to fix a patch actually regressing something.

Anyway, I don't think an indicator like talos is really useful if it's
not reliable at all on try and locally. At the end of the day, you might
easily end up doing like Justin who had to push a startup regression or
simply forget about your patch (depending on the cost/benefit).

In my opinion, trying to make talos reliable on try should be the first
step if we want developers to care mare about those tests.

Cheers,
--
Mounir
Reply all
Reply to author
Forward
0 new messages