Turn off Firefox3.0 Talos machines?

Samuel Sidler

unread,

Feb 25, 2009, 11:22:46 AM2/25/09

to dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org, John O'Duinn

(follow up to dev-planning)

Hello,

The more I've thought about it, the more it seems like we don't need
the Talos machines on the Firefox3.0 tree running any more and I'd
like to see if other people agree.

There are a couple of major reasons we have Talos machines on
Firefox3.0 in the first place:
1) To ensure no performance regressions on the 1.9.0 branch.
2) To compare performance numbers between 1.9.0 and 1.9.1 (and
mozilla-central).

Handling those in reverse order...

Since bug 463323 was fixed, turning off Talos machines doesn't mean we
lose the baseline of performance numbers for those machines. A
flatline is created for previous Talos machines that are no longer
running, which allows comparisons to 1.9.1 and m-c. This isn't a good
reason to keep Talos machines running on Firefox3.0.

As for performance regressions, we've been pretty good about only
taking security and stability fixes and (usually) only after they've
baked on 1.9.1 and/or m-c. If a performance regression is discovered
due to the landing of a security fix, we likely would *not* back out
the security fix unless the perf regression was incredibly severe (the
need for security fixes will almost always outweigh the need for flat
performance numbers). But most of those regressions are already caught
when checking into 1.9.1 and m-c. By the time they get to us, patches
are pretty perf-safe.

Unless I'm missing other reasons why we run these machines, I think
it's time we turn them off. The effort to maintain these machines
isn't worth the benefit.

Thoughts?

-Sam

Dave Townsend

unread,

Feb 25, 2009, 11:55:49 AM2/25/09

to

Aside from it giving build back some resources that could be allocated
elsewhere, what is the need for turning them off?

Mike Beltzner

unread,

Feb 25, 2009, 12:16:45 PM2/25/09

to dtow...@mozilla.com, dev-pl...@lists.mozilla.org

Are active branch drivers even looking for performance regressions from
security and stability checkins? While I agree that we would be unlikely to
refuse a fix that was deemed as required for security or stability reasons,
I believe that there is often more than one way to write a patch. It's
entirely possible that a fix may have an unintended performance impact, and
by catching and reporting that, the patch author may be able to address it
while retaining their fix.

Of course, we should be catching and baking that on the trunk, first.

cheers,
mike

_______________________________________________
dev-planning mailing list
dev-pl...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-planning

Mike Shaver

unread,

Feb 25, 2009, 12:25:05 PM2/25/09

to Dave Townsend, dev-pl...@lists.mozilla.org

On Wed, Feb 25, 2009 at 8:55 AM, Dave Townsend <dtow...@mozilla.com> wrote:
> Aside from it giving build back some resources that could be allocated
> elsewhere, what is the need for turning them off?

As Sam said:

"The effort to maintain these machines isn't worth the benefit."

I would much rather have the Talos team's maintenance time and the
hardware put where it can do more good.

Mike

Samuel Sidler

unread,

Feb 25, 2009, 12:47:32 PM2/25/09

to Mike Beltzner, dev-pl...@lists.mozilla.org

On Feb 25, 2009, at 9:16 AM, Mike Beltzner wrote:

> Are active branch drivers even looking for performance regressions
> from
> security and stability checkins?

Not actively, no. We glance periodically, but tend not to do it
regularly.

> While I agree that we would be unlikely to
> refuse a fix that was deemed as required for security or stability
> reasons,
> I believe that there is often more than one way to write a patch. It's
> entirely possible that a fix may have an unintended performance
> impact, and
> by catching and reporting that, the patch author may be able to
> address it
> while retaining their fix.
>
> Of course, we should be catching and baking that on the trunk, first.

Your first paragraph is true here, but your last sentence tells the
real story. We should be catching these problems long before we see
them on 1.9.0 and if we don't, they're likely to stay in anyway (to
get the security fix).

-Sam

Samuel Sidler

unread,

Feb 25, 2009, 12:52:20 PM2/25/09

to Mike Shaver, dev-pl...@lists.mozilla.org, Dave Townsend

On Feb 25, 2009, at 9:25 AM, Mike Shaver wrote:
> On Wed, Feb 25, 2009 at 8:55 AM, Dave Townsend
> <dtow...@mozilla.com> wrote:

>> Aside from it giving build back some resources that could be
>> allocated
>> elsewhere, what is the need for turning them off?
>

> As Sam said:
>
> "The effort to maintain these machines isn't worth the benefit."
>

> I would much rather have the Talos team's maintenance time and the
> hardware put where it can do more good.

Adding to that, we've had several machines go orange and red for
"random" reasons (read: RE knows what they are, but they're still
fairly random). The net result of that is that people tend to ignore
orange on Firefox3.0 in general because it wasn't caused by their
checkin(s).

While it's not necessarily unreasonable to take any and all fixes
needed to make Talos more reliable on Firefox3.0 (which we might be
doing anyway), it *is* unreasonable to pull the Talos team off of
other, more important work like 1.9.1 and m-c Talos machines. I,
personally, would rather have them focused on improving Talos than
worrying about sporadic oranges on Firefox3.0.

-Sam

Blake Kaplan

unread,

Feb 25, 2009, 2:21:55 PM2/25/09

to

Samuel Sidler <s...@mozilla.com> wrote:
> Unless I'm missing other reasons why we run these machines, I think
> it's time we turn them off. The effort to maintain these machines
> isn't worth the benefit.

I'm not necessarily against this, but I would like to point out that one
advantage that we get from running Talos against the 3.0 branch is additional
automated testing. For a while, Talos was the only tool that we had that was
able to reproduce crash bug 474537 (and on trunk, I've seen other unexplained
Tp crashes). Turning these machines off will reduce the chances of one of
these types of crashes being noticed and acted on.
--
Blake Kaplan

Karl Tomlinson

unread,

Feb 25, 2009, 3:07:04 PM2/25/09

to

Samuel Sidler writes:

> There are a couple of major reasons we have Talos machines on
> Firefox3.0 in the first place:
> 1) To ensure no performance regressions on the 1.9.0 branch.
> 2) To compare performance numbers between 1.9.0 and 1.9.1 (and
> mozilla-central).
>
> Handling those in reverse order...
>
> Since bug 463323 was fixed, turning off Talos machines doesn't
> mean we lose the baseline of performance numbers for those
> machines. A flatline is created for previous Talos machines that
> are no longer running, which allows comparisons to 1.9.1 and
> m-c. This isn't a good reason to keep Talos machines running on
> Firefox3.0.

I don't think this argument holds.

This has been discussed before, and I thought the conclusion was
that bug 463323 does not give us apples to apples comparison
between 1.9.0 and 1.9.1.

If we rely on a flatline extension, we are likely to lose the
ability to compare 1.9.0 and 1.9.1.

The reason discussed before was that the talos infrastructure
changes so test results change.

One suggestion was to scale down 1.9.0 performance tests (number
of machines and/or frequency of test runs), so that we still have
some numbers. This seems sensible.

http://groups.google.com/group/mozilla.dev.planning/browse_frm/thread/a889d711a7956768/fa03e1322cc9953b#fa03e1322cc9953b

There was also the suggestion of shutting down the 1.9.0 perf
machines and only firing them up when we felt there was a need for
a fresh run. My two concerns about this are:

* I don't know whether tests would actually end up being started again.

* We wouldn't necessarily know when they needed to be started.

> If a performance regression is discovered due to the landing of
> a security fix, we likely would *not* back out the security fix
> unless the perf regression was incredibly severe (the need for
> security fixes will almost always outweigh the need for flat
> performance numbers).

If we rely on the flatline extension of previous data (without the
1.9.0 regression), again we would not have apples to apples comparison
of the branches.

> Unless I'm missing other reasons why we run these machines, I
> think it's time we turn them off.

Are all people being good about postponing 1.9.1 checkins until
trunk talos machines have cycled?

If not, we would be losing the control for determining whether
performance changes are due to code or testing changes.

http://groups.google.com/group/mozilla.dev.planning/tree/browse_frm/thread/a889d711a7956768/c0732c9021fcfd42?rnum=1&_done=%2Fgroup%2Fmozilla.dev.planning%2Fbrowse_frm%2Fthread%2Fa889d711a7956768%2Ffa03e1322cc9953b%3F#doc_b2dbc5f4bdd3eea8

> The effort to maintain these machines isn't worth the benefit.

(I'm not able to comment on the effort involved, I just want to
make sure that people know what we would lose.)

Nick Thomas

unread,

Feb 25, 2009, 3:38:52 PM2/25/09

to

Samuel Sidler wrote:

> Mike Beltzner wrote:
>> Of course, we should be catching and baking that on the trunk, first.
>
> Your first paragraph is true here, but your last sentence tells the real
> story. We should be catching these problems long before we see them on
> 1.9.0 and if we don't, they're likely to stay in anyway (to get the
> security fix).

Except that a proportion of fixes are different on 1.9.0 than they are
on 1.9.1 & later, due to changes in the code base. A perf-safe patch on
1.9.1 isn't necessarily safe on 1.9.0.

-Nick

Mike Shaver

unread,

Feb 25, 2009, 3:49:48 PM2/25/09

to Nick Thomas, dev-pl...@lists.mozilla.org

On Wed, Feb 25, 2009 at 12:38 PM, Nick Thomas
<nrth...@spamgmail.suxorscom> wrote:
> Except that a proportion of fixes are different on 1.9.0 than they are on
> 1.9.1 & later, due to changes in the code base. A perf-safe patch on 1.9.1
> isn't necessarily safe on 1.9.0.

Can you give some examples of patches that have had undesireable and
unexpected perf characteristics in a backport, but not on the main
branch? It would be good to know more specifically what we think
we're protecting against, since in the abstract we can create doomsday
scenarios all day.

Related, since we have this data due to Talos coverage I believe:
what's the largest range of "real" performance change we've seen along
a maintenance release stream? "real" excluding machine noise or
testing changes. My suspicion is that we historically don't see a lot
of variance there, but numbers would know better than I do.

Mike

Nick Thomas

unread,

Feb 25, 2009, 4:39:39 PM2/25/09

to

I don't think I was creating a doomsday scenario at all, just pointing
out a gap in the assertion that "all patches are perf tested before
hitting 1.9.0". Sam tells me that the proportion of patches that differ
between 1.9.1 and 1.9.0 is very much smaller than for 1.9.0 and 1.8.1,
so I'd have to conclude that it's a small gap and not a big source of risk.

Jonas Sicking

unread,

Feb 26, 2009, 5:59:45 PM2/26/09

to Mike Shaver, Nick Thomas, dev-pl...@lists.mozilla.org

While I don't know of a single instance where a backport has caused a
perf regression that didn't happen on trunk, it still scares me to not
do any perf monitoring on changing code.

It would make a lot of sense to me to drastically reduce the number of
talos boxes on branches though. First of all the number of patches
landing is much lower which means that it's much easier to catch a
regression. Easier both in the sense that it's easier to see which patch
caused the regression, and easier since we get many more measuring
cycles per patch so we can spot trends even through noise.

So I'm all for reducing the number of talos boxes, but I do think
keeping one per platform would be a good idea.

/ Jonas

Jonas Sicking

unread,

Feb 26, 2009, 5:59:45 PM2/26/09

to Mike Shaver, Nick Thomas, dev-pl...@lists.mozilla.org

While I don't know of a single instance where a backport has caused a

Ben Hearsum

unread,

Feb 26, 2009, 6:10:55 PM2/26/09

to

On 2/26/09 5:59 PM, Jonas Sicking wrote:
> It would make a lot of sense to me to drastically reduce the number of
> talos boxes on branches though. First of all the number of patches
> landing is much lower which means that it's much easier to catch a
> regression. Easier both in the sense that it's easier to see which patch
> caused the regression, and easier since we get many more measuring
> cycles per patch so we can spot trends even through noise.
>
> So I'm all for reducing the number of talos boxes, but I do think
> keeping one per platform would be a good idea.
>

I think we'd be worse off if we did this. The whole idea of having 3
talos boxes per platform is to have tie breakers. If we only had 1
machine per platform and say, the Mac Leopard machine went orange we'd
have a much more difficult time knowing if it was real or not.

It also helps in cases where one machine from a set disappears - we can
still keep the tree open and not have to rush down to the colo to fix it.

Jonas Sicking

unread,

Feb 26, 2009, 9:51:21 PM2/26/09

to

Worse than what? Not having the boxes at all wouldn't mean less oranges,
it'd just mean that we don't know about them which hardly seems better.
And not having boxes at all seems no different from having one box which
temporarily is out.

I do agree that we'd be worse off than having the number of boxes we
currently do though, but the idea here was to sacrifice the branch a
little bit to free up resources for newer branches and moz-central, no?

I think rando-oranges are less cause of panic on a branch as stable as
the 1.9.0 branch. If something goes orange and there is a reason to
suspect it was random, just wait a cycle or two. It's generally not
going to hold up a significant amount of work.

/ Jonas

Robert Kaiser

unread,

Feb 27, 2009, 5:10:58 AM2/27/09

to

Samuel Sidler wrote:
> As for performance regressions, we've been pretty good about only taking
> security and stability fixes and (usually) only after they've baked on
> 1.9.1 and/or m-c.

So you suggest that something having no perf impact on 1.9.1 because
TraceMonkey optimizes it away or something like that means that we just
don't care if it slows down 1.9.0 because it does have a different
feature set and reacts differently?

I remember we had the same discussion for 1.8.1 and IIRC we went away
with needing to keep a minimal set of Talos machines alive as long as
the branch is actively maintained.

Robert Kaiser

Daniel Veditz

unread,

Feb 28, 2009, 1:06:06 AM2/28/09

to Robert Kaiser

Robert Kaiser wrote:
> Samuel Sidler wrote:
>> As for performance regressions, we've been pretty good about only taking
>> security and stability fixes and (usually) only after they've baked on
>> 1.9.1 and/or m-c.
>
> So you suggest that something having no perf impact on 1.9.1 because
> TraceMonkey optimizes it away or something like that means that we just
> don't care if it slows down 1.9.0 because it does have a different
> feature set and reacts differently?

TraceMonkey only optimizes JavaScript and our security fixes very rarely
touch JavaScript code. In all of 3.0.7 we touched 2 lines of JS code; in
3.0.6 bug 449027 changed nsBlocklistService.js and another bug touched
three lines in nsSessionStore.js

> I remember we had the same discussion for 1.8.1 and IIRC we went away
> with needing to keep a minimal set of Talos machines alive as long as
> the branch is actively maintained.

The difference between the 1.8 and 1.9 branches often required
significant back-porting. That's mostly not the case yet between 1.9.0
and 1.9.1. That said, the "1.8 drivers" weren't watching Talos on that
branch. If you were a 1.8 user you got the security fix perf be damned.
Users who cared about perf had the option of upgrading to the
performance-focused Firefox 3 release.

I guess not SeaMonkey users, and I'm sorry if we caused any perf hits
(though I don't think we did). But there's nothing that slows your
machine down more than a spambot infection so it was still the right
tradeoff.

Daniel Veditz

unread,

Feb 28, 2009, 1:12:31 AM2/28/09

to Jonas Sicking

Jonas Sicking wrote:
> While I don't know of a single instance where a backport has caused a
> perf regression that didn't happen on trunk, it still scares me to not
> do any perf monitoring on changing code.

Once 3.1 ships I don't care so much, but until then 3.0.x is our only
production release and it would be irresponsible not to keep half an eye
on it.

> So I'm all for reducing the number of talos boxes, but I do think
> keeping one per platform would be a good idea.

Me too.

Robert Kaiser

unread,

Feb 28, 2009, 5:50:23 AM2/28/09

to

Daniel Veditz wrote:
> Robert Kaiser wrote:
>> Samuel Sidler wrote:
>>> As for performance regressions, we've been pretty good about only taking
>>> security and stability fixes and (usually) only after they've baked on
>>> 1.9.1 and/or m-c.
>> So you suggest that something having no perf impact on 1.9.1 because
>> TraceMonkey optimizes it away or something like that means that we just
>> don't care if it slows down 1.9.0 because it does have a different
>> feature set and reacts differently?
>
> TraceMonkey only optimizes JavaScript and our security fixes very rarely
> touch JavaScript code. In all of 3.0.7 we touched 2 lines of JS code; in
> 3.0.6 bug 449027 changed nsBlocklistService.js and another bug touched
> three lines in nsSessionStore.js

TM was just meant as one example the significant changes that actually
have been made between 1.9.0 and 1.9.1 - the .1 is not really as small a
difference as it originally was planned.

> I guess not SeaMonkey users, and I'm sorry if we caused any perf hits
> (though I don't think we did).

That wasn't the case, and the 1.9.0 story doesn't even touch SeaMonkey,
but I remember we had that discussion back with 1.8.1 and AFAIK still
decided to leave some Talos coverage alive so a few interested people
could at least monitor it from time to time.

Robert Kaiser

John O'Duinn

unread,

Mar 4, 2009, 9:32:28 PM3/4/09

to Mike Beltzner, dev-pl...@lists.mozilla.org, dtow...@mozilla.com

hi;

There's a few of reasons to want to remove these FF3.0 talos machines:
- only being used infrequently, by very few people
- could be better used in mozilla-central, by many people
- would allow us to simplify our RelEng code; there's a bunch of "if
mozilla18 then... if mozilla190 then... if mozilla191/3.next then...".
This would also simplify our testing work whenever we roll out new talos
changes.

If we're going to leave any talos machines on FF3.0, we're already at
the minimum set of machines. We need three machines per o.s. to give us
tiebreakers in the variable results. Removing *some* machines would only
make it harder to figure out if something was wrong with the results,
while also still leaving us the machine+code support headaches anyway -
the worst of both worlds!!

Given that a few people, however infrequently, are using these machines
to make sure we dont hurt our FF3.0 users, I suggest we continue to
leave these talos machines all on for now - after all they reduce the
risk of shipping a huge performance regression in a FF3.0.x release that
wasnt noticed in FF3.1. Given the infrequency of checking these
machines, its not clear to me just how much they reduce the risk, so
maybe we need to monitor these more closely going forward?

As soon as we release FF3.1, I suspect the need for these FF3.0
Talos machines will reduce, and I think thats an ideal time to revisit
this discussion. When we do, I will happily come carrying my trusty axe.
:-)

Does that seem reasonable?

tc
John.
=====

Mike Beltzner wrote:
> Are active branch drivers even looking for performance regressions from

> security and stability checkins? While I agree that we would be unlikely to

> refuse a fix that was deemed as required for security or stability reasons,
> I believe that there is often more than one way to write a patch. It's
> entirely possible that a fix may have an unintended performance impact, and
> by catching and reporting that, the patch author may be able to address it
> while retaining their fix.
>

> Of course, we should be catching and baking that on the trunk, first.
>

> cheers,
> mike
>
> ----- Original Message -----
> From: dev-planning-bounces+beltzner=mozil...@lists.mozilla.org
> <dev-planning-bounces+beltzner=mozil...@lists.mozilla.org>
> To: dev-pl...@lists.mozilla.org <dev-pl...@lists.mozilla.org>
> Sent: Wed Feb 25 08:55:49 2009
> Subject: Re: Turn off Firefox3.0 Talos machines?
>

> Aside from it giving build back some resources that could be allocated
> elsewhere, what is the need for turning them off?
>

> On 25/2/09 08:22, Samuel Sidler wrote:
>> (follow up to dev-planning)
>>
>> Hello,
>>
>> The more I've thought about it, the more it seems like we don't need the
>> Talos machines on the Firefox3.0 tree running any more and I'd like to
>> see if other people agree.
>>

>> There are a couple of major reasons we have Talos machines on Firefox3.0
>> in the first place:
>> 1) To ensure no performance regressions on the 1.9.0 branch.
>> 2) To compare performance numbers between 1.9.0 and 1.9.1 (and
>> mozilla-central).
>>
>> Handling those in reverse order...
>>
>> Since bug 463323 was fixed, turning off Talos machines doesn't mean we
>> lose the baseline of performance numbers for those machines. A flatline
>> is created for previous Talos machines that are no longer running, which
>> allows comparisons to 1.9.1 and m-c. This isn't a good reason to keep
>> Talos machines running on Firefox3.0.
>>

>> As for performance regressions, we've been pretty good about only taking
>> security and stability fixes and (usually) only after they've baked on

>> 1.9.1 and/or m-c. If a performance regression is discovered due to the

>> landing of a security fix, we likely would *not* back out the security
>> fix unless the perf regression was incredibly severe (the need for
>> security fixes will almost always outweigh the need for flat performance

>> numbers). But most of those regressions are already caught when checking
>> into 1.9.1 and m-c. By the time they get to us, patches are pretty
>> perf-safe.
>>

>> Unless I'm missing other reasons why we run these machines, I think it's

>> time we turn them off. The effort to maintain these machines isn't worth
>> the benefit.
>>

Johnathan Nightingale

unread,

Mar 5, 2009, 11:50:52 AM3/5/09

to dev. planning

On 4-Mar-09, at 9:32 PM, John O'Duinn wrote:

> Given the infrequency of checking these
> machines, its not clear to me just how much they reduce the risk, so
> maybe we need to monitor these more closely going forward?

Sam - would it change how often branch drivers checked performance if
the 3.0 boxes were added to http://people.mozilla.org/~johnath/
pdb2/ ? That is pretty straightforward to do if you think it would
help spot any regressions, but obviously there's no point if no one's
going to use it.

Cheers,

Johnathan

---
Johnathan Nightingale
Human Shield
joh...@mozilla.com