Hello,
The more I've thought about it, the more it seems like we don't need
the Talos machines on the Firefox3.0 tree running any more and I'd
like to see if other people agree.
There are a couple of major reasons we have Talos machines on
Firefox3.0 in the first place:
1) To ensure no performance regressions on the 1.9.0 branch.
2) To compare performance numbers between 1.9.0 and 1.9.1 (and
mozilla-central).
Handling those in reverse order...
Since bug 463323 was fixed, turning off Talos machines doesn't mean we
lose the baseline of performance numbers for those machines. A
flatline is created for previous Talos machines that are no longer
running, which allows comparisons to 1.9.1 and m-c. This isn't a good
reason to keep Talos machines running on Firefox3.0.
As for performance regressions, we've been pretty good about only
taking security and stability fixes and (usually) only after they've
baked on 1.9.1 and/or m-c. If a performance regression is discovered
due to the landing of a security fix, we likely would *not* back out
the security fix unless the perf regression was incredibly severe (the
need for security fixes will almost always outweigh the need for flat
performance numbers). But most of those regressions are already caught
when checking into 1.9.1 and m-c. By the time they get to us, patches
are pretty perf-safe.
Unless I'm missing other reasons why we run these machines, I think
it's time we turn them off. The effort to maintain these machines
isn't worth the benefit.
Thoughts?
-Sam
Of course, we should be catching and baking that on the trunk, first.
cheers,
mike
_______________________________________________
dev-planning mailing list
dev-pl...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-planning
As Sam said:
"The effort to maintain these machines isn't worth the benefit."
I would much rather have the Talos team's maintenance time and the
hardware put where it can do more good.
Mike
> Are active branch drivers even looking for performance regressions
> from
> security and stability checkins?
Not actively, no. We glance periodically, but tend not to do it
regularly.
> While I agree that we would be unlikely to
> refuse a fix that was deemed as required for security or stability
> reasons,
> I believe that there is often more than one way to write a patch. It's
> entirely possible that a fix may have an unintended performance
> impact, and
> by catching and reporting that, the patch author may be able to
> address it
> while retaining their fix.
>
> Of course, we should be catching and baking that on the trunk, first.
Your first paragraph is true here, but your last sentence tells the
real story. We should be catching these problems long before we see
them on 1.9.0 and if we don't, they're likely to stay in anyway (to
get the security fix).
-Sam
Adding to that, we've had several machines go orange and red for
"random" reasons (read: RE knows what they are, but they're still
fairly random). The net result of that is that people tend to ignore
orange on Firefox3.0 in general because it wasn't caused by their
checkin(s).
While it's not necessarily unreasonable to take any and all fixes
needed to make Talos more reliable on Firefox3.0 (which we might be
doing anyway), it *is* unreasonable to pull the Talos team off of
other, more important work like 1.9.1 and m-c Talos machines. I,
personally, would rather have them focused on improving Talos than
worrying about sporadic oranges on Firefox3.0.
-Sam
I'm not necessarily against this, but I would like to point out that one
advantage that we get from running Talos against the 3.0 branch is additional
automated testing. For a while, Talos was the only tool that we had that was
able to reproduce crash bug 474537 (and on trunk, I've seen other unexplained
Tp crashes). Turning these machines off will reduce the chances of one of
these types of crashes being noticed and acted on.
--
Blake Kaplan
> There are a couple of major reasons we have Talos machines on
> Firefox3.0 in the first place:
> 1) To ensure no performance regressions on the 1.9.0 branch.
> 2) To compare performance numbers between 1.9.0 and 1.9.1 (and
> mozilla-central).
>
> Handling those in reverse order...
>
> Since bug 463323 was fixed, turning off Talos machines doesn't
> mean we lose the baseline of performance numbers for those
> machines. A flatline is created for previous Talos machines that
> are no longer running, which allows comparisons to 1.9.1 and
> m-c. This isn't a good reason to keep Talos machines running on
> Firefox3.0.
I don't think this argument holds.
This has been discussed before, and I thought the conclusion was
that bug 463323 does not give us apples to apples comparison
between 1.9.0 and 1.9.1.
If we rely on a flatline extension, we are likely to lose the
ability to compare 1.9.0 and 1.9.1.
The reason discussed before was that the talos infrastructure
changes so test results change.
One suggestion was to scale down 1.9.0 performance tests (number
of machines and/or frequency of test runs), so that we still have
some numbers. This seems sensible.
There was also the suggestion of shutting down the 1.9.0 perf
machines and only firing them up when we felt there was a need for
a fresh run. My two concerns about this are:
* I don't know whether tests would actually end up being started again.
* We wouldn't necessarily know when they needed to be started.
> If a performance regression is discovered due to the landing of
> a security fix, we likely would *not* back out the security fix
> unless the perf regression was incredibly severe (the need for
> security fixes will almost always outweigh the need for flat
> performance numbers).
If we rely on the flatline extension of previous data (without the
1.9.0 regression), again we would not have apples to apples comparison
of the branches.
> Unless I'm missing other reasons why we run these machines, I
> think it's time we turn them off.
Are all people being good about postponing 1.9.1 checkins until
trunk talos machines have cycled?
If not, we would be losing the control for determining whether
performance changes are due to code or testing changes.
> The effort to maintain these machines isn't worth the benefit.
(I'm not able to comment on the effort involved, I just want to
make sure that people know what we would lose.)
Except that a proportion of fixes are different on 1.9.0 than they are
on 1.9.1 & later, due to changes in the code base. A perf-safe patch on
1.9.1 isn't necessarily safe on 1.9.0.
-Nick
Can you give some examples of patches that have had undesireable and
unexpected perf characteristics in a backport, but not on the main
branch? It would be good to know more specifically what we think
we're protecting against, since in the abstract we can create doomsday
scenarios all day.
Related, since we have this data due to Talos coverage I believe:
what's the largest range of "real" performance change we've seen along
a maintenance release stream? "real" excluding machine noise or
testing changes. My suspicion is that we historically don't see a lot
of variance there, but numbers would know better than I do.
Mike
While I don't know of a single instance where a backport has caused a
perf regression that didn't happen on trunk, it still scares me to not
do any perf monitoring on changing code.
It would make a lot of sense to me to drastically reduce the number of
talos boxes on branches though. First of all the number of patches
landing is much lower which means that it's much easier to catch a
regression. Easier both in the sense that it's easier to see which patch
caused the regression, and easier since we get many more measuring
cycles per patch so we can spot trends even through noise.
So I'm all for reducing the number of talos boxes, but I do think
keeping one per platform would be a good idea.
/ Jonas
While I don't know of a single instance where a backport has caused a
I think we'd be worse off if we did this. The whole idea of having 3
talos boxes per platform is to have tie breakers. If we only had 1
machine per platform and say, the Mac Leopard machine went orange we'd
have a much more difficult time knowing if it was real or not.
It also helps in cases where one machine from a set disappears - we can
still keep the tree open and not have to rush down to the colo to fix it.
Worse than what? Not having the boxes at all wouldn't mean less oranges,
it'd just mean that we don't know about them which hardly seems better.
And not having boxes at all seems no different from having one box which
temporarily is out.
I do agree that we'd be worse off than having the number of boxes we
currently do though, but the idea here was to sacrifice the branch a
little bit to free up resources for newer branches and moz-central, no?
I think rando-oranges are less cause of panic on a branch as stable as
the 1.9.0 branch. If something goes orange and there is a reason to
suspect it was random, just wait a cycle or two. It's generally not
going to hold up a significant amount of work.
/ Jonas
So you suggest that something having no perf impact on 1.9.1 because
TraceMonkey optimizes it away or something like that means that we just
don't care if it slows down 1.9.0 because it does have a different
feature set and reacts differently?
I remember we had the same discussion for 1.8.1 and IIRC we went away
with needing to keep a minimal set of Talos machines alive as long as
the branch is actively maintained.
Robert Kaiser
TraceMonkey only optimizes JavaScript and our security fixes very rarely
touch JavaScript code. In all of 3.0.7 we touched 2 lines of JS code; in
3.0.6 bug 449027 changed nsBlocklistService.js and another bug touched
three lines in nsSessionStore.js
> I remember we had the same discussion for 1.8.1 and IIRC we went away
> with needing to keep a minimal set of Talos machines alive as long as
> the branch is actively maintained.
The difference between the 1.8 and 1.9 branches often required
significant back-porting. That's mostly not the case yet between 1.9.0
and 1.9.1. That said, the "1.8 drivers" weren't watching Talos on that
branch. If you were a 1.8 user you got the security fix perf be damned.
Users who cared about perf had the option of upgrading to the
performance-focused Firefox 3 release.
I guess not SeaMonkey users, and I'm sorry if we caused any perf hits
(though I don't think we did). But there's nothing that slows your
machine down more than a spambot infection so it was still the right
tradeoff.
Once 3.1 ships I don't care so much, but until then 3.0.x is our only
production release and it would be irresponsible not to keep half an eye
on it.
> So I'm all for reducing the number of talos boxes, but I do think
> keeping one per platform would be a good idea.
Me too.
TM was just meant as one example the significant changes that actually
have been made between 1.9.0 and 1.9.1 - the .1 is not really as small a
difference as it originally was planned.
> I guess not SeaMonkey users, and I'm sorry if we caused any perf hits
> (though I don't think we did).
That wasn't the case, and the 1.9.0 story doesn't even touch SeaMonkey,
but I remember we had that discussion back with 1.8.1 and AFAIK still
decided to leave some Talos coverage alive so a few interested people
could at least monitor it from time to time.
Robert Kaiser
There's a few of reasons to want to remove these FF3.0 talos machines:
- only being used infrequently, by very few people
- could be better used in mozilla-central, by many people
- would allow us to simplify our RelEng code; there's a bunch of "if
mozilla18 then... if mozilla190 then... if mozilla191/3.next then...".
This would also simplify our testing work whenever we roll out new talos
changes.
If we're going to leave any talos machines on FF3.0, we're already at
the minimum set of machines. We need three machines per o.s. to give us
tiebreakers in the variable results. Removing *some* machines would only
make it harder to figure out if something was wrong with the results,
while also still leaving us the machine+code support headaches anyway -
the worst of both worlds!!
Given that a few people, however infrequently, are using these machines
to make sure we dont hurt our FF3.0 users, I suggest we continue to
leave these talos machines all on for now - after all they reduce the
risk of shipping a huge performance regression in a FF3.0.x release that
wasnt noticed in FF3.1. Given the infrequency of checking these
machines, its not clear to me just how much they reduce the risk, so
maybe we need to monitor these more closely going forward?
As soon as we release FF3.1, I suspect the need for these FF3.0
Talos machines will reduce, and I think thats an ideal time to revisit
this discussion. When we do, I will happily come carrying my trusty axe.
:-)
Does that seem reasonable?
tc
John.
=====
Mike Beltzner wrote:
> Are active branch drivers even looking for performance regressions from
> security and stability checkins? While I agree that we would be unlikely to
> refuse a fix that was deemed as required for security or stability reasons,
> I believe that there is often more than one way to write a patch. It's
> entirely possible that a fix may have an unintended performance impact, and
> by catching and reporting that, the patch author may be able to address it
> while retaining their fix.
>
> Of course, we should be catching and baking that on the trunk, first.
>
> cheers,
> mike
>
> ----- Original Message -----
> From: dev-planning-bounces+beltzner=mozil...@lists.mozilla.org
> <dev-planning-bounces+beltzner=mozil...@lists.mozilla.org>
> To: dev-pl...@lists.mozilla.org <dev-pl...@lists.mozilla.org>
> Sent: Wed Feb 25 08:55:49 2009
> Subject: Re: Turn off Firefox3.0 Talos machines?
>
> Aside from it giving build back some resources that could be allocated
> elsewhere, what is the need for turning them off?
>
> On 25/2/09 08:22, Samuel Sidler wrote:
>> (follow up to dev-planning)
>>
>> Hello,
>>
>> The more I've thought about it, the more it seems like we don't need the
>> Talos machines on the Firefox3.0 tree running any more and I'd like to
>> see if other people agree.
>>
>> There are a couple of major reasons we have Talos machines on Firefox3.0
>> in the first place:
>> 1) To ensure no performance regressions on the 1.9.0 branch.
>> 2) To compare performance numbers between 1.9.0 and 1.9.1 (and
>> mozilla-central).
>>
>> Handling those in reverse order...
>>
>> Since bug 463323 was fixed, turning off Talos machines doesn't mean we
>> lose the baseline of performance numbers for those machines. A flatline
>> is created for previous Talos machines that are no longer running, which
>> allows comparisons to 1.9.1 and m-c. This isn't a good reason to keep
>> Talos machines running on Firefox3.0.
>>
>> As for performance regressions, we've been pretty good about only taking
>> security and stability fixes and (usually) only after they've baked on
>> 1.9.1 and/or m-c. If a performance regression is discovered due to the
>> landing of a security fix, we likely would *not* back out the security
>> fix unless the perf regression was incredibly severe (the need for
>> security fixes will almost always outweigh the need for flat performance
>> numbers). But most of those regressions are already caught when checking
>> into 1.9.1 and m-c. By the time they get to us, patches are pretty
>> perf-safe.
>>
>> Unless I'm missing other reasons why we run these machines, I think it's
>> time we turn them off. The effort to maintain these machines isn't worth
>> the benefit.
>>
> Given the infrequency of checking these
> machines, its not clear to me just how much they reduce the risk, so
> maybe we need to monitor these more closely going forward?
Sam - would it change how often branch drivers checked performance if
the 3.0 boxes were added to http://people.mozilla.org/~johnath/
pdb2/ ? That is pretty straightforward to do if you think it would
help spot any regressions, but obviously there's no point if no one's
going to use it.
Cheers,
Johnathan
---
Johnathan Nightingale
Human Shield
joh...@mozilla.com