As mentioned in an earlier post:
We are ready to migrate the last set of tinderboxes to VMs. Since these
tinderboxes run performance tests we'd like to close the 1.8 and trunk
trees while they are down. Plan is to migrate them from 10am->10pm PDT
on this Wednesdau May 17. We'll obviously re-open as fast as we can,
but the migrations do take about that long.
There is never a good time to do this with all the releases going on -
but this week seems as good as any since many folks are at XTech or
otherwise out.
Any major objections?
Schrep
Do we plan to then tune the VMs to give performance numbers comparable to the
numbers those machines give now? Is this even a worthwhile exercise? e.g. are
the migrated machines going to be running the same software versions, etc?
If we do decide to do this, it's just worth keeping in mind that this would take
some time; possibly more than one day.
Are there any ready-to-go (reviewed and such, but not landed) patches for 1.8a3
that are not landed yet? Might be a good idea to get those landed tomorrow just
in case...
-Boris
Even if they are not stable and not comparable - this is a worthwhile
exercise as we'll then have an image to rebuild the boxes in case of
catastrophic failure.
Schrep
I hope not (in the latter case); we should be testing a configuration
(compiler, toolkit, etc.) that matches what we ship and support.
I'm not entirely sold on the practical value of trying to calibrate
the timing numbers. All those numbers should only ever go down, so
unless we have outstanding T? regressions that we need to repair, I
think we can probably just switch over. If we have outstanding T?
regressions, we should defer the migration until they're resolved.
Mike
Sounds good! Are we planning to at least limit the VMs so that they're not "too
fast" (a perennial problem we have with modern hardware and
millisecond-resolution-at-best measurements)?
-Boris
It's worth it if we care about historical trends (like "how did Tp change over
the last year?").
> All those numbers should only ever go down
Not necessarily. For example, see the checkin that made the memory cache
smaller recently.
-Boris
True, but we should maybe do this on completely new VM images that are
built upon the reference platforms currently being discussed elsewhere.
We should still test other not-so-common configurations or environments
as well (as long as they are supported at all) so that we know if/when
they break. The recent gcc 2.95 breakage was one such issue - as long as
we support that compiler, it's good to have a tinderbox running with it
(and from what I know, some current systems still depend on that old
compiler suite).
> I'm not entirely sold on the practical value of trying to calibrate
> the timing numbers. All those numbers should only ever go down, so
> unless we have outstanding T? regressions that we need to repair, I
> think we can probably just switch over. If we have outstanding T?
> regressions, we should defer the migration until they're resolved.
Sometimes data like we have on e.g.
http://build-graphs.mozilla.org/graph/query.cgi?tbox=btek&testname=pageload&autoscale=1&size=&units=<ype=&points=&showpoint=2006%3A05%3A16%3A10%3A41%3A04%2C935&avg=1&days=300
is useful to see if we're going the right way in the long term.
Also, you might notice in this graph that over the last few months we
had a slow but steady increase in those numbers (not by a single
trackable regression though) that was about as large as the Tp win we
had from the thread manager changes recently - and currently we aren't
really better than a year ago.
Such data can be quite interesting, as they open up a different
perspective on certain wins and/or losses we might see in short-term
numbers.
Robert Kaiser
I hope build people will not take the route they took with comet, where
graphs have a sudden bump because of that change and the old machine
fades away from one moment to the other, but create the new machine with
a slightly different name (like they did for creature-vm) and maybe even
leave the old one for a short period (some hours or so) running in
parallel...
Robert Kaiser
We will take the create-vm approach for these machines - namely
prometheus, pacifica, gaius, btek will be offline while we do the
migration. Once the migration is complete the old tinderboxes be
turned back on unscathed. The new vms: prometheus-vm, pacifica-vm,
gauius-vm, and btek-vm will then show up in the trees as new tinderboxes.
Cheers,
Schrep
I care about "how fast were we on 1.0?" and "how fast were we on
1.5?". What decision would we make differently with richer historical
data than that? And why would we want to make it on the basis of "how
fast were we on a configuration we don't care about"?
Better question: what decisions have we made in the last 2 years on
the basis of richer historical data than "versus previous release"?
If we want to find a specific regression in a deeply historical sense,
we should isolate that as we would non-performance bugs, IMO, rather
than impede this transition "just in case".
> > All those numbers should only ever go down
>
> Not necessarily. For example, see the checkin that made the memory cache
> smaller recently.
OK, sure, but we can analyze those effects locally as well as
globally, I think. If we are willing to take that performance hit for
it, we're willing, and I think it's only the comparison against
previous general-user releases that we would use in terms of
historical data.
Mike
Sure, when they're ready. Moving these to VMs today solves a problem
that we have today related to tbox management, and moving to another
standard VM image when they're ready isn't impaired by this near-term
plan. So why is this a "but"?
> We should still test other not-so-common configurations or environments
> as well (as long as they are supported at all) so that we know if/when
> they break.
"They break"? If 2.95 has a performance regression and the compiler
we use to ship builds to millions of users doesn't, I'm totally
uninterested in us stopping everyone to peer at the results.
> The recent gcc 2.95 breakage was one such issue - as long as
> we support that compiler, it's good to have a tinderbox running with it
> (and from what I know, some current systems still depend on that old
> compiler suite).
I think we should drop gcc 2.95 support, actually. From what I know
only _one_ tier-infinity system (BeOS) requires gcc2.95, and it's not
worth holding other things back.
But "can produce working builds on compiler X" is different from
"monitor performance on compiler X to gate and evaluate changes to our
software", and we're talking about the latter. A gcc 2.95 tinderbox
is welcome on some ports-of-ports page alongside VC6 and gtk1, I
guess, but if someone says "I can't work on X, I have to fix the 2.95
breakage first" I think we're all losing.
> > I'm not entirely sold on the practical value of trying to calibrate
> > the timing numbers. All those numbers should only ever go down, so
> > unless we have outstanding T? regressions that we need to repair, I
> > think we can probably just switch over. If we have outstanding T?
> > regressions, we should defer the migration until they're resolved.
>
> Sometimes data like we have on e.g.
> http://build-graphs.mozilla.org/graph/query.cgi?tbox=btek&testname=pageload&autoscale=1&size=&units=<ype=&points=&showpoint=2006%3A05%3A16%3A10%3A41%3A04%2C935&avg=1&days=300
> is useful to see if we're going the right way in the long term.
Can you give me an example of a decision that was made based on such
long term data? I can't think of one, but you probably follow
different bugs than I do. "Useful" is misleading there, if it's only
useful for something that doesn't affect the world.
> Also, you might notice in this graph that over the last few months we
> had a slow but steady increase in those numbers (not by a single
> trackable regression though) that was about as large as the Tp win we
> had from the thread manager changes recently - and currently we aren't
> really better than a year ago.
We aren't really better than a year ago on _that_system_, with its
archaic compiler and kernel, and wholly unrepresentative I/O costs.
When changing the length of a directory name is indistinguishable from
adding nsSolveHaltingProblem.js to the startup sequence, I have a
really hard time getting worked up about a slow slide.
(But I *would* be interested in running 1.0.latest and 1.5.0.latest
and even EOMB against our test suite all the time, so that we can see
deltas without having to freeze our content set in carbonite.
Comparisons against previous products, based on capabilities that are
ever-closer to the activities of real users, are data that we can use
to estimate user reaction, and I think provide enough historical
context for evaluating our progress.)
> Such data can be quite interesting, as they open up a different
> perspective on certain wins and/or losses we might see in short-term
> numbers.
If we want that data, someone can generate it by pulling by date over
a large historical range, even using some incremental-refinement
technique until they're satisfied with the resolution. But if nobody
has an important question that we need those data to answer, then I
don't think we should spend energy generating and preserving that data
now that we could instead spend on more valuable pursuits.
Mike
There's over a year of trunk development in between; it might be nice to have
slightly better granularity than that. But not sure; see below.
> What decision would we make differently with richer historical
> data than that? And why would we want to make it on the basis of "how
> fast were we on a configuration we don't care about"?
The answer to the latter question is, "we wouldn't", frankly.
> Better question: what decisions have we made in the last 2 years on
> the basis of richer historical data than "versus previous release"?
I don't know, but then again I wasn't exactly heavily involved in the decision
making process 2 years ago.
And frankly, what decisions have we made on the basis of previous release data?
I'm not aware of any....
> If we want to find a specific regression in a deeply historical sense,
> we should isolate that as we would non-performance bugs, IMO, rather
> than impede this transition "just in case".
Agreed. Did I suggest otherwise?
-Boris
_This_ would be lovely; if we had this I would see little reason to keep the
sort of historical data we have from btek now (past the 2-3 months we need to
spot recent regressions, of course). Might I suggest the following set of test
builds:
1.0
1.0.latest
1.5.0
1.5.0.latest
for this purpose? In particular, comparison of 1.5.0.latest vs 1.5.0 might be
instructive; I can think of at least several checkins that took perf hits (clear
code analysis) to fix security issues. I would dearly love to know whether said
perf hits were noticeable and if so how big.
And yes, the ability to add to (or subtract from) our tests would be great.
-Boris
Nice!
BTW, could someone remember to turn logscraping on for creature-vm while
you might be at it (and need to do that for the other *-vm boxen as well)?
Thanks!
Robert Kaiser
Done. For what it's worth, this doesn't need access to the tinderbox itself;
anyone with the tinderbox password (eg. anyone sheriffing) can do this.
-Boris