Proposal to Change Talos Data Calculations

Clint Talbert

unread,

Nov 3, 2011, 9:32:58 PM11/3/11

to Stephen Lewchuk, Joel Maher, jod...@mozilla.com

For the last few weeks, Stephen Lewchuk from the Metrics team has been
analyzing some of our Talos data to determine if there are better ways
we could collect or report the data so that we have more deterministic
results, faster runs, etc.

He's not finished yet, but he gave Joel and I an overview today. I'll
talk about the take-aways first and then we can discuss the "why" behind
them. Take-aways:
* Stop discarding the maximum value
* Start discarding the first iteration of the test run.

He found that the maximum value for a test run is often coming from a
single html test during a larger test of several pages. Therefore,
discarding that value is often akin to discarding that entire segment of
the test, we should simply stop doing that. See his graphs for that [1]
and [2].

He also found that on almost every single test, our first iteration of
the test produces a wonky number. Successive iterations produce more
stable results. This showed up in tsvg, tdhtml[3], and tp5 [4],[5] (run
index in these results - the x-axis - is the value for the iteration of
the run). So, he recommends throwing out the first iteration of the test.

Now, these are two changes we could make during the upcoming rollout of
the RSS measurement for android - because the RSS rollout will mean that
we have to rebaseline tp5 anyway. Do we want to take these two changes
at the same time?

To help us answer that question, Stephen is currently working on
generating a graph of the tp5 results calculated as we do today with the
same test calculated with this new proposal overlaid over it. This will
give us an idea of the possible changes we expect to see. But, if his
data holds, then it looks like this should give us more accurate, less
noisy numbers.

Thoughts on the above proposal? Should we change the reporting
mechanism as we roll out RSS? Since we must rebase-line anyway, doing
it at the same time saves on test-slave bandwidth. (We must run all
affected talos tests in both the new and old versions on all branches,
so it's a significant impact, and something we'd prefer not to do twice).

Clint

[1]:
http://people.mozilla.org/~ctalbert/talos_presentation/02-tvsg_medians.pdf
[2]:
http://people.mozilla.org/~ctalbert/talos_presentation/04-tsvg-composite-scale-rotate.svg
(some of the tests are multi-modal, so sometimes they are high,
sometimes they are low, and if we always throw out the high value we are
losing valuable information.)
[3]:
http://people.mozilla.org/~ctalbert/talos_presentation/13-tdhtml-chrome_chrome_mac_runs.pdf
[4]: http://people.mozilla.org/~ctalbert/talos_presentation/16-tp5_runs.pdf
[5]: http://people.mozilla.org/~ctalbert/talos_presentation/17-tp5_runs.pdf

slew...@mozilla.com

unread,

Nov 4, 2011, 8:29:02 PM11/4/11

to Stephen Lewchuk, Joel Maher, jod...@mozilla.com

Some additional graphs/insights. The proposed changes are expected to slightly increase the noise and variance in the data as high valued anomalies may no longer be discarded. The benefit however is the elimination of systematic biases in the data. The proposed scheme increases the output by a similar amount across most of the builds in the week of Oct 24-30, with a few larger changes (88cd8e9287c8 for fedora64 and 532d27c289de for snowleopard-r4) [1]. In the week of October 24 through 30, the same set of small pages were discarded meaning the existing scheme hides the impact of these tests [2].

[1] http://people.mozilla.org/~slewchuk/graphs/tp5_aggregation_differences.pdf
First graph shows the existing aggregation scheme (blue) and proposed scheme (red). Second graph shows the difference between the two schemes.

[2] http://people.mozilla.org/~slewchuk/graphs/tp5_discarded_pages.pdf
This graph shows the counts of the pages being discarded by the current scheme. Raw counts below:

fedora fedora64 leopard snowleopard snowleopard-r4 win7 xp
alipay.com/.../index.html 8 11 3 3 4 2 1
bild.de/.../index.html 0 25 0 24 62 22 0
dailymail.co.uk/.../index.html 118 102 158 149 110 117 135
xinhuanet.com/.../index.html 0 0 14 3 3 0 0

If any one has questions about the graphs/data/proposals please ping me and I'll happily answer them,

Stephen

Justin Lebar

unread,

Nov 6, 2011, 12:20:28 PM11/6/11

to mozilla.de...@googlegroups.com, Joel Maher, dev-pl...@lists.mozilla.org, Stephen Lewchuk, jod...@mozilla.com

Towards the end of [1] it looks like there was a spike in the proposed
data across operating systems which wasn't caught by our existing
infrastructure. Do you have any idea whether this represents a real
regression that we missed? (For example, does the spike correspond to
something being checked in and then backed out?)

Aside from this one event, the effect on variance seems minimal,
except for the few Fedora 64 spikes...

Anyway, this seems like a reasonable thing to do!

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>

Shawn Wilsher

unread,

Nov 6, 2011, 8:38:51 PM11/6/11

to dev-pl...@lists.mozilla.org

On 11/3/2011 6:32 PM, Clint Talbert wrote:
> Thoughts on the above proposal? Should we change the reporting mechanism
> as we roll out RSS? Since we must rebase-line anyway, doing it at the
> same time saves on test-slave bandwidth. (We must run all affected talos
> tests in both the new and old versions on all branches, so it's a
> significant impact, and something we'd prefer not to do twice).

I am so happy to see this happening, and I think it's great that our
metrics team is looking at our performance numbers :)

Cheers,

Shawn

Stephen Lewchuk

unread,

Nov 7, 2011, 1:15:57 PM11/7/11

to Justin Lebar, Joel Maher, mozilla dev platform, dev-pl...@lists.mozilla.org, jod...@mozilla.com

The spike on [1] is an example (an unfortunately large one) of the additional noise this change will create. The associated change: https://hg.mozilla.org/integration/mozilla-inbound/rev/88cd8e9287c8. The spike appears to be caused by an anomaly in the test runs of alipay.com/www.alipay.com/index.html. This page is often the longest running page [2] and therefore most of the time is hidden by the existing scheme. The large nature of this spike also reveals another issue that we will be addressing going forward: that the averaging process heavily favors changes in the long running pages compared to short pages.

Justin Lebar

unread,

Nov 7, 2011, 2:14:54 PM11/7/11

to Stephen Lewchuk, Joel Maher, mozilla dev platform, dev-pl...@lists.mozilla.org, jod...@mozilla.com

Hm. So this looks like noise -- that changeset didn't do anything,
but if this is noise in the test run itself, why did the spike appear
on multiple platforms?

> The large nature of this spike also reveals another issue that we will be addressing going forward: that the averaging process heavily favors
> changes in the long running pages compared to short pages.

Yes, using arithmetic mean is nutso. I'll echo Shawn's sentiment:
It's great that you guys are looking at this!

-Justin

Stephen Lewchuk

unread,

Nov 7, 2011, 4:04:32 PM11/7/11

to Justin Lebar, Joel Maher, dev-pl...@lists.mozilla.org, jod...@mozilla.com

That's a good question, as that build is a spike in fedora, fedora64, and win7. I'll dig a bit deeper and see if I can find anything interesting/different about it.

Stephen

Taras Glek

unread,

Nov 7, 2011, 8:13:25 PM11/7/11

to

On 11/7/2011 10:15 AM, Stephen Lewchuk wrote:
> The spike on [1] is an example (an unfortunately large one) of the additional noise this change will create. The associated change: https://hg.mozilla.org/integration/mozilla-inbound/rev/88cd8e9287c8. The spike appears to be caused by an anomaly in the test runs of alipay.com/www.alipay.com/index.html. This page is often the longest running page [2] and therefore most of the time is hidden by the existing scheme. The large nature of this spike also reveals another issue that we will be addressing going forward: that the averaging process heavily favors changes in the long running pages compared to short pages.
>
> ----- Original Message -----
> From: "Justin Lebar"<justin...@gmail.com>
> To: "mozilla dev platform"<mozilla.de...@googlegroups.com>
> Cc: dev-pl...@lists.mozilla.org, "Joel Maher"<jma...@mozilla.com>, "Stephen Lewchuk"<slew...@mozilla.com>, jod...@mozilla.com
> Sent: Sunday, November 6, 2011 9:20:28 AM
> Subject: Re: Proposal to Change Talos Data Calculations
>
> Towards the end of [1] it looks like there was a spike in the proposed
> data across operating systems which wasn't caught by our existing
> infrastructure. Do you have any idea whether this represents a real
> regression that we missed? (For example, does the spike correspond to
> something being checked in and then backed out?)
>
> Aside from this one event, the effect on variance seems minimal,
> except for the few Fedora 64 spikes...

In this case the outlier is beyond variability, but...

Is there any chance we could keep track of variance? It's painful to get
a new faster/slower number out of our infrastructure only to discover
that the difference is noise. Are there biases introduced by particular
machines, are they correctable?

Taras

Stephen Lewchuk

unread,

Nov 7, 2011, 8:48:41 PM11/7/11

to dev-pl...@lists.mozilla.org

Trying to reduce variance is one of the goals of the next stage of the work. I'm not sure if the current infrastructure would support some type of variance tracking. As for machine bias, the data seems to point to little or no machine bias and in the next few days I hope to have numbers to back up that assertion.

Stephen

Justin Lebar

unread,

Nov 7, 2011, 10:10:54 PM11/7/11

to Taras Glek, Stephen Lewchuk, dev-pl...@lists.mozilla.org

> Is there any chance we could keep track of variance? It's painful to get a
> new faster/slower number out of our infrastructure only to discover that the
> difference is noise.

Doing something like this is very important. We don't necessarily
need to track the variance in graphserver itself, so long as we had a
uniform way of computing the variance from the raw graphserver data.
Unfortunately, figuring out the variance of a test is very difficult
to do automatically.

You have outliers like the one here, with no apparent cause, plus you
have changes caused by code which is backed out. You need to
distinguish a momentary 5% bump in test scores due to a botched
checkin from a 5% bump caused by natural randomness.

Changes in infra, even changes which aren't expected to affect
variance, sometimes have an effect as well. For example, in the bug
where I looked at Dromaeo scores [1], an infra change which affected
which machines got which jobs may have affected the variance, because
the test score was affected by which object files were rebuilt! I
also saw in that bug that test scores were different on different
trees, again due to infra weirdness.

You also have to model the variance. Most tests' results aren't
normally distributed. Are they uniform in a range? Are they
something else? Some are bimodal. And the kind of distribution can
change over time; bimodality can appear or disappear as we fix or
regress things.

Since the variance can change over time, you have to somehow deal with
these points in time where there was a hard shift in the test result
distribution.

And all of this has to be automatic and work without intervention,
because we have a gazillion tests and are adding more.

Anyway, I'm not saying we shouldn't do this. It is, in fact,
extremely important that we figure this out. Not only is it currently
difficult to tell how your patch affects performance, but all sorts of
changes to the distribution of benchmark results are missed by our
current baby-statistics approach to monitoring.

But I don't think solution is going to be simple, unfortunately.

-Justin

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=653961

Justin Dolske

unread,

Nov 8, 2011, 8:33:59 PM11/8/11

to

On 11/3/11 6:32 PM, Clint Talbert wrote:
> For the last few weeks, Stephen Lewchuk from the Metrics team has been
> analyzing some of our Talos data to determine if there are better ways
> we could collect or report the data so that we have more deterministic
> results, faster runs, etc.

Yay! I suspect we're sorely overdue for an intervention from someone who
understands statistics well.

> Take-aways:
> * Stop discarding the maximum value

Yes! IIRC, this has resulted in a "whuck?!" from previous examinations
how how data is reported.

> * Start discarding the first iteration of the test run.

Hmm, yes and no.

It makes sense to me to do _something_ differently with the first
iteration of a test run, because it's inherently different. Caches are
cold, and we often do extra work the first time a code path is hit (eg
initializing services).

My gut (a certified professional statistician) tells me we want to
ignore the first run for the purposes of an "overall" number, but still
report it somewhere else to watch as a "worst case" kind of thing.

As a simple hypothetical, suppose someone (let's call him "Dave")
introduces some kind of dumb perf killer that makes a service take
seconds to initialize instead of the usual milliseconds. Mr. Townsend,
erruhh, Dave, wouldn't notice the problem if the first-run values are
always ignored by our perf measurements. [A more realistic example would
be a serious of, say, small 15% regressions that add up over time.]

Justin

Taras Glek

unread,

Nov 10, 2011, 2:53:37 PM11/10/11

to Justin Dolske

I think there are 2 independent issues here: repeatability and
representability*. I think on our testsuites we want to focus on
measuring significant differences in performance, ie focus on repeatable
results. On the other hand ensuring that we have a test that's
representative of real world is something that should be tackled during
the [re]design phase of the benchmark.
I think the best way to see if a benchmark is representative is to track
the particular feature via telemetry** and see how that compares to our
synthetic benchmark. Arguing on which number to drop will not get us as
close to the truth.

Taras

* at least one of these words is made up.
** We should look into statistically tracking certain telemetry reports
to catch regressions from real life

>
> Justin

Clint Talbert

unread,

Nov 10, 2011, 4:25:10 PM11/10/11

to Taras Glek, Justin Dolske

On 11/10/2011 11:53 AM, Taras Glek wrote:
> On 11/8/2011 5:33 PM, Justin Dolske wrote:
>> On 11/3/11 6:32 PM, Clint Talbert wrote:
>>> For the last few weeks, Stephen Lewchuk from the Metrics team has been
>>> analyzing some of our Talos data to determine if there are better ways
>>> we could collect or report the data so that we have more deterministic
>>> results, faster runs, etc.
>>
>> Yay! I suspect we're sorely overdue for an intervention from someone who
>> understands statistics well.
>>
>>> Take-aways:
>>> * Stop discarding the maximum value
>>
>> Yes! IIRC, this has resulted in a "whuck?!" from previous examinations
>> how how data is reported.
>>
>>> * Start discarding the first iteration of the test run.
>>
>> Hmm, yes and no.
>>
>> It makes sense to me to do _something_ differently with the first
>> iteration of a test run, because it's inherently different. Caches are
>> cold, and we often do extra work the first time a code path is hit (eg
>> initializing services).
>>
>> My gut (a certified professional statistician) tells me we want to
>> ignore the first run for the purposes of an "overall" number, but still
>> report it somewhere else to watch as a "worst case" kind of thing.
>>

We do report all the numbers to the graph server so we have the data
there. We can craft views using that data to show all numbers if we
want. We'd just leave out the first-run when reporting the "overall"
number; that's what we're suggesting.

> I think there are 2 independent issues here: repeatability and
> representability*. I think on our testsuites we want to focus on
> measuring significant differences in performance, ie focus on repeatable
> results. On the other hand ensuring that we have a test that's
> representative of real world is something that should be tackled during
> the [re]design phase of the benchmark.

Right, with Talos right now, I want to get better reliability on the
numbers we are currently reporting.

> I think the best way to see if a benchmark is representative is to track
> the particular feature via telemetry** and see how that compares to our
> synthetic benchmark. Arguing on which number to drop will not get us as
> close to the truth.

I completely agree. But, getting better numbers for talos helps to
ensure that we have good comparisons here.

Clint

Stephen Lewchuk

unread,

Nov 10, 2011, 7:42:50 PM11/10/11

to dev-pl...@lists.mozilla.org

----- Original Message -----
> From: "Clint Talbert" <ctal...@mozilla.com>
> To: dev-pl...@lists.mozilla.org
> Cc: "Justin Dolske" <dol...@mozilla.com>
> Sent: Thursday, November 10, 2011 1:25:10 PM
> Subject: Re: Proposal to Change Talos Data Calculations

I believe we only report the median (after max clipping) of each page in a test to graph server. That is partially why the additional pipeline was needed to do this analysis.

>
> > I think there are 2 independent issues here: repeatability and
> > representability*. I think on our testsuites we want to focus on
> > measuring significant differences in performance, ie focus on
> > repeatable
> > results. On the other hand ensuring that we have a test that's
> > representative of real world is something that should be tackled
> > during
> > the [re]design phase of the benchmark.
> Right, with Talos right now, I want to get better reliability on the
> numbers we are currently reporting.
>
> > I think the best way to see if a benchmark is representative is to
> > track
> > the particular feature via telemetry** and see how that compares to
> > our
> > synthetic benchmark. Arguing on which number to drop will not get us
> > as
> > close to the truth.
> I completely agree. But, getting better numbers for talos helps to
> ensure that we have good comparisons here.

I think that comparing to a Telemetry metric would be really useful for identifying which pages represent good signals for real world performance.
>
> Clint

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

The goal of the current set of analyses is to identify any systematic changes we can make to test framework, based on an analysis of the actual results. This will hopefully give us a more stable results which could then be used in future more complex work.

Stephen