Thoughs/observations on Talos/PerfHerder

23 views
Skip to first unread message

Mike Hommey

unread,
May 29, 2015, 9:18:55 PM5/29/15
to dev-tree-...@lists.mozilla.org
Hi,

I've been using compare-talos at times, in the past. Not enough to have
much to say about it, though. I'm however currently tracking a perf
regression across the board which made me use talos locally, as well as
compare-talos, which recently pointed me at perfherder. Some things have
been pleasant (for example, running the talos script was really easy,
deeply appreciated), but I won't detail them. I will only focus, in the
paragraphs below, on what doesn't help me in my regression tracking.

- Compare-talos and Perfherder don't display the same values. While
Jmaher gave me an explanation on IRC that makes some sense, I'm still
puzzled by e.g. the consistent ten-fold difference between v8_7 values
in compare-talos vs. perfherder.

- When the delta is below some percentage(1.5%?), no coloring is done
(which I guess is meant to be "not significant").
As some tests are higher is better and others the opposite, this
doesn't help know which direction is good or bad which the colors help
knowing. You might say "if it's insignificant, why do you care?".
Well, I'm currently looking at something that affects essentially all
tests. In some configurations, I'm seeing insignificant differences,
but still differences. I'd like to get a feeling whether in those
cases the general trend is all in the "red" direction, which would
be indicative that while it's much more controlled, there is still a
small regression, or that it's just mostly business as usual.

- Relatedly, even when a delta is not significant, subdeltas might be,
and that information is entirely lost in the "all tests and platforms"
view. Clicking each and every "details" is tedious.
See for example how on
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f
cart is not colored, but on
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f&originalSignature=7d1b47c5dbf2507dbc5fe10639baed65638f56af&newSignature=7d1b47c5dbf2507dbc5fe10639baed65638f56af
three subdeltas are.
See also
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e4ec7d3ae44e&newProject=try&newRevision=3f5ef465f6df
where a 4.11% delta on tsvgr_opacity opt hides a 24.62% delta on
tsvgr_opacity big-optimizable-group-opacity-2500.svg opt. While in
this particular case the delta on the main list is significant, I
wouldn't be surprised if we can find cases where an insignificant
regression hides a significant one.
I guess a view that shows you *all* details at once could be useful.
Combined with a "show only significant deltas" checkbox.

- I understand what red and green are. But what is orange?
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6a3982c14382&newProject=try&newRevision=6ada351bdafc&originalSignature=b1490f9999855ac71faa954e2bb35a24f62fe7c2&newSignature=b1490f9999855ac71faa954e2bb35a24f62fe7c2
Maybe "within stddev"? But I'm sure I've seen deltas within stddev
that weren't orange.

- I guess the green or red bars between the delta % and the confidence
columns are a representation of the delta %. In its current form, it's
hard to know why it's there adding plenty of empty space. I, for one,
would rather have the intensity of the red/green change depending on
the delta %, so that a pale red/green says it's insignificant (and
that would solve the problem of always knowing if higher is better),
dark red would say "we have a problem", vivid green would say "yay".
Although... aren't red and green a problem for some kinds of color
blindness?

- Having run talos locally and having run some stats on my local
results, I'm not sure of the methodology being used, and I'd like
clarifications.
My understanding of how talos results are aggregated on perfherder
was the following:
- each talos run generates n sample scores. n=20 when I ran tpaint
locally.
- the geometric mean for those n samples is taken as the score for a
given talos run
- perfherder takes the results of m talos runs, and gives the mean and
stddev on those m samples.
However, if I take tpaint results on talos runs from one of my recent
try runs, this doesn't yield the above doesn't yield the results
perfherder is giving me. Which is actually good because the above
seemed fishy to me. So I took all the tpaint data I found in the
corresponding talog logs, munged them in many different ways, and
failed to get the same numbers as perfherder. The closest I got was
with the geometric mean of all the samples of all the talos runs[1]
but I couldn't get anywhere close to the displayed stddev with either
standard deviation or geometric standard deviation.
A link on perfherder giving some details would be welcome. One thing
is sure, the specific use of "geomean" and "stddev" in the column
titles is confusing, except if we *are* doing geomean and stddev, but
I'd hope that we're using a geometric stddev.

- Relatedly, my tpaint results are clearly much closer to a log-normal
distribution than to a normal distribution. Which means using geomean
is clearly better than mean, but I'd like to be sure: is it the case on
all talos tests? Maybe that's my control-freak side showing, but would
it make sense to have distribution histograms/plots on perfherder?

Sorry if some of those are already documented somewhere I didn't find,
but I guess I'd have been sent there during our recent irc discussion if
there were.

Cheers,

Mike

1. Interestingly, I had a 0.03 difference between the geomean of all
sample and the value perfherder displayed... but it turns out there is
*also* a 0.03 difference between the value displayed on the "all tests
and platform" page and the "details" page for that particular try I
looked at. See "Old geomean" for:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f&originalSignature=a9aa9ea937c35962b71a6d8dcb69c9415da1f2c6&newSignature=a9aa9ea937c35962b71a6d8dcb69c9415da1f2c6
vs.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f
In fact, the new Geomean has more difference between both pages.

William Lachance

unread,
Jun 1, 2015, 4:54:59 PM6/1/15
to
Hi! Thanks for your comments, it's great to get this level of feedback.

On 2015-05-29 9:18 PM, Mike Hommey wrote:
> - Compare-talos and Perfherder don't display the same values. While
> Jmaher gave me an explanation on IRC that makes some sense, I'm still
> puzzled by e.g. the consistent ten-fold difference between v8_7 values
> in compare-talos vs. perfherder.

So I had to look back to the IRC conversation to understand what was
going on here:

http://logs.glob.uno/?c=mozilla%23ateam&s=29+May+2015&e=29+May+2015&h=glandium#c936471

Based on my understanding of how the old compare-talos works, I don't
think you can compare its numbers (an average which drops a bunch of
data) with a geometric mean. They're completely different calculations
which will yield completely different results.

If the compare-talos shows a regression where perfherder doesn't, that's
obviously interesting. Otherwise I wouldn't worry about it.

> - When the delta is below some percentage(1.5%?), no coloring is done
> (which I guess is meant to be "not significant").
> As some tests are higher is better and others the opposite, this
> doesn't help know which direction is good or bad which the colors help
> knowing. You might say "if it's insignificant, why do you care?".
> Well, I'm currently looking at something that affects essentially all
> tests. In some configurations, I'm seeing insignificant differences,
> but still differences. I'd like to get a feeling whether in those
> cases the general trend is all in the "red" direction, which would
> be indicative that while it's much more controlled, there is still a
> small regression, or that it's just mostly business as usual.

It might be worth considering coloring results if we're confident
there's a regression, even if the regression is small.

> - Relatedly, even when a delta is not significant, subdeltas might be,
> and that information is entirely lost in the "all tests and platforms"
> view. Clicking each and every "details" is tedious.
> See for example how on
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f
> cart is not colored, but on
> https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f&originalSignature=7d1b47c5dbf2507dbc5fe10639baed65638f56af&newSignature=7d1b47c5dbf2507dbc5fe10639baed65638f56af
> three subdeltas are.
> See also
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e4ec7d3ae44e&newProject=try&newRevision=3f5ef465f6df
> where a 4.11% delta on tsvgr_opacity opt hides a 24.62% delta on
> tsvgr_opacity big-optimizable-group-opacity-2500.svg opt. While in
> this particular case the delta on the main list is significant, I
> wouldn't be surprised if we can find cases where an insignificant
> regression hides a significant one.

The idea behind the geometric mean is that it's designed to magnify
differences (at least sustained differences across retriggers and/or
more pushes) so that doesn't happen.

If you haven't already, it's worth reading the wikipedia article on the
geometric mean:

http://en.wikipedia.org/wiki/Geometric_mean

> I guess a view that shows you *all* details at once could be useful.
> Combined with a "show only significant deltas" checkbox.

This would be an awful lot of data at once, at least in many cases. :)
It means we think there's a regression, but we're unsure (i.e. we need
more retriggers).

https://github.com/mozilla/treeherder/blob/master/ui/js/perf.js#L220

I agree this could be better explained. I filed a bug to pop up a
tooltip when we hover over these things:

https://bugzilla.mozilla.org/show_bug.cgi?id=1170301

Feel free to comment in the bug if you have a better idea.

> - I guess the green or red bars between the delta % and the confidence
> columns are a representation of the delta %. In its current form, it's
> hard to know why it's there adding plenty of empty space. I, for one,
> would rather have the intensity of the red/green change depending on
> the delta %, so that a pale red/green says it's insignificant (and
> that would solve the problem of always knowing if higher is better),
> dark red would say "we have a problem", vivid green would say "yay".
> Although... aren't red and green a problem for some kinds of color
> blindness?

Yeah, this could probably be improved. It might be a good idea to have
faded out versions of the bars on either side, so it was more obvious
which directions things could go in. Might be worth asking some data
visualization gurus about this sort of thing.

I filed a bug proposing the above, again feel free to comment if you
have other ideas:

https://bugzilla.mozilla.org/show_bug.cgi?id=1170305

> - Having run talos locally and having run some stats on my local
> results, I'm not sure of the methodology being used, and I'd like
> clarifications.
> My understanding of how talos results are aggregated on perfherder
> was the following:
> - each talos run generates n sample scores. n=20 when I ran tpaint
> locally.
> - the geometric mean for those n samples is taken as the score for a
> given talos run
> - perfherder takes the results of m talos runs, and gives the mean and
> stddev on those m samples.
> However, if I take tpaint results on talos runs from one of my recent
> try runs, this doesn't yield the above doesn't yield the results
> perfherder is giving me. Which is actually good because the above
> seemed fishy to me. So I took all the tpaint data I found in the
> corresponding talog logs, munged them in many different ways, and
> failed to get the same numbers as perfherder. The closest I got was
> with the geometric mean of all the samples of all the talos runs[1]
> but I couldn't get anywhere close to the displayed stddev with either
> standard deviation or geometric standard deviation.
> A link on perfherder giving some details would be welcome. One thing
> is sure, the specific use of "geomean" and "stddev" in the column
> titles is confusing, except if we *are* doing geomean and stddev, but
> I'd hope that we're using a geometric stddev.

So what you're describing is basically a bug -- we currently create a
summary series for tpaint, which does use a geometric mean -- however
the geometric mean is only the mean of one variable, so it's essentially
meaningless (albeit in a harmless way, I think -- in the sense that the
numbers being generated are still "valid").

It only really makes sense to use a geometric mean for summary tests
which provide an aggregated number based on the results of a bunch of
smaller tests (e.g. the CART example above that you linked to). For
things like tpaint, we should just be doing calculations on the average
(or maybe the median?). :jmaher may have more to say on this topic.

We have a bug open to remove summary series for tests which only have
one subtest -- https://bugzilla.mozilla.org/show_bug.cgi?id=1162522

> - Relatedly, my tpaint results are clearly much closer to a log-normal
> distribution than to a normal distribution. Which means using geomean
> is clearly better than mean, but I'd like to be sure: is it the case on
> all talos tests? Maybe that's my control-freak side showing, but would
> it make sense to have distribution histograms/plots on perfherder?

You're not the first person to ask for this. :) We filed a bug to
implement this already:

https://bugzilla.mozilla.org/show_bug.cgi?id=1164891

We should get to this sometime this summer, I'd guess.

Thanks again for the feedback, let me know if I didn't address something...

Will

Mike Hommey

unread,
Jun 2, 2015, 1:44:25 AM6/2/15
to William Lachance, dev-tree-...@lists.mozilla.org
On Mon, Jun 01, 2015 at 04:54:58PM -0400, William Lachance wrote:
> Hi! Thanks for your comments, it's great to get this level of feedback.
>
> On 2015-05-29 9:18 PM, Mike Hommey wrote:
> >- Compare-talos and Perfherder don't display the same values. While
> > Jmaher gave me an explanation on IRC that makes some sense, I'm still
> > puzzled by e.g. the consistent ten-fold difference between v8_7 values
> > in compare-talos vs. perfherder.
>
> So I had to look back to the IRC conversation to understand what was going
> on here:
>
> http://logs.glob.uno/?c=mozilla%23ateam&s=29+May+2015&e=29+May+2015&h=glandium#c936471
>
> Based on my understanding of how the old compare-talos works, I don't think
> you can compare its numbers (an average which drops a bunch of data) with a
> geometric mean. They're completely different calculations which will yield
> completely different results.

(snip)

Ok, let me stop you here. Reading further in your message, I /think/ I'm
getting a grasp at what the talos results shown on perfherder are, but
really, that should all be written in some document linked from there,
because even with what I /think/ I may have grasped, I have not the
slightest idea how to make sense of local results of running talos in a
similar, useful, way. Nor have I the slightest idea how to reproduce
the results perfherder gives from the test logs of those talos runs on
automation.

> > I guess a view that shows you *all* details at once could be useful.
> > Combined with a "show only significant deltas" checkbox.
>
> This would be an awful lot of data at once, at least in many cases. :)

And it would be fine because it wouldn't be the default.

Mike

William Lachance

unread,
Jun 2, 2015, 1:20:35 PM6/2/15
to
On 15-06-02 01:43 AM, Mike Hommey wrote:
>> >Based on my understanding of how the old compare-talos works, I don't think
>> >you can compare its numbers (an average which drops a bunch of data) with a
>> >geometric mean. They're completely different calculations which will yield
>> >completely different results.
> (snip)
>
> Ok, let me stop you here. Reading further in your message, I/think/ I'm
> getting a grasp at what the talos results shown on perfherder are, but
> really, that should all be written in some document linked from there,
> because even with what I/think/ I may have grasped, I have not the
> slightest idea how to make sense of local results of running talos in a
> similar, useful, way. Nor have I the slightest idea how to reproduce
> the results perfherder gives from the test logs of those talos runs on
> automation.

Ok, so after discussing this with jmaher and mconley, I think there are
two things we should do here:

1. Make it possible to use compareperf with locally generated data (so
that people doing stuff with talos locally can use the hopefully
well-developed and familiar interface). Filed:
https://bugzilla.mozilla.org/show_bug.cgi?id=1170639

2. We need better documentation (probably inside perfherder) of what
calculations we're doing at what they mean. Filed:
https://bugzilla.mozilla.org/show_bug.cgi?id=1170648

Will
Reply all
Reply to author
Forward
0 new messages