Hi,
I've been using compare-talos at times, in the past. Not enough to have
much to say about it, though. I'm however currently tracking a perf
regression across the board which made me use talos locally, as well as
compare-talos, which recently pointed me at perfherder. Some things have
been pleasant (for example, running the talos script was really easy,
deeply appreciated), but I won't detail them. I will only focus, in the
paragraphs below, on what doesn't help me in my regression tracking.
- Compare-talos and Perfherder don't display the same values. While
Jmaher gave me an explanation on IRC that makes some sense, I'm still
puzzled by e.g. the consistent ten-fold difference between v8_7 values
in compare-talos vs. perfherder.
- When the delta is below some percentage(1.5%?), no coloring is done
(which I guess is meant to be "not significant").
As some tests are higher is better and others the opposite, this
doesn't help know which direction is good or bad which the colors help
knowing. You might say "if it's insignificant, why do you care?".
Well, I'm currently looking at something that affects essentially all
tests. In some configurations, I'm seeing insignificant differences,
but still differences. I'd like to get a feeling whether in those
cases the general trend is all in the "red" direction, which would
be indicative that while it's much more controlled, there is still a
small regression, or that it's just mostly business as usual.
- Relatedly, even when a delta is not significant, subdeltas might be,
and that information is entirely lost in the "all tests and platforms"
view. Clicking each and every "details" is tedious.
See for example how on
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f
cart is not colored, but on
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f&originalSignature=7d1b47c5dbf2507dbc5fe10639baed65638f56af&newSignature=7d1b47c5dbf2507dbc5fe10639baed65638f56af
three subdeltas are.
See also
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e4ec7d3ae44e&newProject=try&newRevision=3f5ef465f6df
where a 4.11% delta on tsvgr_opacity opt hides a 24.62% delta on
tsvgr_opacity big-optimizable-group-opacity-2500.svg opt. While in
this particular case the delta on the main list is significant, I
wouldn't be surprised if we can find cases where an insignificant
regression hides a significant one.
I guess a view that shows you *all* details at once could be useful.
Combined with a "show only significant deltas" checkbox.
- I understand what red and green are. But what is orange?
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6a3982c14382&newProject=try&newRevision=6ada351bdafc&originalSignature=b1490f9999855ac71faa954e2bb35a24f62fe7c2&newSignature=b1490f9999855ac71faa954e2bb35a24f62fe7c2
Maybe "within stddev"? But I'm sure I've seen deltas within stddev
that weren't orange.
- I guess the green or red bars between the delta % and the confidence
columns are a representation of the delta %. In its current form, it's
hard to know why it's there adding plenty of empty space. I, for one,
would rather have the intensity of the red/green change depending on
the delta %, so that a pale red/green says it's insignificant (and
that would solve the problem of always knowing if higher is better),
dark red would say "we have a problem", vivid green would say "yay".
Although... aren't red and green a problem for some kinds of color
blindness?
- Having run talos locally and having run some stats on my local
results, I'm not sure of the methodology being used, and I'd like
clarifications.
My understanding of how talos results are aggregated on perfherder
was the following:
- each talos run generates n sample scores. n=20 when I ran tpaint
locally.
- the geometric mean for those n samples is taken as the score for a
given talos run
- perfherder takes the results of m talos runs, and gives the mean and
stddev on those m samples.
However, if I take tpaint results on talos runs from one of my recent
try runs, this doesn't yield the above doesn't yield the results
perfherder is giving me. Which is actually good because the above
seemed fishy to me. So I took all the tpaint data I found in the
corresponding talog logs, munged them in many different ways, and
failed to get the same numbers as perfherder. The closest I got was
with the geometric mean of all the samples of all the talos runs[1]
but I couldn't get anywhere close to the displayed stddev with either
standard deviation or geometric standard deviation.
A link on perfherder giving some details would be welcome. One thing
is sure, the specific use of "geomean" and "stddev" in the column
titles is confusing, except if we *are* doing geomean and stddev, but
I'd hope that we're using a geometric stddev.
- Relatedly, my tpaint results are clearly much closer to a log-normal
distribution than to a normal distribution. Which means using geomean
is clearly better than mean, but I'd like to be sure: is it the case on
all talos tests? Maybe that's my control-freak side showing, but would
it make sense to have distribution histograms/plots on perfherder?
Sorry if some of those are already documented somewhere I didn't find,
but I guess I'd have been sent there during our recent irc discussion if
there were.
Cheers,
Mike
1. Interestingly, I had a 0.03 difference between the geomean of all
sample and the value perfherder displayed... but it turns out there is
*also* a 0.03 difference between the value displayed on the "all tests
and platform" page and the "details" page for that particular try I
looked at. See "Old geomean" for:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f&originalSignature=a9aa9ea937c35962b71a6d8dcb69c9415da1f2c6&newSignature=a9aa9ea937c35962b71a6d8dcb69c9415da1f2c6
vs.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f29f6ab77403&newProject=try&newRevision=8f550ccabe2f
In fact, the new Geomean has more difference between both pages.