I've been using compare-talos at times, in the past. Not enough to have
much to say about it, though. I'm however currently tracking a perf
regression across the board which made me use talos locally, as well as
compare-talos, which recently pointed me at perfherder. Some things have
been pleasant (for example, running the talos script was really easy,
deeply appreciated), but I won't detail them. I will only focus, in the
paragraphs below, on what doesn't help me in my regression tracking.
- Compare-talos and Perfherder don't display the same values. While
Jmaher gave me an explanation on IRC that makes some sense, I'm still
puzzled by e.g. the consistent ten-fold difference between v8_7 values
in compare-talos vs. perfherder.
- When the delta is below some percentage(1.5%?), no coloring is done
(which I guess is meant to be "not significant").
As some tests are higher is better and others the opposite, this
doesn't help know which direction is good or bad which the colors help
knowing. You might say "if it's insignificant, why do you care?".
Well, I'm currently looking at something that affects essentially all
tests. In some configurations, I'm seeing insignificant differences,
but still differences. I'd like to get a feeling whether in those
cases the general trend is all in the "red" direction, which would
be indicative that while it's much more controlled, there is still a
small regression, or that it's just mostly business as usual.
- Relatedly, even when a delta is not significant, subdeltas might be,
and that information is entirely lost in the "all tests and platforms"
view. Clicking each and every "details" is tedious.
See for example how on
cart is not colored, but on
three subdeltas are.
where a 4.11% delta on tsvgr_opacity opt hides a 24.62% delta on
tsvgr_opacity big-optimizable-group-opacity-2500.svg opt. While in
this particular case the delta on the main list is significant, I
wouldn't be surprised if we can find cases where an insignificant
regression hides a significant one.
I guess a view that shows you *all* details at once could be useful.
Combined with a "show only significant deltas" checkbox.
- I understand what red and green are. But what is orange?
Maybe "within stddev"? But I'm sure I've seen deltas within stddev
that weren't orange.
- I guess the green or red bars between the delta % and the confidence
columns are a representation of the delta %. In its current form, it's
hard to know why it's there adding plenty of empty space. I, for one,
would rather have the intensity of the red/green change depending on
the delta %, so that a pale red/green says it's insignificant (and
that would solve the problem of always knowing if higher is better),
dark red would say "we have a problem", vivid green would say "yay".
Although... aren't red and green a problem for some kinds of color
- Having run talos locally and having run some stats on my local
results, I'm not sure of the methodology being used, and I'd like
My understanding of how talos results are aggregated on perfherder
was the following:
- each talos run generates n sample scores. n=20 when I ran tpaint
- the geometric mean for those n samples is taken as the score for a
given talos run
- perfherder takes the results of m talos runs, and gives the mean and
stddev on those m samples.
However, if I take tpaint results on talos runs from one of my recent
try runs, this doesn't yield the above doesn't yield the results
perfherder is giving me. Which is actually good because the above
seemed fishy to me. So I took all the tpaint data I found in the
corresponding talog logs, munged them in many different ways, and
failed to get the same numbers as perfherder. The closest I got was
with the geometric mean of all the samples of all the talos runs
but I couldn't get anywhere close to the displayed stddev with either
standard deviation or geometric standard deviation.
A link on perfherder giving some details would be welcome. One thing
is sure, the specific use of "geomean" and "stddev" in the column
titles is confusing, except if we *are* doing geomean and stddev, but
I'd hope that we're using a geometric stddev.
- Relatedly, my tpaint results are clearly much closer to a log-normal
distribution than to a normal distribution. Which means using geomean
is clearly better than mean, but I'd like to be sure: is it the case on
all talos tests? Maybe that's my control-freak side showing, but would
it make sense to have distribution histograms/plots on perfherder?
Sorry if some of those are already documented somewhere I didn't find,
but I guess I'd have been sent there during our recent irc discussion if
1. Interestingly, I had a 0.03 difference between the geomean of all
sample and the value perfherder displayed... but it turns out there is
*also* a 0.03 difference between the value displayed on the "all tests
and platform" page and the "details" page for that particular try I
looked at. See "Old geomean" for:
In fact, the new Geomean has more difference between both pages.