> Some people have noted in the past that some Talos measurements are not
> representative of something that the users would see, the Talos numbers
> are noisy, and we don't have good tools to deal with these types of
> regressions. There might be some truth to all of these, but I believe
> that the bigger problem is that nobody owns watching over these numbers,
> and as a result as take regressions in some benchmarks which can
> actually be representative of what our users experience.
I was recently hit by most of the shortcomings you mentioned while
trying to upgrade clang. Fortunately, I found the issue on try, but I
will admit that comparing talos on try is something I only do when I
expect a problem.
I still intend to write a blog post once I am done with the update and
have more data, but some interesting points that showed up so far
* compare-talos and compare.py were out of date. I was really lucky that
one of the benchmarks that still had the old name was the one that
showed the regression. I have started a script that I hope will be more
resilient to future changes. (bug 786504).
* our builds are *really* hard to reproduce. The build I was downloading
from try was faster than the one I was doing locally. In despair I
decided to fix at least part of this first. It found that our build was
depending on the way the bots use ccache (they set CCACHE_BASEDIR which
changes __FILE__), the build directory (shows up on debug info that is
not stripped), and the file system being case sensitive or not.
* testing on linux showed even more bizarre cases where small changes
cause performance problems. In particular, adding a nop *after the last
ret* in function would make the js interpreter faster on sunspider. The
nop was just enough to make the function size cross the next 16 bytes
boundary and that changed the address of every function linked after it.
* the histogram of some benchmarks don't look like a normal distribution
(
https://plus.google.com/u/0/108996039294665965197/posts/8GyqMEZHHVR). I
still have to read the paper mentioned in the comments.
> I don't believe that the current situation is acceptable, especially
> with the recent focus on performance (through the Snappy project), and I
> would like to ask people if they have any ideas on what we can do to fix
> this. The fix might be turning off some Talos tests if they're really
> not useful, asking someone or a group of people to go over these test
> results, get better tools with them, etc. But _something_ needs to
> happen here.
There are many things we can do to make perf debugging/testing better,
but I don't think that is the main thing we need to do to solve the
problem. The tools we have do work. Try is slow and talos is noisy, but
it is possible to detect and debug regressions.
What I think we need to do is differentiate tests that we expect to
match user experience and synthetic tests. Synthetic tests *are* useful
as they can much more easily find what changed, even if it is something
as silly as the address of some function. The difference is that we
don't want to regress on the tests that match user experience. IMHO we
*can* regress on synthetic ones as long as we know what is going on. And
yes, if a particular synthetic test is too brittle then we should remove it.
With the distinction in place we can then handle perf regressions in a
similar way to how we handle test failures: revert the offending patch
and make the original developer responsible for tracking it down. If a
test is known to regress a synthetic benchmark, a comment on the commit
on the lines of "renaming this file causes __FILE__ to change in an
assert message and produces a spurious regression on md5" should be
sufficient. It is not the developers *fault* that that causes a problem,
but IHMO it should still be his responsibility to track it.
> Cheers,
> Ehsan
>
>
Cheers,
Rafael