> On Thursday, August 30, 2012 9:11:25 AM UTC-7, Ehsan Akhgari wrote:
>> On 12-08-29 9:20 PM, Dave Mandelin wrote:
>>> On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
>> In my opinion, one of the reasons why Talos is disliked is because many
>> people don't know where its code lives (hint:
>> http://hg.mozilla.org/build/talos/) and can't run those tests like other
>> test suites. I think this would be very valuable to fix, so that
>> developers can read Talos tests like any other test, and fix or improve
>> them where needed.
> It is hard to find. And beyond that, it seems hard to use. It's been a while since I've run Talos locally, but last time I did it was a pain to set up and difficult to run, and I hear it's still kind of like that. For testing tools, "convenient for the developer" is a critical requirement, but has been neglected in the past.
> js/src/jit-test/ is an example of something that is very convenient for developers: creating a test is just adding a .js file to a directory (no manifest or extra files; by default error or crash is a fail, but you can change that for a test), the harness is a Python file with nice options, the test configuration and basic usage is documented in a README, and it lives in the tree.
Absolutely! We really need to work hard to make them easier to run. I hear that the Automation team has already been making progress towards that goal.
>>>> [...] I believe
>>>> that the bigger problem is that nobody owns watching over these numbers,
>>>> and as a result as take regressions in some benchmarks which can
>>>> actually be representative of what our users experience.
>>> The interesting thing is that we basically have no idea if that's true for any given Talos alarm.
>> That's something that I think should be judged per benchmark. For
>> example, the Ts measurements will probably correspond very directly to
>> the startup time that our users experience. The Tp5 measurements don't
>> directly correspond to anything like that, since nobody loads those
>> pages sequentially, but it could be an indication of average page load
>> performance.
> I exaggerated a bit--yes, some tests like Ts are pretty easy to understand and do correspond to user experience. With Tp5, I just don't know--I haven't spent any time trying to use it or looking at regressions, since JS doesn't affect it.
Right. I think at the very least, on bigger tests like Tp5 we want to know if something is regressed by a large amount, because that is very likely to reflect an actual behavior change which is worth knowing about.
>>> - Speaking of false positives, we should seriously start tracking them. We should keep track of each Talos regression found and its outcome. (It would be great to track false negatives too but it's a lot harder to catch them and record them accurately.) That way we'd actually know whether we have a few false positives or a lot, or whether the false positives were coming up on certain tests. And we could use that information to improve the false positive rate over time.
>> I agree. Do you have any suggestions on how we would track them?
> The details would vary according to the preferences of the person doing it, but I'd sketch it out something like this: when Talos detects a regression, file a bug to "resolve" it (i.e., show that it's not a real regression, show that it's an acceptable regression for the patch, or fix the regression). Then keep a file listing those bugs (with metadata for each: tests regressed, date, component, etc), and as each is closed, mark down the result: false positive, allowed, backed out, or fixed. That's your data set. Of course, various parts of this could be automated but that's not required.
Oh, sorry, I needed to ask my question better. I'm specifically wondering who needs to track and investigate the regression if it happened on a range of, let's say, 5 committers...
> * Joel will revisit maintaining Talos within mozilla-central to reduce
> developer barriers to understanding what a particular Talos test result
> means. This should also make Talos easier to run
On Thursday 2012-08-30 14:42 -0700, Taras Glek wrote:
> * Joel will revisit maintaining Talos within mozilla-central to
> reduce developer barriers to understanding what a particular Talos
> test result means. This should also make Talos easier to run
This will also solve one of the other problems that leads developers
to distrust talos, which is that a significant portion of the
performance regressions reported are (or at least were at one time)
the result of changes to the tests, but that changes to the tests
don't show up as part of the list of suspected causes of
regressions.
> Some people have noted in the past that some Talos measurements are not
> representative of something that the users would see, the Talos numbers
> are noisy, and we don't have good tools to deal with these types of
> regressions. There might be some truth to all of these, but I believe
> that the bigger problem is that nobody owns watching over these numbers,
> and as a result as take regressions in some benchmarks which can
> actually be representative of what our users experience.
I was recently hit by most of the shortcomings you mentioned while trying to upgrade clang. Fortunately, I found the issue on try, but I will admit that comparing talos on try is something I only do when I expect a problem.
I still intend to write a blog post once I am done with the update and have more data, but some interesting points that showed up so far
* compare-talos and compare.py were out of date. I was really lucky that one of the benchmarks that still had the old name was the one that showed the regression. I have started a script that I hope will be more resilient to future changes. (bug 786504).
* our builds are *really* hard to reproduce. The build I was downloading from try was faster than the one I was doing locally. In despair I decided to fix at least part of this first. It found that our build was depending on the way the bots use ccache (they set CCACHE_BASEDIR which changes __FILE__), the build directory (shows up on debug info that is not stripped), and the file system being case sensitive or not.
* testing on linux showed even more bizarre cases where small changes cause performance problems. In particular, adding a nop *after the last ret* in function would make the js interpreter faster on sunspider. The nop was just enough to make the function size cross the next 16 bytes boundary and that changed the address of every function linked after it.
> I don't believe that the current situation is acceptable, especially
> with the recent focus on performance (through the Snappy project), and I
> would like to ask people if they have any ideas on what we can do to fix
> this. The fix might be turning off some Talos tests if they're really
> not useful, asking someone or a group of people to go over these test
> results, get better tools with them, etc. But _something_ needs to
> happen here.
There are many things we can do to make perf debugging/testing better, but I don't think that is the main thing we need to do to solve the problem. The tools we have do work. Try is slow and talos is noisy, but it is possible to detect and debug regressions.
What I think we need to do is differentiate tests that we expect to match user experience and synthetic tests. Synthetic tests *are* useful as they can much more easily find what changed, even if it is something as silly as the address of some function. The difference is that we don't want to regress on the tests that match user experience. IMHO we *can* regress on synthetic ones as long as we know what is going on. And yes, if a particular synthetic test is too brittle then we should remove it.
With the distinction in place we can then handle perf regressions in a similar way to how we handle test failures: revert the offending patch and make the original developer responsible for tracking it down. If a test is known to regress a synthetic benchmark, a comment on the commit on the lines of "renaming this file causes __FILE__ to change in an assert message and produces a spurious regression on md5" should be sufficient. It is not the developers *fault* that that causes a problem, but IHMO it should still be his responsibility to track it.
> On Thursday 2012-08-30 14:42 -0700, Taras Glek wrote:
>> * Joel will revisit maintaining Talos within mozilla-central to
>> reduce developer barriers to understanding what a particular Talos
>> test result means. This should also make Talos easier to run
> This will also solve one of the other problems that leads developers
> to distrust talos, which is that a significant portion of the
> performance regressions reported are (or at least were at one time)
> the result of changes to the tests, but that changes to the tests
> don't show up as part of the list of suspected causes of
> regressions.
This means that changes to the Talos suite *are* associated with a
mozilla-central revision, have tests run for them, can be backed out,
can ride trains, etc.
On Thursday, August 30, 2012 2:54:55 PM UTC-7, Ehsan Akhgari wrote:
> Oh, sorry, I needed to ask my question better. I'm specifically > wondering who needs to track and investigate the regression if it > happened on a range of, let's say, 5 committers...
Ah. I believe that's a job for a bugmaster, a position that we don't have filled at the moment. We need one. Perhaps one or more people in QA can step into part of that role, possibly temporarily.
Otherwise, it seems we just have to share the pain. Bisecting changesets is not necessarily an enjoyable job but it is a necessary one. I would suggest that sheriffs pick one of the 5 committers and ask that person to bisect the change and try not to pick the same person repeatedly (unless that person keeps landing the regressions!).
> Otherwise, it seems we just have to share the pain. Bisecting changesets is not necessarily an enjoyable job but it is a necessary one. I would suggest that sheriffs pick one of the 5 committers and ask that person to bisect the change and try not to pick the same person repeatedly (unless that person keeps landing the regressions!).
Finding an offending commit within n commits is scriptable.
I think tracking and investigating is all of our responsibility. QA definitely has a role to play and I think we've been playing that role to a certain extent. We don't always have the skills, knowledge, experience, or time to help but we always try and we are always willing to learn. We rely on Release Management to keep us apprised of what's important and we rely on developers to help us understand the code, tools, and testcases.
Having a Bugmaster will certainly improve things but I don't think it eliminates the necessity, nor the desire for this collaborative dynamic.
----- Original Message -----
From: "Dave Mandelin" <dmande...@gmail.com>
To: dev-platf...@lists.mozilla.org
Cc: "Dave Mandelin" <dmande...@gmail.com>, dev-platf...@lists.mozilla.org
Sent: Thursday, August 30, 2012 6:13:33 PM
Subject: Re: The current state of Talos benchmarks
On Thursday, August 30, 2012 2:54:55 PM UTC-7, Ehsan Akhgari wrote:
> Oh, sorry, I needed to ask my question better. I'm specifically > wondering who needs to track and investigate the regression if it > happened on a range of, let's say, 5 committers...
Ah. I believe that's a job for a bugmaster, a position that we don't have filled at the moment. We need one. Perhaps one or more people in QA can step into part of that role, possibly temporarily.
Otherwise, it seems we just have to share the pain. Bisecting changesets is not necessarily an enjoyable job but it is a necessary one. I would suggest that sheriffs pick one of the 5 committers and ask that person to bisect the change and try not to pick the same person repeatedly (unless that person keeps landing the regressions!).
> This means that changes to the Talos suite *are* associated with a
> mozilla-central revision, have tests run for them, can be backed out,
> can ride trains, etc.
I have backed out changes made to talos and the tests a few times due to performance regressions. While I might not catch every one, we do treat talos changes as another changeset in m-c.
If there is an expected shift in numbers, we create a new test. This is why there are 5+ versions of all the tests. It really adds a lot of overhead and breakage (e.g. compare-talos), but this way we don't confuse the old test data with the new adjusted tests.
Sorry to continue beating this horse, but I don't think it's quite dead yet:
One of the best things we could do to make finding these regressions
easier is to never coalesce Talos on mozilla-inbound. It's crazy to
waste developer time bisecting Talos locally when we don't run it on
every push.
Another thing that would help a lot is fixing bug 752002, so people
will stop filtering the e-mails.
On Thu, Aug 30, 2012 at 6:42 PM, Taras Glek <tg...@mozilla.com> wrote:
> Hi,
> We had a quick meeting focused on how to not regress Talos.
> Attendance: Taras Glek, Ehsan Akhgari, Clint Talbert, Nathan Froyd, Dave
> Mandelin, Christina Choi, Joel Maher
> Notes:
> * Clint's Automation&Tools team is improving Talos reporting abilities. We
> should have much better tools for comparing performance between 2 different
> changesets soon.
> * Talos is now significantly easier to run locally than it used to be.
> Expect blog posts from Joel/Ehsan
> * Joel will revisit maintaining Talos within mozilla-central to reduce
> developer barriers to understanding what a particular Talos test result
> means. This should also make Talos easier to run
> * We will implement a formal policy on Talos impact of merges.
> ** focus on perf tracking on inbound, to avoid merge pains
> ** We will extend the current merge criteria of last green PGO changeset to
> also include "good Talos numbers"
> * Nathan Froyd will look at historical data for the last Firefox nightly
> release cycle to come up with threshold numbers for backing out commits
> * Joel/Ehsan will look into using mozregression with talos so we can bisect
> performance regressions locally. We will also consider doing something
> similar on try.
> IMHO we *can* regress on synthetic ones as long as we know what is going on.
It's the requirement that we know what is going on that I think is unreasonable.
Indeed, we /have/ a no not-understood regresisons policy, IIRC. The
extent to which it's being ignored is at least partially indicative of
how difficult these changes can be to track down. Rafael's post has
some great examples of how insane tracking down perf regressions can
be.
I really don't think that the right way to go about fixing our
proclivity to regress Talos is to "get tough on regressions" and make
this every committer's problem. We shouldn't expect committers to
track down the fact that "my change pushes X function down 16 bytes,
which changes some other function's alignment, which, in combination
with a change to __FILE__, affects benchmark Y" as a regular part of
their job. And it's not clear to me that if we have any tests left if
we eliminated from the tree all tests which are affected by this sort
of thing.
I think the right way to go about this is to first investigate which
tests are stable, and how stable they are (*). Then a team of
engineers can gain some experience finding and understanding
regressions which occur over some period of time, so we can understand
how feasible it would be to seriously ask developers to do this as a
part of their day-to-day jobs.
I'm not saying it should be OK to regress our performance tests, as a
rule. But I think we need to acknowledge that hunting regressions can
be time-consuming, and that a policy requiring that all regressions be
understood may hamstring our ability to get anything else done.
There's a trade-off here that we seem to be ignoring.
> I'm not saying it should be OK to regress our performance tests, as a
> rule. But I think we need to acknowledge that hunting regressions can
> be time-consuming, and that a policy requiring that all regressions be
> understood may hamstring our ability to get anything else done.
> There's a trade-off here that we seem to be ignoring.
There is definitely a trade-off here, and at least for the past year (and maybe for the past two years) we have in practice been weighing on the side of the difficulty of tracking down performance regression to the point that we've been ignoring them (except for perhaps a few people.)
It is a mistake to take Rafael's example and extend it to the average regression that we measure on Talos. It's true that sometimes those things happen, and in practice we cannot deal with them all, because we don't have an army of Rafaels. But it bothers me when people take an example of a very difficult to understand regression encountered by a person who bravely dwells with low-level compiler code generation stuff and extend it to come up with a policy covering all regressions. Please, let's not do that.
And let's remember the other side of the trade-off too. A lot of blood and tears has gone into shaving off milliseconds from our startup time. Taking a ~5% hit on startup time within a 6-week cycle effectively means that we have undone man-months of optimizations which have happened to the startup time. So it's not like letting these regressions in beneath our noses is going to make us all more productive.
There are extremely non-stable Talos tests, and relatively stable ones. Let's focus on the relatively stable ones. There are extremely hard to diagnose performance regressions, and extremely easy ones (i.e., let's not wait on this lock, do this I/O, run this exponential algorithm, load tons of XUL/XBL when a window opens, etc.) We have many great tools for the job, so not all regressions need to be treated the same.
> Sorry to continue beating this horse, but I don't think it's quite dead yet:
> One of the best things we could do to make finding these regressions
> easier is to never coalesce Talos on mozilla-inbound. It's crazy to
> waste developer time bisecting Talos locally when we don't run it on
> every push.
In order to help kill that horse, I filed bug 787447 and CCed John on it. :-)
On 31/08/12 11:32 AM, Ehsan Akhgari wrote:> There are extremely non-stable Talos tests, and relatively stable ones.
> Let's focus on the relatively stable ones. There are extremely hard
> to diagnose performance regressions, and extremely easy ones (i.e.,
> let's not wait on this lock, do this I/O, run this exponential
> algorithm, load tons of XUL/XBL when a window opens, etc.) We have many
> great tools for the job, so not all regressions need to be treated the
> same.
What value do the extremely non-stable Talos tests have? Shouldn't we stop running them if they're not giving useful information?
> There are extremely non-stable Talos tests, and relatively stable ones.
> Let's focus on the relatively stable ones.
It's not exclusively a question of noise in the tests. Even
regressions in stable tests are sometimes hard to track down. I spent
two months trying to figure out why I could not reproduce a Dromaeo
regression I saw on m-i using try, and eventually gave up (bug
653961).
It's great if we try to track down this mysterious 5% startup
regression. We shouldn't ignore important regressions. But what I
object to is the idea that if I regress Dromaeo DOM by 2%, I'm
automatically backed out and prevented from doing any work until I
prove that the problem is I changed filename somewhere.
On Fri, Aug 31, 2012 at 12:32 PM, Ehsan Akhgari <ehsan.akhg...@gmail.com> wrote:
> On 12-08-31 6:01 AM, Justin Lebar wrote:
>> I'm not saying it should be OK to regress our performance tests, as a
>> rule. But I think we need to acknowledge that hunting regressions can
>> be time-consuming, and that a policy requiring that all regressions be
>> understood may hamstring our ability to get anything else done.
>> There's a trade-off here that we seem to be ignoring.
> There is definitely a trade-off here, and at least for the past year (and
> maybe for the past two years) we have in practice been weighing on the side
> of the difficulty of tracking down performance regression to the point that
> we've been ignoring them (except for perhaps a few people.)
> It is a mistake to take Rafael's example and extend it to the average
> regression that we measure on Talos. It's true that sometimes those things
> happen, and in practice we cannot deal with them all, because we don't have
> an army of Rafaels. But it bothers me when people take an example of a very
> difficult to understand regression encountered by a person who bravely
> dwells with low-level compiler code generation stuff and extend it to come
> up with a policy covering all regressions. Please, let's not do that.
> And let's remember the other side of the trade-off too. A lot of blood and
> tears has gone into shaving off milliseconds from our startup time. Taking
> a ~5% hit on startup time within a 6-week cycle effectively means that we
> have undone man-months of optimizations which have happened to the startup
> time. So it's not like letting these regressions in beneath our noses is
> going to make us all more productive.
> There are extremely non-stable Talos tests, and relatively stable ones.
> Let's focus on the relatively stable ones. There are extremely hard to
> diagnose performance regressions, and extremely easy ones (i.e., let's not
> wait on this lock, do this I/O, run this exponential algorithm, load tons of
> XUL/XBL when a window opens, etc.) We have many great tools for the job, so
> not all regressions need to be treated the same.
Another concern I have read in this thread and have heard over the last few months is why are we even running these tests as they are old, irrelevant and nobody looks at them. A valid concern and something I have asked myself many times while working on Talos. I took it upon myself earlier this summer to find a developer who is a point of contact for each and every test we run. Then we figured out if the tests were relevant and testing things we care about. Many tests have been updated/added/disabled in the last couple months.
A similar complaint is about the noise in the numbers and how we can realistically detect a regression or gain value. For minor regressions our current toolchain will not be very effective. A lot of work has been done to look into how we run tests, the tools we use and if we can apply different models to the numbers to gain more reliable data. Most of that work is documented in the Signal from Noise project: https://wiki.mozilla.org/Auto-tools/Projects/Signal_From_Noise. I encourage folks to join into the public meetings we have to learn more about how we are actually solving this problem.
Back on subject, we want to detect regressions to the exact changeset as well as reducing our false positives that get mailed to dev.tree-management. There is probably no silver bullet or policy we can create today which will fix our problems. There is a big lag between a current patch's run of talos and when we get a notification in dev.tree-management. For large regressions this can be detected by visually looking at graph server (we have links to everything from tbpl), but for small regressions, you have to see this over time as a minor increase could look like the regular noise we have in our numbers.
Coming from a talos tool maintainer perspective, I am committed to making talos easy to run and documented so we can all work on fixing regressions instead of offering sacrifices to the try server. When there are requests for features, fixes or test adjustments somebody on the A*Team usually will resolve it quickly. While this only solves some of the pain, it is a step in the right direction until Signal From Noise can come out and solve a large portion of the other problems.
On Thursday, August 30, 2012 9:11:25 AM UTC-7, Ehsan Akhgari wrote:
> On 12-08-29 9:20 PM, Dave Mandelin wrote:
> > On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
> In my opinion, one of the reasons why Talos is disliked is because many > people don't know where its code lives (hint: > http://hg.mozilla.org/build/talos/) and can't run those tests like other > test suites. I think this would be very valuable to fix, so that > developers can read Talos tests like any other test, and fix or improve > them where needed.
It is hard to find. And beyond that, it seems hard to use. It's been a while since I've run Talos locally, but last time I did it was a pain to set up and difficult to run, and I hear it's still kind of like that. For testing tools, "convenient for the developer" is a critical requirement, but has been neglected in the past.
js/src/jit-test/ is an example of something that is very convenient for developers: creating a test is just adding a .js file to a directory (no manifest or extra files; by default error or crash is a fail, but you can change that for a test), the harness is a Python file with nice options, the test configuration and basic usage is documented in a README, and it lives in the tree.
> >> [...] I believe
> >> that the bigger problem is that nobody owns watching over these numbers,
> >> and as a result as take regressions in some benchmarks which can
> >> actually be representative of what our users experience.
> > The interesting thing is that we basically have no idea if that's true for any given Talos alarm.
> That's something that I think should be judged per benchmark. For > example, the Ts measurements will probably correspond very directly to > the startup time that our users experience. The Tp5 measurements don't > directly correspond to anything like that, since nobody loads those > pages sequentially, but it could be an indication of average page load > performance.
I exaggerated a bit--yes, some tests like Ts are pretty easy to understand and do correspond to user experience. With Tp5, I just don't know--I haven't spent any time trying to use it or looking at regressions, since JS doesn't affect it.
> >> I don't believe that the current situation is acceptable, especially
> >> with the recent focus on performance (through the Snappy project), and I
> >> would like to ask people if they have any ideas on what we can do to fix
> >> this. The fix might be turning off some Talos tests if they're really
> >> not useful, asking someone or a group of people to go over these test
> >> results, get better tools with them, etc. But _something_ needs to
> >> happen here.
> > - Second, as you say, get an owner for performance regressions. There are lots of ways we could do this. I think it would integrate fairly easily into our existing processes if we (automatically or by a designated person) filed a bug for each regression and marked it tracking (so the release managers would own followup). Alternately, we could have a designated person own followup. I'm not sure if that has any advantages, but release managers would probably know. But doing any of this is going to severely annoy engineers unless we get the false positive rate under control.
> Note that some of the work of to differentiate between false positives > and real regressions needs to be done by the engineers, similar to the > work required to investigate correctness problems. And people need to > accept that seemingly benign changes may also cause real performance > regressions, so it's not always possible to glance over a changeset and > say "nah, this can't be my fault." :-)
Agreed.
> > - Speaking of false positives, we should seriously start tracking them. We should keep track of each Talos regression found and its outcome. (It would be great to track false negatives too but it's a lot harder to catch them and record them accurately.) That way we'd actually know whether we have a few false positives or a lot, or whether the false positives were coming up on certain tests. And we could use that information to improve the false positive rate over time.
> I agree. Do you have any suggestions on how we would track them?
The details would vary according to the preferences of the person doing it, but I'd sketch it out something like this: when Talos detects a regression, file a bug to "resolve" it (i.e., show that it's not a real regression, show that it's an acceptable regression for the patch, or fix the regression). Then keep a file listing those bugs (with metadata for each: tests regressed, date, component, etc), and as each is closed, mark down the result: false positive, allowed, backed out, or fixed. That's your data set. Of course, various parts of this could be automated but that's not required.
On Thursday, August 30, 2012 2:54:55 PM UTC-7, Ehsan Akhgari wrote:
> Oh, sorry, I needed to ask my question better. I'm specifically > wondering who needs to track and investigate the regression if it > happened on a range of, let's say, 5 committers...
Ah. I believe that's a job for a bugmaster, a position that we don't have filled at the moment. We need one. Perhaps one or more people in QA can step into part of that role, possibly temporarily.
Otherwise, it seems we just have to share the pain. Bisecting changesets is not necessarily an enjoyable job but it is a necessary one. I would suggest that sheriffs pick one of the 5 committers and ask that person to bisect the change and try not to pick the same person repeatedly (unless that person keeps landing the regressions!).
Another concern I have read in this thread and have heard over the last few months is why are we even running these tests as they are old, irrelevant and nobody looks at them. A valid concern and something I have asked myself many times while working on Talos. I took it upon myself earlier this summer to find a developer who is a point of contact for each and every test we run. Then we figured out if the tests were relevant and testing things we care about. Many tests have been updated/added/disabled in the last couple months.
A similar complaint is about the noise in the numbers and how we can realistically detect a regression or gain value. For minor regressions our current toolchain will not be very effective. A lot of work has been done to look into how we run tests, the tools we use and if we can apply different models to the numbers to gain more reliable data. Most of that work is documented in the Signal from Noise project: https://wiki.mozilla.org/Auto-tools/Projects/Signal_From_Noise. I encourage folks to join into the public meetings we have to learn more about how we are actually solving this problem.
Back on subject, we want to detect regressions to the exact changeset as well as reducing our false positives that get mailed to dev.tree-management. There is probably no silver bullet or policy we can create today which will fix our problems. There is a big lag between a current patch's run of talos and when we get a notification in dev.tree-management. For large regressions this can be detected by visually looking at graph server (we have links to everything from tbpl), but for small regressions, you have to see this over time as a minor increase could look like the regular noise we have in our numbers.
Coming from a talos tool maintainer perspective, I am committed to making talos easy to run and documented so we can all work on fixing regressions instead of offering sacrifices to the try server. When there are requests for features, fixes or test adjustments somebody on the A*Team usually will resolve it quickly. While this only solves some of the pain, it is a step in the right direction until Signal From Noise can come out and solve a large portion of the other problems.
> On 31/08/12 11:32 AM, Ehsan Akhgari wrote:> There are extremely
> non-stable Talos tests, and relatively stable ones.
> > Let's focus on the relatively stable ones. There are extremely hard
> > to diagnose performance regressions, and extremely easy ones (i.e.,
> > let's not wait on this lock, do this I/O, run this exponential
> > algorithm, load tons of XUL/XBL when a window opens, etc.) We have many
> > great tools for the job, so not all regressions need to be treated the
> > same.
> What value do the extremely non-stable Talos tests have? Shouldn't we
> stop running them if they're not giving useful information?
Either that, or find some way of making them more stable, such as not measuring the wall clock time.
> On 12-08-31 11:45 AM, Chris AtLee wrote:
>> On 31/08/12 11:32 AM, Ehsan Akhgari wrote:> There are extremely
>> non-stable Talos tests, and relatively stable ones.
>> > Let's focus on the relatively stable ones. There are extremely hard
>> > to diagnose performance regressions, and extremely easy ones (i.e.,
>> > let's not wait on this lock, do this I/O, run this exponential
>> > algorithm, load tons of XUL/XBL when a window opens, etc.) We have
>> many
>> > great tools for the job, so not all regressions need to be treated the
>> > same.
>> What value do the extremely non-stable Talos tests have? Shouldn't we
>> stop running them if they're not giving useful information?
> Either that, or find some way of making them more stable, such as not
> measuring the wall clock time.
Sure, that sounds like a great project. Until that's finished, is there any value to running these suites, or are they expensive random number generators?
> On 31/08/12 03:59 PM, Ehsan Akhgari wrote:
>> On 12-08-31 11:45 AM, Chris AtLee wrote:
>>> On 31/08/12 11:32 AM, Ehsan Akhgari wrote:> There are extremely
>>> non-stable Talos tests, and relatively stable ones.
>>> > Let's focus on the relatively stable ones. There are extremely
>>> hard
>>> > to diagnose performance regressions, and extremely easy ones (i.e.,
>>> > let's not wait on this lock, do this I/O, run this exponential
>>> > algorithm, load tons of XUL/XBL when a window opens, etc.) We have
>>> many
>>> > great tools for the job, so not all regressions need to be treated
>>> the
>>> > same.
>>> What value do the extremely non-stable Talos tests have? Shouldn't we
>>> stop running them if they're not giving useful information?
>> Either that, or find some way of making them more stable, such as not
>> measuring the wall clock time.
> Sure, that sounds like a great project. Until that's finished, is there
> any value to running these suites, or are they expensive random number
> generators?
I think that is something that needs to be evaluated on a per-test per-platform basis, hopefully by someone who knows a bit about statistics. :-)
On Saturday, September 1, 2012 10:08:53 AM UTC-4, Ehsan Akhgari wrote:
> On 12-08-31 4:03 PM, Chris AtLee wrote:
> > On 31/08/12 03:59 PM, Ehsan Akhgari wrote:
> >> On 12-08-31 11:45 AM, Chris AtLee wrote:
> >>> On 31/08/12 11:32 AM, Ehsan Akhgari wrote:> There are extremely
> >>> non-stable Talos tests, and relatively stable ones.
> >>> > Let's focus on the relatively stable ones. There are extremely
> >>> hard
> >>> > to diagnose performance regressions, and extremely easy ones (i.e.,
> >>> > let's not wait on this lock, do this I/O, run this exponential
> >>> > algorithm, load tons of XUL/XBL when a window opens, etc.) We have
> >>> many
> >>> > great tools for the job, so not all regressions need to be treated
> >>> the
> >>> > same.
> >>> What value do the extremely non-stable Talos tests have? Shouldn't we
> >>> stop running them if they're not giving useful information?
> >> Either that, or find some way of making them more stable, such as not
> >> measuring the wall clock time.
> > Sure, that sounds like a great project. Until that's finished, is there
> > any value to running these suites, or are they expensive random number
> > generators?
> I think that is something that needs to be evaluated on a per-test
> per-platform basis, hopefully by someone who knows a bit about
> statistics. :-)
> Cheers,
> Ehsan
We are detecting regressions with this despite the large levels of noise. So while it might appear to be a waste of machine resources to some, Talos serves a purpose. Having people look at the results more frequently will solve many of the problems.
I would say a handful of tests/counters on certain platforms are not very useful in the current way we are reporting numbers.
Taras Glek wrote:
> * Joel will revisit maintaining Talos within mozilla-central to reduce
> developer barriers to understanding what a particular Talos test result
> means. This should also make Talos easier to run
To call out this point explicitly.
I'm not convinced that folding it into m-c is the necessary way forward, and I think before folding in any other of our "stable" but external-to-m-c repos we should start a community discussion on general guidelines as to why/why not we would do that, and THEN evaluate those against WHY we want talos, what goals are we solving, etc.
I don't feel that "reduce developer barriers to understanding what a particular Talos test result means." is helped by this, if you [anyone] thinks so, can you try to articulate why here in this thread?
[I note that myself and jhammel at least were discussing this in the bug about moving talos to m-c as well, which we both agree does not belong as an in-bug discussion -- and I do feel the move, if the talos module owner feels is necessary should not get blocked on a need for an external process, but I do feel we should think hard on this]
On Saturday, September 1, 2012 10:08:53 AM UTC-4, Ehsan Akhgari wrote:
> On 12-08-31 4:03 PM, Chris AtLee wrote:
> > On 31/08/12 03:59 PM, Ehsan Akhgari wrote:
> >> On 12-08-31 11:45 AM, Chris AtLee wrote:
> >>> On 31/08/12 11:32 AM, Ehsan Akhgari wrote:> There are extremely
> >>> non-stable Talos tests, and relatively stable ones.
> >>> > Let's focus on the relatively stable ones. There are extremely
> >>> hard
> >>> > to diagnose performance regressions, and extremely easy ones (i.e.,
> >>> > let's not wait on this lock, do this I/O, run this exponential
> >>> > algorithm, load tons of XUL/XBL when a window opens, etc.) We have
> >>> many
> >>> > great tools for the job, so not all regressions need to be treated
> >>> the
> >>> > same.
> >>> What value do the extremely non-stable Talos tests have? Shouldn't we
> >>> stop running them if they're not giving useful information?
> >> Either that, or find some way of making them more stable, such as not
> >> measuring the wall clock time.
> > Sure, that sounds like a great project. Until that's finished, is there
> > any value to running these suites, or are they expensive random number
> > generators?
> I think that is something that needs to be evaluated on a per-test
> per-platform basis, hopefully by someone who knows a bit about
> statistics. :-)
> Cheers,
> Ehsan
We are detecting regressions with this despite the large levels of noise. So while it might appear to be a waste of machine resources to some, Talos serves a purpose. Having people look at the results more frequently will solve many of the problems.
I would say a handful of tests/counters on certain platforms are not very useful in the current way we are reporting numbers.