proposal: replace talos with inline tests

68 views
Skip to first unread message

Jim Mathies

unread,
Mar 4, 2013, 8:15:56 AM3/4/13
to dev-pl...@lists.mozilla.org
For metrofx we’ve been working on getting omtc and apzc running in the browser. One of the things we need to be able to do is run performance tests that tell us whether or not the work we’re doing is having a positive effect on perf. We currently don’t have automated tests up and running for metrofx and talos is even farther off.

So to work around this I’ve been putting together some basic perf tests I can use to measure performance using the mochitest framework. I’m wondering if this might be a useful answer to our perf tests problems long term.

Putting together talos tests is a real pain. You have to write a new test using the talos framework (which is a separate repo from mc), test the test to be sure it’s working, file rel eng bugs on getting it integrated into talos test runs, populated in graph server, and tested via staging to be sure everything is working right. Overall the overhead here seems way too high.

Maybe we should consider changing this system so devs can write performance tests that suit their needs that are integrated into our main repo? Basically:

1) rework graphs server to be open ended so that it can accept data from test runs within our normal test frameworks.
2) develop of test module that can be included in tests that allows test writers to post performance data to graph server.
3) come up with a good way to manage the life cycle of active perf tests so graph server doesn’t become polluted.
4) port existing talos tests over to the mochitest framework.
5) drop talos.

Curious what people think of this idea.

Jim

Ed Morley

unread,
Mar 4, 2013, 8:42:39 AM3/4/13
to Jim Mathies, auto-...@mozilla.com, dev-pl...@lists.mozilla.org
(CCing auto-...@mozilla.com)

jmaher and jhammel will be able to comment more on the talos specifics,
but few thoughts off the top of my head:

It seems like we're conflating multiple issues here:
1) "[talos] is a separate repo from mc"
2) "[it's a hassle to] test the test to be sure it’s working"
3) "[it's a hassle to get results] populated in graph server"
4) "[we need to] come up with a good way to manage the life cycle of
active perf tests so graph server doesn’t become polluted"

Switching from the talos harness to mochitest doesn't fix #2 (we still
have to test, and I don't see how it magically becomes any easier
without extra work - that could have been applied to talos instead) or
#3/#4 (orthogonal problem). It also seems like a brute force way of
fixing #1 (we could just check talos into mozilla-central).

Instead, I think we should be asking:
1) Is the best test framework for performance testing: [a] talos (with
improvements), [b] mochitest (with a significant amount of work to make
it compatible), or [c] a brand new framework?
2) Regardless of framework used, would checking it into mozilla-central
improve dev workflow enough to outweigh the downsides (see bug 787200
for history on that discussion)?
3) Regardless of framework used, how can we make the
development/testing/staging cycle less painful?
4) Regardless of framework used, who should be responsible for ensuring
we actively prune performance tests that are no longer relevant?

Note also that graphs.mozilla.org will be depreciated soon, in favour
of datazilla - which afaik is less painful for adding new test suites
(eg doesn't need manual database changes); jeads can say more on that
front.

Best wishes,

Ed
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Joel Maher

unread,
Mar 4, 2013, 8:59:41 AM3/4/13
to Ed Morley, auto-...@mozilla.com, Jim Mathies, dev-pl...@lists.mozilla.org
Some thoughts on the subject-

I would argue against running performance tests inside of mochitest. The main reason is that mochitest has a lot of profile stuff for testing as well as many other tests bundled inside of the same browser session. For a standalone metric unrelated to a user scenario, we could consider performance style tests into mochitest.

In the process of creating Datazilla, we have found endless little quirks in the end to end system how performance works. As time goes on we have continued to push forward with the goal of making a performance system that can detect regressions automatically when the test finishes.

For the last few months we have had data going both to Datazilla and graph server and have been refining our assumptions and tools along the way. When graph server is deprecated in the near future, it will be REALLY EASY to add new tests to the collection and reporting system. That doesn't solve the problem of making it easy to add or adjust a test in the test runners (buildbot scripts), but it solves half the problem.

Many of the talos tests are old and outdated and while we have tried to find owners for the tests, it has been a failing effort. To that tune, we have disabled some Talos tests which nobody had interest in anymore. If there are tests which people feel are not useful, we should disable those ASAP to reduce our load on our infrastructure and work on creating a test which people care about.

-Joel

Jim Mathies

unread,
Mar 4, 2013, 9:16:31 AM3/4/13
to dev-pl...@lists.mozilla.org
Good points, comments below.

"Ed Morley" <emo...@mozilla.com> wrote in message
news:<mailman.1992.13624045...@lists.mozilla.org>...
> (CCing auto-...@mozilla.com)
>
> jmaher and jhammel will be able to comment more on the talos specifics,
> but few thoughts off the top of my head:
>
> It seems like we're conflating multiple issues here:
> 1) "[talos] is a separate repo from mc"
> 2) "[it's a hassle to] test the test to be sure it’s working"
> 3) "[it's a hassle to get results] populated in graph server"
> 4) "[we need to] come up with a good way to manage the life cycle of
> active perf tests so graph server doesn’t become polluted"
>
> Switching from the talos harness to mochitest doesn't fix #2 (we still
> have to test, and I don't see how it magically becomes any easier without
> extra work - that could have been applied to talos instead)

I disagree here, very few devs are familiar with the talos framework and
what it takes to get a new test written. Everyone is very familiar with
mochitest and other related test frameworks on mc. I can write a mochitest
to test perf in something simple like scrolling in about an hour. Putting
together a talos scroll test would take much longer. If Talos were on mc it
would help, but integrating into existing test frameworks we have and use on
a regular basis seems like the simplest approach with the least amount of
overhead.

> Instead, I think we should be asking:
> 1) Is the best test framework for performance testing: [a] talos (with
> improvements), [b] mochitest (with a significant amount of work to make it
> compatible), or [c] a brand new framework?

On [b] there might be a significant amount of work in getting infra pieces
to work maybe (like graph server or whatever we plan to replace it with) but
not in writing a import module that devs would use to post data.

> 2) Regardless of framework used, would checking it into mozilla-central
> improve dev workflow enough to outweigh the downsides (see bug 787200 for
> history on that discussion)?

Maybe we might want to keep talos around for "big, important tests". But I
think devs need a way to run perf tests on a smaller scale that doesn't
involve infra changes. I think having this ability would be a big win for
us.

Jim


Boris Zbarsky

unread,
Mar 4, 2013, 10:00:35 AM3/4/13
to
On 3/4/13 8:15 AM, Jim Mathies wrote:
> So to work around this I’ve been putting together some basic perf tests I can use to measure performance using the mochitest framework.

How are you dealing with the fact that mochitest runs on heterogeneous
hardware (including VMs and the like last I checked, which could have
arbitrarily bad (or good!) performance characteristics depending on what
else is happening with the host system)?

> Maybe we should consider changing this system so devs can write performance tests that suit their needs that are integrated into our main repo? Basically:
>
> 1) rework graphs server to be open ended so that it can accept data from test runs within our normal test frameworks.
> 2) develop of test module that can be included in tests that allows test writers to post performance data to graph server.
> 3) come up with a good way to manage the life cycle of active perf tests so graph server doesn’t become polluted.
> 4) port existing talos tests over to the mochitest framework.
> 5) drop talos.

This sounds plausible, modulo the inability to port Tp in its current
state to a setup that involves the tests living in m-c, as long as the
problem above is kept in mind. Basically, reusing something
mochitest-like for developer familiarity may make sense, but it would
need to be a separate test suite run on completely separate test slaves
that are actually set up with performance testing in mind. A separate
test suite which is like mochitest is not a problem per se (we have the
ipcplugins, chrome, browserchrome, a11y tests already).

So the main win would be making it easier to add new tests in terms of
number of actions to be taken (something it seems like we could improve
with the current Talos setup too) and easier for developers to add tests
because the framework is already similar, right?

-Boris

Gregory Szorc

unread,
Mar 4, 2013, 12:36:03 PM3/4/13
to Jim Mathies, dev-pl...@lists.mozilla.org
On 3/4/13 5:15 AM, Jim Mathies wrote:
> For metrofx we’ve been working on getting omtc and apzc running in the browser. One of the things we need to be able to do is run performance tests that tell us whether or not the work we’re doing is having a positive effect on perf. We currently don’t have automated tests up and running for metrofx and talos is even farther off.
>
> So to work around this I’ve been putting together some basic perf tests I can use to measure performance using the mochitest framework. I’m wondering if this might be a useful answer to our perf tests problems long term.
>
> Putting together talos tests is a real pain. You have to write a new test using the talos framework (which is a separate repo from mc), test the test to be sure it’s working, file rel eng bugs on getting it integrated into talos test runs, populated in graph server, and tested via staging to be sure everything is working right. Overall the overhead here seems way too high.
>
> Maybe we should consider changing this system so devs can write performance tests that suit their needs that are integrated into our main repo? Basically:
>
> 1) rework graphs server to be open ended so that it can accept data from test runs within our normal test frameworks.
> 2) develop of test module that can be included in tests that allows test writers to post performance data to graph server.
> 3) come up with a good way to manage the life cycle of active perf tests so graph server doesn’t become polluted.
> 4) port existing talos tests over to the mochitest framework.
> 5) drop talos.
>
> Curious what people think of this idea.

Generally speaking, I think we should have a generic framework for
declaring tests. i.e. test files for xpcshell, mochitest, Talos, etc
would all look very similar from a JS perspective. I've been wanting to
unify the in-test code for a while and over the weekend I put together a
very rough draft of what I think this should look like [1]. Please
criticize it.

If all your tests are declared the same way, then presumably the test
running code would be similar and capturing performance data would
require a single implementation affecting all test suites instead of N
1-off solutions.

I'm of the opinion that would should generally collect tons of data from
all of our testing frameworks and then sort out the meaning of that data
later (e.g. ignore data from tests running on non-homogenous or
unreliable hardware). Maybe we don't care about things like rev X-Y
comparison of CPU cycles on an individual mochitest. But, we'd certainly
be interested if we saw an individual mochitest's CPU cycle count or
wall time double over the span of a month! You can't even raise eyebrows
unless you have data. We don't have this data today. Even if we did, it
would require separate implementations for each testing flavor
(xpcshell, mochitest, etc).

We should unify our test running code as much as possible. Then, we
should make decisions on whether it makes sense to collect and/or assess
performance data in each execution context/test flavor.

[1] https://gist.github.com/indygreg/5073810

Jim Mathies

unread,
Mar 4, 2013, 2:50:42 PM3/4/13
to dev-pl...@lists.mozilla.org
"Boris Zbarsky" <bzba...@mit.edu> wrote in message news:<o7ydnYp6N66OKqnM...@mozilla.org>...
> On 3/4/13 8:15 AM, Jim Mathies wrote:
> > So to work around this I’ve been putting together some basic perf tests I can use to measure performance using the mochitest framework.
>
> How are you dealing with the fact that mochitest runs on heterogeneous
> hardware (including VMs and the like last I checked, which could have
> arbitrarily bad (or good!) performance characteristics depending on what
> else is happening with the host system)?

That sounds like a rel eng problem that could be solved. I don’t know our enough about our test slaves to say for sure.

> This sounds plausible, modulo the inability to port Tp in its current
> state to a setup that involves the tests living in m-c, as long as the
> problem above is kept in mind. Basically, reusing something
> mochitest-like for developer familiarity may make sense, but it would
> need to be a separate test suite run on completely separate test slaves
> that are actually set up with performance testing in mind. A separate
> test suite which is like mochitest is not a problem per se (we have the
> ipcplugins, chrome, browserchrome, a11y tests already).

That's fine, I'm not married to mochitest, but something similar using the similar run characteristics would be best.

> So the main win would be making it easier to add new tests in terms of
> number of actions to be taken (something it seems like we could improve
> with the current Talos setup too) and easier for developers to add tests
> because the framework is already similar, right?
>
> -Boris

Yes, basically -

1) something checked into mc anyone can easily author or run (for tracking down regressions) without having to checkout a separate repo, or setup and run a custom perf test framework.
2) performance tests that generate data that spits out to the console on local runs or could be posted to a graphs server in automation.
3) no releng overhead for setup of new perf tests. something that is built into the test framework / infrastructure we set up.

Jim

Justin Lebar

unread,
Mar 4, 2013, 3:25:29 PM3/4/13
to Jim Mathies, dev-pl...@lists.mozilla.org
> 1) something checked into mc anyone can easily author or run (for tracking down regressions) without having to checkout a separate repo, or setup and run a custom perf test framework.

I don't oppose the gist of what you're suggesting here, but please
keep in mind that small perf changes are often very difficult to track
down locally. Small changes in system and toolchain configuration can
have large effects on average build speed and its variance. For
example, I've found observable performance differences between Try and
m-c/m-i builds in the past (bug 653961), despite their build configs
being nearly identical.

In my experience, we spend the majority of our time trying to track
down small perf changes, so a change which makes it easier to track
down the source of large perf changes might not have an outsize
effect.

> 3) no releng overhead for setup of new perf tests. something that is built into the test framework / infrastructure we set up.

If we did this, we'd need to figure out how and when to promote
benchmarks to "we care about them" status.

We already don't back back out changes for regressing a benchmark like
we back them out for regressing tests. I think this is at least
partially because a general sentiment that not all of our benchmarks
correlate strongly to what they're trying to measure.

I suspect if anyone could check in a benchmark, the average quality of
benchmarks would likely stay roughly the same, but the number of
benchmarks would increase. In that case we'd have even more
benchmarks with spurious regressions to deal with.

-Justin

Justin Dolske

unread,
Mar 4, 2013, 7:25:45 PM3/4/13
to
On 3/4/13 9:36 AM, Gregory Szorc wrote:

> If all your tests are declared the same way, then presumably the test
> running code would be similar and capturing performance data would
> require a single implementation affecting all test suites instead of N
> 1-off solutions.

We've talked about this before (perhaps in this very newsgroup), as a
cheap (?) way to get extra perf measurements beyond our current limited
set of tests, and to avoid having to add a new test suite/framework
whenever someone wants a metric... E.G. measure the run time of each
existing test, use scripts to figure out which ones are fairly stable
over time, then watch for regressions. A chance to begin again in a
orange land of opportunity and adventure!

But I'd also take the general ability to add a new test as a microbenchmark.

> We should unify our test running code as much as possible.

Oh god yes please.

Justin

Dave Mandelin

unread,
Mar 4, 2013, 7:47:10 PM3/4/13
to dev-pl...@lists.mozilla.org, jma...@mozilla.com, Taras Glek
On Monday, March 4, 2013 5:15:56 AM UTC-8, Jim Mathies wrote:
> For metrofx we’ve been working on getting omtc and apzc running in the browser. One of the things we need to be able to do is run performance tests that tell us whether or not the work we’re doing is having a positive effect on perf. We currently don’t have automated tests up and running for metrofx and talos is even farther off.
>
> So to work around this I’ve been putting together some basic perf tests I can use to measure performance using the mochitest framework. I’m wondering if this might be a useful answer to our perf tests problems long term.

I think this is an incredibly interesting proposal, and I'd love to see something like it go forward. Detailed reactions below.

> Putting together talos tests is a real pain. You have to write a new test using the talos framework (which is a separate repo from mc), test the test to be sure it’s working, file rel eng bugs on getting it integrated into talos test runs, populated in graph server, and tested via staging to be sure everything is working right. Overall the overhead here seems way too high.

Yup. And that's a big problem. Not only does this make your life harder, it makes people not do as much performance testing as they otherwise might. The JS team has had the experience that adding a new way of creating correctness tests incredibly easy (with *zero* overhead in the common case) really helped get more tests written and used. So I think it would be great to make it a lot easier to write perf tests.

> Maybe we should consider changing this system so devs can write performance tests that suit their needs that are integrated into our main repo? Basically:
>
> 1) rework graphs server to be open ended so that it can accept data from test runs within our normal test frameworks.

IIUC, something like this is a key requirement: letting any perf test feed into the reporting system. People have pointed out that the Talos tests run on selected machines, so the perf tests should probably run on them as well, rather than on the correctness test machines. But that's only a small change to the basic idea, right?

> 2) develop of test module that can be included in tests that allows test writers to post performance data to graph server.

Does that mean a mochitest module? This part seems optional, although certainly useful. Some tests will require non-mochitest frameworks.

I believe jmaher did some work to get in-browser standard JS benchmarks running automatically and reporting to graph-server. I'm curious how that would fit in with this idea--would the test module help at all, or could there be some other kind of more general module maybe, so that even things like standard benchmarks can be self-serve?

> 3) come up with a good way to manage the life cycle of active perf tests so graph server doesn’t become polluted.

:-) How about getting an owner optionally listed for new tests, and then tests will be removed if no one is looking at them (according to web server logs) and there is no owner of record or the owner doesn't say the tests are still needed?

> 4) port existing talos tests over to the mochitest framework.
>
> 5) drop talos.

This seems like it's in the line of "fix Talos". I'm not sure if this particular 4+5 is the right way to go, but the idea has some merit.

To the extent that people don't pay attention to Talos, it seems we really don't need to do anything with it. If people are paying attention to and taking care of performance in their area, then we're covered. To take the example I happen to know best, the JS team uses AWFY to track JS performance on standard benchmarks and additional tests they've decided are useful. So Talos is not needed to track JS performance. Having all the features of the new graph server does sound pretty cool, though.

It appears that there a few areas that are only covered by Talos for now, though. I think in that category we have warm startup time via Ts, and basic layout performance via Tp. I'm not sure about memory, because we do seem to detect increases via Talos, but we also have AWSY, and I don't know whether AWSY obviates Talos memory measurements or not.

For that kind of thing, I'm thinking maybe we should go with the same "teams take care of their own perf tests" idea. Performance is a natural owner for Ts. I'm not entirely sure about Tp, but it's probably layout or DOM. Then those teams could decide if they wanted to switch from Talos to a different framework. If everything's working properly, if the difficulty of reproducing Talos tests locally caused enough problems to warrant it, the owning teams would notice and switch.

Dave

Dave Mandelin

unread,
Mar 4, 2013, 7:47:10 PM3/4/13
to mozilla.de...@googlegroups.com, Taras Glek, jma...@mozilla.com, dev-pl...@lists.mozilla.org
On Monday, March 4, 2013 5:15:56 AM UTC-8, Jim Mathies wrote:
> For metrofx we’ve been working on getting omtc and apzc running in the browser. One of the things we need to be able to do is run performance tests that tell us whether or not the work we’re doing is having a positive effect on perf. We currently don’t have automated tests up and running for metrofx and talos is even farther off.
>
> So to work around this I’ve been putting together some basic perf tests I can use to measure performance using the mochitest framework. I’m wondering if this might be a useful answer to our perf tests problems long term.

I think this is an incredibly interesting proposal, and I'd love to see something like it go forward. Detailed reactions below.

> Putting together talos tests is a real pain. You have to write a new test using the talos framework (which is a separate repo from mc), test the test to be sure it’s working, file rel eng bugs on getting it integrated into talos test runs, populated in graph server, and tested via staging to be sure everything is working right. Overall the overhead here seems way too high.

Yup. And that's a big problem. Not only does this make your life harder, it makes people not do as much performance testing as they otherwise might. The JS team has had the experience that adding a new way of creating correctness tests incredibly easy (with *zero* overhead in the common case) really helped get more tests written and used. So I think it would be great to make it a lot easier to write perf tests.

> Maybe we should consider changing this system so devs can write performance tests that suit their needs that are integrated into our main repo? Basically:
>
> 1) rework graphs server to be open ended so that it can accept data from test runs within our normal test frameworks.

IIUC, something like this is a key requirement: letting any perf test feed into the reporting system. People have pointed out that the Talos tests run on selected machines, so the perf tests should probably run on them as well, rather than on the correctness test machines. But that's only a small change to the basic idea, right?

> 2) develop of test module that can be included in tests that allows test writers to post performance data to graph server.

Does that mean a mochitest module? This part seems optional, although certainly useful. Some tests will require non-mochitest frameworks.

I believe jmaher did some work to get in-browser standard JS benchmarks running automatically and reporting to graph-server. I'm curious how that would fit in with this idea--would the test module help at all, or could there be some other kind of more general module maybe, so that even things like standard benchmarks can be self-serve?

> 3) come up with a good way to manage the life cycle of active perf tests so graph server doesn’t become polluted.

:-) How about getting an owner optionally listed for new tests, and then tests will be removed if no one is looking at them (according to web server logs) and there is no owner of record or the owner doesn't say the tests are still needed?

> 4) port existing talos tests over to the mochitest framework.
>
> 5) drop talos.

Robert O'Callahan

unread,
Mar 4, 2013, 7:52:47 PM3/4/13
to Dave Mandelin, Taras Glek, jma...@mozilla.com, dev-pl...@lists.mozilla.org, mozilla.de...@googlegroups.com
Writing a lot of performance tests creates the problem that those tests
will take a long time to run. The nature of performance tests is that each
test must run for a relatively long time to get meaningful results.
Therefore I doubt writing lots of different performance tests can scale.
(Maybe we can find ways to eliminate noise in very short tests, but that
might be research.)

One other thing to keep in mind if we're going to start doing performance
tests differently is https://bugzilla.mozilla.org/show_bug.cgi?id=846166.
Basically Chris suggests using eideticker for performance tests a lot more.

Rob
--
Wrfhf pnyyrq gurz gbtrgure naq fnvq, “Lbh xabj gung gur ehyref bs gur
Tragvyrf ybeq vg bire gurz, naq gurve uvtu bssvpvnyf rkrepvfr nhgubevgl
bire gurz. Abg fb jvgu lbh. Vafgrnq, jubrire jnagf gb orpbzr terng nzbat
lbh zhfg or lbhe freinag, naq jubrire jnagf gb or svefg zhfg or lbhe fynir
— whfg nf gur Fba bs Zna qvq abg pbzr gb or freirq, ohg gb freir, naq gb
tvir uvf yvsr nf n enafbz sbe znal.” [Znggurj 20:25-28]

Jeff Hammel

unread,
Mar 4, 2013, 7:56:46 PM3/4/13
to dev-pl...@lists.mozilla.org
I'll point out and really this is about all I have to say on this thread
that while perf testing (that is, recording results) may be....well, not
easy, but not too awful that rigorous analysis of what the data means
and if there is a regression is often hard since it is often the case,
as evidenced by Talos, that distributions are non-normal and may be
multi-modal. While I have no love of Talos, despite/because of sinking a
year's worth of effort into it, I fear that any replacement will be done
with a loss of all wisdom harvested from legacy, and then relearned. If
each team is responsible for perf testing, without a common basis and
understanding of the stats analysis problem, I fear this will just
multiply the problem. Frankly, one of the problems I've seen time and
time again is the duplication of effort around a problem (which isn't a
bad thing except...) and a lack of consolidation towards a
(moz-)universal solution.

Dave Mandelin

unread,
Mar 4, 2013, 8:03:11 PM3/4/13
to Jim Mathies, auto-...@mozilla.com, dev-pl...@lists.mozilla.org
On Monday, March 4, 2013 5:42:39 AM UTC-8, Ed Morley wrote:
> (CCing auto-...@mozilla.com)
>
> jmaher and jhammel will be able to comment more on the talos specifics,
> but few thoughts off the top of my head:
>
> It seems like we're conflating multiple issues here:
> 1) "[talos] is a separate repo from mc"

And also

1a) Talos itself is a big pain for developers to use and debug regressions in, not to mention add tests to, which they basically don't.

It seems that some of this may have changed recently, especially around using the new framework--I haven't used it in a while. I think Talos still does fail on creating tests, though, because lots of things just don't fit its assumptions.

> 2) "[it's a hassle to] test the test to be sure it’s working"
> 3) "[it's a hassle to get results] populated in graph server"
> 4) "[we need to] come up with a good way to manage the life cycle of
> active perf tests so graph server doesn’t become polluted"

> Switching from the talos harness to mochitest doesn't fix #2 (we still
> have to test, and I don't see how it magically becomes any easier
> without extra work - that could have been applied to talos instead) or
> #3/#4 (orthogonal problem). It also seems like a brute force way of
> fixing #1 (we could just check talos into mozilla-central).

I think that part was mostly supposed to address (1a).

> Instead, I think we should be asking:
>
> 1) Is the best test framework for performance testing: [a] talos (with
> improvements), [b] mochitest (with a significant amount of work to make
> it compatible), or [c] a brand new framework?

I think that question doesn't have one answer. For JS, it's clearly "something else", but it's not even really a framework--it's just running standard benchmarks.

For other areas, there are likely different answers. That's why I was so excited about the self-serve idea. (Interestingly, I got schooled on this subject in a similar vein recently on bug tracking. :-) )

> 2) Regardless of framework used, would checking it into mozilla-central
> improve dev workflow enough to outweigh the downsides (see bug 787200
> for history on that discussion)?

Thanks for the bug link. It seems like putting Talos itself into m-c has significant disadvantages. I'm not sure what to do with other/new perf tests.

> 3) Regardless of framework used, how can we make the
> development/testing/staging cycle less painful?

I liked the original proposal a lot for this.

> 4) Regardless of framework used, who should be responsible for ensuring
> we actively prune performance tests that are no longer relevant?

I gave an idea for how to do this in my reply to the original proposal. I didn't say who would do it, but I was assuming the maintainers/operators of graph-server, with the notion that they would be highly empowered to remove anything that no one asked them to keep or that didn't otherwise have a well-documented, easily understood rationale.

Dave

Dave Mandelin

unread,
Mar 4, 2013, 8:03:11 PM3/4/13
to mozilla.de...@googlegroups.com, auto-...@mozilla.com, Jim Mathies, dev-pl...@lists.mozilla.org
On Monday, March 4, 2013 5:42:39 AM UTC-8, Ed Morley wrote:
> (CCing auto-...@mozilla.com)
>
> jmaher and jhammel will be able to comment more on the talos specifics,
> but few thoughts off the top of my head:
>
> It seems like we're conflating multiple issues here:
> 1) "[talos] is a separate repo from mc"

And also

1a) Talos itself is a big pain for developers to use and debug regressions in, not to mention add tests to, which they basically don't.

It seems that some of this may have changed recently, especially around using the new framework--I haven't used it in a while. I think Talos still does fail on creating tests, though, because lots of things just don't fit its assumptions.

> 2) "[it's a hassle to] test the test to be sure it’s working"
> 3) "[it's a hassle to get results] populated in graph server"
> 4) "[we need to] come up with a good way to manage the life cycle of
> active perf tests so graph server doesn’t become polluted"

> Switching from the talos harness to mochitest doesn't fix #2 (we still
> have to test, and I don't see how it magically becomes any easier
> without extra work - that could have been applied to talos instead) or
> #3/#4 (orthogonal problem). It also seems like a brute force way of
> fixing #1 (we could just check talos into mozilla-central).

I think that part was mostly supposed to address (1a).

> Instead, I think we should be asking:
>
> 1) Is the best test framework for performance testing: [a] talos (with
> improvements), [b] mochitest (with a significant amount of work to make
> it compatible), or [c] a brand new framework?

I think that question doesn't have one answer. For JS, it's clearly "something else", but it's not even really a framework--it's just running standard benchmarks.

For other areas, there are likely different answers. That's why I was so excited about the self-serve idea. (Interestingly, I got schooled on this subject in a similar vein recently on bug tracking. :-) )

> 2) Regardless of framework used, would checking it into mozilla-central
> improve dev workflow enough to outweigh the downsides (see bug 787200
> for history on that discussion)?

Thanks for the bug link. It seems like putting Talos itself into m-c has significant disadvantages. I'm not sure what to do with other/new perf tests.

> 3) Regardless of framework used, how can we make the
> development/testing/staging cycle less painful?

I liked the original proposal a lot for this.

> 4) Regardless of framework used, who should be responsible for ensuring
> we actively prune performance tests that are no longer relevant?

Dave Mandelin

unread,
Mar 4, 2013, 8:09:44 PM3/4/13
to Jim Mathies, dev-pl...@lists.mozilla.org
> We already don't back back out changes for regressing a benchmark like
> we back them out for regressing tests. I think this is at least
> partially because a general sentiment that not all of our benchmarks
> correlate strongly to what they're trying to measure.

I know this has been a hot topic lately. I think getting more clarity on this would be great, *if* of course we could have an answer that was both operationally beneficial and clear, which seems to be incredibly difficult.

But this thread gives me a new idea. If each test run in automation had an owner (as I suggested elsewhere), how about also making the owners responsible for informing the sheriffs about what to do in case of regression? If the owners know the test is reliable and measures something important, they can ask for monitoring and presumptive backout. If not, they can ask sheriffs to ignore the test, inform and coordinate with the owning team, inform the landing person only, or some other action.

Dave

Dave Mandelin

unread,
Mar 4, 2013, 8:09:44 PM3/4/13
to mozilla.de...@googlegroups.com, dev-pl...@lists.mozilla.org, Jim Mathies
> We already don't back back out changes for regressing a benchmark like
> we back them out for regressing tests. I think this is at least
> partially because a general sentiment that not all of our benchmarks
> correlate strongly to what they're trying to measure.

Gregory Szorc

unread,
Mar 4, 2013, 8:17:29 PM3/4/13
to Dave Mandelin, Jim Mathies, dev-pl...@lists.mozilla.org, mozilla.de...@googlegroups.com
This should be annotated in the tests themselves, IMO. We could even
have said annotation influence the color on TBPL. A well-written test
harness could also re-run failing tests to see if failures are constant
or intermittent. We could also introduce "expectations" instead of
"assertions" and have soft/expectancy failures for things like assertion
count mismatch. IMO we should be focusing on lessening the burden on the
sheriffs and leaving them to focus on real problems. There's so much
more we can be doing with our test infrastructure...

Dave Mandelin

unread,
Mar 4, 2013, 8:18:37 PM3/4/13
to dev-pl...@lists.mozilla.org
On Monday, March 4, 2013 4:56:46 PM UTC-8, Jeff Hammel wrote:
> I'll point out and really this is about all I have to say on this thread
> that while perf testing (that is, recording results) may be....well, not
> easy, but not too awful that rigorous analysis of what the data means
> and if there is a regression is often hard since it is often the case,
> as evidenced by Talos, that distributions are non-normal and may be
> multi-modal. While I have no love of Talos, despite/because of sinking a
> year's worth of effort into it, I fear that any replacement will be done
> with a loss of all wisdom harvested from legacy, and then relearned. If
> each team is responsible for perf testing, without a common basis and
> understanding of the stats analysis problem, I fear this will just
> multiply the problem. Frankly, one of the problems I've seen time and
> time again is the duplication of effort around a problem (which isn't a
> bad thing except...) and a lack of consolidation towards a
> (moz-)universal solution.

Those are real issues, but do you really think they are so serious? AWFY seems to do the job, and the JS team is happy with it, certainly happier with it than any other JS perf testing system we've had. One thing to note about it is that it doesn't have any automatic alarms or other actions. It's fed into human judgment only, so no statistical model is required.

On the general subject of having perf tests collected under one banner or distributed, the experience so far seems pretty clear that tests designed in a distributed way are much more successful for serving the purpose. I'm not convinced that most of these systems really need advanced treatment to be useful. But if it would help, maybe it would be good to set up some kind of "perf testing group" that could meet from time to time and exchange knowledge?

Dave

Dave Mandelin

unread,
Mar 4, 2013, 8:18:37 PM3/4/13
to mozilla.de...@googlegroups.com, dev-pl...@lists.mozilla.org
On Monday, March 4, 2013 4:56:46 PM UTC-8, Jeff Hammel wrote:
> I'll point out and really this is about all I have to say on this thread
> that while perf testing (that is, recording results) may be....well, not
> easy, but not too awful that rigorous analysis of what the data means
> and if there is a regression is often hard since it is often the case,
> as evidenced by Talos, that distributions are non-normal and may be
> multi-modal. While I have no love of Talos, despite/because of sinking a
> year's worth of effort into it, I fear that any replacement will be done
> with a loss of all wisdom harvested from legacy, and then relearned. If
> each team is responsible for perf testing, without a common basis and
> understanding of the stats analysis problem, I fear this will just
> multiply the problem. Frankly, one of the problems I've seen time and
> time again is the duplication of effort around a problem (which isn't a
> bad thing except...) and a lack of consolidation towards a
> (moz-)universal solution.

Dave Mandelin

unread,
Mar 4, 2013, 8:22:41 PM3/4/13
to Dave Mandelin, Jim Mathies, dev-pl...@lists.mozilla.org
On Monday, March 4, 2013 5:17:29 PM UTC-8, Gregory Szorc wrote:
> On 3/4/13 5:09 PM, Dave Mandelin wrote:
>
> >> We already don't back back out changes for regressing a benchmark like
> >> we back them out for regressing tests. I think this is at least
> >> partially because a general sentiment that not all of our benchmarks
> >> correlate strongly to what they're trying to measure.
>
> > I know this has been a hot topic lately. I think getting more clarity on this would be great, *if* of course we could have an answer that was both operationally beneficial and clear, which seems to be incredibly difficult.
>
> > But this thread gives me a new idea. If each test run in automation had an owner (as I suggested elsewhere), how about also making the owners responsible for informing the sheriffs about what to do in case of regression? If the owners know the test is reliable and measures something important, they can ask for monitoring and presumptive backout. If not, they can ask sheriffs to ignore the test, inform and coordinate with the owning team, inform the landing person only, or some other action.
>
> This should be annotated in the tests themselves, IMO. We could even
> have said annotation influence the color on TBPL.

I like it. We would need to make sure the annotations reflect active consideration by the test owners, but I suppose failures are likely to self-correct.

> IMO we should be focusing on lessening the burden on the
> sheriffs and leaving them to focus on real problems.

Absolutely.

Dave

Dave Mandelin

unread,
Mar 4, 2013, 8:22:41 PM3/4/13
to mozilla.de...@googlegroups.com, dev-pl...@lists.mozilla.org, Dave Mandelin, Jim Mathies
On Monday, March 4, 2013 5:17:29 PM UTC-8, Gregory Szorc wrote:
> On 3/4/13 5:09 PM, Dave Mandelin wrote:
>
> >> We already don't back back out changes for regressing a benchmark like
> >> we back them out for regressing tests. I think this is at least
> >> partially because a general sentiment that not all of our benchmarks
> >> correlate strongly to what they're trying to measure.
>
> > I know this has been a hot topic lately. I think getting more clarity on this would be great, *if* of course we could have an answer that was both operationally beneficial and clear, which seems to be incredibly difficult.
>
> > But this thread gives me a new idea. If each test run in automation had an owner (as I suggested elsewhere), how about also making the owners responsible for informing the sheriffs about what to do in case of regression? If the owners know the test is reliable and measures something important, they can ask for monitoring and presumptive backout. If not, they can ask sheriffs to ignore the test, inform and coordinate with the owning team, inform the landing person only, or some other action.
>
> This should be annotated in the tests themselves, IMO. We could even
> have said annotation influence the color on TBPL.

I like it. We would need to make sure the annotations reflect active consideration by the test owners, but I suppose failures are likely to self-correct.

> IMO we should be focusing on lessening the burden on the
> sheriffs and leaving them to focus on real problems.

Absolutely.

Dave

Nicholas Nethercote

unread,
Mar 5, 2013, 4:37:22 AM3/5/13
to Dave Mandelin, Taras Glek, jma...@mozilla.com, dev-pl...@lists.mozilla.org, mozilla.de...@googlegroups.com
On Tue, Mar 5, 2013 at 11:47 AM, Dave Mandelin <dman...@gmail.com> wrote:
>
> It appears that there a few areas that are only covered by Talos for now, though. I think in that category we have warm startup time via Ts, and basic layout performance via Tp. I'm not sure about memory, because we do seem to detect increases via Talos, but we also have AWSY, and I don't know whether AWSY obviates Talos memory measurements or not.

Talos memory measurements aren't very good because it cycles through
multiple sites in a single tab. So it sometimes catches start-up
memory consumption regressions (Firefox Health Report was a recent
case) but it doesn't get much beyond that.

In comparison, AWSY cycles through 100 sites with 30 tabs open at
once, which is a much better reflection of typical browsing. It also
does multiple measurements -- start-up, after loading the tabs, after
closing the tabs, etc.

It's worth pointing out that AWSY is sort of built on top of Talos --
its 100 sites are taken from the Talos Tp5 set. The good thing about
this page set is that they're stored entirely locally. The downside
is that all the external stuff in the pages (e.g. Facebook "Like"
buttons, Google Ad stuff, Twitter feeds) isn't present, so it's a not
particularly realistic representation of those pages; in particular,
the amount of JS present is much less than real pages have.
(https://bugzilla.mozilla.org/show_bug.cgi?id=679940#c31 is an example
of the effect of this in action.)

Nick

Jim Mathies

unread,
Mar 5, 2013, 5:49:46 AM3/5/13
to dev-pl...@lists.mozilla.org

> Writing a lot of performance tests creates the problem that those tests
> will take a long time to run. The nature of performance tests is that each
> test must run for a relatively long time to get meaningful results.
> Therefore I doubt writing lots of different performance tests can scale.
> (Maybe we can find ways to eliminate noise in very short tests, but that
> might be research.)

Well we learn as we write more tests what works and what doesn’t. A factor
like length of run is something we learn about over time as we experiment.
My whole point here to provide an easy way for devs to experiment. We
currently do not have something like this available.

What the tests run on and how they integrate into our existing testing
infrastructure is an engineering problem we can solve.

> One other thing to keep in mind if we're going to start doing performance
> tests differently is https://bugzilla.mozilla.org/show_bug.cgi?id=846166.
> Basically Chris suggests using eideticker for performance tests a lot
> more.

Eideticker is interesting, but it's also not pliable. We'd love to have
eideticker tests running for metro but the odds of that happening anytime
soon are slim due to the overhead of getting it set up. I imagine making
changes or adding tests is probably not very easy either.

Something like eideticker is great as a research project or something that
is owned by a special team that augments it over time and produces data sets
we can use. But I seriously doubt devs on m-c will ever be able to spend a
few hours writing and then checking in an eideticker test.

Jim

Andrew McCreight

unread,
Mar 5, 2013, 9:16:24 AM3/5/13
to Nicholas Nethercote, Taras Glek, jma...@mozilla.com, dev-pl...@lists.mozilla.org, Dave Mandelin, mozilla dev platform
----- Original Message -----
> Talos memory measurements aren't very good because it cycles through
> multiple sites in a single tab. So it sometimes catches start-up
> memory consumption regressions (Firefox Health Report was a recent
> case) but it doesn't get much beyond that.

Another problem with the Talos memory tests, in comparison to AWSY, is that Talos opens and closes pages very rapidly, while AWSY proceeds at a more stately pace. This is very important for a memory test, because most of our GC and CC heuristics are time based. The drawback of this is that each test takes hours to run, though it is mostly just sitting around, so many test runs can be done at once on the same machine.

>
> In comparison, AWSY cycles through 100 sites with 30 tabs open at
> once, which is a much better reflection of typical browsing. It also
> does multiple measurements -- start-up, after loading the tabs, after
> closing the tabs, etc.
>
> It's worth pointing out that AWSY is sort of built on top of Talos --
> its 100 sites are taken from the Talos Tp5 set. The good thing about
> this page set is that they're stored entirely locally. The downside
> is that all the external stuff in the pages (e.g. Facebook "Like"
> buttons, Google Ad stuff, Twitter feeds) isn't present, so it's a not
> particularly realistic representation of those pages; in particular,
> the amount of JS present is much less than real pages have.
> (https://bugzilla.mozilla.org/show_bug.cgi?id=679940#c31 is an
> example
> of the effect of this in action.)
>
> Nick
Reply all
Reply to author
Forward
0 new messages