Performance Measurement with Mozmill

Clint Talbert

unread,

Mar 11, 2009, 1:36:27 AM3/11/09

to mozmi...@googlegroups.com

Gary brought up the idea of using Mozmill as a means to measure
performance degradation of an application under duress - for example
Firefox with 1000 tabs open. This is an interesting idea, and I was
wondering if anyone would be interested in trying to run with it. I
have a couple of big picture questions to spur some discussion on it,
Gary and I discussed it privately a little bit, but we both thought that
the discussion really ought to be on this list, so we're posting it
here. My questions are framed as the hurdles that we need to overcome
to test this effectively and the decisions we must make in order to
actually get benefit from such testing.

= The Hurdles =
1. What is the impact of Mozmill on performance in the first place? I'm
not exactly sure how to determine this, even.
2. How would you mitigate the noise from the network transactions (cache
the pageset?)
3. How would you decide on a specific site to load or a specific set of
sites that represent some reasonable sample of the web? Or does it make
sense to load the same site (or even about:blank) 1000 times. This
depends on the application you're testing also, obviously.

= The Decisions =
1. What are you trying to measure? Tab performance? Inbox
performance? Memory use? What does that mean?
2. How do you produce anything actionable from the results? This one is
sticky, and pertains to all high end load testing analysis. Often the
amount of data pumped through any given system in a stress testing
situation is great, so how do you determine an adequate response by the
application versus an incorrect response to that stimulus? How do you
then take your result and point developers at the issue? One way to do
this is to graph trends with steadily increasing load. I did this once
and discovered a system where time from input to response grew from a
linear relationship in low stress situations to an exponential
relationship when under duress. This was a bug, and it is possible that
similar analysis could be defined for Mozilla applications and tested
using Mozmill.

So, Gary's got a great idea. Neither he nor I have the bandwidth to
lead the project, though we'd both like to help. Do we have anyone
interested in taking a look at this? Are there thoughts about any of
the questions I've raised above? More questions to be raised?

Thanks,
Clint

Adam Christian

unread,

Mar 11, 2009, 3:56:09 AM3/11/09

to mozmi...@googlegroups.com

This sounds like a really super cool project, I wish I had the time to
spearhead it! There are a ton of interesting numbers you need to
collect before you can really even dive into this, ie. timing
javascript without mozmill running, averaging those out and then
running the same js code with mozmill etc.

If there are any mozmill features that we think would make this more
feasible, go ahead and log them and let me know so I can get them in
the queue.

Adam

Mikeal Rogers

unread,

Mar 11, 2009, 1:27:14 PM3/11/09

to mozmi...@googlegroups.com

All of the hurdles described below stem from the same strategy, which
I think is not the best way to approach this problem.

Performance measurements are a comparison between one set of data
against a another set of similar data used as some kind of baseline.

There are two ways we seem to be assuming the baseline data is created;

The first way is the way Talos creates it's baselines, which is the
accumulation of a long history of the same data against revisions of
the product as it is developed and data against a previous version of
the product. This requires that variable elements of the tests are
eliminated which is why the web pages are cached locally (a huge pain
in the ass by the way) and changes in Talos itself and it's tests are
minimized. It seems like we've already decided that this isn't going
to work because mozmill is being so rapidly developed, changes in the
speed of test execution or the overhead of the framework is just
assumed to happen going forward.

The second way is to create a "clean" baseline from the previous
product release and removing the variables above at an even deeper
level, not just caching the pages but trying to approximate the
overhead of the tests themselves and the framework so that the data
can keep it's integrity through changes in the framework.

Both of these strategies assume one thing; that the baseline must be
created in isolation from the new performance data.

By the end of the week, when i finish the mozrunner/jsbridge/mozmill
python reworking, we'll have the ability to run two test data sets in
parallel. Instead of thinking about the baseline how we have before by
removing all the variables, we can instead create the baseline in
parallel with the new test data with all the same runtime variables
and just compare the two.

A mozmill performance test would think of test data as a comparison of
the same test running at the exact same time on the exact same machine
in the previous product and also against the latest build instead of a
long history of baseline information or a "clean" baseline.

No need to cache the pages, no need to quantify the framework overhead
(which I think is actually quite minimal), we just have raw
comparisons between products in live environments. Of course we would
run the same tests many times in order to remove any random noise in
the perf tests, this is done by every performance test system I've
ever worked with.

I think if we go with this strategy we sidestep most of the hurdles
below and only need a minimal set of new features added. I think this
performance data also removes one of the larger problems with
automated test data in general which is that automated data is always
collected in relatively clean environments so that the data can be
replicated and quantified easier but it inherently loses all of the
more dynamic elements of the product that cause bugs and performance
problems.

Does this make sense?

-Mikeal

Clint Talbert

unread,

Mar 11, 2009, 1:50:37 PM3/11/09

to mozmi...@googlegroups.com

This is a really interesting idea, Mikeal. Let me boil this down to make sure I'm following you.

You'd run the same version of mozmill on two different versions of the product under test at the same time and then compare the performance of each run?

That sounds cool, and I can see how that does eliminate a lot of the hurdles. One thing I don't understand is how to provide actionable data from such a run.

If a run on version A is 2x slower than version B, how would you attack that as a developer? Start bisecting among the nightlies? Ideally, that would give you a one day changeset that caused the slowdown. But, more likely, a slowdown between version A and version B is the product of untold interactions between the code and trying to reduce variables to pinpoint the cause of emergent behavior like that is complicated by this approach. It's easier to pinpoint something in a more "clean room" environment. But on the other hand, clean room environments don't generally have emergent behaviors like products in the wild. Thoughts on this?

Clint

Joel Maher

unread,

Mar 11, 2009, 2:03:52 PM3/11/09

to mozmi...@googlegroups.com

First off, the concept of performance testing/measuring here is really awesome.

I see two categories of stuff here:
1) limit/load/stress testing using mozmill (think 1000+ tabs)
2) measuring smaller metrics (bookmark load times, page load times, application launch times, memory usage during profile creation, etc...)

For each of these types of testing it is best to have a goal of what we would like, and another number of what we find acceptable. Now it might not be worth our time to do all of this, so taking an approach of developing a bunch of tests that are geared towards measuring performance and comparing against a moving baseline (previous version, etc...) is a good approach as well.

I would like to see if measuring each smaller transaction would be of interest. So for example, measuring the time for the click of a button. If we went this route, we could apply some basic rules (no single action should take >1 second except for x, y, z). Then when all mozmill tests are running (presumably with a -perf flag) we could be measuring perf by default.

To look at the larger picture of something like how long it takes to open the 1001th tab, or how the memory works with 1000 tabs open, we can craft a specific test for that.

I would think that adding some basic tools into mozmill to get the # of threads, memory used/process, cpu time, and total time taken would be good. Then we could just query that from our test case and maybe publish a parallel set of data to the pass/fail results which include the raw perf metrics.

Those are my initial thoughts

-Joel

Mikeal Rogers

unread,

Mar 11, 2009, 2:10:09 PM3/11/09

to mozmi...@googlegroups.com

I don't think this would differ that greatly from how we use our
current performance data provided we chart the data based on the
*difference* between the latest release and the current nightly rather
than chart the totaltime data for the nightly like we have
traditionally.

If you saw a big jump in differentials between two revisions you could
also kick off a test comparing those two versions rather than just the
latest release and the currently nightly.

Also, I'm saying nightly here but really you could use _any_ full
build. So if we have a build running for every revision, and we have
enough hardware to run all these tests against each build, we could
get a more granular idea of where the performance problem surfaced.

Another big advantage we have here is that mozmill is designed to run
identically on a local machine as it does in a continuous environment
and the setup even when running with Python is pretty simple. Since
the tests run in parallel for comparisons you could run them locally
and get the same kind of comparison that we do in continuous which
would allow for better debugging by developers.

-Mikeal

Mikeal Rogers

unread,

Mar 11, 2009, 2:22:20 PM3/11/09

to mozmi...@googlegroups.com

> For each of these types of testing it is best to have a goal of what
> we would like, and another number of what we find acceptable. Now
> it might not be worth our time to do all of this, so taking an
> approach of developing a bunch of tests that are geared towards
> measuring performance and comparing against a moving baseline
> (previous version, etc...) is a good approach as well.
>
> I would like to see if measuring each smaller transaction would be
> of interest. So for example, measuring the time for the click of a
> button. If we went this route, we could apply some basic rules (no
> single action should take >1 second except for x, y, z). Then when
> all mozmill tests are running (presumably with a -perf flag) we
> could be measuring perf by default.
>

In Windmill we do this by default, but we don't find it nearly as
useful as you would think. For one thing the click event firing
returns once the event is done propagating, but since javascript is so
asynchronous this doesn't actually measure the time it took for the
code on the page to fire that was attached to that event.

I think what is more useful is the addition of manual timers. This
way, you could set your own timer before a click event and end it
after a waitForElement() with a really low eval interval.

I know it sounds like it would be nice to double all our functional
test as performance tests but what you end up finding is that you
aren't measure what you would actually like to measure and the amount
of data you're flooded with ends up not being very useful just because
of the sheer volume.

> To look at the larger picture of something like how long it takes to
> open the 1001th tab, or how the memory works with 1000 tabs open, we
> can craft a specific test for that.
>
> I would think that adding some basic tools into mozmill to get the #
> of threads, memory used/process, cpu time, and total time taken
> would be good. Then we could just query that from our test case and
> maybe publish a parallel set of data to the pass/fail results which
> include the raw perf metrics.

So, the way you would do this is to write some Python code that can
poll for all the local system performance information. Then in mozmill
you fire your own custom event in javascript whenever you want this
information logged and just add a callback for that event in Python.
Then after all the tests are finished you parse out all perf data on
the Python side.

This brings up a few points I think I failed to make.

1) mozmill performance tests will require being run from Python
2) each set of performance tests will probably have their own Python
script to launch and parse data, but those scripts will mostly use
tools already provided on the mozmill and jsbridge Python side.

-Mikeal

Joel Maher

unread,

Mar 11, 2009, 2:27:32 PM3/11/09

to mozmi...@googlegroups.com

I have done this a couple times in the past and found many bugs by measuring single actions. But as you pointed out with click and return events and asynchronous code, it might not make sense.

> To look at the larger picture of something like how long it takes to
> open the 1001th tab, or how the memory works with 1000 tabs open, we
> can craft a specific test for that.
>
> I would think that adding some basic tools into mozmill to get the #
> of threads, memory used/process, cpu time, and total time taken
> would be good. Then we could just query that from our test case and
> maybe publish a parallel set of data to the pass/fail results which
> include the raw perf metrics.

So, the way you would do this is to write some Python code that can
poll for all the local system performance information. Then in mozmill
you fire your own custom event in javascript whenever you want this
information logged and just add a callback for that event in Python.
Then after all the tests are finished you parse out all perf data on
the Python side.

This brings up a few points I think I failed to make.

1) mozmill performance tests will require being run from Python
2) each set of performance tests will probably have their own Python
script to launch and parse data, but those scripts will mostly use
tools already provided on the mozmill and jsbridge Python side.

This sounds reasonable. I assume we will have a common library/module and template that test authors will follow.

-Mikeal

Reply all

Reply to author

Forward