Eideticker automation status

William Lachance

unread,

Sep 30, 2011, 7:29:46 PM9/30/11

to to...@lists.mozilla.org

[ Bccing this to a bunch of people privately, please send followups to
mozilla.tools if you can ]

So I've been spending most of this week working on a framework for
running repeatable test cases with Eideticker
(https://wiki.mozilla.org/Project_Eideticker), under the assumption that
what we have right now in terms of video capture is good enough as a
starting point. After talking a bit with people with experience with
these things (chiefly, Joel Maher), I decided to try and use talos as
the basis for this work. It basically does what we want (load a webpage
on a device with minimum overhead and measure its performance) and there
are already examples of people adding hooks to it to "capture" other
stuff (e.g. talos-xperf captures xperf data on Windows), so adding video
capture support is a fairly natural fit.

It's been a hard slog so far as Talos isn't really used much outside of
build automation, especially for the mobile case. That said, I now have
at least the "ts" testcase working on the LG 2 Gx with video recording
(unexciting demo here:
http://people.mozilla.com/~wlachance/talos-tp-recorded.webm).

That aside, the workflow for setting up eideticker currently looks like
this (assuming you have a correctly configured system with a video
capture card!):

* Create a virtualenv
* Check out the eideticker project
(https://github.com/markrcote/eideticker), build the decklink console
app, insert it into the virtualenv
* Activate the virtualenv
* Clone talos inside the virtualenv
* Configure talos using something like this:

python remotePerfConfigurator.py -v -e org.mozilla.fennec_wlach
--activeTests tp --sampleConfig remote.config --noChrome --output
test.config --remotePort -1 --webServer 192.168.1.3/talos
--resultsServer ' ' --resultsLink ' ' --videoCapture

* Run talos:

python ./run_tests.py test.config

Which I'd say is pretty reasonable. :)

Next steps / open questions:

* I can't seem to get any of the talos tests working on this device
except for "ts" (I've tried using both SUTAgent and the new adb device
manager approaches). Not really sure why -- I prolly just need to debug
this with Joel or someone else knowledgeable. This is a high priority as
from my understanding even some of the basic talos tests (tzoom, tpan)
would be interesting to analyze under this framework.
* We obviously need to add a post-process/analysis step. What would be
the best way of doing that? The first thing I can think of is allowing
the user to specify a script/executable to be run upon completion of
each talos cycle (with the video filenames and raw data as arguments).
* How high a priority is the input automation vs. refining video capture
behaviour? Obviously what we have right now is not what the physical
device is actually outputting, but I'm not sure if that's a deal breaker.

Will

Chris Jones

unread,

Sep 30, 2011, 9:52:55 PM9/30/11

to William Lachance, to...@lists.mozilla.org

----- Original Message -----
> After talking a bit with people with experience with
> these things (chiefly, Joel Maher), I decided to try and use talos as
> the basis for this work. It basically does what we want (load a
> webpage
> on a device with minimum overhead and measure its performance) and
> there
> are already examples of people adding hooks to it to "capture" other
> stuff (e.g. talos-xperf captures xperf data on Windows), so adding
> video
> capture support is a fairly natural fit.
>

Talos isn't minimum overhead and it's inextricably tied to firefox. It doesn't help solve problems like "when have the page pixels reached a stable state after loading", because talos only has access to DOM events fired by firefox, which firefox lies about freely. For example, Firefox will happily fire "onload" before it's even painted, and when Firefox draws using the GPU even "MozAfterPaint" can be fired well before pixels actually appear on screen. These problems are exacerbated an order of magnitude by out-of-process content. And in fact it was these problems that motivated trying to build a system to complement talos: https://wiki.mozilla.org/Project_Eideticker#Background . See also https://wiki.mozilla.org/Project_Eideticker#Controlling_the_device .

As a means of ramping up, to run pages and record output, talos seems like a good way to start. But its current incarnation isn't what we'd want in the long run, for the HDMI capture tests.

> * I can't seem to get any of the talos tests working on this device
> except for "ts" (I've tried using both SUTAgent and the new adb device
> manager approaches). Not really sure why -- I prolly just need to
> debug
> this with Joel or someone else knowledgeable. This is a high priority
> as
> from my understanding even some of the basic talos tests (tzoom, tpan)
> would be interesting to analyze under this framework.

Ts measures how quickly firefox fires "onload", which is useful but not at all what users see. At the end of a Ts run, firefox may not have painted at all. To fix that, you would have to bolt an entirely new of feedback and control loops onto talos, and I suspect it would be easier to start from scratch on a simpler substrate (for the longer term, not initial ramp-up).

> * How high a priority is the input automation vs. refining video
> capture
> behaviour? Obviously what we have right now is not what the physical
> device is actually outputting, but I'm not sure if that's a deal
> breaker.

Last I heard we were capturing 4:2:2 YUV at 60fps, which is fantastic, better than I hoped for in the first prototype. I think the focus now should be on three things
- writing good tests: it's really hard, and you really need knowledge of gecko internals to make sure you're testing the right thing. I can help pull in platform people here.
- writing analyses: framerate is pretty trivial. Load histogram/heatmap is a little bit trickier but not too bad. Not sure whose purview these would fall under.
- test controller: driving the browser in a generic way, so that we can run tests on the stock android browser and opera and compare results

Great news! :D

Cheers,
Chris

William Lachance

unread,

Oct 1, 2011, 11:10:31 AM10/1/11

to to...@lists.mozilla.org

On 11-09-30 09:52 PM, Chris Jones wrote:
>> > After talking a bit with people with experience with
>> > these things (chiefly, Joel Maher), I decided to try and use talos as
>> > the basis for this work. It basically does what we want (load a
>> > webpage
>> > on a device with minimum overhead and measure its performance) and
>> > there
>> > are already examples of people adding hooks to it to "capture" other
>> > stuff (e.g. talos-xperf captures xperf data on Windows), so adding
>> > video
>> > capture support is a fairly natural fit.
>> >

> Talos isn't minimum overhead and it's inextricably tied to firefox. It doesn't help solve problems like "when have the page pixels reached a stable state after loading", because talos only has access to DOM events fired by firefox, which firefox lies about freely. For example, Firefox will happily fire "onload" before it's even painted, and when Firefox draws using the GPU even "MozAfterPaint" can be fired well before pixels actually appear on screen. These problems are exacerbated an order of magnitude by out-of-process content. And in fact it was these problems that motivated trying to build a system to complement talos:https://wiki.mozilla.org/Project_Eideticker#Background . See alsohttps://wiki.mozilla.org/Project_Eideticker#Controlling_the_device .

>
> As a means of ramping up, to run pages and record output, talos seems like a good way to start. But its current incarnation isn't what we'd want in the long run, for the HDMI capture tests.

Well, I'm definitely not opposed to creating a new wheel if necessary
but from my short experiences so far with talos (especially remote
talos, which is what I'm using here) my instinct is that it's actually
not too far from what we want. In the remote talos case, the harness is
running on a controller, not the device, so there's no issue with the
competition for resources that you mention. You're right that we would
eventually need to add a new layer of feedback and control to actually
measure the sorts of things we want, but my feeling is that it should be
fairly natural to do so as an extension. I also don't see any real a
priori reason why we couldn't get (remote) talos running (e.g.) Opera if
that turns out be something we want to do. I guess we'll see? My
feeling is that it should become obvious fairly quickly if Talos is
totally the wrong tool for the job.

Will

jmaher

unread,

Oct 3, 2011, 1:25:01 PM10/3/11

to mozill...@lists.mozilla.org

Just some points about measuring mozafterpaint. For 'ts', we have a
ts_paint which waits for mozafterpaint. Likewise all pageloader tests
are modified to support plain load or load+mozafterpaint. So we have
a lot of the basics in place.

I understand the concern about >1 process. We have some handling in
place for that using the pageloader extension (http://hg.mozilla.org/
build/pageloader/file/4dec1e56c677/chrome/pageloader.js), but it might
not be a perfect scenario.

William Lachance

unread,

Oct 3, 2011, 6:43:40 PM10/3/11

to Chris Jones, to...@lists.mozilla.org

On 11-09-30 09:52 PM, Chris Jones wrote:

>> > * How high a priority is the input automation vs. refining video
>> > capture
>> > behaviour? Obviously what we have right now is not what the physical
>> > device is actually outputting, but I'm not sure if that's a deal
>> > breaker.
> Last I heard we were capturing 4:2:2 YUV at 60fps, which is fantastic, better than I hoped for in the first prototype. I think the focus now should be on three things
> - writing good tests: it's really hard, and you really need knowledge of gecko internals to make sure you're testing the right thing. I can help pull in platform people here.
> - writing analyses: framerate is pretty trivial. Load histogram/heatmap is a little bit trickier but not too bad. Not sure whose purview these would fall under.
> - test controller: driving the browser in a generic way, so that we can run tests on the stock android browser and opera and compare results

This sounds about right to me. We had a bit of a discussion today
amongst some members of automation about creating some user stories for
Eideticker, and eventually came up with this work-in-progress on etherpad:

http://etherpad.mozilla.com:9000/ABGUu0FAb3

I think the big thing missing from the above are some specific
user-stories about tests that we'd like to run. I tried to throw up a
quick user story about capturing a page with continuously animating DOM
elements (based loosely on the idea of the Microsoft Psychedelic
Browsing demo on the eideticker wiki page:
https://wiki.mozilla.org/Project_Eideticker#Example:_Framerate_analysis), something
maybe like this:

http://wlach.masalalabs.ca/color-cycle.html

(don't open that link if you have epilepsy)

But really I'm kind of just stabbing in the dark at this point. :) I
guess what I'd most like to get at this point is:

1. What's the most basic test case we could implement that would be
useful for performance testing?
2. What other test cases are we pretty sure that we want to implement
for the first release of this thing? From the eideticker document, I
gathered that we wanted to measure: frame rate, frame splitting,
checkerboarding, lag, and behaviour of the application while a page is
being loaded from the network. Do we have a good idea what sorts of test
cases would be good for measuring these sorts of things, or is that
something we're going to have to iteratively develop as we go along? If
the latter, maybe I should frame these user stories more in terms of the
high-level things we want to measure, leaving the details of the exact
test cases to a more detailed function spec (and/or implementation).

It's ok if we don't have all the answers at this point. I really just
need a few things to work towards to guide development over the next few
weeks.

Will

Chris Jones

unread,

Oct 14, 2011, 6:41:08 AM10/14/11

to mozill...@lists.mozilla.org

Sorry for the lag, wasn't subscribed to the mailing list. Fixed.

On Oct 3, 3:43 pm, William Lachance <wlacha...@mozilla.com> wrote:
> 1. What's the most basic test case we could implement that would be
> useful for performance testing?

I assume you're talking about framerate analysis? If so, the main
uses of eideticker there are

- Extract more rigorous results from performance benchmarks. They
usually do their own framerate estimation, and often don't do a good
job of it. Some benchmarks (*cough* Microsoft's) make up silly
metrics other than fps, which are just distracting. So one thing that
would be interesting is comparing the results of benchmarks' own fps
results and what eideticker says. Another useful thing would be
setting the silly-metric benchmarks back on firm footing. A few
interesting ones are
o JSGameBench
o GUIMark2/3 HTML5
o Asteroids HTML5 Canvas 2D
o Psychedelic Browsing
o Hardware Acceleration Stress Test
o FishIE
o WebGL FishIE

(Note that when testing under eideticker, you'd need to remove the fps
estimation and reporting since they can affect the eideticker
results.)

- Regression testing on interesting benchmarks. We don't do that
right now, and that's a bad thing. It would be extremely interesting
to see how the scores on some of the benchmarks above have changed
over time, and where we could have used eideticker to catch
regressions. Or even better, if we're in the middle of a performance
regression, report it! :) That would be a huge find.

- Comparing against other browsers on rigorous grounds. That's a lot
to bite off for a first release though, I would postpone.

> 2. What other test cases are we pretty sure that we want to implement
> for the first release of this thing? From the eideticker document, I
> gathered that we wanted to measure: frame rate, frame splitting,
> checkerboarding, lag, and behaviour of the application while a page is
> being loaded from the network. Do we have a good idea what sorts of test
> cases would be good for measuring these sorts of things, or is that
> something we're going to have to iteratively develop as we go along? If
> the latter, maybe I should frame these user stories more in terms of the
> high-level things we want to measure, leaving the details of the exact
> test cases to a more detailed function spec (and/or implementation).
>

I would go for one or both of the page-load analyses described on the
wiki. We *do* have problems with extra reflowing and repainting on
android. It would be really cool to show that off with a heat map and/
or histogram. Then developers can start attacking bugs and verifying
that they're fixed, and tuning various knobs to see how the results
change.

Of all the things you listed, checkerboarding would be the most
valuable to implement because we can't measure it atm, but it's also
very hard to do rigorously. I wouldn't wait on that for the first
release. The test harness needs to have fine control of panning the
page, and that's not easy to do well.

For load analysis, I would just start with popular, complicated pages:
your nytimes, cnns, tom's hardwares, endgadgets, those sorts of
pages. The talos pageset would be interesting too. We'll definitely
iteratively develop sets of tests, and I'm sure lots of people will
have ideas for new things they'd like to measure. We'll likely want a
benchmark suite like talos', and also a set of simple regression
tests. That is, when we find problems on large complicated pages, we
can distill those problems into small tests and add them to a
regression suite. When do all this stuff, we definitely want to rope
in platform folks.

In general, I would focus more on the aspect of releasing a tool that
gives us new measurement capabilities instead of particular analyses
and tests. We of course want to show examples where we're gathering
data that other tools can't, but I expect analyses and tests to evolve
quite a bit as people think of new ways to use the new capabilities.

Cheers,
Chris

William Lachance

unread,

Oct 17, 2011, 6:32:25 PM10/17/11

to Chris Jones, mozill...@lists.mozilla.org

On 11-10-14 06:41 AM, Chris Jones wrote:
> Sorry for the lag, wasn't subscribed to the mailing list. Fixed.
>

>> 2. What other test cases are we pretty sure that we want to implement
>> for the first release of this thing? From the eideticker document, I
>> gathered that we wanted to measure: frame rate, frame splitting,
>> checkerboarding, lag, and behaviour of the application while a page is
>> being loaded from the network. Do we have a good idea what sorts of test
>> cases would be good for measuring these sorts of things, or is that
>> something we're going to have to iteratively develop as we go along? If
>> the latter, maybe I should frame these user stories more in terms of the
>> high-level things we want to measure, leaving the details of the exact
>> test cases to a more detailed function spec (and/or implementation).
>>
>
> I would go for one or both of the page-load analyses described on the
> wiki. We *do* have problems with extra reflowing and repainting on
> android. It would be really cool to show that off with a heat map and/
> or histogram. Then developers can start attacking bugs and verifying
> that they're fixed, and tuning various knobs to see how the results
> change.

> ...

Hi Chris,

Thanks for all the ideas. I guess my question could really be boiled
down to "what's the minimal viable product" for Eideticker. It sounds
like this question could have multiple answers. :) I just finished a
very basic test which goes through the color cycle test I mentioned
earlier, results here:

http://wlach.masalalabs.ca/colorcycle.avi (original)
http://wlach.masalalabs.ca/colorcycle.webm (webm compressed)

Notes on this:

1. Unfortunately this only works about 80% of the time because of bugs
in my own stuff and/or Talos, but that can probably be fixed. I can
confirm that Talos is a bit of a pain to deal with sometimes, but I
remain convinced it's best to share work if possible. :)
2. There's some strange banding artifacts in even the uncompressed video
(the color-changing box is supposed to be one solid color, but there's
very visible banding). Is this a show stopper? I assume it isn't.

If you're curious, you can find my current source here:

https://github.com/mozilla/eideticker

Also of note, I've been tracking this project on pivotal tracker here:

https://www.pivotaltracker.com/projects/387017

If you're interested in commenting on stuff directly there, let me know.
I think it might be a bit easier, but I can understand not wanting to
deal with yet another new tool.

--

In any case, based on the above, it sounds you think the next logical
step is to work on some kind of basic page loading test case and the
necessary infrastructure to support analyzing it. I'll hack on that over
the next while unless I hear otherwise from you. If you think the
animation cases would be more useful, it would probably be relatively
easy to bootstrap something like that on top of the framework I built
for the color cycling test. Let me know!

Will

William Lachance

unread,

Oct 17, 2011, 6:32:25 PM10/17/11

to Chris Jones, mozill...@lists.mozilla.org

On 11-10-14 06:41 AM, Chris Jones wrote:

> Sorry for the lag, wasn't subscribed to the mailing list. Fixed.
>

>> 2. What other test cases are we pretty sure that we want to implement
>> for the first release of this thing? From the eideticker document, I
>> gathered that we wanted to measure: frame rate, frame splitting,
>> checkerboarding, lag, and behaviour of the application while a page is
>> being loaded from the network. Do we have a good idea what sorts of test
>> cases would be good for measuring these sorts of things, or is that
>> something we're going to have to iteratively develop as we go along? If
>> the latter, maybe I should frame these user stories more in terms of the
>> high-level things we want to measure, leaving the details of the exact
>> test cases to a more detailed function spec (and/or implementation).
>>
>
> I would go for one or both of the page-load analyses described on the
> wiki. We *do* have problems with extra reflowing and repainting on
> android. It would be really cool to show that off with a heat map and/
> or histogram. Then developers can start attacking bugs and verifying
> that they're fixed, and tuning various knobs to see how the results
> change.