Orange is the new Bad (Gij)

Michael Henretty

unread,

Nov 4, 2015, 10:39:33 AM11/4/15

to dev-...@lists.mozilla.org

Hi Gaia Folk,

If you've been doing Gaia core work for any length of time, you are probably aware that we have *many* intermittent Gij test failures on Treeherder [1]. But the problem is even worse than you may know! You see, each Gij test is run 5 times within a test chunk (g. Gij4) before it is marked as failing. Then that chunk itself is retried up to 5 times before the whole thing is marked as failing. This means that for a test to be marked as "passing," it only has to run successfully once in 25 times. I'm not kidding. Our retry logic, especially those inside the test chunk, make it hard to know which intermittent tests are our worst offenders. This is bad.

My suggestion is to stop doing the retries inside the chunks. That way, the failures will at least surface on Treeherder, which means we can star more test, which means we'll have a lot more visibility on the bad intermittents. Sheriffs will complain a lot, so we have to be ready to act on these bugs. But the alternative is that we continue to write tests with a low "raciness" bar which, IMO, have a much lower chance of catching regressions. The longer we wait, the worse this problem becomes.

Thoughts?

Thanks,

Michael

1.) https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure&keywords_type=allwords&list_id=12657856&resolution=---&query_format=advanced&product=Firefox%20OS

Fabrice Desré

unread,

Nov 4, 2015, 10:46:02 AM11/4/15

to dev-...@lists.mozilla.org

Can we *right now* identify the worst offenders by looking at the tests
results/re-runs? You know that sheriffs will very quickly hide and
ignore tests that are really flaky.

Fabrice

On 11/04/2015 07:39 AM, Michael Henretty wrote:
> Hi Gaia Folk,
>
> If you've been doing Gaia core work for any length of time, you are
> probably aware that we have *many* intermittent Gij test failures on
> Treeherder [1]. But the problem is even worse than you may know! You
> see, each Gij test is run 5 times within a test chunk (g. Gij4) before
> it is marked as failing. Then that chunk itself is retried up to 5 times
> before the whole thing is marked as failing. This means that for a test

> to be marked as "passing," it only has to run successfully once in *25*

> times. I'm not kidding. Our retry logic, especially those inside the
> test chunk, make it hard to know which intermittent tests are our worst
> offenders. This is bad.
>
> My suggestion is to stop doing the retries inside the chunks. That way,
> the failures will at least surface on Treeherder, which means we can
> star more test, which means we'll have a lot more visibility on the bad
> intermittents. Sheriffs will complain a lot, so we have to be ready to
> act on these bugs. But the alternative is that we continue to write
> tests with a low "raciness" bar which, IMO, have a much lower chance of
> catching regressions. The longer we wait, the worse this problem becomes.
>
> Thoughts?
>
> Thanks,
> Michael
>
> 1.)
> https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure&keywords_type=allwords&list_id=12657856&resolution=---&query_format=advanced&product=Firefox%20OS
>
>

> _______________________________________________
> dev-fxos mailing list
> dev-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-fxos
>

--
Fabrice Desré
b2g team
Mozilla Corporation

Michael Henretty

unread,

Nov 4, 2015, 10:48:45 AM11/4/15

to Fabrice Desré, dev-...@lists.mozilla.org

On Wed, Nov 4, 2015 at 4:45 PM, Fabrice Desré <fab...@mozilla.com> wrote:

Can we *right now* identify the worst offenders by looking at the tests
results/re-runs? You know that sheriffs will very quickly hide and
ignore tests that are really flaky.

Yes, that's an important point. The problem is that you have to actually look at the logs of an individual chunk to see which tests failed. If a certain Gij test passes at least 1 out of it's 5 given runs, it will not surface to Treeherder, which means we can't start it. Looking through each chunk log file (of which we have 40 per run) is doable, but more time consuming and error prone.

Michael Henretty

unread,

Nov 4, 2015, 10:52:42 AM11/4/15

to Fabrice Desré, dev-...@lists.mozilla.org

Another way forward is just to figure out a way to track intermittent failures even inside the chunk. I agree we don't do ourselves any favors if we try to fix this but instead just get Gij hidden everywhere besides Gaia Try.

Gregor Wagner

unread,

Nov 4, 2015, 11:01:48 AM11/4/15

to mozilla-...@lists.mozilla.org

This sounds like the perfect use-case for a project repository. Lets disable the retry logic and fix/disable tests there.

-Gregor

Johnny Stenback

unread,

Nov 4, 2015, 11:11:20 AM11/4/15

to Michael Henretty, Fabrice Desré, dev-...@lists.mozilla.org

On Wed, Nov 4, 2015 at 7:48 AM, Michael Henretty <mhen...@mozilla.com> wrote:

>
> On Wed, Nov 4, 2015 at 4:45 PM, Fabrice Desré <fab...@mozilla.com> wrote:
>>
>> Can we *right now* identify the worst offenders by looking at the tests
>> results/re-runs? You know that sheriffs will very quickly hide and
>> ignore tests that are really flaky.
>
>
>

> Yes, that's an important point. The problem is that you have to actually
> look at the logs of an individual chunk to see which tests failed. If a
> certain Gij test passes at least 1 out of it's 5 given runs, it will not
> surface to Treeherder, which means we can't start it. Looking through each
> chunk log file (of which we have 40 per run) is doable, but more time
> consuming and error prone.

Jumping in on something I haven't been able to pay much attention to
myself here so I may be missing context here, but this sounds like it
sets people up to assume that if something occasionally works we're
good to ship it, as opposed to if it occasionally fails we need to fix
it. Seems to me that this needs to be flipped around very aggressively
for these tests to provide much value.

- jst

L. David Baron

unread,

Nov 4, 2015, 11:13:54 AM11/4/15

to Michael Henretty, Fabrice Desré, dev-...@lists.mozilla.org

On Wednesday 2015-11-04 16:48 +0100, Michael Henretty wrote:
> On Wed, Nov 4, 2015 at 4:45 PM, Fabrice Desré <fab...@mozilla.com> wrote:
> > Can we *right now* identify the worst offenders by looking at the tests
> > results/re-runs? You know that sheriffs will very quickly hide and
> > ignore tests that are really flaky.
>
> Yes, that's an important point. The problem is that you have to actually
> look at the logs of an individual chunk to see which tests failed. If a
> certain Gij test passes at least 1 out of it's 5 given runs, it will not
> surface to Treeherder, which means we can't start it. Looking through each
> chunk log file (of which we have 40 per run) is doable, but more time
> consuming and error prone.

Can you write a script to gather the data? Seems like you should be
able to get to it through various JSON files exposed for treeherder.

(The code in
https://hg.mozilla.org/users/dbaron_mozilla.com/buildbot-json-tools/
might not work anymore (although it might), but might provide some
useful hints on how to find the right logs. Treeherder source
should provide better hints, but is larger.)

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)

signature.asc

David Flanagan

unread,

Nov 4, 2015, 12:45:36 PM11/4/15

to Johnny Stenback, Michael Henretty, Fabrice Desré, dev-...@lists.mozilla.org

The mismatch between the synchronous nature of Marionette tests and the asynchronous nature of the code we want to test just begs for race conditions. And clearly we've got a lot.

But I agree with jst that we can't make excuses here. We've got to fix the tests.

And, going forward, we've got to figure out a better, less error-prone, way to write them.

As we fix them, let's pay attention to what the underlying issues are. I predict that we'll find a handful of common errors repeated over and over again. Maybe we can improve the Marionette API to make them less common. (Can we improve things with Promises?) Or at least we could end up with some kind of "HOWTO write Marionette tests that are not racy" best practices guide. It would be nice, for example, if we had a naming convention for functions that would help test writers distinguish those that block until some condition is true from those that do not block.

David S has set us the task of converting our python Marionette tests to JS. Maybe we can try to get a handle on the raciness issues as part of that conversion. It would be nice if there was some way to ensure that this new batch of tests we'll be writing will not have the automatic retry that the existing tests have.

David

On Wed, Nov 4, 2015 at 8:10 AM, Johnny Stenback <j...@mozilla.com> wrote:

On Wed, Nov 4, 2015 at 7:48 AM, Michael Henretty <mhen...@mozilla.com> wrote:
>
> On Wed, Nov 4, 2015 at 4:45 PM, Fabrice Desré <fab...@mozilla.com> wrote:
>>
>> Can we *right now* identify the worst offenders by looking at the tests
>> results/re-runs? You know that sheriffs will very quickly hide and
>> ignore tests that are really flaky.
>
>
>
> Yes, that's an important point. The problem is that you have to actually
> look at the logs of an individual chunk to see which tests failed. If a
> certain Gij test passes at least 1 out of it's 5 given runs, it will not
> surface to Treeherder, which means we can't start it. Looking through each
> chunk log file (of which we have 40 per run) is doable, but more time
> consuming and error prone.

Jumping in on something I haven't been able to pay much attention to
myself here so I may be missing context here, but this sounds like it
sets people up to assume that if something occasionally works we're
good to ship it, as opposed to if it occasionally fails we need to fix
it. Seems to me that this needs to be flipped around very aggressively
for these tests to provide much value.

- jst

Michael Henretty

unread,

Nov 4, 2015, 12:46:55 PM11/4/15

to L. David Baron, Fabrice Desré, dev-...@lists.mozilla.org

Both dbaron's and Gregor's suggestions sound good to me. Of the two, I like Gregor's approach since we can use the Treeherder page of the project repo to have a nice view of the oranges. But dbaron's approach probably requires less manpower to set up. Not sure.

Regardless, to do something about this we would need serious commitment to fixing the situation. We put this retry logic there in the first place because there are Heisenbugs somewhere in the marionette runner, client, or server stacks. Ie, having just one or two people on the B2G side look at this wasn't enough. We would need a task force that includes Gij experts, Taskcluster experts, and Marionette exports. I'm not sure how much effort this would require, but IMO it's worth it.

So I guess the real question is, is it worth that kind of resource commitment to fix?

On Wed, Nov 4, 2015 at 5:13 PM, L. David Baron <dba...@dbaron.org> wrote:

On Wednesday 2015-11-04 16:48 +0100, Michael Henretty wrote:
> On Wed, Nov 4, 2015 at 4:45 PM, Fabrice Desré <fab...@mozilla.com> wrote:
> > Can we *right now* identify the worst offenders by looking at the tests
> > results/re-runs? You know that sheriffs will very quickly hide and
> > ignore tests that are really flaky.
>
> Yes, that's an important point. The problem is that you have to actually
> look at the logs of an individual chunk to see which tests failed. If a
> certain Gij test passes at least 1 out of it's 5 given runs, it will not
> surface to Treeherder, which means we can't start it. Looking through each
> chunk log file (of which we have 40 per run) is doable, but more time
> consuming and error prone.

David Flanagan

unread,

Nov 4, 2015, 12:50:43 PM11/4/15

to Michael Henretty, dev-...@lists.mozilla.org

I've got no idea how the test runners work, but can we manage the too-many-retries allowed problem by turning off retries by default but adding a whitelist of tests that should be retried on failure? That would give us data about which tests are sound, but would allow sheriffs to quiet the bad intermittents.

Maybe if a test filename begins with "racy_" then it gets retried, otherwise it doesn't? That would make it very clear to test maintainers which ones need work!

David

On Wed, Nov 4, 2015 at 7:39 AM, Michael Henretty <mhen...@mozilla.com> wrote:

Hi Gaia Folk,

If you've been doing Gaia core work for any length of time, you are probably aware that we have *many* intermittent Gij test failures on Treeherder [1]. But the problem is even worse than you may know! You see, each Gij test is run 5 times within a test chunk (g. Gij4) before it is marked as failing. Then that chunk itself is retried up to 5 times before the whole thing is marked as failing. This means that for a test to be marked as "passing," it only has to run successfully once in 25 times. I'm not kidding. Our retry logic, especially those inside the test chunk, make it hard to know which intermittent tests are our worst offenders. This is bad.

My suggestion is to stop doing the retries inside the chunks. That way, the failures will at least surface on Treeherder, which means we can star more test, which means we'll have a lot more visibility on the bad intermittents. Sheriffs will complain a lot, so we have to be ready to act on these bugs. But the alternative is that we continue to write tests with a low "raciness" bar which, IMO, have a much lower chance of catching regressions. The longer we wait, the worse this problem becomes.

Thoughts?

Thanks,
Michael

1.) https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure&keywords_type=allwords&list_id=12657856&resolution=---&query_format=advanced&product=Firefox%20OS

Gareth Aye

unread,

Nov 4, 2015, 1:27:39 PM11/4/15

to Michael Henretty, dev-fxos

On Wed, Nov 4, 2015 at 10:39 AM, Michael Henretty <mhen...@mozilla.com> wrote:

Hi Gaia Folk,

If you've been doing Gaia core work for any length of time, you are probably aware that we have *many* intermittent Gij test failures on Treeherder [1]. But the problem is even worse than you may know! You see, each Gij test is run 5 times within a test chunk (g. Gij4) before it is marked as failing. Then that chunk itself is retried up to 5 times before the whole thing is marked as failing. This means that for a test to be marked as "passing," it only has to run successfully once in 25 times. I'm not kidding. Our retry logic, especially those inside the test chunk, make it hard to know which intermittent tests are our worst offenders. This is bad.

I'm not sure that it is so bad. From my own experience, regressions rarely cause intermittent failures. They mostly pop up as permareds. I think it would make sense to demonstrate that we are, in fact, masking a lot of real broken functionality before making our intermittents noisier for sheriffs.

Michael Henretty

unread,

Nov 5, 2015, 10:06:56 AM11/5/15

to Gareth Aye, dev-fxos

On Wed, Nov 4, 2015 at 7:27 PM, Gareth Aye <garet...@gmail.com> wrote:

I'm not sure that it is so bad. From my own experience, regressions rarely cause intermittent failures. They mostly pop up as permareds. I think it would make sense to demonstrate that we are, in fact, masking a lot of real broken functionality before making our intermittents noisier for sheriffs.

I disagree with this mentality. For one thing, QA files bugs all the time that are themselves intermittent. They even have a template item for it, "Repro rate: XX%". With the current retry count of 25x per test, we simply cannot write an effective Gij test for one of these bugs since the retries will make the test always pass regardless of if the fix was effective.

More generally though, all I'm suggesting is that we turn these retries down to something more sane, like maybe 5. And then surface these retries on treeherder as blue runs. Again though, we can't do this unless we make a concerted effort to decrease our overall level of intermittents (and squash some of those Heisenbugs). Otherwise Gij will just wind up hidden again. With Wilfred saying that 2.6 should be a released focused on quality over new features, now would be a great time to make a big testing push.

Johnny Stenback

unread,

Nov 5, 2015, 11:07:51 AM11/5/15

to Gareth Aye, Michael Henretty, dev-fxos

On Wed, Nov 4, 2015 at 10:27 AM, Gareth Aye <garet...@gmail.com> wrote:
> On Wed, Nov 4, 2015 at 10:39 AM, Michael Henretty <mhen...@mozilla.com>
> wrote:
>>
>> Hi Gaia Folk,
>>
>> If you've been doing Gaia core work for any length of time, you are
>> probably aware that we have *many* intermittent Gij test failures on
>> Treeherder [1]. But the problem is even worse than you may know! You see,
>> each Gij test is run 5 times within a test chunk (g. Gij4) before it is
>> marked as failing. Then that chunk itself is retried up to 5 times before
>> the whole thing is marked as failing. This means that for a test to be
>> marked as "passing," it only has to run successfully once in 25 times. I'm
>> not kidding. Our retry logic, especially those inside the test chunk, make
>> it hard to know which intermittent tests are our worst offenders. This is
>> bad.
>
>

> I'm not sure that it is so bad. From my own experience, regressions rarely
> cause intermittent failures. They mostly pop up as permareds. I think it
> would make sense to demonstrate that we are, in fact, masking a lot of real
> broken functionality before making our intermittents noisier for sheriffs.

I couldn't disagree more. A decade+ of Firefox and Gecko test
automation has mountains of evidence that intermittent failures are
caused by regressions or exposed by seemingly unrelated changes.

- jst

Naoki Hirata

unread,

Nov 5, 2015, 11:35:00 AM11/5/15

to Johnny Stenback, Michael Henretty, Gareth Aye, dev-fxos

Most of the intermittent from what I seen comes from race conditions, so it's hard to get a constant reproduction rate.

Adding logging statements might make the issue disappear due to the timing becoming different from writing the log.

I think there's a few tools to help show the concurrent processes. top, b2g-ps, b2g-info... I guess in some sense we'll need to try profiling some too?

More over is there anything QA can do to help narrow these things faster?

Regression ranges are hard to find on intermittent.

Gareth Aye

unread,

Nov 5, 2015, 11:36:10 AM11/5/15

to Johnny Stenback, Michael Henretty, dev-fxos

Just to be clear, I meant to ask questions and you can neither agree nor disagree with a question. The assertion here is that the oranges are masking real issues. My intention was really to ask to what extent we know that oranges are masking real issues. I only added my own experience that many regressions have resulted in permareds rather than oranges to support the idea that we might look into quantifying the badness of the situation before creating more noise for sheriffs. That part is falsifiable and it would make more sense to argue (if you're intent on disagreeing with me : ) that it's not worth quantifying the extent to which oranges mask real issues for reasons x, y, z, etc.

Johnny Stenback

unread,

Nov 5, 2015, 11:43:31 AM11/5/15

to Gareth Aye, Michael Henretty, dev-fxos

Fair enough. I personally don't think it's worth any more time trying
to prove this one way or another as we've seen intermittent issues
arise time and time again by seemingly unrelated changes. The A-team
at Mozilla has tons of data on this from years of tracking oranges on
tbpl and now tree herder, jgriffin can point you to that if needed.

My point is simply that if we're care at all about quality then we
need a test harness that brings intermittent issues to light as
opposed to tries to hide them. From the op here it sounds like we have
the latter.

- jst

Michael Henretty

unread,

Nov 7, 2015, 6:19:35 AM11/7/15

to Johnny Stenback, Gareth Aye, dev-fxos

Big thanks to Aus for getting work on this kicked off in bug 1222215. If anyone wants to help out with this effort, please take a look at all the blue runs here [1] and grab any intermittent test bugs that you are familiar with.

1.) https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=80a5920fbf8c49400f457501cf80b81fd30468de

Jonathan Griffin

unread,

Nov 7, 2015, 12:30:33 PM11/7/15

to Johnny Stenback, Michael Henretty, Gareth Aye, dev-fxos

As someone on this thread pointed out, the way that Gij is currently run actually hides many of the intermittents from Treeherder, and thus from our ability to track them. So the picture we see is very incomplete, and not an accurate representation of the stability of that suite.

If people want to get serious about fixing the intermittents, we should stop hiding them from Treeherder so we can track them and see what the worst offenders are. This means both stopping (or at least reducing) the internal reruns and stopping the automatic TaskCluster retries (or at least marking them as retriggers instead of retries), although in order for that not to turn into an explosion of orange on Treeherder, we'd probably only want to do one change at a time.

Jonathan

Julien Wajsberg

unread,

Nov 9, 2015, 10:55:27 AM11/9/15

to dev-...@lists.mozilla.org

Le 05/11/2015 16:06, Michael Henretty a écrit :

On Wed, Nov 4, 2015 at 7:27 PM, Gareth Aye <garet...@gmail.com> wrote:

I'm not sure that it is so bad. From my own experience, regressions rarely cause intermittent failures. They mostly pop up as permareds. I think it would make sense to demonstrate that we are, in fact, masking a lot of real broken functionality before making our intermittents noisier for sheriffs.

I disagree with this mentality. For one thing, QA files bugs all the time that are themselves intermittent. They even have a template item for it, "Repro rate: XX%". With the current retry count of 25x per test, we simply cannot write an effective Gij test for one of these bugs since the retries will make the test always pass regardless of if the fix was effective.

Actually once you know what the bug is, you can always write a test for it. Maybe it will be a unit test and not an integration test, but it's always possible :)

signature.asc

Julien Wajsberg

unread,

Nov 9, 2015, 10:58:29 AM11/9/15

to dev-...@lists.mozilla.org

I usually have a look at the blues as well when I look at my pull request runs. And most of the time it's an issue that's not related with the test. I think this account for most of the issues we see, actually. Do we know why this happens ?

See for example Gu13 in [1] (full log is [2]).

[1] https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=0d4437d31e8dd8c81d4617d875c7337d858097a4
[2] https://public-artifacts.taskcluster.net/b32i9Hj1T1qurpfDYWyyQA/0/public/logs/live_backing.log

Le 04/11/2015 16:39, Michael Henretty a écrit :

Hi Gaia Folk,

If you've been doing Gaia core work for any length of time, you are probably aware that we have *many* intermittent Gij test failures on Treeherder [1]. But the problem is even worse than you may know! You see, each Gij test is run 5 times within a test chunk (g. Gij4) before it is marked as failing. Then that chunk itself is retried up to 5 times before the whole thing is marked as failing. This means that for a test to be marked as "passing," it only has to run successfully once in 25 times. I'm not kidding. Our retry logic, especially those inside the test chunk, make it hard to know which intermittent tests are our worst offenders. This is bad.

My suggestion is to stop doing the retries inside the chunks. That way, the failures will at least surface on Treeherder, which means we can star more test, which means we'll have a lot more visibility on the bad intermittents. Sheriffs will complain a lot, so we have to be ready to act on these bugs. But the alternative is that we continue to write tests with a low "raciness" bar which, IMO, have a much lower chance of catching regressions. The longer we wait, the worse this problem becomes.

Thoughts?

Thanks,

Michael

1.) https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure&keywords_type=allwords&list_id=12657856&resolution=---&query_format=advanced&product=Firefox%20OS

signature.asc

Michael Henretty

unread,

Nov 9, 2015, 11:16:02 AM11/9/15

to Julien Wajsberg, dev-fxos

On Mon, Nov 9, 2015 at 4:58 PM, Julien Wajsberg <jwaj...@mozilla.com> wrote:

I usually have a look at the blues as well when I look at my pull request runs. And most of the time it's an issue that's not related with the test. I think this account for most of the issues we see, actually. Do we know why this happens ?

I started this thread specifically about Gij (integration tests). Gu (unit tests) historically have had less flakiness, but you're right that there has been some weirdness there recently too. I'm not looking at Gu yet since I think Gij is in worse shape, and I didn't want to conflate this thread with Gu runner issues.

Now, concerning Gij, to address your statement here: "And most of the time it's an issue that's not related with the test. I think this account for most of the issues we see, actually."

I thought this might be the case, and indeed I started this thread so we could have more awareness of any harness issues in Gij. But it turns out I think the (excellent) work Gareth and Aus have been doing over the last year might have addressed our biggest Gij harness issues (knock on wood as hard as possible). In any case, I have been able to fix most of the intermittents surfaced by bug 1222215 in the tests themselves. I'm hoping this continues to be the case :)

Julien Wajsberg

unread,

Nov 9, 2015, 11:30:15 AM11/9/15

to Michael Henretty, dev-fxos

Le 09/11/2015 17:15, Michael Henretty a écrit :

On Mon, Nov 9, 2015 at 4:58 PM, Julien Wajsberg <jwaj...@mozilla.com> wrote:

I usually have a look at the blues as well when I look at my pull request runs. And most of the time it's an issue that's not related with the test. I think this account for most of the issues we see, actually. Do we know why this happens ?

I started this thread specifically about Gij (integration tests). Gu (unit tests) historically have had less flakiness, but you're right that there has been some weirdness there recently too. I'm not looking at Gu yet since I think Gij is in worse shape, and I didn't want to conflate this thread with Gu runner issues.

Mmm you're right that in this case it was a Gu build, but if you look at the log you'll see it's not Gu-specific at all and I've seen the same issues on Gij runs as well.

signature.asc

Jonas Sicking

unread,

Nov 17, 2015, 7:36:02 PM11/17/15

to Michael Henretty, dev-fxos

Jumping in on an old thread here.

I 100% agree that getting rid of the intermittent failures is really
important. Especially the retry-three-times thing that we are doing.

One really important problem that we need to solve is that we have a
test harness problem which is causing the socket that marionette uses
to sometimes disconnect.

This means that any test can and does fail intermittently. And fairly
often as I understand it.

At the very least we should detect that this is the problem and rerun
the test. (This is ok since the broken socket is a marionette bug and
not a product bug). But even better is of course to find the source of
this disconnect and fix it.

Many have tried to find and fix this problem, but it's hard since it
only reproduces intermittently. One possible approach would be to try
to catch this in rr. I don't think that has been tried yet.

/ Jonas

Aus Lacroix

unread,

Nov 17, 2015, 7:47:32 PM11/17/15

to Jonas Sicking, Michael Henretty, dev-fxos

Jonas, that particular error is on the decline. Many went away when we rolled out a series of a fixes to run the tests on devices. The error itself was a symptom of a different issue. I would imagine that the ones that we still see occurring are, likely, also not directly related to sockit-to-me.

Even though this is the case, we recognize that synchronous tcp socket usage isn't ideal (we didn't think it was in the first place, necessarily, it was just the best way to make the tests easy to write).

FFWD to now, we're adding a promise based tcp driver for marionette which will enable new tests to be written using promises. Marionette calls would always return a promise which you could .then() to do something else. It's a much nicer, and standardized pattern.

Note that we *WILL* be seeking *ALL OF YOUR HELP* to port existing tests to the new driver. There are simply too many for any single team to handle.

FYI, Andre Natal is working on the new tcp driver for marionette.

Andrew Sutherland

unread,

Nov 17, 2015, 8:40:15 PM11/17/15

to dev-...@lists.mozilla.org

On 11/17/2015 07:47 PM, Aus Lacroix wrote:
> Even though this is the case, we recognize that synchronous tcp socket
> usage isn't ideal (we didn't think it was in the first place,
> necessarily, it was just the best way to make the tests easy to write).

Do we have a mitigation plan in place as we move towards full-async and
promises? One of the nice things about the (ugly) synchronous-TCP-use
in Marionette JS and the old-school mozmill (ugly) nested event loop
spinning (from inside the gecko process on the main thread) was that
they were foolproof from a control flow perspective. You couldn't
forget to yield on a Promise or otherwise fail to wait for the async
thing you just requested to complete.

Marionette JS has the advantage that it mediates all interaction with
the Gecko process so it can ensure that operations occur in order, but
it might be advantageous for it to return promises that insta-fail the
test if the returned Promise does not have then() called on it before
control flow is yielded.

This type of promise isn't something one could use without false
positives/explosions in non-testing code[1], but it seems like it would
be a good way to keep everyone sane. Specifically, if using a driver
like https://github.com/tj/co that can do generator.throw() when a
yielded Promise rejects, it helps keep line numbers and call stacks
sane. It also helps ensure that calls to console.log() don't end up
misleading as control-flow races ahead of where Marionette is at. It
also is beneficial for tests that involve processes other than Gecko,
such as fake HTTP or email servers. (Versus the case where Marionette
JS is super-smart and has its secret queue of pending operations.)

Andrew

1: See discussions on enabling promise debugging related to done() and
others. Some starting points for those interested:
https://github.com/domenic/promises-unwrapping/issues/19
https://code.google.com/p/v8/issues/detail?id=3093

Michael Henretty

unread,

Nov 18, 2015, 5:30:00 AM11/18/15

to Aus Lacroix, Jonas Sicking, dev-fxos

On Wed, Nov 18, 2015 at 1:47 AM, Aus Lacroix <a...@mozilla.com> wrote:

Jonas, that particular error is on the decline. Many went away when we rolled out a series of a fixes to run the tests on devices. The error itself was a symptom of a different issue. I would imagine that the ones that we still see occurring are, likely, also not directly related to sockit-to-me.

FWIW, in working on bug 1222215 I haven't seen the dreaded "Cannot call send of undefined" error once. I didn't want to jinx it, but I think we can safely say this is no longer an issue for us (knock on wood).

Even though this is the case, we recognize that synchronous tcp socket usage isn't ideal (we didn't think it was in the first place, necessarily, it was just the best way to make the tests easy to write).

FFWD to now, we're adding a promise based tcp driver for marionette which will enable new tests to be written using promises. Marionette calls would always return a promise which you could .then() to do something else. It's a much nicer, and standardized pattern.

Putting in my two cents, I love the fact that sockit-to-me is synchronous. It allows me to read, write, and understand tests very quickly. I realize there are performance problems with a busy wait, but I don't think this is a problem during marionette tests since they aren't user facing. What other advantages does switching to a Promise based driver give us? Do we think the tests will be less intermittent?

Jonas Sicking

unread,

Nov 18, 2015, 7:03:00 PM11/18/15

to Michael Henretty, Aus Lacroix, dev-fxos

Great to hear that socket disconnects aren't a big problem any more.
That is really really awesome!!

As far as synchronous vs. async goes, I'd say that optimizing for
making it easy to write tests is generally much more important than
optimizing for getting tests to run fast.

This is especially true if slowness happens on desktop computers
running the test harness, rather than on the phone hardware that we
are testing. Desktop computers are fairly cheap to scale up in
comparison to engineering manpower.

And engineers (me included) tend to dislike test writing and test
debugging enough that anything we can do to make that more pleasant
has a big payoff in the quality and quantity of tests that we have.

/ Jonas

Michael Henretty

unread,

Dec 17, 2015, 2:00:44 PM12/17/15

to Jonas Sicking, Aus Lacroix, dev-fxos

Resurrecting this old thread since intermittents are still a problem, and now that we are back from Mozlando I want to have a clear path forward here.

As a quick refresher, the retry logic inside the Gij test runner makes it very hard to know which tests are the worst intermittents. In bug 1222215, we have been trying to mark and fix all the intermittents we find so that we can disable this retry logic. But this has proven very time consuming, and in the meantime new intermittents can be introduced and not noticed due to bug 1222215.

Since these intermittents slow down development for everybody, are a general annoyance to sheriffs and devs alike, and since they provide us with little trust in protecting our features, here is the path forward we are going to take:

- Disable all current intermittents that block bug 1222215.

- Land bug 1222215.

- Focus efforts on re-enabling the intermittent tests.

This way, we will at least be protected against introducing new racy tests while we fix the old ones. I have spoken to Julie McCracken, and she agrees with this way forward. We will be ni? module owners on these tests to get help re-enabling them. Also, if you want to help out with this effort please reach out to me or Julie. Let's take this opportunity to get our current tests nice and green while we figure out the direction forward for Firefox OS.

Thanks!

Michael

Armen Zambrano Gasparnian

unread,

Dec 18, 2015, 10:56:19 AM12/18/15

to Michael Henretty, Aus Lacroix, Jonas Sicking, dev-fxos

Would separating intermittent test jobs into their own separate job help here?

That way the retry logic can be removed for the more stable tests.

_______________________________________________
dev-fxos mailing list
dev-...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-fxos

--

Zambrano Gasparnian, Armen
Automation & Tools Engineer
http://armenzg.blogspot.ca

Michael Henretty

unread,

Dec 18, 2015, 12:01:14 PM12/18/15

to Armen Zambrano Gasparnian, Aus Lacroix, Jonas Sicking, dev-fxos

On Fri, Dec 18, 2015 at 10:56 AM, Armen Zambrano Gasparnian <arm...@mozilla.com> wrote:

Would separating intermittent test jobs into their own separate job help here?
That way the retry logic can be removed for the more stable tests.

This is not a bad idea. However, I think the worst part about the "internal" retry logic is that it hides how bad an intermittent test really is. So perhaps we could move all the disabled/intermittent tests [1] to a job that gets hidden on b2g-inbound.

1.) https://github.com/mozilla-b2g/gaia/blob/master/shared/test/integration/tbpl-manifest.json

Andreas Tolfsen

unread,

Dec 20, 2015, 12:45:51 PM12/20/15

to Michael Henretty, Aus Lacroix, Armen Zambrano Gasparnian, Jonas Sicking, dev-fxos

On 18 December 2015 at 17:01, Michael Henretty <mhen...@mozilla.com> wrote:
>
> On Fri, Dec 18, 2015 at 10:56 AM, Armen Zambrano Gasparnian
> <arm...@mozilla.com> wrote:
>>
>> Would separating intermittent test jobs into their own separate job help
>> here?
>> That way the retry logic can be removed for the more stable tests.
>
> This is not a bad idea. However, I think the worst part about the "internal"
> retry logic is that it hides how bad an intermittent test really is. So
> perhaps we could move all the disabled/intermittent tests [1] to a job that
> gets hidden on b2g-inbound.

Yes, it is possible to make that specific job a tier-2 (or 3), i.e.
hidden, job on Treeherder.

However I cannot tell if this approach is better than just disabling
these specific tests. It’s my impression that they have been
intermittent for such a long time, that they clearly aren’t considered
valuable by developers.

Michael Henretty

unread,

Dec 31, 2015, 10:42:58 AM12/31/15

to Andreas Tolfsen, Sheriffs, Aus Lacroix, Armen Zambrano Gasparnian, Julie McCracken, dev-fxos, Jonas Sicking

We have landed bug 1222215 and the Gaia Marionette JS retry logic is finally gone \o/.To make this happen we:
- increased the default script/seach timeout for the marionette js client [1]
- made the marionette app.launch API more robust [2]
- made client.helper.waitForElement and waitforElementToDisappear configurable [3][4]
- disabled about 40 tests [5]

Special thanks to Aus and Gareth (among others) for helping out with this effort.

With the additional 40 tests disabled, we now have ~90 out of ~380 tests disabled, which is 25%. And since we are currently in between releases (so to speak) now would be a great time to fix this up. I highly encourage you to go into our disabled list [5], and start to pick off the tests that are relevant to your module. A great way to get quick results on treeherder for an intermittent test is to use this commit [6] (replacing the TEST_FILES var with your test path) on top of your test patch.

Also note that because we disabled the retry logic, Gij will fail more often. Please be a good Gaia citizen and keep an eye on the blue runs on gaia [7] and gaia-master [8]. If we catch intermittents there, we can prevent headaches for other teams once they reach b2g-inbound and m-c.

I'll put some tips and tricks in the P.S. section of this email, but feel free to reach out to Aus, Gareth or myself anytime for help with these pesky intermittents. Long story short, if you are using assert you should probably use waitFor instead. But reality can be a bit more complicated, so we are at your disposal.

Thanks!

Michael

1.) https://bugzilla.mozilla.org/show_bug.cgi?id=1233892
2.) https://bugzilla.mozilla.org/show_bug.cgi?id=1233823
3.) https://bugzilla.mozilla.org/show_bug.cgi?id=1233880
4.) https://bugzilla.mozilla.org/show_bug.cgi?id=1235531
5.) https://github.com/mozilla-b2g/gaia/blob/master/shared/test/integration/tbpl-manifest.json
6.) https://github.com/mozilla-b2g/gaia/commit/5ca9082a33c9
7.) https://treeherder.mozilla.org/#/jobs?repo=gaia
8.) https://treeherder.mozilla.org/#/jobs?repo=gaia-master

P.S.
In case anyone is interested, here are a couple of common problems we saw while going through these tests. First, writing to settings or contacts DB is erratically slow. Haven't had time to investigate why yet, so make sure to set large timeouts when waiting for these operations to complete. Similarly, opening a new window (dialog, popup, activity) is always slower than device, so make sure to wait for windows to show up before switching to them. Also, all tests can fail with the following messages "RangeError: port should be >= 0 and < 65536: 65537, Error: Not connected. To write data you must call connect first.". This is the dreaded harness issue that has been around for a long time, but thankfully it only happens 2-3 out of a hundred times now. And if you see "Crash detected but error running stackwalk," it means it's time to submit a PR with some console.logs in them. Hope that helps :)

Andreas Tolfsen

unread,

Jan 8, 2016, 12:09:10 PM1/8/16

to dev-fxos, Sheriffs, Michael Henretty, Aus Lacroix, Armen Zambrano Gasparnian, Jonas Sicking, Julie McCracken

I’d like to commend the work from mhenretty, aus, and gaye on this.
There are still some releng idiosyncrasies between Gij run on Gaia-
and Gecko trees that it would be good resolve, that affect those of
especially that work components shared between Gecko and Gaia.

On 31 December 2015 at 15:42, Michael Henretty <mhen...@mozilla.com> wrote:
> Also note that because we disabled the retry logic, Gij will fail more
> often. Please be a good Gaia citizen and keep an eye on the blue runs on
> gaia [7] and gaia-master [8]. If we catch intermittents there, we can
> prevent headaches for other teams once they reach b2g-inbound and m-c.

The retry logic you speak of here is the one embedded within the test
runner, and not the Taskcluster/Treeherder re-run logic, correct?

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1228079 some time
ago about using the same job re-run logic for Gij on try as on
gaia-master. The re-run logic is what makes the jobs blue in
Treeherder.

It would be good if one of the sheriffs had time to pitch in on that bug.

Aus Lacroix

unread,

Jan 19, 2016, 7:18:47 PM1/19/16

to dev-fxos

Hi All,

Some additional news about fixing your intermittent tests.

We recently fixed bug 1175116 [1] which was preventing us from getting the correct JavaScript stack trace when framework errors occurred in the host JavaScript chrome level code.

Now you get nice pretty javascript stack traces and *actual* stack traces if the process crashes.

Enjoy!

-aus

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1175116

Johan Lorenzo

unread,

Jan 20, 2016, 9:06:43 AM1/20/16

to Aus Lacroix, dev-fxos

Thank you so much for the fix, guys!

I've been working on fixing a Gij test in Dialer. I got plenty of different errors, but Crash detected but error running stackwalk hasn't come back, so far! Yay :)

Johan