Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

PSA: Flaky timeouts in mochitest-plain now fail newly added tests

149 views
Skip to first unread message

Ehsan Akhgari

unread,
Dec 12, 2014, 1:35:25 PM12/12/14
to dev-pl...@lists.mozilla.org
We had a session on intermittent test failures in Portland <
https://etherpad.mozilla.org/ateam-pdx-intermittent-oranges>, and one of
the things that we discussed was adding analyses to our test suites that
detect known bad test writing practices <
https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges>
and fails the offending tests. As part of that effort, I just landed the
patch to bug 649012 which enables this analysis on mochitest-plain.

What this means for you as a test author:
* Using setTimeout(foo, N); where N is not 0 will cause your tests to
fail. You should double check why you're using such timeouts. They are
almost never necessary.
* If you have a legitimate reason for using timeouts like this (such as if
your test needs to wait a bit to make sure an event doesn't happen in the
future, and there is no other way of checking that), you should call
SimpleTest.requestFlakyTimeout("reason");. The argument to the function is
a string which is meant to document why you need to use these kinds of
timeouts, and why that doesn't lead into intermittent test failures.

What this means for you as a code reviewer:
* If you see a patch which uses SimpleTest.requestFlakyTimeout, please
review the usage of timeouts carefully and make sure they are safe in the
test, and the reason argument passed to that function correctly documents
it.

What this means for you as a module owner/peer:
* I didn't have time to triage all of the existing tests that this analysis
caught, so I have annotated them all with
|SimpleTest.requestFlakyTimeout("untriaged");|. See <
https://hg.mozilla.org/integration/mozilla-inbound/rev/fa8bd3194aca> for
examples. Please go through the code in your module and try to triage
these tests, and either fix the timeout usages, or if they are indeed OK
document the reason in requestFlakyTimeout's argument. I had to do this in
order to prevent new bad tests from going in as soon as possible.

I have turned this on only for mochitest-plain for now, and am planning to
do the same for mochitest-chrome/a11y/browser-chrome very soon as well.

Cheers,
--
Ehsan

Jonas Sicking

unread,
Dec 12, 2014, 5:33:41 PM12/12/14
to Ehsan Akhgari, dev-platform
Awesome! I think this would be great to have for the integration tests
in B2G as well.

/ Jonas
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Ehsan Akhgari

unread,
Dec 12, 2014, 5:39:15 PM12/12/14
to Jonas Sicking, dev-platform
On 2014-12-12 5:33 PM, Jonas Sicking wrote:
> Awesome! I think this would be great to have for the integration tests
> in B2G as well.

It is already live for all mochitest-plains run on b2g, or did you mean
another test suite? If the latter, I'd be happy to file bugs and help
folks adopt the check, the check itself is super simple.

Jonas Sicking

unread,
Dec 12, 2014, 5:55:08 PM12/12/14
to Ehsan Akhgari, dev-platform
On Fri, Dec 12, 2014 at 2:39 PM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
> On 2014-12-12 5:33 PM, Jonas Sicking wrote:
>>
>> Awesome! I think this would be great to have for the integration tests
>> in B2G as well.
>
>
> It is already live for all mochitest-plains run on b2g, or did you mean
> another test suite?

I meant another test suite. We have two "Gaia Integration" test suites
which are essentially our version of "browser chrome" tests.

See the Gij and Gip tests for the "B2G Desktop Linux x64 Opt" platform.

The Gij ones are probably more important to address since we're
migrating to that harness from Gip.

/ Jonas

Ehsan Akhgari

unread,
Dec 12, 2014, 6:08:10 PM12/12/14
to Jonas Sicking, dev-...@lists.mozilla.org, dev-platform
Filed <https://bugzilla.mozilla.org/show_bug.cgi?id=1111068>. I hope I
got the component right!


Martijn

unread,
Dec 17, 2014, 4:28:30 PM12/17/14
to Ehsan Akhgari, dev-platform
What about setInterval? (I guess that's not really used much?)
What about where setTimeout is used as a fallback for when some event
failed to fire and the mochitest is stalled and the setTimeout is then
used to finish the mochitest on time and give some useful debug info?

Regards,
Martijn

Daniel Holbert

unread,
Dec 17, 2014, 4:56:25 PM12/17/14
to Martijn, Ehsan Akhgari, dev-platform
On 12/17/2014 01:27 PM, Martijn wrote:
> What about where setTimeout is used as a fallback for when some event
> failed to fire and the mochitest is stalled and the setTimeout is then
> used to finish the mochitest on time and give some useful debug info?

This exact scenario was called out in the initial message on this
thread, as a legitimate reasons to use setTimeout, and with instructions
on what to do. Quoting ehsan:

> On Fri, Dec 12, 2014 at 10:34 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>> * If you have a legitimate reason for using timeouts like this (such as if
>> your test needs to wait a bit to make sure an event doesn't happen in the
>> future, and there is no other way of checking that), you should call
>> SimpleTest.requestFlakyTimeout("reason");. The argument to the function is
>> a string which is meant to document why you need to use these kinds of
>> timeouts, and why that doesn't lead into intermittent test failures.

~Daniel

Nils Ohlmeier

unread,
Dec 17, 2014, 8:33:33 PM12/17/14
to Ehsan Akhgari, dev-pl...@lists.mozilla.org

> On Dec 12, 2014, at 10:34 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>
> We had a session on intermittent test failures in Portland <
> https://etherpad.mozilla.org/ateam-pdx-intermittent-oranges>, and one of
> the things that we discussed was adding analyses to our test suites that
> detect known bad test writing practices <
> https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges>
> and fails the offending tests. As part of that effort, I just landed the
> patch to bug 649012 which enables this analysis on mochitest-plain.

Great idea.

> What this means for you as a test author:
> * Using setTimeout(foo, N); where N is not 0 will cause your tests to
> fail. You should double check why you're using such timeouts. They are
> almost never necessary.
> * If you have a legitimate reason for using timeouts like this (such as if
> your test needs to wait a bit to make sure an event doesn't happen in the
> future

What about the other way around?
My test is waiting for an event to proceed with the test. All of our WebRTC
mochitests highly depend on events from the WebRTC API’s. Now the
obvious answer is: mochitest has a generic timeout of 300 seconds (I think)
which will terminate the test in case the expected event never fired. The
problem I have with that is that as said before the WebRTC tests depend
on whole sequence of events to happen. And if the generic mochitest timeout
kills a test it is very hard to tell which events exactly was missing, especially
as we are orchestrating two actors in the browser and both wait for the same
events.
For that reason I got used to the habit of creating a timeout while the test is
waiting for an event to happen, because if that timer fires I can precisely
print an error message what is missing and why test is exiting early (besides
the fact that the test is not blocking for another 2xx seconds).

So while I totally agree with the description here
https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges
that it is bad to magically wait for time and hope that it is safe to proceed
with your test, I do not agree with the assessment that setTimeout in general
is evil.

> , and there is no other way of checking that), you should call
> SimpleTest.requestFlakyTimeout("reason");. The argument to the function is
> a string which is meant to document why you need to use these kinds of
> timeouts, and why that doesn't lead into intermittent test failures.

If we really want to have people not use setTimeout at all in test, but have a
legitimate use case for it:
- your test needs to wait for X time and check that either an event happened
or is missing and exit the test execution if the check failed
Can we then maybe add a connivence function for exactly that purpose?
I would prefer a connivence function, because that would allow us to still
catch the evil wait magically for a little bit as bad behavior. Where right now
I have no other choice then requesting to lift the whole protection via
requestFlakyTimeout().

What I really do not like is the name chosen for this, as it seems to suggest
that timeouts in general are flaky, which they are not (see above).

Can we please rename this to something more appropriate like
SimpleTest.requestUnreliableTimeout(“reason”);
Because I think with “flaky” you mean unreliable.

And while we are on improving this: please, please, please improve the error
message. Right now someone who writes a new test he will see something
like this in his try logs:
"Test attempted to use a flaky timeout value 5000"
And he will be baffled like me and some colleagues this morning about the
error message and what it means. Would it have been acceptable if I
would have used 4999 instead of 5000? In case of the WebRTC tests you can
very easily write a legit test case without ever calling setTimeout yourself,
but some of the generic helper code will do it for the test writer. And then
the current error message essentially means nothing to him.

Best
Nils Ohlmeier

Ehsan Akhgari

unread,
Dec 17, 2014, 9:22:30 PM12/17/14
to Martijn, dev-platform
On 2014-12-17 4:27 PM, Martijn wrote:
> What about setInterval? (I guess that's not really used much?)

Yes, setInterval is not used that much.

> What about where setTimeout is used as a fallback for when some event
> failed to fire and the mochitest is stalled and the setTimeout is then
> used to finish the mochitest on time and give some useful debug info?

What dholbert said. :-)

> On Fri, Dec 12, 2014 at 10:34 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>> We had a session on intermittent test failures in Portland <
>> https://etherpad.mozilla.org/ateam-pdx-intermittent-oranges>, and one of
>> the things that we discussed was adding analyses to our test suites that
>> detect known bad test writing practices <
>> https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges>
>> and fails the offending tests. As part of that effort, I just landed the
>> patch to bug 649012 which enables this analysis on mochitest-plain.
>>
>> What this means for you as a test author:
>> * Using setTimeout(foo, N); where N is not 0 will cause your tests to
>> fail. You should double check why you're using such timeouts. They are
>> almost never necessary.
>> * If you have a legitimate reason for using timeouts like this (such as if
>> your test needs to wait a bit to make sure an event doesn't happen in the
>> future, and there is no other way of checking that), you should call
>> SimpleTest.requestFlakyTimeout("reason");. The argument to the function is
>> a string which is meant to document why you need to use these kinds of
>> timeouts, and why that doesn't lead into intermittent test failures.
>>

Ehsan Akhgari

unread,
Dec 17, 2014, 9:33:12 PM12/17/14
to Nils Ohlmeier, dev-pl...@lists.mozilla.org
On 2014-12-17 8:33 PM, Nils Ohlmeier wrote:
>
>> On Dec 12, 2014, at 10:34 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>>
>> We had a session on intermittent test failures in Portland <
>> https://etherpad.mozilla.org/ateam-pdx-intermittent-oranges>, and one of
>> the things that we discussed was adding analyses to our test suites that
>> detect known bad test writing practices <
>> https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges>
>> and fails the offending tests. As part of that effort, I just landed the
>> patch to bug 649012 which enables this analysis on mochitest-plain.
>
> Great idea.
>
>> What this means for you as a test author:
>> * Using setTimeout(foo, N); where N is not 0 will cause your tests to
>> fail. You should double check why you're using such timeouts. They are
>> almost never necessary.
>> * If you have a legitimate reason for using timeouts like this (such as if
>> your test needs to wait a bit to make sure an event doesn't happen in the
>> future
>
> What about the other way around?
> My test is waiting for an event to proceed with the test. All of our WebRTC
> mochitests highly depend on events from the WebRTC API’s. Now the
> obvious answer is: mochitest has a generic timeout of 300 seconds (I think)
> which will terminate the test in case the expected event never fired. The
> problem I have with that is that as said before the WebRTC tests depend
> on whole sequence of events to happen. And if the generic mochitest timeout
> kills a test it is very hard to tell which events exactly was missing, especially
> as we are orchestrating two actors in the browser and both wait for the same
> events.

Usually a good way of handling this problem is to log the steps in the
test via info() so that you can go back and look at the log to see how
much of the test was executed. (I think these days you need to call
SimpleTest.requestCompleteLog() to make that actually work.)

> For that reason I got used to the habit of creating a timeout while the test is
> waiting for an event to happen, because if that timer fires I can precisely
> print an error message what is missing and why test is exiting early (besides
> the fact that the test is not blocking for another 2xx seconds).

If the above is not an option, this is a legitimate use of setTimeout.
You can describe this in the string argument to requestFlakyTimeout().

> So while I totally agree with the description here
> https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges
> that it is bad to magically wait for time and hope that it is safe to proceed
> with your test, I do not agree with the assessment that setTimeout in general
> is evil.

I think the text in
<https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges#Using_magical_timeouts_to_cause_delays>
describes the issue with setTimeout well. Nobody is definitely saying
that setTimeout in general is evil, but an automated analysis has no way
of knowing, which is why we have the current requestFlakyTimeout mechanism.

>> , and there is no other way of checking that), you should call
>> SimpleTest.requestFlakyTimeout("reason");. The argument to the function is
>> a string which is meant to document why you need to use these kinds of
>> timeouts, and why that doesn't lead into intermittent test failures.
>
> If we really want to have people not use setTimeout at all in test, but have a
> legitimate use case for it:
> - your test needs to wait for X time and check that either an event happened
> or is missing and exit the test execution if the check failed
> Can we then maybe add a connivence function for exactly that purpose?
> I would prefer a connivence function, because that would allow us to still
> catch the evil wait magically for a little bit as bad behavior. Where right now
> I have no other choice then requesting to lift the whole protection via
> requestFlakyTimeout().

Well, for most people the existing mochitest timeout mechanism plus the
log messages as the test proceeds is enough in that case. I understand
that adding the proper log messages after the fact can be a lot of work
(although I don't know how big the WebRTC test suite that uses this
pattern is.)

> What I really do not like is the name chosen for this, as it seems to suggest
> that timeouts in general are flaky, which they are not (see above).
>
> Can we please rename this to something more appropriate like
> SimpleTest.requestUnreliableTimeout(“reason”);
> Because I think with “flaky” you mean unreliable.

I really don't see either name as clearly better than the other, and
that is a lot of sed'ing (around 580 occurrences of that phrase), but if
you feel strongly enough to write a patch, I'd be happy to review. :-)

> And while we are on improving this: please, please, please improve the error
> message. Right now someone who writes a new test he will see something
> like this in his try logs:
> "Test attempted to use a flaky timeout value 5000"
> And he will be baffled like me and some colleagues this morning about the
> error message and what it means. Would it have been acceptable if I
> would have used 4999 instead of 5000? In case of the WebRTC tests you can
> very easily write a legit test case without ever calling setTimeout yourself,
> but some of the generic helper code will do it for the test writer. And then
> the current error message essentially means nothing to him.

Yes, we can and should definitely improve the error message. Since you
hit this in practice, what error message would have helped you
understand the issue better? Perhaps we should at least link to
<https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges>?

Thanks for the feedback!
Ehsan

Daniel Holbert

unread,
Dec 18, 2014, 2:47:24 AM12/18/14
to Martijn, Ehsan Akhgari, dev-platform
Sorry -- after re-reading, I realized I was wrong here -- your example
scenario is actually different from the legitimate scenario I alluded to
in the first message of this thread.

The legitimate scenario from that first message was:
- We're expecting that an event *will not* fire.
- We wait a bit to see if it fires.
- Fail if it fires before the timeout expires.

(Clearly, regardless of the timeout we choose -- and unexpected delays
-- any failures here are "real" and indicate bugs.)

In contrast, your scenario is:
- We're expecting that an event *will* fire.
- We wait a bit to see if it fires.
- Fail if it the event *does not* fire before the timeout expires.

Here, failures are iffy - they may or may not be "real" depending on
delays and whether the timeout was long enough. This is precisely the
sort of thing that results in random-oranges (when e.g. new
test-platforms are added that are way slower than the system that the
test was developed on).

So, the idea now is that there should be a high threshold for adding the
second sort of test (ideally, we should just be using an event-listener
where possible). If a setTimeout is really needed for some reason in
this kind of scenario, the justification (and ratioale for why it won't
cause randomorange) will need to be explicitly documented in the test.

~Daniel

Martijn

unread,
Dec 18, 2014, 12:01:25 PM12/18/14
to Daniel Holbert, Ehsan Akhgari, dev-platform
On Wed, Dec 17, 2014 at 1:56 PM, Daniel Holbert <dhol...@mozilla.com> wrote:
> On 12/17/2014 01:27 PM, Martijn wrote:
>> What about where setTimeout is used as a fallback for when some event
>> failed to fire and the mochitest is stalled and the setTimeout is then
>> used to finish the mochitest on time and give some useful debug info?
>
> This exact scenario was called out in the initial message on this
> thread, as a legitimate reasons to use setTimeout, and with instructions
> on what to do. Quoting ehsan:

Sorry, I got the thought that the usage of requestFlakyTimeout was
wrong at all times.
I guess it's only wrong when the comment inside it tells you that it's
not legitimite usage.

Regards,
Martijn

>> On Fri, Dec 12, 2014 at 10:34 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>>> * If you have a legitimate reason for using timeouts like this (such as if
>>> your test needs to wait a bit to make sure an event doesn't happen in the
>>> future, and there is no other way of checking that), you should call
>>> SimpleTest.requestFlakyTimeout("reason");. The argument to the function is
>>> a string which is meant to document why you need to use these kinds of
>>> timeouts, and why that doesn't lead into intermittent test failures.
>
> ~Daniel

Nils Ohlmeier

unread,
Dec 18, 2014, 5:28:51 PM12/18/14
to Daniel Holbert, Martijn, Ehsan Akhgari, dev-platform

> On Dec 17, 2014, at 11:47 PM, Daniel Holbert <dhol...@mozilla.com> wrote:
> In contrast, your scenario is:
> - We're expecting that an event *will* fire.
> - We wait a bit to see if it fires.
> - Fail if it the event *does not* fire before the timeout expires.
>
> Here, failures are iffy - they may or may not be "real" depending on
> delays and whether the timeout was long enough. This is precisely the
> sort of thing that results in random-oranges (when e.g. new
> test-platforms are added that are way slower than the system that the
> test was developed on).
>
> So, the idea now is that there should be a high threshold for adding the
> second sort of test (ideally, we should just be using an event-listener
> where possible).

Well there is an event listener waiting for the event to fire.
But how else then through a timeout, obviously with a high timeout value like
30 or 60 seconds, can your test tell/complain that the event is missing?
The API’s usually do not provide event listeners for NOT firing events.

Sure I can print an info message that my test is now waiting for an event to pop,
but lots of my tests don’t just sit there and wait, they proceed and do other things,
but eventually they have to block on waiting for the event callback.
In that scenario only the over all timeout will terminate my test. And somewhere
far up in the log I might find a log message. But as I tried to describe in my other
email having a long living timer which pops complaining that event X is missing
I think is a legit use case for setTimeout in test.

Nils

Ehsan Akhgari

unread,
Dec 18, 2014, 7:41:38 PM12/18/14
to Nils Ohlmeier, Daniel Holbert, Martijn, dev-platform
On 2014-12-18 5:28 PM, Nils Ohlmeier wrote:
>
>> On Dec 17, 2014, at 11:47 PM, Daniel Holbert <dhol...@mozilla.com> wrote:
>> In contrast, your scenario is:
>> - We're expecting that an event *will* fire.
>> - We wait a bit to see if it fires.
>> - Fail if it the event *does not* fire before the timeout expires.
>>
>> Here, failures are iffy - they may or may not be "real" depending on
>> delays and whether the timeout was long enough. This is precisely the
>> sort of thing that results in random-oranges (when e.g. new
>> test-platforms are added that are way slower than the system that the
>> test was developed on).
>>
>> So, the idea now is that there should be a high threshold for adding the
>> second sort of test (ideally, we should just be using an event-listener
>> where possible).
>
> Well there is an event listener waiting for the event to fire.
> But how else then through a timeout, obviously with a high timeout value like
> 30 or 60 seconds, can your test tell/complain that the event is missing?
> The API’s usually do not provide event listeners for NOT firing events.

mochitest already has a per-test timeout value which will mark the test
as failed in that case. You do not need to roll your own timer on top
of it.

> Sure I can print an info message that my test is now waiting for an event to pop,
> but lots of my tests don’t just sit there and wait, they proceed and do other things,
> but eventually they have to block on waiting for the event callback.
> In that scenario only the over all timeout will terminate my test. And somewhere
> far up in the log I might find a log message. But as I tried to describe in my other
> email having a long living timer which pops complaining that event X is missing
> I think is a legit use case for setTimeout in test.

If you have several async operations you need to wait on, here is the
proper way of doing things:

var gRemainingTests = 10; // the number of async ops you're waiting on
function asyncOpFinished() {
if (--gRemainingTests == 0) {
SimpleTest.finish(); // we're done
}
}

// Now, call asyncOpFinished() in the event handler for each of those
async ops.

If you need to make sure you only get one of a specific type of event,
you can maintain one counter per event type, set the initial value to 1,
and check all values for being zero before calling finish().

HTH,
Ehsan


Boris Zbarsky

unread,
Dec 19, 2014, 1:07:19 AM12/19/14
to
On 12/18/14, 2:28 PM, Nils Ohlmeier wrote:
> Well there is an event listener waiting for the event to fire.
> But how else then through a timeout, obviously with a high timeout value like
> 30 or 60 seconds

We've had quite a number of random oranges from precisely this scenario.
It seems that it's not uncommon for the test VMs to completely lose
the timeslice for 60 seconds or more.

If the test is in fact expecting the event to fire, the right thing to
do is to jut have the test time out per the harness timeout (which can
be globally adjusted as needed depending on the characteristics of the
test environment) if the event doesn't fire. Rolling your own shorter
timer just means contributing to random orange.

> Sure I can print an info message that my test is now waiting for an event to pop,

Sure, and another one when the event comes in. Then you can check the
log to see whether the timeout was for this reason in the event of a
harness timeout on this test.

> But as I tried to describe in my other
> email having a long living timer which pops complaining that event X is missing
> I think is a legit use case for setTimeout in test.

Given the number of random oranges we've had from _precisely_ this use
of setTimeout, I don't think it's legit.

-Boris

Ehsan Akhgari

unread,
Dec 19, 2014, 9:57:03 AM12/19/14
to Boris Zbarsky, dev-pl...@lists.mozilla.org
Let me try to rephrase the problem in different terms, to hopefully make
it clearer why using timers like this is a bad idea.

setTimeout(foo, 1000) may seem to suggest run foo after 1 second, but
that is *not* what that function does, at all. What it does is, run foo
after 1 second, or perhaps at any other point in time later than that,
and it could even be *minutes* later.

A lot of people are under the assumption that just because their local
timings pan out, and they've added a, let's say, 10x error margin to
their timers, their use of timers is fine. It's not, as has been
demonstrated as intermittent oranges over the years. We have no way of
knowing whether those calculations hold under various types of machine
loads, on new platforms, on much lower powered machines (think of
running the tests where the such measurements have been done on a beefy
desktop on a low-end phone.) And increasing the error margin won't save
us, it will just move the problem further into the future.

Also, please note that letting the test time out if it doesn't succeed
is a perfectly valid failure scenario, we have numerous tests that do
things like fire an async event, setup a listener and wait for the
listener to fire and SimpleTest.finish() there, without any timers to
catch the case where the event does not fire. Logging sufficiently is
almost always enough to not have to use these timers, as those tests
have demonstrated in practice.

Cheers,
Ehsan

Gijs Kruitbosch

unread,
Dec 19, 2014, 11:34:35 AM12/19/14
to Ehsan Akhgari, Boris Zbarsky
On 19/12/2014 14:56, Ehsan Akhgari wrote:
> Logging sufficiently is
> almost always enough to not have to use these timers, as those tests
> have demonstrated in practice.

Who's working on improving the log output from tinderbox? Timestamps get
mashed [0], sometimes only the failing assertion is printed with no
other logs at all [1], ... it's not really such a happy world as you assert.

~ Gijs

[0] https://bugzilla.mozilla.org/show_bug.cgi?id=1049545#c0
[1]
https://treeherder.mozilla.org/ui/logviewer.html#?repo=fx-team&job_id=1494921

Ehsan Akhgari

unread,
Dec 19, 2014, 1:09:01 PM12/19/14
to Gijs Kruitbosch, Boris Zbarsky, dev-pl...@lists.mozilla.org
On Fri, Dec 19, 2014 at 11:34 AM, Gijs Kruitbosch <gijskru...@gmail.com>
wrote:
>
> On 19/12/2014 14:56, Ehsan Akhgari wrote:
>
>> Logging sufficiently is
>> almost always enough to not have to use these timers, as those tests
>> have demonstrated in practice.
>>
>
> Who's working on improving the log output from tinderbox?


I'm not sure.


> Timestamps get mashed [0],


File a bug, I guess?


> sometimes only the failing assertion is printed with no other logs at all
> [1],


This was done by Nathan Froyd, with the goal of reducing the log file
size. This annoys me as well, but there is a remedy:
SimpleTest.requestCompleteLog().


> ... it's not really such a happy world as you assert.
>

Not sure if you're complaining about the state of things or if you're
suggesting that the usage of these types of timeouts is better. I hope
that the discussion in this thread has proved the latter false, at least.
If you have specific suggestions on how to make the logging story better,
please file bugs, but I think holding back changes such as blacklisting
well-known intermittent-failure prone until we get our logging story
straight is not very reasonable. For one thing, requestCompleteLog() is
*much* easier to use than any of these magical timeouts, if you need it.

Cheers,
Ehsan
--
Ehsan

Nils Ohlmeier

unread,
Dec 19, 2014, 4:40:17 PM12/19/14
to Ehsan Akhgari, Boris Zbarsky, dev-pl...@lists.mozilla.org

> On Dec 19, 2014, at 6:56 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>
> Let me try to rephrase the problem in different terms, to hopefully make it clearer why using timers like this is a bad idea.
>
> setTimeout(foo, 1000) may seem to suggest run foo after 1 second, but that is *not* what that function does, at all. What it does is, run foo after 1 second, or perhaps at any other point in time later than that, and it could even be *minutes* later.
>
> A lot of people are under the assumption that just because their local timings pan out, and they've added a, let's say, 10x error margin to their timers, their use of timers is fine. It's not, as has been demonstrated as intermittent oranges over the years. We have no way of knowing whether those calculations hold under various types of machine loads, on new platforms, on much lower powered machines (think of running the tests where the such measurements have been done on a beefy desktop on a low-end phone.) And increasing the error margin won't save us, it will just move the problem further into the future.
>
Ok. That’s fine by my.
First of all I assume that the slowness of the machine also effects mochitest’s over all timer, right? If not, then we are in big trouble.
Secondly remember that I’m NOT doing anything useful when my “missing-event-timer” pops. I’m just logging a message that the event is missing and then I’m trying to exit the test early (earlier then the mochitests general test timeout).

The only problem I can see with that approach is: if the “missing-event-timer” actually pops more or less precisely after the requested time, BUT the actual work is running super slow motion for some reason. And so the event would have arrived, but after the even-timeout.
So if my “missing-event-timers” are popping on time, because they are for some bizarre reason not affected by the over all slowness of the machine then we have a problem with this. But in that case we should have a problem with the over test timeout of mochitest as well, or?

> Also, please note that letting the test time out if it doesn't succeed is a perfectly valid failure scenario, we have numerous tests that do things like fire an async event, setup a listener and wait for the listener to fire and SimpleTest.finish() there, without any timers to catch the case where the event does not fire.

My test are not about testing this one timer. Doing that would be relatively easy.
In WebRTC I need to orchestrate two participants to establish a real time audio video call. Lots of the tests need to establish a full call, before assessing if the call got properly established under the given criteria. The most common scenario for intermittent failures is that the call does not even gets established. From start to a full call I need to verify several state machines and lots of events firing and all of that twice for the two participants of each call.
So yes I could simply try to establish a call (and not worrying about all the intermediary steps and events) and then do the verification/assessment. And let call failures be caught by the generic mochitest timeout. But then I would spend hours and hours staring at pretty useless log files (see below) every time a call got stuck.

> Logging sufficiently is almost always enough to not have to use these timers, as those tests have demonstrated in practice.
>

I disagree. Especially as the logging of mochitest is really poor.
- the timestamps of log messages are lost as described in bug 1020516
- SimpleTest only has one log level: info(). Which I think is the main reason the log files get big, because test writers can’t fine tune the logging.
- Log messages from mochitest are not printed together with the stdout/stderr of the browser (so correlating these two is almost impossible)
- Logging behaves differently on your local machine then on the build servers

BTW I understand that WebRTC has pretty special and probably unique requirements, compared to what the rest of the mochitest users do with it. That’s why I was trying to make the case for adding a special connivence function for setting up these special “missing-event-timer”, which BTW would also allow us to increase or decrease the timeout values based on the platform/machine the test currently gets executed on.

But as I don’t seem to be able to make my case here I’ll simply live with requesting flaky timeouts for all WebRTC tests.

Best
Nils

> Cheers,
> Ehsan
>
> On 2014-12-19 1:07 AM, Boris Zbarsky wrote:

Ehsan Akhgari

unread,
Dec 19, 2014, 6:59:04 PM12/19/14
to Nils Ohlmeier, Boris Zbarsky, dev-pl...@lists.mozilla.org
On 2014-12-19 4:40 PM, Nils Ohlmeier wrote:
>
>> On Dec 19, 2014, at 6:56 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
>>
>> Let me try to rephrase the problem in different terms, to hopefully make it clearer why using timers like this is a bad idea.
>>
>> setTimeout(foo, 1000) may seem to suggest run foo after 1 second, but that is *not* what that function does, at all. What it does is, run foo after 1 second, or perhaps at any other point in time later than that, and it could even be *minutes* later.
>>
>> A lot of people are under the assumption that just because their local timings pan out, and they've added a, let's say, 10x error margin to their timers, their use of timers is fine. It's not, as has been demonstrated as intermittent oranges over the years. We have no way of knowing whether those calculations hold under various types of machine loads, on new platforms, on much lower powered machines (think of running the tests where the such measurements have been done on a beefy desktop on a low-end phone.) And increasing the error margin won't save us, it will just move the problem further into the future.
>>
> Ok. That’s fine by my.
> First of all I assume that the slowness of the machine also effects mochitest’s over all timer, right?

That's correct.

> If not, then we are in big trouble.
> Secondly remember that I’m NOT doing anything useful when my “missing-event-timer” pops. I’m just logging a message that the event is missing and then I’m trying to exit the test early (earlier then the mochitests general test timeout).

Your approach is _still_ faulty in that you are assuming that section of
your test will reasonably finish in 60s, but if the test slave is
starved, the 60s timer may fire *before* the event you are waiting for
has a chance to be dispatched. Based on that, your approach is prone to
intermittent failures.

Of course, the exact level of problem depends on what you mean by "log"
above. If you mean something that is *actually* side-effect free such
as dump() or info(), then the worst thing that can happen is that you
may see misleading log messages sometimes. But if you do *anything*
which can have side-effects, you should stop using setTimeout ASAP.
That includes things such as ok(false, "the event didn't arrive"),
SimpleTest.finish(), etc.

At this point though, you should be asking yourself, why go through the
pain of ensuring those timer callbacks are side-effect-free and that
they remain so for the years to come? Really, the problem that you are
trying to protect yourself against doesn't exist. So, please just don't
use setTimeouts like that, ever! :-) That will keep things much
simpler, and less footgun-ish.

> The only problem I can see with that approach is: if the “missing-event-timer” actually pops more or less precisely after the requested time, BUT the actual work is running super slow motion for some reason. And so the event would have arrived, but after the even-timeout.
> So if my “missing-event-timers” are popping on time, because they are for some bizarre reason not affected by the over all slowness of the machine then we have a problem with this. But in that case we should have a problem with the over test timeout of mochitest as well, or?

The default mochitest timeout is 5 minutes. And you're right, if your
test takes longer than that, it will fail. If it does so
intermittently, then it will fail intermittently. And if that happens,
the solution is to either use SimpleTest.requestLongerTimeout() to
request a longer timeout, or to split up the test. Years of experience
has shown that the default 5 minute timeout is sufficient for all of our
platforms in practice. And in the future if that proves to be too
short, we can change it globally with a one line change (in
TestRunner.js). If individual test authors roll out their own clones of
the harness timeout mechanism, updating them in that situation will
become very very hard, if possible at all.

>> Also, please note that letting the test time out if it doesn't succeed is a perfectly valid failure scenario, we have numerous tests that do things like fire an async event, setup a listener and wait for the listener to fire and SimpleTest.finish() there, without any timers to catch the case where the event does not fire.
>
> My test are not about testing this one timer. Doing that would be relatively easy.
> In WebRTC I need to orchestrate two participants to establish a real time audio video call. Lots of the tests need to establish a full call, before assessing if the call got properly established under the given criteria. The most common scenario for intermittent failures is that the call does not even gets established. From start to a full call I need to verify several state machines and lots of events firing and all of that twice for the two participants of each call.
> So yes I could simply try to establish a call (and not worrying about all the intermediary steps and events) and then do the verification/assessment. And let call failures be caught by the generic mochitest timeout. But then I would spend hours and hours staring at pretty useless log files (see below) every time a call got stuck.

Can you please point me to the code in question?

>> Logging sufficiently is almost always enough to not have to use these timers, as those tests have demonstrated in practice.
>>
>
> I disagree. Especially as the logging of mochitest is really poor.
> - the timestamps of log messages are lost as described in bug 1020516
> - SimpleTest only has one log level: info(). Which I think is the main reason the log files get big, because test writers can’t fine tune the logging.
> - Log messages from mochitest are not printed together with the stdout/stderr of the browser (so correlating these two is almost impossible)
> - Logging behaves differently on your local machine then on the build servers

Everybody (yours truly included!) has their favorite list of logging
issues, and honestly I've not been paying close attentions to the recent
changes such as structured logs, so I'm not able to help you on any of
these specifically. Some of these issues are possible to work around in
your own code (for example you can have your own logging levels and
whatnot, based on a quick grep there's even some prior art, for example
see Logger in accessible/jsat/Utils.jsm). Some of these (such as the
latter two) seem to be issues that you'd also have with your own custom
logging in mochitest, so I'm not sure how the timer based logging is not
affected by them.

> BTW I understand that WebRTC has pretty special and probably unique requirements, compared to what the rest of the mochitest users do with it. That’s why I was trying to make the case for adding a special connivence function for setting up these special “missing-event-timer”, which BTW would also allow us to increase or decrease the timeout values based on the platform/machine the test currently gets executed on.

If there were a way to implement your idea of "missing event timer" in a
way that is not prone to intermittent failures, I'd say go for it. But
I don't understand how you can do that.

But thinking of this more, do you find it useful to be able to register
a callback with the harness to be called when the harness decides to
kill your test because it has timed out? (Obviously the callback would
be informational, and would not be able to prevent the harness from
killing the test and moving on.)

> But as I don’t seem to be able to make my case here I’ll simply live with requesting flaky timeouts for all WebRTC tests.

Well, that would be a terrible conclusion to this thread. :( That
would mean that WebRTC tests will be prone to all sorts of intermittent
test failures caused by setTimeout (both the scenario above, and others
-- as reuestFlakyTimeout is a blanket pass) forever.

Please note that intermittent test failures are a real problem. Besides
all of the time wasted by sheriffs/developers/machines to deal with
them, they reduce our ability to tell whether a given checkin is "good",
and *most importantly* they will result in the tests getting disabled.
Tests getting disabled means that areas of our code will be untested, so
we won't be able to detect the regressions that the tests would have
otherwise caught, and we'd end up shipping those bugs to our users.

I'm really trying hard to suggest something that is not flaky for your
needs. Perhaps the above idea would help. If not, maybe I can come up
with something by looking at the code. But simply calling
requestFlakyTimeout in *all* WebRTC tests and moving on is not really an
outcome that I look forward to seeing.

Cheers,
Ehsan

>> Cheers,
>> Ehsan
>>
>> On 2014-12-19 1:07 AM, Boris Zbarsky wrote:

Karl Tomlinson

unread,
Dec 26, 2014, 9:15:49 PM12/26/14
to
On Fri, 19 Dec 2014 18:58:53 -0500, Ehsan Akhgari wrote:

> On 2014-12-19 4:40 PM, Nils Ohlmeier wrote:

>>> On Dec 19, 2014, at 6:56 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:

>>> Logging sufficiently is almost always enough to not have to
>>> use these timers, as those tests have demonstrated in
>>> practice.
>>>
>>
>> I disagree. Especially as the logging of mochitest is really poor.
>> - the timestamps of log messages are lost as described in bug 1020516
>> - SimpleTest only has one log level: info(). Which I think is the main
>> reason the log files get big, because test writers can’t fine tune the
>> logging.
>> - Log messages from mochitest are not printed together with the
>> stdout/stderr of the browser (so correlating these two is almost impossible)
>> - Logging behaves differently on your local machine then on the build servers

> Some of these (such as the latter
> two) seem to be issues that you'd also have with your own custom
> logging in mochitest, so I'm not sure how the timer based logging
> is not affected by them.

I assume multiple custom timers can provide different output that
can support a diagnosis without timestamps.

>
>> BTW I understand that WebRTC has pretty special and probably
>> unique requirements, compared to what the rest of the mochitest
>> users do with it.

FWIW, these issue plague test result analysis in general, not just rtc.

> But thinking of this more, do you find it useful to be able to
> register a callback with the harness to be called when the harness
> decides to kill your test because it has timed out? (Obviously
> the callback would be informational, and would not be able to
> prevent the harness from killing the test and moving on.)

That might be useful, but perhaps it would not be necessary if
logging worked. Perhaps it is better to have an error handler
that only runs when there are errors, reducing spam, but that
means keeping disjoint parts of the test in sync perhaps putting
more state in global scope, while logging is simple to maintain.
0 new messages