Using rr chaos mode to find intermittent bugs

Robert O'Callahan

unread,

Feb 10, 2016, 3:05:02 PM2/10/16

to dev-pl...@lists.mozilla.org

Background:
http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

I just landed on rr master support for a "-h" option which enables a chaos
mode for rr recording. This is designed to help reproduce intermittent test
failures under rr. We already have a few reports of people using this
successfully to find difficult bugs. Even though rr works only on desktop
Linux (including VMs), I've reproduced a bug that only showed up in
automation on Android, and khuey reproduced a bug that only showed up on
OSX 10.6.

I'm continuing to do experiments to try to reproduce more of our top
intermittents, but you may already find rr chaos mode useful. I recommend
running a single test or a small group of tests continuously; one of my
bugs only had a few failing runs out of a thousand. I'm sure there are
still bugs rr can't reproduce, and I'm very interested in hearing about
bugs that eventually get fixed but that rr was not able to reproduce. By
studying such bugs we can improve rr chaos mode so it can find them.

Obviously, once rr chaos mode has proved itself, we should get some
automation around it. I'd like a bit more experience with it before we have
that discussion.

Rob
--
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf
toD
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t
rdsme,aoreseoouoto
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea
lurpr
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr
esn

Ted Mielczarek

unread,

Feb 10, 2016, 3:32:48 PM2/10/16

to dev-pl...@lists.mozilla.org

On Wed, Feb 10, 2016, at 03:04 PM, Robert O'Callahan wrote:
> Background:
> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
>
> I just landed on rr master support for a "-h" option which enables a
> chaos
> mode for rr recording. This is designed to help reproduce intermittent
> test
> failures under rr. We already have a few reports of people using this
> successfully to find difficult bugs. Even though rr works only on desktop
> Linux (including VMs), I've reproduced a bug that only showed up in
> automation on Android, and khuey reproduced a bug that only showed up on
> OSX 10.6.
>
> I'm continuing to do experiments to try to reproduce more of our top
> intermittents, but you may already find rr chaos mode useful. I recommend
> running a single test or a small group of tests continuously; one of my
> bugs only had a few failing runs out of a thousand. I'm sure there are
> still bugs rr can't reproduce, and I'm very interested in hearing about
> bugs that eventually get fixed but that rr was not able to reproduce. By
> studying such bugs we can improve rr chaos mode so it can find them.
>
> Obviously, once rr chaos mode has proved itself, we should get some
> automation around it. I'd like a bit more experience with it before we
> have
> that discussion.

This is great! I've kept holding out hope that rr can help us fix
intermittent test failures, but so far we've failed to actually prove
this out. BenWa tried doing some work on this but kept getting hung up
on hitting test failures unrelated to the ones we see in production,
possibly due to environment issues. jmaher and armenzg and others have
been doing some great work lately standing up Linux tests in
Taskcluster, as a side effect of which means we now have a Docker image
for running Linux tests. If anyone wants to prototype reproducing
failures from CI running rr inside that image would be a good place to
start.

-Ted

Robert O'Callahan

unread,

Feb 10, 2016, 3:47:46 PM2/10/16

to Ted Mielczarek, dev-pl...@lists.mozilla.org

On Thu, Feb 11, 2016 at 9:32 AM, Ted Mielczarek <t...@mielczarek.org> wrote:

> BenWa tried doing some work on this but kept getting hung up
> on hitting test failures unrelated to the ones we see in production,

possibly due to environment issues.
>

Yes. In this vein, it's possible that in some cases rr chaos mode might
trigger bugs that don't normally happen, that one way or another block you
from finding the bug you care about.

However, bugs found by rr chaos mode should all be "real bugs". I'd
certainly love to hear about any cases where that's not true.

ISHIKAWA,chiaki

unread,

Feb 10, 2016, 4:43:43 PM2/10/16

to dev-pl...@lists.mozilla.org

On 2016/02/11 5:47, Robert O'Callahan wrote:
> On Thu, Feb 11, 2016 at 9:32 AM, Ted Mielczarek <t...@mielczarek.org> wrote:
>
>> BenWa tried doing some work on this but kept getting hung up
>> on hitting test failures unrelated to the ones we see in production,
> possibly due to environment issues.
> Yes. In this vein, it's possible that in some cases rr chaos mode might
> trigger bugs that don't normally happen, that one way or another block you
> from finding the bug you care about.
>
> However, bugs found by rr chaos mode should all be "real bugs". I'd
> certainly love to hear about any cases where that's not true.
>
> Rob

This scheduling change causing rare to reproduce bugs to occur more
often sounds interesting.

I have found that running C-C TB (sorry it is not the browser here)
under valgrind/memcheck which slows down the operation dramatically
have helped me to find a few issues.
From the top of my head:
- incremental GC gets re-entered before it finishes the previous
invocation.
This was not handled properly until I noticed the issue, but it is
now handled OK.
- there are some issues in threading.
For one, at start up, some threads incorrectly assume that window as
on screen is
already there, but due to the slowdown, it is not created yet.
I see some disturbing warning messages printed on the invoking tty
window.
I have not filed a bug yet since this is relatively new. I don't
think I saw
such messages early last year.

For the other, at shutdown, C-C TB has a problem of incorrect
ordering of
thread shutdown: some threads seem to request services during shutdown
from service providers, but threads that provide the services have
already
shutdown. So proper shutdown does not happen. There may even be a
cyclic
dependency. Who knows?
With the slowdown due to valgrind/memcheck, the issue
gets more pronounced. Well, right now, though, there is
a timer that monitors the shtudown process and the prolonged timeout of
some operations due to the thread missing and the slowdown caused by
valgrind/memcheck automatically triggers the assertion of permanent
hung at
shutdown and so it is difficult to figure out what are going on. But
one can
hope that the check for permanent hung gets removed temporarily to
investigate the issue further.
Crashes at C-C TB are something I experienced several times in the last
couple of years in real life.

Another thing this rr framework or similar approach will be useful for
C-C TB xpcshell testing (and I think it is useful for FF xpcshell
testing as well.)

There seem to be a few intermittent test failures in xpcshell tests.
This rr approach may make the test fail more often.

*HOWEVER*, I am going to file a bugzilla about
OVEREAGER ASYNC approach of the current test xpcshell script introducing
spurious errors at least under Windows (a previous test which still have
some files open has not completely shut down before the next test that
seems to use
THOSE files get started. Under windows, opening such a file may result in
file locked error (under linux/OSX, I think it is OK to open such files
unless the first program explicitly calls |flock| or something.)

So whether ALL the intermittent failures in C-C TB xpcshell tests are
something that can be investigated better with rr approach is anyone's
guess, but
I think it does have a potential to trigger more dormant bugs just as
valgrind/memcheck uncovered a few timing issues.

But one other post suggested that it is not applicable right now outside
Gecko, meaning C-C TB xpcshell testing cannot directly benefit from rr?
(The approach, of course, can be emulated, I suppose.)

TIA

Robert O'Callahan

unread,

Feb 10, 2016, 5:10:07 PM2/10/16

to chiaki ISHIKAWA, dev-pl...@lists.mozilla.org

rr should work fine with c-c xpcshell tests (and most other Linux programs).

ISHIKAWA,chiaki

unread,

Feb 10, 2016, 5:18:49 PM2/10/16

to dev-pl...@lists.mozilla.org

On 2016/02/11 7:04, Robert O'Callahan wrote:
> rr should work fine with c-c xpcshell tests (and most other Linux programs).

This sounds great!

CI

Nicolas B. Pierron

unread,

Feb 11, 2016, 5:55:03 AM2/11/16

to

On 02/10/2016 08:04 PM, Robert O'Callahan wrote:
> Background:
> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
>
> I just landed on rr master support for a "-h" option which enables a chaos
> mode for rr recording. This is designed to help reproduce intermittent test

> failures under rr. […]

Thanks Roc, I will give it a try.

On the other hand, I used to rely more on the "-c" option to achieve a
similar thing in the past, instead of the "-e" option.

The reason I did so being that the thread I am interested in does a few
syscalls compared to the rest of the program. Thus I felt that using "-e"
option would give it an unfair large time slices compared to what is
supposed to happen if the threads are running concurrently.

--
Nicolas B. Pierron

Robert O'Callahan

unread,

Feb 11, 2016, 2:35:42 PM2/11/16

to Nicolas B. Pierron, dev-pl...@lists.mozilla.org

The -e option is gone now because the new scheduler (with or without chaos
mode) does not take system calls into account when calculating the length
of a timeslice. We only count conditional branches.

Bugs that require very frequent fine-grained context switching are probably
still hard to find with chaos mode, because very frequent context switching
slows down recording tremendously and I didn't want chaos mode to slow down
execution by more than a bounded amount. So you may find that -c is still
needed.

Kyle Huey

unread,

Feb 11, 2016, 2:39:09 PM2/11/16

to Robert O'Callahan, Nicolas B. Pierron, dev-pl...@lists.mozilla.org

On Thu, Feb 11, 2016 at 11:35 AM, Robert O'Callahan <rob...@ocallahan.org>
wrote:

> On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron <
> nicolas....@mozilla.com> wrote:
>

> The -e option is gone now because the new scheduler (with or without chaos
> mode) does not take system calls into account when calculating the length
> of a timeslice. We only count conditional branches.
>

So we context switch at a syscall now only when the current thread happens
to become unschedulable?

- Kyle

Robert O'Callahan

unread,

Feb 11, 2016, 2:43:07 PM2/11/16

to Kyle Huey, Nicolas B. Pierron, dev-pl...@lists.mozilla.org

On Fri, Feb 12, 2016 at 8:39 AM, Kyle Huey <m...@kylehuey.com> wrote:

> On Thu, Feb 11, 2016 at 11:35 AM, Robert O'Callahan <rob...@ocallahan.org>
> wrote:
>
>> On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron <
>> nicolas....@mozilla.com> wrote:
>>

>> The -e option is gone now because the new scheduler (with or without chaos
>> mode) does not take system calls into account when calculating the length
>> of a timeslice. We only count conditional branches.
>>
>
> So we context switch at a syscall now only when the current thread happens
> to become unschedulable?
>

Or if any higher-priority thread has become runnable. This includes not
just a low-priority thread doing a FUTEX_WAKE to wake a high-priority
thread, but also a thread changing its priority or another thread's
priority, or even a low-priority thread writing to a pipe that a
high-priority thread is reading from. (Though in the latter case the
scheduler *might* not see the high-priority thread become runnable in time
in all cases.)