Re: TSan Deadlock Traces (Hangs in Mozilla CI)

47 views
Skip to first unread message

Dmitry Vyukov

unread,
Apr 23, 2021, 5:03:25 AM4/23/21
to Christian Holler, Kris Wright, Sylvestre Ledru, thread-sanitizer
+thread-sanitizer mailing list

On Fri, Apr 23, 2021 at 10:48 AM Christian Holler <cho...@mozilla.com> wrote:
>
> Good Morning Dmitry,
>
>
>
> as discussed earlier, we finally managed to grab a full backtrace for
> each thread for the deadlock in our CI (we managed to get the process to
> coredump finally). I've attached two logs of two different hangs.
>
> If I'm reading those correctly, your initial guess was right, we are
> reporting a race and at the same time joining/detaching/finalizing a
> thread. In the logs I've also seen before (through fprintf debugging)
> that one call to `PR_JoinThread` is not returning.
>
>
> If this is not enough for you to diagnose or propose a fix, we can try
> to grab additional information from the core dump (it is fairly reliably
> reproducible in CI) and also make changes to TSan easily. Any changes we
> make to Clang will automatically rebuild the toolchain for our test job
> only, so even invasive or slow changes to the toolchain are perfectly fine.
>
>
>
> Thanks in advance and have a good weekend :)
>
> - Chris
>
gdb.log.18613
gdb.log.7434

Dmitry Vyukov

unread,
Apr 23, 2021, 5:13:11 AM4/23/21
to Christian Holler, Kris Wright, Sylvestre Ledru, thread-sanitizer
What llvm revision is this?

Dmitry Vyukov

unread,
Apr 23, 2021, 5:21:33 AM4/23/21
to Christian Holler, Kris Wright, Sylvestre Ledru, thread-sanitizer
This is a deadlock in the single thread:

Thread 44 (Thread 0x7f865856c700 (LWP 7505)):
#0 0x000055bb077f9ee0 in
atomic_exchange<__sanitizer::atomic_uint32_t> () at
compiler-rt/lib/sanitizer_common/sanitizer_atomic_clang.h:67
#1 0x000055bb077f9ee0 in Lock() () at
compiler-rt/lib/sanitizer_common/sanitizer_linux.cpp:658
#2 0x000055bb07881212 in Lock () at
compiler-rt/lib/tsan/../sanitizer_common/sanitizer_thread_registry.h:97
#3 0x000055bb07881212 in GenericScopedLock () at
compiler-rt/lib/tsan/../sanitizer_common/sanitizer_mutex.h:183
#4 0x000055bb07881212 in ReportRace() () at
compiler-rt/lib/tsan/rtl/tsan_rtl_report.cpp:682
#5 0x000055bb07885d9a in __tsan_report_race_thunk () at
compiler-rt/lib/tsan/rtl/tsan_rtl_amd64.S:133
#6 0x000055bb07875190 in HandleRace () at
compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:639
#7 0x000055bb07875190 in MemoryAccessImpl1 () at
compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:715
#8 0x000055bb07875190 in MemoryAccess () at
compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:888
#9 0x000055bb07875190 in MemoryRead () at
compiler-rt/lib/tsan/rtl/tsan_rtl.h:742
#10 0x000055bb07875190 in __tsan_read8() () at
compiler-rt/lib/tsan/rtl/tsan_interface_inl.h:33
#11 0x00007f86f6d70c91 in evsig_handler () at
gecko/ipc/chromium/src/third_party/libevent/signal.c:385
#12 0x000055bb0781dfb7 in CallUserSignalHandler() () at
compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1913
#13 0x000055bb07814b09 in ProcessPendingSignals() () at
compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1958
#14 0x000055bb078620d5 in __tsan_atomic32_store() () at
compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cpp:561
#15 0x000055bb078cf193 in store () at include/c++/7.4.0/bits/atomic_base.h:374
#16 0x000055bb078cf193 in store () at include/mozilla/Atomics.h:195
#17 0x000055bb078cf193 in operator= () at include/mozilla/Atomics.h:297
#18 0x000055bb078cf193 in Lock () at include/mozilla/BaseProfilerDetail.h:58
#19 0x000055bb078cf193 in PSAutoLock () at
gecko/mozglue/baseprofiler/core/platform.cpp:256
#20 0x000055bb078cf193 in paf_parent() () at
gecko/mozglue/baseprofiler/core/platform-linux-android.cpp:517
#21 0x00007f870e1c8cc7 in __libc_fork () at ../sysdeps/nptl/fork.c:241
#22 0x000055bb0781e86b in __interceptor_fork() () at
compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:2105
#23 0x00007f86f6d40430 in LaunchApp() () at
gecko/ipc/chromium/src/base/process_util_linux.cc:246
#24 0x00007f86f6d8ba3d in DoLaunch() () at
gecko/ipc/glue/GeckoChildProcessHost.cpp:1246
#25 0x00007f86f6d89a55 in PerformAsyncLaunch() () at
gecko/ipc/glue/GeckoChildProcessHost.cpp:1016
#26 0x00007f86f6da735c in applyImpl<mozilla::ipc::BaseProcessLauncher,
RefPtr<mozilla::MozPromise<mozilla::ipc::LaunchResults,
mozilla::ipc::LaunchError, false> >
(mozilla::ipc::BaseProcessLauncher::*)()> () at
gecko/xpcom/threads/nsThreadUtils.h:1148
#27 0x00007f86f6da735c in apply<mozilla::ipc::BaseProcessLauncher,
RefPtr<mozilla::MozPromise<mozilla::ipc::LaunchResults,
mozilla::ipc::LaunchError, false> >
(mozilla::ipc::BaseProcessLauncher::*)()> () at
gecko/xpcom/threads/nsThreadUtils.h:1154
#28 0x00007f86f6da735c in Invoke () at include/mozilla/MozPromise.h:1514
#29 0x00007f86f6da735c in Run() () at include/mozilla/MozPromise.h:1534
#30 0x00007f86f6495fa0 in Run() () at gecko/xpcom/threads/TaskQueue.cpp:208
#31 0x00007f86f64a50c2 in ProcessNextEvent() () at
gecko/xpcom/threads/nsThread.cpp:1160
#32 0x00007f86f64abd93 in NS_ProcessNextEvent() () at
gecko/xpcom/threads/nsThreadUtils.cpp:548
#33 0x00007f86f6dbde49 in Run() () at gecko/ipc/glue/MessagePump.cpp:332
#34 0x00007f86f6d4825d in RunInternal () at
gecko/ipc/chromium/src/base/message_loop.cc:335
#35 0x00007f86f6d4825d in RunHandler () at
gecko/ipc/chromium/src/base/message_loop.cc:328
#36 0x00007f86f6d4825d in Run() () at
gecko/ipc/chromium/src/base/message_loop.cc:310
#37 0x00007f86f64a14a9 in ThreadFunc() () at
gecko/xpcom/threads/nsThread.cpp:397
#38 0x00007f870cf9b24c in _pt_root () at
gecko/nsprpub/pr/src/pthreads/ptthread.c:201

It calls fork, and we lock ThreadRegistry mutex around fork, so that
we don't start the new process with ThreadRegistry locked in another
thread.
But then it tries to report a race, which also tries to take
ThreadRegistry mutex.

Christian Holler

unread,
Apr 23, 2021, 5:22:35 AM4/23/21
to Dmitry Vyukov, Kris Wright, Sylvestre Ledru, thread-sanitizer
This is Clang 11, rev 43ff75f2c3feef64f9d73328230d34dac8832a91 with some
patches applied as noted in
https://searchfox.org/mozilla-central/source/build/build-clang/clang-11-linux64.json
(the patches are in the same directory).

We also have Clang 12 (and I believe Clang trunk) in CI, I tried 12 and
it also had the same problem.

- Chris

Dmitry Vyukov

unread,
Apr 23, 2021, 7:29:00 AM4/23/21
to Christian Holler, Kris Wright, Sylvestre Ledru, thread-sanitizer
Here is a fix: https://reviews.llvm.org/D101154
I don't know if you are registered at reviews.llvm.org and under what
name, so couldn't add you.

Amusingly we already have tests/tsan/pthread_atfork_deadlock.c and
tests/tsan/pthread_atfork_deadlock2.c, one based on Firefox, another
on Chromium. Time for pthread_atfork_deadlock3.c. It's always the
fork!

Christian Holler

unread,
Apr 26, 2021, 6:01:22 AM4/26/21
to Dmitry Vyukov, Kris Wright, Sylvestre Ledru, thread-sanitizer
Thank you for that quick fix, Dmitry!

I will test this in our CI today and report back about the results.

Christian Holler

unread,
Apr 27, 2021, 5:15:05 AM4/27/21
to Dmitry Vyukov, Kris Wright, Sylvestre Ledru, thread-sanitizer
We've tested the fix in CI and can confirm that the hangs are gone :)

Thanks again for the quick help.

- Chris

On 23.04.21 13:28, Dmitry Vyukov wrote:
Reply all
Reply to author
Forward
0 new messages