TSAN slowdown then failure on repeat test run

250 views
Skip to first unread message

Ben Clayton

unread,
Oct 19, 2020, 6:17:24 AM10/19/20
to thread-sanitizer
Hello,

I've spent a bit of time investigating an issue where repeated runs of the same test (--gtest_repeat=-1) under TSAN become slower and slower until TSAN explodes with a SEGV on unknown address

The full issue is described at: https://github.com/google/marl/issues/133

It smells like we're leaking memory or state between test runs, but I see no obvious memory leakage in the library itself.

It's worth mentioning that Marl uses fibers, either via ucontext or via its own assembly. Switching between either appear to behave the same.

Marl does not currently emit calls to notify the sanitizers of fiber switching - I've have experimented with this, but ended up getting a bunch of false-positives about state being accessed between fibers in the scheduler (which I know to be safe). I wonder if omitting these calls may be the problem?

I'm at a bit of a loss to what's going wrong here. Any pointers would be greatly appreciated.

Many thanks,
Ben

Ben Clayton

unread,
Oct 19, 2020, 6:20:01 AM10/19/20
to thread-sanitizer
Immediately after posting this, I see https://groups.google.com/g/thread-sanitizer/c/v6erS15Ft2A, which sounds remarkably similar.

Does it seem likely that this is the same problem?

Cheers,
Ben

Dmitry Vyukov

unread,
Oct 19, 2020, 6:23:25 AM10/19/20
to Ben Clayton, thread-sanitizer
On Mon, Oct 19, 2020 at 12:20 PM 'Ben Clayton' via thread-sanitizer
<thread-s...@googlegroups.com> wrote:
>
> Immediately after posting this, I see https://groups.google.com/g/thread-sanitizer/c/v6erS15Ft2A, which sounds remarkably similar.
>
> Does it seem likely that this is the same problem?

Yes, it looks like the same problem. TSan does not support fibers out
of the box.

> Cheers,
> Ben
>
> On Mon, Oct 19, 2020 at 11:12 AM Ben Clayton <bcla...@google.com> wrote:
>>
>> Hello,
>>
>> I've spent a bit of time investigating an issue where repeated runs of the same test (--gtest_repeat=-1) under TSAN become slower and slower until TSAN explodes with a SEGV on unknown address.
>>
>> The full issue is described at: https://github.com/google/marl/issues/133
>>
>> It smells like we're leaking memory or state between test runs, but I see no obvious memory leakage in the library itself.
>>
>> It's worth mentioning that Marl uses fibers, either via ucontext or via its own assembly. Switching between either appear to behave the same.
>>
>> Marl does not currently emit calls to notify the sanitizers of fiber switching - I've have experimented with this, but ended up getting a bunch of false-positives about state being accessed between fibers in the scheduler (which I know to be safe). I wonder if omitting these calls may be the problem?
>>
>> I'm at a bit of a loss to what's going wrong here. Any pointers would be greatly appreciated.
>>
>> Many thanks,
>> Ben
>
> --
> You received this message because you are subscribed to the Google Groups "thread-sanitizer" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to thread-sanitiz...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/thread-sanitizer/CAESkinA2%3DTsS5duhqujVYpfvywzqzbAoYsTvrA0TRB-ETU3hHw%40mail.gmail.com.

Ben Clayton

unread,
Oct 19, 2020, 7:24:55 AM10/19/20
to thread-sanitizer
Thanks for the very quick reply Dmitry.

When you say 'fibers out the box' - would annotating my library with sanitizer calls help here? If not, is any intention to support this (i.e. is there a bug I can track)? 

Thanks again,
Ben

Dmitry Vyukov

unread,
Oct 19, 2020, 7:49:03 AM10/19/20
to Ben Clayton, thread-sanitizer
On Mon, Oct 19, 2020 at 1:24 PM 'Ben Clayton' via thread-sanitizer
<thread-s...@googlegroups.com> wrote:
>
> Thanks for the very quick reply Dmitry.
>
> When you say 'fibers out the box' - would annotating my library with sanitizer calls help here?

Yes, it should help.

> If not, is any intention to support this (i.e. is there a bug I can track)?

None that I know of.
> To view this discussion on the web visit https://groups.google.com/d/msgid/thread-sanitizer/0d9f1547-0d33-4509-9cde-36a49fe1cb9en%40googlegroups.com.

Ben Clayton

unread,
Oct 19, 2020, 12:04:24 PM10/19/20
to thread-sanitizer
> When you say 'fibers out the box' - would annotating my library with sanitizer calls help here?

Yes, it should help.

So I had a play with adding the __tsan_switch_to_fiber annotations and friends, and it had some impact:
  • Performance remains now stable between test iterations (instead of slowing down each iteration)
  • We only got to around 6 iterations before crashing, we now get to around 18...
  • ... before running out of memory and crashing and burning again.

So it still seems like we're still leaking some state somewhere.

The other issue with adding these TSAN annotations is how fibers are now seemingly treated like threads. This is problematic in this project as:
  • There is a job scheduler (marl::Scheduler) which owns multiple worker threads.
  • Each worker thread owns multiple fibers and a single, shared mutex.
  • Fibers are transitioned with the mutex lock held, so fiber A will take the lock, switch to fiber B, and fiber B will release the lock.
As TSAN seems to treat fibers like threads, the unlocking of the mutex on a different fiber results "in unlock of an unlocked mutex (or by a wrong thread)" errors.
I've tried bodging a hack to unlock pre-fiber-switch and lock-post-fiber-switch, but I'm not keen on this change landing as it might introduce subtle issues.
Is there any way to use these TSAN annotations, and tell TSAN to ignore a particular std::mutex?
 
Many thanks,
Ben

Dmitry Vyukov

unread,
Oct 20, 2020, 4:24:46 AM10/20/20
to Ben Clayton, thread-sanitizer
On Mon, Oct 19, 2020 at 6:04 PM 'Ben Clayton' via thread-sanitizer
<thread-s...@googlegroups.com> wrote:
>>
>> > When you say 'fibers out the box' - would annotating my library with sanitizer calls help here?
>>
>> Yes, it should help.
>
>
> So I had a play with adding the __tsan_switch_to_fiber annotations and friends, and it had some impact:
>
> Performance remains now stable between test iterations (instead of slowing down each iteration)
> We only got to around 6 iterations before crashing, we now get to around 18...
> ... before running out of memory and crashing and burning again.
>
> Details here.
>
> So it still seems like we're still leaking some state somewhere.

TSAN_OPTIONS=profile_memory=/tmp/mprof can help to identify what's
leaking. WIth this flag tsan will dump memory usage periodically to
the specified file.

> The other issue with adding these TSAN annotations is how fibers are now seemingly treated like threads. This is problematic in this project as:
>
> There is a job scheduler (marl::Scheduler) which owns multiple worker threads.
> Each worker thread owns multiple fibers and a single, shared mutex.
> Fibers are transitioned with the mutex lock held, so fiber A will take the lock, switch to fiber B, and fiber B will release the lock.
>
> As TSAN seems to treat fibers like threads, the unlocking of the mutex on a different fiber results "in unlock of an unlocked mutex (or by a wrong thread)" errors.
> I've tried bodging a hack to unlock pre-fiber-switch and lock-post-fiber-switch, but I'm not keen on this change landing as it might introduce subtle issues.
> Is there any way to use these TSAN annotations, and tell TSAN to ignore a particular std::mutex?

Tsan does not have anything specifically for this.

However, one thing that I can think of is using these annotations:
https://github.com/llvm-mirror/compiler-rt/blob/master/include/sanitizer/tsan_interface.h#L80-L105
to pretend that the mutex is unlocked right before the switch and then
locked again after the switch. This way you don't need to do the
actual unlock/lock, but tsan will think that these are done. But for
addr you will need to somehow pass the same address that std::mutex
passes to pthread_mutex_lock/unlock functions.

Ben Clayton

unread,
Oct 20, 2020, 7:24:26 AM10/20/20
to thread-sanitizer
> TSAN_OPTIONS=profile_memory=/tmp/mprof can help to identify what's
> leaking. WIth this flag tsan will dump memory usage periodically to
> the specified file.

Interestingly, attempting to run the specific tests with the --gtest_filter flag doesn't produce any profile output. I had to run the with the --gtest_filter flag removed to get some output. Odd.
Anyway, taking that filter out, and letting the whole test suite run does produce a bunch of files. One of these files seems to be much larger than the others, and indeed contains numbers that ever grow:

    RSS 45 MB: shadow:5 meta:0 file:5 mmap:19 trace:13 heap:1 other:0 stacks=0[1454] nthr=12/28
    RSS 33 MB: shadow:5 meta:0 file:5 mmap:14 trace:5 heap:1 other:0 stacks=0[1657] nthr=1/49
    RSS 33 MB: shadow:5 meta:0 file:5 mmap:14 trace:5 heap:1 other:0 stacks=0[1745] nthr=1/49
    RSS 32 MB: shadow:5 meta:0 file:5 mmap:15 trace:3 heap:1 other:0 stacks=0[1832] nthr=4/49
    RSS 50 MB: shadow:7 meta:1 file:5 mmap:28 trace:4 heap:2 other:0 stacks=0[1890] nthr=30/64
    RSS 76 MB: shadow:13 meta:4 file:5 mmap:35 trace:9 heap:6 other:0 stacks=0[1942] nthr=82/194
    RSS 82 MB: shadow:15 meta:5 file:5 mmap:37 trace:11 heap:6 other:0 stacks=0[1972] nthr=84/194
    RSS 88 MB: shadow:19 meta:6 file:5 mmap:38 trace:10 heap:8 other:0 stacks=0[2061] nthr=64/211
    RSS 87 MB: shadow:19 meta:6 file:5 mmap:37 trace:10 heap:8 other:0 stacks=0[2085] nthr=2/211
    RSS 87 MB: shadow:19 meta:6 file:5 mmap:37 trace:10 heap:8 other:0 stacks=0[2085] nthr=2/211
... some time later...
    RSS 2460 MB: shadow:1489 meta:257 file:5 mmap:641 trace:47 heap:17 other:0 stacks=0[3774] nthr=65/1081
    RSS 2462 MB: shadow:1487 meta:257 file:5 mmap:641 trace:51 heap:18 other:0 stacks=0[3774] nthr=2/1081
    RSS 2466 MB: shadow:1487 meta:257 file:5 mmap:641 trace:55 heap:18 other:0 stacks=0[3774] nthr=3/1081
    RSS 2471 MB: shadow:1487 meta:257 file:5 mmap:642 trace:58 heap:17 other:0 stacks=0[3774] nthr=5/1081
    RSS 2588 MB: shadow:1490 meta:257 file:5 mmap:745 trace:70 heap:18 other:0 stacks=0[3774] nthr=9/1081
    RSS 2833 MB: shadow:1495 meta:258 file:5 mmap:858 trace:193 heap:21 other:0 stacks=0[3774] nthr=834/1081
    RSS 2576 MB: shadow:1494 meta:258 file:5 mmap:741 trace:53 heap:21 other:0 stacks=0[3774] nthr=30/1081
    RSS 2733 MB: shadow:1506 meta:261 file:5 mmap:885 trace:51 heap:23 other:0 stacks=0[3774] nthr=1001/1081
    RSS 2527 MB: shadow:1523 meta:263 file:5 mmap:662 trace:48 heap:23 other:0 stacks=0[3774] nthr=413/1081
    RSS 2561 MB: shadow:1541 meta:265 file:5 mmap:676 trace:47 heap:24 other:0 stacks=0[3775] nthr=428/1081
** bang **


Is there anything from this that suggests marl is doing something stupid?

> However, one thing that I can think of is using these annotations:
https://github.com/llvm-mirror/compiler-rt/blob/master/include/sanitizer/tsan_interface.h#L80-L105
> to pretend that the mutex is unlocked right before the switch and then
> locked again after the switch. This way you don't need to do the
> actual unlock/lock, but tsan will think that these are done. But for
> addr you will need to somehow pass the same address that std::mutex
> passes to pthread_mutex_lock/unlock functions.

I was thinking along the same lines, but didn't see any obvious way to obtain the handles. I'll do some digging. Thanks!

Dmitry Vyukov

unread,
Oct 20, 2020, 7:32:31 AM10/20/20
to Ben Clayton, thread-sanitizer
1081 threads is definitely lots of threads. Tsan can handle up to
8192, but it needs lots of memory. Perhaps it's some issue with
fibers impl that leads to effective thread leaks (e.g. undestroyed
fiber)? However I see at some points it get down to 3/1081 (live/total
threads), so the number of live threads drops to almost 0, which
suggests there are no leaks. If you intentionally create that many
threads, tsan will need lots of memory to chew it.

The other numbers may be just a consequence of the number of threads.
E.g. growing shadow may be consequence of growing mmap, which is turn
may be thread stacks.

Ben Clayton

unread,
Oct 20, 2020, 7:51:07 AM10/20/20
to thread-sanitizer
The individual tests spawn up to 64 (real) threads at a time. It seems like the fibers are being counted as threads in these stats, which would explain why there are so many (64 x FibersPerThread).

> Tsan can handle up to 8192, but it needs lots of memory.

This machine has 256GB of RAM, so physical memory should not be an issue.

> The other numbers may be just a consequence of the number of threads.
> E.g. growing shadow may be consequence of growing mmap, which is turn
> may be thread stacks.

So I've tried the experiment again with the TSAN fiber annotations removed. We still get a crash after RSS and mmap have reached very large numbers:

    RSS 22675 MB: shadow:272 meta:61 file:6 mmap:22244 trace:68 heap:21 other:0 stacks=21817[239974] nthr=9/1081
    RSS 22670 MB: shadow:274 meta:61 file:6 mmap:22246 trace:59 heap:22 other:0 stacks=21837[240038] nthr=65/1081
    RSS 23254 MB: shadow:272 meta:61 file:6 mmap:22831 trace:59 heap:22 other:0 stacks=22464[241363] nthr=1/1081
    RSS 23899 MB: shadow:272 meta:61 file:6 mmap:23476 trace:59 heap:22 other:0 stacks=23107[242669] nthr=1/1081
    RSS 24575 MB: shadow:272 meta:61 file:6 mmap:24153 trace:59 heap:22 other:0 stacks=23794[244008] nthr=1/1081
    RSS 25205 MB: shadow:272 meta:61 file:6 mmap:24782 trace:59 heap:22 other:0 stacks=24344[245044] nthr=1/1081

Dmitry Vyukov

unread,
Oct 20, 2020, 8:01:37 AM10/20/20
to Ben Clayton, thread-sanitizer
On Tue, Oct 20, 2020 at 1:51 PM Ben Clayton <headles...@gmail.com> wrote:
>
> The individual tests spawn up to 64 (real) threads at a time. It seems like the fibers are being counted as threads in these stats, which would explain why there are so many (64 x FibersPerThread).
>
> > Tsan can handle up to 8192, but it needs lots of memory.
>
> This machine has 256GB of RAM, so physical memory should not be an issue.
>
> > The other numbers may be just a consequence of the number of threads.
> > E.g. growing shadow may be consequence of growing mmap, which is turn
> > may be thread stacks.
>
> So I've tried the experiment again with the TSAN fiber annotations removed. We still get a crash after RSS and mmap have reached very large numbers:
>
> RSS 22675 MB: shadow:272 meta:61 file:6 mmap:22244 trace:68 heap:21 other:0 stacks=21817[239974] nthr=9/1081
> RSS 22670 MB: shadow:274 meta:61 file:6 mmap:22246 trace:59 heap:22 other:0 stacks=21837[240038] nthr=65/1081
> RSS 23254 MB: shadow:272 meta:61 file:6 mmap:22831 trace:59 heap:22 other:0 stacks=22464[241363] nthr=1/1081
> RSS 23899 MB: shadow:272 meta:61 file:6 mmap:23476 trace:59 heap:22 other:0 stacks=23107[242669] nthr=1/1081
> RSS 24575 MB: shadow:272 meta:61 file:6 mmap:24153 trace:59 heap:22 other:0 stacks=23794[244008] nthr=1/1081
> RSS 25205 MB: shadow:272 meta:61 file:6 mmap:24782 trace:59 heap:22 other:0 stacks=24344[245044] nthr=1/1081
>
> Full log here.

Does it crash with the same failure?
ASSERT: Failed to protect page at 0x7f2e481fa000: Cannot allocate memory

Failure to mprotect while there is plenty of RAM may be due to the
kernel limit on the number of VMAs.
Hard to say anything without knowing what actually consumes that
mmap:24782. Maybe contents of /proc/self/maps before the failure will
shed some light on the nature of the problem.
> --
> You received this message because you are subscribed to the Google Groups "thread-sanitizer" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to thread-sanitiz...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/thread-sanitizer/d68b4ed2-f09f-41e5-a671-c7a82e6e954bn%40googlegroups.com.

Ben Clayton

unread,
Oct 20, 2020, 8:14:16 AM10/20/20
to thread-sanitizer
> Does it crash with the same failure?

With the TSAN fiber annotations removed, it typically crashes with something like:

ThreadSanitizer:DEADLYSIGNAL
==480860==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000007f0 (pc 0x0000004d5546 bp 0x000000000020 sp 0x7bc402517530 T480860)
==480860==The signal is caused by a READ memory access.
==480860==Hint: address points to the zero page.
    #0 __tsan::TraceSwitch(__tsan::ThreadState*) /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:638 (marl-unittests+0x4d5546)
    #1 ObtainCurrentStack<__sanitizer::BufferedStackTrace> /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:656 (marl-unittests+0x4d5546)
    #2 TraceSwitch /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:572 (marl-unittests+0x4d5546)
    #3 __tsan_trace_switch_thunk /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl_amd64.S:53 (marl-unittests+0x4e9c99)
    #4 __tsan_read1 /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:862 (marl-unittests+0x4d7db3)
    #5 MemoryAccess /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:874 (marl-unittests+0x4d7db3)
    #6 MemoryRead /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:742 (marl-unittests+0x4d7db3)
    #7 __tsan_read1 /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interface_inl.h:21 (marl-unittests+0x4d7db3)
    #8 (anonymous namespace)::DefaultAllocator::allocate(marl::Allocation::Request const&) /home/ben/src/marl/build/../src/memory.cpp:209 (marl-unittests+0x5ff473)
    #9 marl::TrackedAllocator::allocate(marl::Allocation::Request const&) /home/ben/src/marl/build/../include/marl/memory.h:307 (marl-unittests+0x516024)
    #10 std::unique_ptr<marl::OSFiber, marl::Allocator::Deleter> marl::Allocator::make_unique_n<marl::OSFiber, marl::Allocator*&>(unsigned long, marl::Allocator*&) /home/ben/src/marl/build/../include/marl/memory.h:215 (marl-unittests+0x601f6e)
    #11 std::unique_ptr<marl::OSFiber, marl::Allocator::Deleter> marl::Allocator::make_unique<marl::OSFiber, marl::Allocator*&>(marl::Allocator*&) /home/ben/src/marl/build/../include/marl/memory.h:201 (marl-unittests+0x601f6e)
    #12 marl::OSFiber::createFiber(marl::Allocator*, unsigned long, std::function<void ()> const&) /home/ben/src/marl/build/../src/osfiber_asm.h:168 (marl-unittests+0x601f6e)


However, the number of mmaps might have something to do with all those stacks=24344[245044], right?

Dmitry Vyukov

unread,
Oct 20, 2020, 8:31:07 AM10/20/20
to Ben Clayton, thread-sanitizer
On Tue, Oct 20, 2020 at 2:14 PM Ben Clayton <headles...@gmail.com> wrote:
>
> > Does it crash with the same failure?
>
> With the TSAN fiber annotations removed, it typically crashes with something like:
>
> ThreadSanitizer:DEADLYSIGNAL
> ==480860==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000007f0 (pc 0x0000004d5546 bp 0x000000000020 sp 0x7bc402517530 T480860)
> ==480860==The signal is caused by a READ memory access.
> ==480860==Hint: address points to the zero page.
> #0 __tsan::TraceSwitch(__tsan::ThreadState*) /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:638 (marl-unittests+0x4d5546)
> #1 ObtainCurrentStack<__sanitizer::BufferedStackTrace> /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:656 (marl-unittests+0x4d5546)
> #2 TraceSwitch /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:572 (marl-unittests+0x4d5546)
> #3 __tsan_trace_switch_thunk /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl_amd64.S:53 (marl-unittests+0x4e9c99)
> #4 __tsan_read1 /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:862 (marl-unittests+0x4d7db3)
> #5 MemoryAccess /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:874 (marl-unittests+0x4d7db3)
> #6 MemoryRead /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:742 (marl-unittests+0x4d7db3)
> #7 __tsan_read1 /home/brian/src/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interface_inl.h:21 (marl-unittests+0x4d7db3)
> #8 (anonymous namespace)::DefaultAllocator::allocate(marl::Allocation::Request const&) /home/ben/src/marl/build/../src/memory.cpp:209 (marl-unittests+0x5ff473)
> #9 marl::TrackedAllocator::allocate(marl::Allocation::Request const&) /home/ben/src/marl/build/../include/marl/memory.h:307 (marl-unittests+0x516024)
> #10 std::unique_ptr<marl::OSFiber, marl::Allocator::Deleter> marl::Allocator::make_unique_n<marl::OSFiber, marl::Allocator*&>(unsigned long, marl::Allocator*&) /home/ben/src/marl/build/../include/marl/memory.h:215 (marl-unittests+0x601f6e)
> #11 std::unique_ptr<marl::OSFiber, marl::Allocator::Deleter> marl::Allocator::make_unique<marl::OSFiber, marl::Allocator*&>(marl::Allocator*&) /home/ben/src/marl/build/../include/marl/memory.h:201 (marl-unittests+0x601f6e)
> #12 marl::OSFiber::createFiber(marl::Allocator*, unsigned long, std::function<void ()> const&) /home/ben/src/marl/build/../src/osfiber_asm.h:168 (marl-unittests+0x601f6e)
>
>
> However, the number of mmaps might have something to do with all those stacks=24344[245044], right?

This does not look like a problem with mmaps/VMA limit:
> ==480860==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000007f0 (pc 0x0000004d5546 bp 0x000000000020 sp 0x7bc402517530 T480860)
This looks more like the original hypothesis related to no fibers annotations.

With fiber annotations failure is more interesting:
ASSERT: Failed to protect page at 0x7f2e481fa000: Cannot allocate memory
The only way I see this can happen is the kernel VMA limit.
If you mmap stacks for 1080 fibers and mprotect the last page for each
stack, this will create lots of VMA regions. Perhaps you are already
close to the limit and tsan just adds some more (it does mmap a lot at
well).
A snapshot of /proc/self/maps on the allocation failure should
prove/disprove this.
> To view this discussion on the web visit https://groups.google.com/d/msgid/thread-sanitizer/f62aefd4-2acc-44aa-8ce8-0f0d9a71470an%40googlegroups.com.

Ben Clayton

unread,
Oct 20, 2020, 9:52:10 AM10/20/20
to thread-sanitizer
> If you mmap stacks for 1080 fibers and mprotect the last page for each
> stack, this will create lots of VMA regions. Perhaps you are already
> close to the limit and tsan just adds some more (it does mmap a lot at
> well).

Sure, makes sense - but:
  •  I'm munmap()'ing all of marl's mmap() calls before the end of each test.
  • When the tests are run in isolation I've never seen a crash.
  • The crash happens nearly always at the same test iteration.
  • Replacing the allocator's mmap() and munmap() calls with malloc() / free() (and without calling mprotect()) still crashes at approximately the same test iteration.
  • There's no reported leaks either with the mmap() or malloc() approach, using either ASAN or my own allocator trackers.
> A snapshot of /proc/self/maps on the allocation failure should
> prove/disprove this.

Here's the proc map for a with-annotations crash, failing with the error message:

    ==773948==ERROR: ThreadSanitizer failed to deallocate 0x43000 (274432) bytes at address 0x0ff231a3d000
    FATAL: ThreadSanitizer CHECK failed: /home/brian/src/final/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_posix.cpp:61 "(("unable to unmap" && 0)) != (0)" (0x0, 0x0)

Cheers,
Ben

Dmitry Vyukov

unread,
Oct 21, 2020, 4:39:57 AM10/21/20
to Ben Clayton, thread-sanitizer
On Tue, Oct 20, 2020 at 3:52 PM Ben Clayton <headles...@gmail.com> wrote:
>
> > If you mmap stacks for 1080 fibers and mprotect the last page for each
> > stack, this will create lots of VMA regions. Perhaps you are already
> > close to the limit and tsan just adds some more (it does mmap a lot at
> > well).
>
> Sure, makes sense - but:
>
> I'm munmap()'ing all of marl's mmap() calls before the end of each test.
> When the tests are run in isolation I've never seen a crash.
> The crash happens nearly always at the same test iteration.
> Replacing the allocator's mmap() and munmap() calls with malloc() / free() (and without calling mprotect()) still crashes at approximately the same test iteration.
> There's no reported leaks either with the mmap() or malloc() approach, using either ASAN or my own allocator trackers.
>
> > A snapshot of /proc/self/maps on the allocation failure should
> > prove/disprove this.
>
> Here's the proc map for a with-annotations crash, failing with the error message:
>
> ==773948==ERROR: ThreadSanitizer failed to deallocate 0x43000 (274432) bytes at address 0x0ff231a3d000
> FATAL: ThreadSanitizer CHECK failed: /home/brian/src/final/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_posix.cpp:61 "(("unable to unmap" && 0)) != (0)" (0x0, 0x0)

A failure to unmap also points towards hitting the limit on the number of VMAs.
However, the dump contains only 20K lines, while the default limit is
64K. You may try to dump /proc/sys/vm/max_map_count, but now I am not
sure if it will help or not...

But it's also weird why these entries are not merged together...

1f100b8a0000-1f100b8b1000 rwxp 00000000 00:00 0
1f100b8b1000-1f100b8f0000 rwxp 00000000 00:00 0
1f100b8f0000-1f100b901000 rwxp 00000000 00:00 0
1f100b901000-1f100b940000 rwxp 00000000 00:00 0
1f100b940000-1f100b951000 rwxp 00000000 00:00 0
1f100b951000-1f100b990000 rwxp 00000000 00:00 0
1f100b990000-1f100b9a1000 rwxp 00000000 00:00 0

They are consecutive, have the same protection, anonymous... a kernel bug?
> To view this discussion on the web visit https://groups.google.com/d/msgid/thread-sanitizer/577aba04-2a43-4c38-b4d7-deaffdf56634n%40googlegroups.com.

Ben Clayton

unread,
Oct 21, 2020, 5:58:52 AM10/21/20
to thread-sanitizer
> A failure to unmap also points towards hitting the limit on the number of VMAs.
> However, the dump contains only 20K lines, while the default limit is
> 64K. You may try to dump /proc/sys/vm/max_map_count, but now I am not
> sure if it will help or not...

$ cat /proc/sys/vm/max_map_count
65530

*Shrug*

Oh well. I guess I'll just avoid the --gtest_repeat flag for testing this project, and hope that whatever is leaking doesn't affect projects that use marl.

Thank you for your help!

Cheers,
Ben
Reply all
Reply to author
Forward
0 new messages