From time to time, I see check-all hang during running of lit tests.
The hang always happens at the > 90% completion stage and I'm pretty
sure all tests have been run and check-all is just waiting for
lit/python to exit. I see a single python processing running, taking
very little CPU time. An strace of that process shows this:
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32168}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0
futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = -1 EAGAIN (Resourc
e temporarily unavailable)
futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = -1 EAGAIN (Resourc
e temporarily unavailable)
futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
It appears that python is waiting for some I/O or something which never
appears.
Has anyone else seen this before? Any ideas of what is going on or how
to fix it?
-David
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-David
> What you're seeing is just the fact that lit is waiting on
> subprocesses (select is waiting on the pipes i suspect).
Right. Some digging revealed that it is waiting on
getline_nohang.cc.tmp, a tsan test.
I see that this test has been disabled for NetBSD, due to it sometimes
failing. I'm seeing the same on Linux.
How can we stabilize the sanitizer tests so that check-all can work
reliably? If some sanitizer tests are so flaky, I should think they
should be marked UNSUPPORTED. Who has the authority to make those
determinations?
> On Jan 3, 2019, at 1:21 PM, David Greene via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Chandler Carruth via llvm-dev <llvm...@lists.llvm.org> writes:
>
>> What you're seeing is just the fact that lit is waiting on
>> subprocesses (select is waiting on the pipes i suspect).
>
> Right. Some digging revealed that it is waiting on
> getline_nohang.cc.tmp, a tsan test.
>
> I see that this test has been disabled for NetBSD, due to it sometimes
> failing. I'm seeing the same on Linux.
>
> How can we stabilize the sanitizer tests so that check-all can work
> reliably? If some sanitizer tests are so flaky, I should think they
> should be marked UNSUPPORTED. Who has the authority to make those
> determinations?
Dmitry Vyukov does. CC'ing him.
Kuba
Are there any special repro instructions? I am running all tsan tests
periodically on linux and none of them flakes.
> Are there any special repro instructions? I am running all tsan tests
> periodically on linux and none of them flakes.
I don't think I'm doing anything especially interesting. I wonder if
lit parallelism has anything to do with it. I tend to run quite wide
(32 or more).
I'm on SLES 12.2, kernel 4.4.21-69-default, x86_64 in case it matters.
I see this test hang pretty frequently.
-David
Hi David,
The test is specifically a regression test for a deadlock:
// Make sure TSan doesn't deadlock on a file stream lock at program shutdown.
// See https://github.com/google/sanitizers/issues/454
So I wonder if it's not completely fixed.
I am sure it does not reproduce on my machine:
$ clang++ getline_nohang.cc -fsanitize=thread -O1 -g
$ stress ./a.out
192 runs so far, 0 failures
...
17137 runs so far, 0 failures
17377 runs so far, 0 failures
Could you please attach to the hanged process with gdb and do
backtrace of all threads?