ThreadSanitizer, both the GCC and LLVM variants, have lots of simple
tests like this one:
http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/tsan/lit_tests/simple_race.c?view=markup
These tests pass if tsan finds a race and fail if tsan does not find a race.
Sometimes tsan fails to find the race.
I just tried to reproduce the problem on my machine by doing this:
clang -fsanitize=thread simple_race.c
while true; do ./a.out > /dev/null 2>&1 && echo "NO RACE DETECTED" ; done
And after quite some time I did get "NO RACE DETECTED".
Jakub says that on his machine that happens very frequently (1 in ~5 runs)
So, what's going on?
The only idea we have is that we are hitting a known by-design race in
the tsan state machine.
A more or less up-to-date description of the algorithm is found here:
https://code.google.com/p/thread-sanitizer/wiki/Algorithm
(the exact description is the source, MemoryAccessImpl in tsan_rtl.cc)
The key loop in tsan's state machine looks like this:
for i in 1..4:
UpdateOneShadowState(shadow_address, i, new_shadow_word, store_word)
i.e. we are updating 4 shadow slots that correspond to the given
application address independently. There is no synchronization here,
otherwise tsan would be 10x slower.
If the racy accesses in the application happen within a few cycles
from each other,
the tsan state machine may actually hit it's own internal race and
fail to find the user's race.
We believe that this race can not lead to false positives, only to
false negatives.
Is this a problem for tsan users?
I think no. Happens-before race detection is probabilistic anyway,
there are many more reasons why tsan may fail to find a single race in
a single run.
Is this a problem for tsan developers? Looks like.
Dmitry has been inserting sleep(1) statements in tests here and there
to make the tests less flaky.
The LLVM bots where green lately (got enough sleeps for our machines)
but GCC bots fail often.
I would start from inserting sleep(1) statements in the tests that
frequently fail on GCC bots.
This will not guarantee lack of failures, but should reduce the
failure rate dramatically. Like this:
void *Thread1(void *x) {
sleep(1);
Global++;
return 0;
}
Interesting observation: if we make sure the racy accesses in the
application really happen
close to each other, tsan will start missing the race more often
(happens 1 in ~20 runs on my box):
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
int Global;
pthread_barrier_t B;
void *Thread1(void *x) {
pthread_barrier_wait(&B); //<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Global = 42;
return NULL;
}
void *Thread2(void *x) {
pthread_barrier_wait(&B); //<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Global = 43;
return NULL;
}
int main() {
pthread_barrier_init(&B, 0, 2);
pthread_t t[2];
pthread_create(&t[0], NULL, Thread1, NULL);
pthread_create(&t[1], NULL, Thread2, NULL);
pthread_join(t[0], NULL);
pthread_join(t[1], NULL);
return 0;
}
--kcc