data race with pthread barriers

Rik Prohaska

unread,

Sep 9, 2015, 5:00:13 PM9/9/15

to thread-sanitizer

Hello,

Using clang 3.8, I see data race reported with barrier_wait and barrier_destroy. Is this a race within TSAN?

Thanks

WARNING: ThreadSanitizer: data race (pid=14924)

Write of size 1 at 0x000001533808 by main thread:

#0 pthread_barrier_destroy /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1236 (barrier-2+0x00000042ed1b)

#1 main /home/rfp/src/tsan.tests/barrier-2.cc:28:9 (barrier-2+0x0000004a93ad)

Previous read of size 1 at 0x000001533808 by thread T1:

#0 pthread_barrier_wait /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1243 (barrier-2+0x00000043e14e)

#1 f(void*) /home/rfp/src/tsan.tests/barrier-2.cc:9:9 (barrier-2+0x0000004a91ae)

Location is global 'b' of size 32 at 0x000001533808 (barrier-2+0x000001533808)

Thread T1 (tid=14926, finished) created by main thread at:

#0 pthread_create /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:848 (barrier-2+0x0000004478d3)

#1 main /home/rfp/src/tsan.tests/barrier-2.cc:21:9 (barrier-2+0x0000004a92d3)

SUMMARY: ThreadSanitizer: data race /home/rfp/src/tsan.tests/barrier-2.cc:28:9 in main

#include <stdio.h>

#include <assert.h>

#include <pthread.h>

pthread_barrier_t b;

void *f(void *arg) {

int r;

r = pthread_barrier_wait(&b);

fprintf(stderr, "%s %u %d\n", __FUNCTION__, __LINE__, r);

assert(r == 0 || r == PTHREAD_BARRIER_SERIAL_THREAD);

return arg;

}

int main(void) {

int r;

r = pthread_barrier_init(&b, nullptr, 2);

assert(r == 0);

pthread_t id;

r = pthread_create(&id, nullptr, f, nullptr);

assert(r == 0);

r = pthread_barrier_wait(&b);

fprintf(stderr, "%s %u %d\n", __FUNCTION__, __LINE__, r);

assert(r == 0 || r == PTHREAD_BARRIER_SERIAL_THREAD);

r = pthread_barrier_destroy(&b);

assert(r == 0);

void *retptr;

r = pthread_join(id, &retptr);

assert(r == 0);

return 0;

}

Alexey Samsonov

unread,

Sep 9, 2015, 5:25:50 PM9/9/15

to thread-s...@googlegroups.com

Looks like this is a legit data race in your code: it's possible that the main thread would proceed and call pthread_barrier_destroy() before T1 would return from pthread_barrier_wait().

--
You received this message because you are subscribed to the Google Groups "thread-sanitizer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thread-sanitiz...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Alexey Samsonov, Mountain View, CA

Rik Prohaska

unread,

Sep 9, 2015, 5:35:11 PM9/9/15

to thread-sanitizer

But pthread_barrier_wait will not return to any of the threads until the barrier has been entered by all of the threads. So, why is this an app problem?

Alexey Samsonov

unread,

Sep 9, 2015, 5:49:38 PM9/9/15

to thread-s...@googlegroups.com

On Wed, Sep 9, 2015 at 2:35 PM, Rik Prohaska <proh...@gmail.com> wrote:

But pthread_barrier_wait will not return to any of the threads until the barrier has been entered by all of the threads. So, why is this an app problem?

Right, I mean the following execution order:

1) main thread enters pthread_barrier_wait()

2) T1 enters pthread_barrier_wait()

3) main thread returns from pthread_barrier_wait()

4) main thread calls pthread_barrier_destroy()

<--- T1 has not yet returned from pthread_barrier_wait(), but it's already destroyed --->

I *think* that this situation qualifies for what is described in man as "The results are undefined if pthread_barrier_destroy() is called when any thread is blocked on the barrier, <...>"

I am not entirely sure that this reading is correct, though.

Rik Prohaska

unread,

Sep 9, 2015, 6:52:45 PM9/9/15

to thread-sanitizer

If I call sleep(10) on the main thread between the barrier_wait and the barrier_destroy, there is still a race reported. In this case, it is obvious that both threads have exited the barrier.

Alexey Samsonov

unread,

Sep 9, 2015, 6:57:23 PM9/9/15

to thread-s...@googlegroups.com

On Wed, Sep 9, 2015 at 3:52 PM, Rik Prohaska <proh...@gmail.com> wrote:

If I call sleep(10) on the main thread between the barrier_wait and the barrier_destroy, there is still a race reported. In this case, it is obvious that both threads have exited the barrier.

sleep() doesn't create a happens-before relation in TSan. You should use some other means to notify the main thread that T1 has exited from pthread_barrier_wait (or just move the call to pthread_barrier_destroy after pthread_join).

Evgenii Stepanov

unread,

Sep 9, 2015, 6:59:56 PM9/9/15

to thread-s...@googlegroups.com

It's not obvious. It's possible, though very unlikely, that the other
thread has not been scheduled once in the 10 seconds, and is still in
the barrier.

Rik Prohaska

unread,

Sep 9, 2015, 7:32:22 PM9/9/15

to thread-sanitizer

The printf's fire before the sleep is called. Therefore, all threads are out of the barrier before barrier destroy is called.

Alexey Samsonov

unread,

Sep 9, 2015, 7:36:51 PM9/9/15

to thread-s...@googlegroups.com

On Wed, Sep 9, 2015 at 4:32 PM, Rik Prohaska <proh...@gmail.com> wrote:

The printf's fire before the sleep is called. Therefore, all threads are out of the barrier before barrier destroy is called.

TSan detects potential data races (if there is no happens-before relation between two memory accesses, one of which is a write),

not the actual case when two memory accesses happen simultaneously. That is, if TSan reports a race it means that there *could

be* some execution order, that would lead to the data race.

Rik Prohaska

unread,

Sep 9, 2015, 8:06:26 PM9/9/15

to thread-sanitizer

Consider this statement:

"The results are undefined if pthread_barrier_destroy() is called when any thread is blocked on the barrier, <...>"

What exactly does "blocked on the barrier" mean? I don't think that a thread is blocked on the barrier after all of the threads have called barrier wait. I do think it means that an insufficient # of threads have called barrier wait.

Rik Prohaska

unread,

Sep 9, 2015, 8:10:00 PM9/9/15

to thread-sanitizer

I am just tracking down a TSAN warning generated when running the MySQL server (version 5.7.8). MySQL has used pthread barriers in the way that the example program that I supplied. It is either a MySQL bug or a TSAN bug. I am not convinced that TSAN is correct in this case. It appears (to me) that the barrier man pages are not definitive.

Alexey Samsonov

unread,

Sep 9, 2015, 8:50:04 PM9/9/15

to thread-s...@googlegroups.com

I agree it's not 100% definitive, so advice from experts (Dima?) would be nice. I can, however, come up with reasonable barrier implementation that would access the underlying memory of a barrier after the necessary number of threads have entered (but not all of them returned from) pthread_barrier_wait(). The beginning of this article http://locklessinc.com/articles/barriers/ confirms my concerns.

Dmitry Vyukov

unread,

Sep 10, 2015, 5:43:02 AM9/10/15

to thread-s...@googlegroups.com

Hi Rik,

This is moot aspect. Documentation is indeed very vague. As the result
different implementation implement it differently.
Nptl pthead implementation up to 2.22 can indeed crash on your program
due to usage of pthread_barrier_t object after destruction.
For example 2.19 implementation does this:

int
pthread_barrier_wait (barrier)
pthread_barrier_t *barrier;
{
struct pthread_barrier *ibarrier = (struct pthread_barrier *) barrier;
int result = 0;

/* Make sure we are alone. */
lll_lock (ibarrier->lock, ibarrier->private ^ FUTEX_PRIVATE_FLAG);

/* One more arrival. */
--ibarrier->left;

/* Are these all? */
if (ibarrier->left == 0)
{
/* Yes. Increment the event counter to avoid invalid wake-ups and
tell the current waiters that it is their turn. */
++ibarrier->curr_event;

/* Wake up everybody. */
lll_futex_wake (&ibarrier->curr_event, INT_MAX,
ibarrier->private ^ FUTEX_PRIVATE_FLAG);

/* This is the thread which finished the serialization. */
result = PTHREAD_BARRIER_SERIAL_THREAD;
}
else
{
/* The number of the event we are waiting for. The barrier's event
number must be bumped before we continue. */
unsigned int event = ibarrier->curr_event;

/* Before suspending, make the barrier available to others. */
lll_unlock (ibarrier->lock, ibarrier->private ^ FUTEX_PRIVATE_FLAG);

/* Wait for the event counter of the barrier to change. */
do
lll_futex_wait (&ibarrier->curr_event, event,
ibarrier->private ^ FUTEX_PRIVATE_FLAG);
while (event == ibarrier->curr_event);
}

/* Make sure the init_count is stored locally or in a register. */
unsigned int init_count = ibarrier->init_count;

/* If this was the last woken thread, unlock. */
if (atomic_increment_val (&ibarrier->left) == init_count)
/* We are done. */
lll_unlock (ibarrier->lock, ibarrier->private ^ FUTEX_PRIVATE_FLAG);

return result;
}

This implementation can corrupt memory or hang on your program.

There is that patch to nptl:
https://sourceware.org/ml/libc-alpha/2015-07/msg00585.html
it says:
```
The previous barrier implementation did not fulfill the POSIX
requirements for when a barrier can be destroyed. Specifically, it was
possible that threads that haven't noticed yet that their round is
complete still access the barrier's memory, and that those accesses can
happen after the barrier has been legally destroyed.
The new algorithm does not have this issue, and it avoids using a lock
internally.
```

The patch makes your program correct. But it is aimed at inclusion
into only 2.22 (dated by 19 Jul 2015).
For example my up-to-date Ubuntu still features libpthread 2.19.

Bottom line:
I would suggest you to treat this as bug in the program, as it can
realistically break on most existing pthread implementations (I am
sure there are other pthread implementations that are still not safe).
For tsan, this "new reading" of posix standard is something to keep in
mind, so I filed https://github.com/google/sanitizers/issues/598

Thank you

> --
> You received this message because you are subscribed to the Google Groups
> "thread-sanitizer" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to thread-sanitiz...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--

Dmitry Vyukov, Software Engineer, dvy...@google.com
Google Germany GmbH, Dienerstraße 12, 80331, München
Geschäftsführer: Graham Law, Christine Elizabeth Flores
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat
sind, leiten Sie diese bitte nicht weiter, informieren Sie den
Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.
This e-mail is confidential. If you are not the right addressee please
do not forward it, please inform the sender, and please erase this
e-mail including any attachments. Thanks.

Rik Prohaska

unread,

Sep 10, 2015, 8:22:38 AM9/10/15

to thread-sanitizer

What will thread santizer report (if anything) with the 'fixed' barrier implementation?

Dmitry Vyukov

unread,

Sep 10, 2015, 8:35:17 AM9/10/15

to thread-s...@googlegroups.com

Tsan disregards actual implementation of synchronization primitives,
but it knows about _contract_ between synchronization primitives and
user code. The current coded contract is that tsan reports race in
this case.

Reply all

Reply to author

Forward