"StoreStore" and "LoadLoad" barriers in TSAN

139 views
Skip to first unread message

Todd Lipcon

unread,
Apr 13, 2015, 2:11:47 PM4/13/15
to thread-s...@googlegroups.com
Hi TSAN experts,

I'm curious if you've given any thought to implementing support for finer-grained memory barriers in TSAN. I have an algorithm with an optimistic single-writer stamped lock... something like this:

struct TLS {
  Atomic32 seq_;
  Blah other_fields_;
};

static __thread TLS* g_tls;

// Called only by the thread that "owns" the TLS data.
// Must be wait-free
void UpdateTLS() {
  g_tls->seq_++;
  StoreStoreBarrier();
  MakeModificationsTo(&g_tls->other_fields_);
  StoreStoreBarrier();
  g_tls->seq_++;
}

// Called by some other thread, which is allowed to block.
void GatherTLS(TLS* other_thread_tls) {
  Blah snapshot_copy;
  while (true) {
    Atomic32 stamp;
    while ((stamp = Acquire_Load(&other_thread_tls->seq_)) & 1 != 0) {
      sched_yield();
    }
    Copy(other_thread_tls->other_fields, &snapshot_copy);
    LoadLoadBarrier();
    if (NoBarrier_Load(&other_thread_tls->seq_) == stamp) break;
  }
  // do something with the snapshot of state from the other thread.
}

The purpose as you might imagine is for the thread who owns the TLS to be able to make updates wait-free and very fast (no interlocked instructions or memory barriers on x86).

I think this is perfectly safe with the barriers as described above, but I'm not sure how best to describe it to TSAN. I've added an ANNOTATE_IGNORE_READS_BEGIN/END around the optimistic snapshot copy, but not sure what annotations I should give to TSAN to let it know about the StoreStore and LoadLoad barriers.

On x86 I've implemented the barriers as compiler-only barriers since TSO already prevents those reorderings.

Should I just annotate out the whole thing for TSAN? Or is there some way to do the finer grained LoadLoad and StoreStore barriers?

Thanks
Todd
--
Todd Lipcon
Software Engineer, Cloudera

Dmitry Vyukov

unread,
Apr 14, 2015, 11:45:31 AM4/14/15
to thread-s...@googlegroups.com
Hi Todd,

Here are two problems:
1. Tsan is bad at handling stand-alone memory barriers. I don't know
how to implement it without performance hit on _all_ relaxed atomic
operations. It can be a severe hit for the programs we are interested
in.
2. I don't think we want to support memory models and atomic
interfaces other than C/C++ atomics. Hopefully, C/C++ atomics is the
dominating interface for native synchronization. As for other
interfaces, some of them are [mostly] equivalent to C/C++ atomics, and
way too expressive (like
https://groups.google.com/d/msg/comp.programming.threads/PidOpfQUEb8/VRG25OSVgKAJ).


Correct way to express seqlock in C/C++ is:

// writer
atomic_store(&seq, seq+1, memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store(&data[0], ..., memory_order_relaxed);
...
atomic_store(&data[N], ..., memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store(&seq, seq+1, memory_order_relaxed);

// reader
atomic_load(&seq, memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
d0 = atomic_load(&data[0], memory_order_relaxed);
...
dN = atomic_load(&data[N], memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
atomic_load(&seq, memory_order_relaxed);

Unfortunately tsan won't understand this, as it does not understand
stand-alone memory fences at all.
If tsan supports them, then it will be support for these C/C++ atomics
rather than store-store/load-load fences.

And here is a way to express seqlock that is both correct, is
understood by tsan and is no overhead on x86:

// writer
atomic_store(&seq, seq+1, memory_order_relaxed);
atomic_store(&data[0], ..., memory_order_release);
...
atomic_store(&data[N], ..., memory_order_release);
atomic_store(&seq, seq+1, memory_order_release);

// reader
atomic_load(&seq, memory_order_acquire);
d0 = atomic_load(&data[0], memory_order_acquire);
...
dN = atomic_load(&data[N], memory_order_acquire);
atomic_load(&seq, memory_order_relaxed);

Todd Lipcon

unread,
Apr 30, 2015, 12:28:54 AM4/30/15
to thread-s...@googlegroups.com
Hey Dmitry,

Thanks for your earlier response, and sorry for the slow reply -- for some reason this didn't make it back to my inbox and only discovered it upon checking the group on the web. Will double-check my subscription... anyway, response inline:

On Tuesday, April 14, 2015 at 8:45:31 AM UTC-7, Dmitry Vyukov wrote:


Hi Todd,

Here are two problems:
1. Tsan is bad at handling stand-alone memory barriers. I don't know
how to implement it without performance hit on _all_ relaxed atomic
operations. It can be a severe hit for the programs we are interested
in.

OK, I can appreciate that these types of barriers are a rarity and not worth affecting more common case performance.
 
2. I don't think we want to support memory models and atomic
interfaces other than C/C++ atomics. Hopefully, C/C++ atomics is the
dominating interface for native synchronization. As for other
interfaces, some of them are [mostly] equivalent to C/C++ atomics, and
way too expressive (like
https://groups.google.com/d/msg/comp.programming.threads/PidOpfQUEb8/VRG25OSVgKAJ).


Correct way to express seqlock in C/C++ is:

// writer
atomic_store(&seq, seq+1, memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store(&data[0], ..., memory_order_relaxed);
...
atomic_store(&data[N], ..., memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store(&seq, seq+1, memory_order_relaxed);

// reader
atomic_load(&seq, memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
d0 = atomic_load(&data[0], memory_order_relaxed);
...
dN = atomic_load(&data[N], memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
atomic_load(&seq, memory_order_relaxed);

Yea, I haven't moved to the C++11 atomic operations yet, and still using atomicops.h (originally derived from google/base, though I'm using an external version)
 

Unfortunately tsan won't understand this, as it does not understand
stand-alone memory fences at all.
If tsan supports them, then it will be support for these C/C++ atomics
rather than store-store/load-load fences.

And here is a way to express seqlock that is both correct, is
understood by tsan and is no overhead on x86:

// writer
atomic_store(&seq, seq+1, memory_order_relaxed);
atomic_store(&data[0], ..., memory_order_release);
...
atomic_store(&data[N], ..., memory_order_release);
atomic_store(&seq, seq+1, memory_order_release);

Is that truely "no-overhead"? Imagine that the 'data' array here is a char[]. In that case, the fact that you're using release stores for every write means that you'll get byte-width store instructions when otherwise the compiler could quite possibly aggregate them into a fewer number of wider instructions, right? Same goes for reordering of the stores (eg if the compiler detects that one of the first stores has a data dependency that may not yet be satisfied, it would probably be smart to reorder it a bit later in the sequence).


// reader
atomic_load(&seq, memory_order_acquire);
d0 = atomic_load(&data[0], memory_order_acquire);
...
dN = atomic_load(&data[N], memory_order_acquire);
atomic_load(&seq, memory_order_relaxed);

Same here -- especially true since the loads during the 'snapshot' operation are typically implemented by something like memcpy(&local_state, state, sizeof(state)) -- we'd need a separate "memcpy_acquire", and probably lose a bunch of the optimizations that the compiler does to substitute __builtin_memcpy() with large-size memory operations.

Given the above, I'll probably end up just suppressing TSAN in this section of the code, since the particular use case shows up on hot paths in a lot of places (it's tracing-related code). A few extra instructions could have a noticeable impact on code size, I-cache usage, etc.

Thanks again
-Todd

Dmitry Vyukov

unread,
May 5, 2015, 5:46:32 AM5/5/15
to thread-s...@googlegroups.com
atomicops.h don't have StoreStore/LoadLoad barriers.
It has Acqurie_Store and Release_Load, though. They are intended for
seqlocks, but are equally unsupported by tsan.


>> Unfortunately tsan won't understand this, as it does not understand
>> stand-alone memory fences at all.
>> If tsan supports them, then it will be support for these C/C++ atomics
>> rather than store-store/load-load fences.
>>
>> And here is a way to express seqlock that is both correct, is
>> understood by tsan and is no overhead on x86:
>>
>> // writer
>> atomic_store(&seq, seq+1, memory_order_relaxed);
>> atomic_store(&data[0], ..., memory_order_release);
>> ...
>> atomic_store(&data[N], ..., memory_order_release);
>> atomic_store(&seq, seq+1, memory_order_release);
>
>
> Is that truely "no-overhead"? Imagine that the 'data' array here is a
> char[]. In that case, the fact that you're using release stores for every
> write means that you'll get byte-width store instructions when otherwise the
> compiler could quite possibly aggregate them into a fewer number of wider
> instructions, right? Same goes for reordering of the stores (eg if the
> compiler detects that one of the first stores has a data dependency that may
> not yet be satisfied, it would probably be smart to reorder it a bit later
> in the sequence).

You can cast the data to an array of words during copy-in/copy-out
(assuming it is POD-like if you are using with a seqlock). I would say
that the rest is virtually zero-overhead.
Storing the data without atomics is undefined behavior in C/C++ anyway.


>> // reader
>> atomic_load(&seq, memory_order_acquire);
>> d0 = atomic_load(&data[0], memory_order_acquire);
>> ...
>> dN = atomic_load(&data[N], memory_order_acquire);
>> atomic_load(&seq, memory_order_relaxed);
>
>
> Same here -- especially true since the loads during the 'snapshot' operation
> are typically implemented by something like memcpy(&local_state, state,
> sizeof(state)) -- we'd need a separate "memcpy_acquire", and probably lose a
> bunch of the optimizations that the compiler does to substitute
> __builtin_memcpy() with large-size memory operations.
>
> Given the above, I'll probably end up just suppressing TSAN in this section
> of the code, since the particular use case shows up on hot paths in a lot of
> places (it's tracing-related code). A few extra instructions could have a
> noticeable impact on code size, I-cache usage, etc.


What's the size of the protected data? If it is megs of memory, then
you may find other synchronization primitives more suitable.
Reply all
Reply to author
Forward
0 new messages