On Tue, 24 Mar 2015 07:48:41 -0700 (PDT)
kennetha...@gmail.com wrote:
[snip]
> Oh ok. I had hoped that by trying to make my worker threads locally
> proximal to one another that caches could be flushed in a coherent
> way extremely quickly. Like, say threads one and two share a portion
> of memory. Before that memory goes out, what if another core could do
> a data structure coherent merge of the data? I thought that if I
> postured the problem so that work was associative and commutative,
> then no matter the order of thread precedence, the final result would
> be correct.
>
> Now it seems like cache control isn't something you directly manage,
> but is done by the processor for you, and you can only try to get the
> compiler to select specific instructions that are suited to what you
> want and hopefully reduce unnecessary processor work as a result of
> those operations.
>
> Please let me know your thoughts :)
If you are trying to avoid the costs of memory synchronization by
resorting to tricks with caches you are doomed. It won't work.
The only way out is to design your algorithms to minimize contention,
by avoiding shared data as much as possible. Where some data sharing
is inevitable and is adversely affecting scalability, examine whether
there is a lock free data structure available to do what you want. Then
again, there is lock free and there is lock free: release-acquire
atomics basically come free at the hardware level on x86/64 because
x86/64 is strongly ordered - the costs over relaxed memory ordering
mainly come down to the additional restrictions imposed on compiler
reordering when optimizing. However, sequential consistency is the
default for C++11 atomics and does not come free (on x86/64 an mfence
instruction is needed which imposes synchronization overhead).
(As an aside, in my view, 99% of all uses of atomics only need
release-acquire semantics in a well designed program, and contrary to
what some people say, release-acquire semantics for atomics are
perfectly easy to reason about: you just have to realize that you are
only synchronizing between the two threads carrying out a particular
release (store) and acquire (load) operation, and not enforcing a total
memory order for all reads and writes of atomic variables in the
program.)
Foremost, profile to see what the problem really is.
Chris