On Friday, March 22, 2013 3:21:35 PM UTC-5, Paul A. Clayton wrote:
> On Mar 22, 12:54 pm, "John D. McCalpin" <
mccal...@tacc.utexas.edu>
>
>
> Recent POWER implementations have tracked locality
> of use in cache block state to avoid unnecessarily
> snooping across package boundaries. Adding four
> bits per block would allow individually identifying
> 16 cores (using exclusive/modified state to indicate
> node size of one), 8 core pairs, etc. (or a more
> complex encoding could be used).
POWER does very scary things....
I developed a "local cache block flush" instruction for
POWER5 (US Patent 7,194,587), to assist with controlling
cache space (not for improving communication), but at the
same time someone else on the team dropped the "invalid
victim select" preference. So I could invalidate lines
that I did not need, but they would stay in the cache
until they became LRU anyway. I did not understand why
anyone would want to do this until I saw how POWER6 uses
invalid entries in the cache to hold information about data
sharing patterns. I am sure it has gotten much more bizarre
since then....
Interestingly, Xeon Phi implements a pair of "local cache
block flush" instructions (CLFLUSH1 & CLFLUSH2) -- one to
flush data from the L1 and the other to flush data from the L2.
Like my patent, they have only local scope. On Xeon Phi they
are useful to control not only cache space, but also the timing
of writebacks, since the caches are (effectively) single-ported.
These don't make any difference for communication, since the
memory latency and cache-to-cache intervention latency are
effectively the same on this processor.
> A last use indicator could also be used to
> support simple producer-consumer communication
> where the microarchitecture would track the
> alternate location. (While directives have some
> advantages, even a predictive mechanism could be
> helpful.)
Hack upon hack upon hack? Why not design the
architecture to allow control over the things that
are important (costly), rather than hacking on
decisions that have not made sense for >20 years?
> A flush-to-home operation might be slightly useful
> (with node ownership and shared last level cache,
> flush-to-home could be just flush to the nearest
> level of cache shared by all users).
I developed a "push for sharing" instruction while
at AMD (US Patent 8,099,557), but I don't think it
was ever implemented.
> By the way, the FIFO latency you indicated in a
> later post is frighteningly high!
I only posted the lowest available numbers because
people often don't believe the actual measurements.
A quick check with lmbench3 shows that "pipe" latency
on my 2-socket Xeon E5 (Sandy Bridge) systems is about
4.7 microseconds (almost 15000 cycles).
On a 40-core (4 socket) Xeon E7 system, a single instance
of a non-blocking concurrent FIFO was reported (reference 1)
to go from ~1 microsecond (2000 cycle) overhead under no load
to ~50 microseconds (100,000 cycles) overhead under a relatively
heavy load (20 enqueuers and 20 dequeuers, each spinning
for ~2.3 microseconds between operations). I should note
that these are *overheads*, not just latencies, because
the processors are only doing real work for 2.3 microseconds
per enqueue or dequeue operation, while the other ~48 micro-
seconds are spent spinning on trying to access the queue.
For a different kind of "communication":
On my Xeon Phi SE10P coprocessors, an OpenMP barrier on
244 threads costs about 22 microseconds (24,000 cycles, or
about 23 Million double-precision floating-point operations).
This is just the overhead of the final barrier -- the overhead
of the initial "PARALLEL FOR" is higher, but I need to run
some more experiments before I decide the best way to report
the combined overhead.
Reference: (1) Christoph Kirsch, Hannes Payer, Harald Rock, and Ana
Sokolova. Performance, scalability, and semantics of concurrent
fifo queues. Algorithms and Architectures for Parallel Processing,
pages 273–287, 2012.