Yes, I know. Many of my responses to Robert Myers have been
explanations of this, the state of the world.
However, the reason that I am willing to cheer Robert on as he tilts at
his windmill, and even to try to help out a bit, is that this trend is
not a fundamental limit. I.e. there is no fundamental reason that we
have to be hurting random accesses as memoy systems evolve.
People seem to act as if there are only two design points:
* low latency, small random accesses
* long latency, burst accesses
But it is possible to build a system that supports
* small random accesses with long latencies
By the way, it is more appropriate to say that the current trend is towards
* long latency, random long sequential burst accesses.
(Noting that you can have random non-sequential burst accesses, as I
have recently posted about.)
The things that seem to be driving the evolution towards long sequential
bursts are
a) tags in caches - the smaller the cache objects, the more area wasted
on tags. But if you don't care about tags for your small random accesses...
b) signalling overhead - long sequential bursts have a ratio of address
bits to data bits of, say, 64:512 = 1:8 for Intel's 64 byte cache lines,
and 64:2048 = 8:256 = 1:32 for IBM's 256 byte cache lines. Whereas
scatter gather has a signalling ratio of more like 1:1.
Signalling overhead manifests both in bandwidth and power.
One can imagine an interconnect that handles both sequential bursts and
scatter/gather random accesses - so that you don't pay a penalty for
sequential access patterns, but you support small random access patterns
with long latencies well. But...
c) this is complex. More complex than simply supporting sequential bursts.
But I'm not afraid of complexity. I try to avoid complexity, when there
are simpler ways of solving a problem. But it appears that this random
access problem is a problem that (a) is solvable (with a bit of
complexity), (b) has customers (Robert, and some other supercomputing
customers I have met, some very important), and (c) isn't getting solved
any other way.
For all that we talk about persuading programmers that DRAM is the new disk.
> 768 MB of L4 means your problem size is limited to a little less than
> that, otherwise random access is out.
It may be worse than you think.
I have not been able to read the redbook yet (Google Chrome and Adobe
Reader were conspiring to hang, and could not view/download the
document; I had to fall back to Internet Explorer).
But I wonder what the cache line size is in the interior caches, the L1,
L2, L3?
With the IBM heritage, it may a small, sectored cache line. Say 64 bytes.
But, I also recall seeing IBM machines that could transfer a full 2048
bits between cache and registers in a single cycle. Something which I
conjecture is good for context switches on mainframe workloads.
If the 256B cache line is used in the inside caches, then it might be
that only the L1 is really capable of random access.
Or, rather: there is no absolute "capable of random access". Instead,
there are penalties for random access.
I suggest that the main penalty should be measured as
ratio long burst sequential time to transfer N bytes
to
ratio small random access to transfer N bytes
Let us talk about 64bit randm accesses.
Inside the L1 cache at Intel, with 64 byte cache lines, this ratio is
close to 1:1.
Accessing data that fits in the L2, this ratio is circa 8:1 - i.e. long
burst sequential is 8X faster, higher bandwidth, than 64b random accesses.
From main memory the 1:8 ratio still approximately holds wire-wise, but
buffering effects tend to crop up which inflates it.
With 256B cache lines, the wire contribution to this ratio is 32:1. -
i.e. long burst sequential is 32X faster, higher bandwidth, than 64b
random accesses. Probably with more slowdowns.
---
What I am concerned about is that it may not be that "DRAM is the new disk".
It may be that "L2 cache is the new disk". More likely "L4 cache is the
new disk".
---
By the way, this is the first post I am making to Robert Myer's
high-bandwid...@googlegroups.com
mailing list
Robert: is this an appropriate topic?
"Affine" is a good word. I think including general rotations is good.
But "crystalline" sparkles. :-)
Thanks Jason for reminding me about GUPS. Plus, if you look at some of
the recent GUPS champions, they have been doing in software what I
propose doing in hardware, along the lines of scatter/gather bursting.
I suspect that we will end up in a 3-way discussion:
1) sequential
2) crystalline or affine - regular access patterns, just not supported
well by block sequential
3) random access patterns
My user community is more in the random access patterns than the
crystalline or affine. E.g. I am into pointers, linked data
structures, irregular sparse arrays. Not regularity. More AI than FFT.
At the moment 2) and 3) are allied. A good solution to 3) will also
help 2) a lot, but not vice versa. I am a little bit afraid of
proposals that seem to hardwire support for certain access patterns into
the hardware, at the expense of others not anticipated by the
anticipator of little foresight.
--
I grew up admiring the stunt box scheduling to coordinate stride
patterns on old machines. But every time I looked closely, it was not
really all that fancy - not as fancy as people in this group suggested.
They weren't solving diophantine equations for access pattern
optimizations in real time; they were applying simple heuristics,
usually variants of greedy, that delivered reasonable performance.
(Now, the *compiler* might be solving diophantine equations - but
usually not for scheduling.)
--
I'm symapthetic to the randomization, but not certain about it.