Re: Redbook on the new z196 mainframes.

Andy Glew <newsgroup at comp-arch.net

unread,

Jul 28, 2010, 11:05:50 AM7/28/10

to high-bandwid...@googlegroups.com

On 7/27/2010 8:08 AM, Terje Mathisen wrote:
> Andy Glew wrote:
>> On 7/27/2010 6:16 AM, Niels J�rgen Kruse wrote:
>>> 24 MB L3 per 4 cores
>>> up to 768 MB L4
>>> 256 byte line sizes at all levels.
>>
>> 256 *BYTE*? [cache line size on new IBM z-Series]
>
> Yes, that one rather screamed at me as well.
>>
>> 2048 bits?
>>
>> Line sizes 4X the typical 64B line size of x86?
>>
>> These aren't cache lines. They are disk blocks.
>
> Yes. So what?
>
> I (and Nick, and you afair) have talked for years about how current CPUs
> are just like mainframes of old:
>
> new old
> DISK -> TAPE : Sequential access only
> RAM -> DISK : HW-controlled, block-based transfer
> CACHE -> RAM : Actual random access, but blocks are still faster
>
>>
>> Won't make Robert Myers happy.

Yes, I know. Many of my responses to Robert Myers have been
explanations of this, the state of the world.

However, the reason that I am willing to cheer Robert on as he tilts at
his windmill, and even to try to help out a bit, is that this trend is
not a fundamental limit. I.e. there is no fundamental reason that we
have to be hurting random accesses as memoy systems evolve.

People seem to act as if there are only two design points:

* low latency, small random accesses
* long latency, burst accesses

But it is possible to build a system that supports

* small random accesses with long latencies

By the way, it is more appropriate to say that the current trend is towards

* long latency, random long sequential burst accesses.

(Noting that you can have random non-sequential burst accesses, as I
have recently posted about.)

The things that seem to be driving the evolution towards long sequential
bursts are

a) tags in caches - the smaller the cache objects, the more area wasted
on tags. But if you don't care about tags for your small random accesses...

b) signalling overhead - long sequential bursts have a ratio of address
bits to data bits of, say, 64:512 = 1:8 for Intel's 64 byte cache lines,
and 64:2048 = 8:256 = 1:32 for IBM's 256 byte cache lines. Whereas
scatter gather has a signalling ratio of more like 1:1.

Signalling overhead manifests both in bandwidth and power.

One can imagine an interconnect that handles both sequential bursts and
scatter/gather random accesses - so that you don't pay a penalty for
sequential access patterns, but you support small random access patterns
with long latencies well. But...

c) this is complex. More complex than simply supporting sequential bursts.

But I'm not afraid of complexity. I try to avoid complexity, when there
are simpler ways of solving a problem. But it appears that this random
access problem is a problem that (a) is solvable (with a bit of
complexity), (b) has customers (Robert, and some other supercomputing
customers I have met, some very important), and (c) isn't getting solved
any other way.

For all that we talk about persuading programmers that DRAM is the new disk.

> 768 MB of L4 means your problem size is limited to a little less than
> that, otherwise random access is out.

It may be worse than you think.

I have not been able to read the redbook yet (Google Chrome and Adobe
Reader were conspiring to hang, and could not view/download the
document; I had to fall back to Internet Explorer).

But I wonder what the cache line size is in the interior caches, the L1,
L2, L3?

With the IBM heritage, it may a small, sectored cache line. Say 64 bytes.

But, I also recall seeing IBM machines that could transfer a full 2048
bits between cache and registers in a single cycle. Something which I
conjecture is good for context switches on mainframe workloads.

If the 256B cache line is used in the inside caches, then it might be
that only the L1 is really capable of random access.

Or, rather: there is no absolute "capable of random access". Instead,
there are penalties for random access.

I suggest that the main penalty should be measured as

ratio long burst sequential time to transfer N bytes
to
ratio small random access to transfer N bytes

Let us talk about 64bit randm accesses.

Inside the L1 cache at Intel, with 64 byte cache lines, this ratio is
close to 1:1.

Accessing data that fits in the L2, this ratio is circa 8:1 - i.e. long
burst sequential is 8X faster, higher bandwidth, than 64b random accesses.

From main memory the 1:8 ratio still approximately holds wire-wise, but
buffering effects tend to crop up which inflates it.

With 256B cache lines, the wire contribution to this ratio is 32:1. -
i.e. long burst sequential is 32X faster, higher bandwidth, than 64b
random accesses. Probably with more slowdowns.

---

What I am concerned about is that it may not be that "DRAM is the new disk".

It may be that "L2 cache is the new disk". More likely "L4 cache is the
new disk".

---

By the way, this is the first post I am making to Robert Myer's

high-bandwid...@googlegroups.com

mailing list

Robert: is this an appropriate topic?

Andy Glew <newsgroup at comp-arch.net

unread,

Jul 28, 2010, 9:49:25 PM7/28/10

to high-bandwid...@googlegroups.com

On 7/28/2010 12:46 PM, Jason Riedy wrote:
> And Robert Myers writes:
>> Affine might be too general, though, since it includes general
>> rotation.
>
> I don't know if anyone has thought hard about system-wide
> coordination of the stride patterns on modern (less synchronized)
> machines, though, and that could prove critical for nontrivial
> streaming access in "exascale" memories.
>
> The MTA/XMT has one "solution" for global coordination of random
> access patterns: randomize (hash) all addresses. The implications
> for reliability are, um, open for interpretation. Forget about
> direct I/O. And you still must avoid hot-spots. I'm dubious that
> the randomization buys anything but pain.

"Affine" is a good word. I think including general rotations is good.

But "crystalline" sparkles. :-)

Thanks Jason for reminding me about GUPS. Plus, if you look at some of
the recent GUPS champions, they have been doing in software what I
propose doing in hardware, along the lines of scatter/gather bursting.

I suspect that we will end up in a 3-way discussion:

1) sequential

2) crystalline or affine - regular access patterns, just not supported
well by block sequential

3) random access patterns

My user community is more in the random access patterns than the
crystalline or affine. E.g. I am into pointers, linked data
structures, irregular sparse arrays. Not regularity. More AI than FFT.

At the moment 2) and 3) are allied. A good solution to 3) will also
help 2) a lot, but not vice versa. I am a little bit afraid of
proposals that seem to hardwire support for certain access patterns into
the hardware, at the expense of others not anticipated by the
anticipator of little foresight.

--

I grew up admiring the stunt box scheduling to coordinate stride
patterns on old machines. But every time I looked closely, it was not
really all that fancy - not as fancy as people in this group suggested.
They weren't solving diophantine equations for access pattern
optimizations in real time; they were applying simple heuristics,
usually variants of greedy, that delivered reasonable performance.
(Now, the *compiler* might be solving diophantine equations - but
usually not for scheduling.)

--

I'm symapthetic to the randomization, but not certain about it.

Jacko

unread,

Jul 29, 2010, 12:29:17 PM7/29/10

to high-bandwidth computing

Stored sequence generators.

All access patterns are sequences of addresses, and any pattern can be
accessed by an address (a.k.a. index). Thus 'simple' memory based
programs can generate address sequencing for processor use.

What would the memory's instruction set look like?

Robert Myers

unread,

Jul 29, 2010, 2:40:06 PM7/29/10

to high-bandwidth computing

But isn't the first question how *anything*, other than the processor
and all of its attendant machinery, gets to touch memory?

I can imagine all kinds of prestaging strategies for data, but, at the
moment, they all eat bandwidth and trash cache, or so it seems.

Robert.

Jacko

unread,

Jul 30, 2010, 11:38:20 AM7/30/10

to high-bandwidth computing

But that's the point of a programatic creation of the 'burst' address
sequencer's sequence executed by the memory.

Robert Myers

unread,

Aug 3, 2010, 9:06:29 PM8/3/10

to high-bandwidth computing

Ok.

Are we doing this with a PIM?

Send a program to the memory to generate memory addresses?

Or are you asking for a set of less-general ways of referring to
memory (constant stride, linked list, etc.)?

Download programs to be stored and referred to as user-generated
instructions (or macros or subroutines)?

In the kinds of situations I'm familiar with, it would be better to
have user-definable memory functions, even if the initial process of
definition involved fairly high overhead, since an access pattern
would typically be reused many times.

Robert.

Jacko

unread,

Aug 4, 2010, 9:52:56 AM8/4/10

to high-bandwidth computing

Burst User Macros - So making the macro number part of the address on
use, and making a section of the memory be mapped to the macro
execution unit.

Jacko

unread,

Aug 4, 2010, 9:54:16 AM8/4/10

to high-bandwidth computing

Only the ROMed linear BUM is present at the start.

Robert Myers

unread,

Aug 4, 2010, 11:47:34 PM8/4/10

to high-bandwidth computing

As with everything else you might think of in computers and computer
architecture, I suspect this or at least related ideas have already
been explored. I'm trying to find what's out there already.

Robert.

Reply all

Reply to author

Forward