Fwd: High-bandwidth computing interest group

9 views
Skip to first unread message

Robert Myers

unread,
Jul 23, 2010, 2:50:45 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: Robert Myers <rbmyers...@gmail.com>
Date: Jul 18, 6:02 pm
Subject: High-bandwidth computing interest group
To: comp.arch


I have lamented, at length, the proliferation of flops at the expense
of
bytes-per-flop in what are now currently styled as supercomputers.

This subject came up recently on the Fedora User's Mailing List when
someone claimed that GPU's are just what the doctor ordered to make
high-end computation pervasively available.  Even I have fallen into
that trap, in this forum, and I was quickly corrected.  In the most
general circumstance, GPU's seem practically to have been invented to
expose bandwidth starvation.

At least one person on the Fedora list got it and says that he has
encountered similar issues in his own work (what is in short supply is
not flops, but bytes per flop).  He also seems to understand that the
problem is fundamental and cannot be made to go away with an endless
proliferation of press releases, photographs of "supercomputers," and
an
endless procession of often meaningless color plots.

Since the issue is only tangentially related to the list, he suggested
a
private mailing list to pursue the issue further without annoying
others
with a topic that most are manifestly not interested in.

The subject is really a mix of micro and macro computer architecture,
the physical limitations of hardware, the realities of what is ever
likely to be funded, and the grubby details of computational
mathematics.

Since I have talked most about the subject here and gotten the most
valuable feedback here, I thought to solicit advice as to what kind of
forum would seem most plausible/attractive to pursue such a subject.
 I
could probably host a mailing list myself, but would that be the way
to
go about it and would anyone else be interested?

Email me privately if you don't care to respond publicly.

Thanks.

Robert.

Robert Myers

unread,
Jul 23, 2010, 2:51:11 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: Edward Feustel <efeus...@hughes.net>
Date: Jul 19, 5:54 am
Subject: High-bandwidth computing interest group
To: comp.arch


On Sun, 18 Jul 2010 18:02:49 -0400, Robert Myers

<rbmyers...@gmail.com> wrote:
>I have lamented, at length, the proliferation of flops at the expense of
>bytes-per-flop in what are now currently styled as supercomputers.

---


>Since I have talked most about the subject here and gotten the most
>valuable feedback here, I thought to solicit advice as to what kind of
>forum would seem most plausible/attractive to pursue such a subject.  I
>could probably host a mailing list myself, but would that be the way to
>go about it and would anyone else be interested?

>Email me privately if you don't care to respond publicly.

>Thanks.

>Robert.

This is an important subject. I would suggest that everything be
archived in a searchable environment. Keywords and tightly focused
discussion would be helpful (if possible). Please let me know if you
decide to do a wiki or e-mail list.

Ed Feustel
Dartmouth College

Robert Myers

unread,
Jul 23, 2010, 2:51:32 PM7/23/10
to high-bandwid...@googlegroups.com


---------- Forwarded message ----------
From: jacko <jackokr...@gmail.com>
Date: Jul 19, 11:02 am
Subject: High-bandwidth computing interest group
To: comp.arch


I might click a look see.

Robert Myers

unread,
Jul 23, 2010, 2:54:23 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: MitchAlsup <MitchAl...@aol.com>
Date: Jul 19, 11:36 am
Subject: High-bandwidth computing interest group
To: comp.arch


It seems to me that having less than 8 bytes of memory bandwidth per
flop leads to an endless series of cache excersizes.**

It also seems to me that nobody is going to be able to put the
required 100 GB/s/processor pin interface on the part.*

Nor does it seam, it would have the latency needed to strip mine main
memory continuously were the required BW made available.

Thus, we are in essence screwed.

* current bandwidths
a) 3 GHz processors with 2 FP pipes running 128-bit double DP flops
(ala SSE) This gives 12 GFlop/processor
b) 12 GFlop/processor demands 100 GByte/processor
c) DDR3 can achieve 17 GBytes/channel
d) high end PC processors can afford 2 memory channels
e) therefore we are screwed:
e.1)The memory system can supply only 1/3rd of what a single processor
wants
e.2)There are 4 and growing numbers of processors
e.3) therefore the memory systen can support less than 1/12 as much BW
as required.

Mitch

** The Ideal memBW/Flop is 3 memory operations per flop, and back in
the Cray-1 to XMP transition much of the vectorization gain occurred
from the added memBW  and the better chaining.

Robert Myers

unread,
Jul 23, 2010, 2:54:52 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: nik Simpson <ni...@knology.net>
Date: Jul 19, 3:44 pm
Subject: High-bandwidth computing interest group
To: comp.arch


On 7/19/2010 10:36 AM, MitchAlsup wrote:

> d) high end PC processors can afford 2 memory channels

Not quite as screwed as that, the top-end Xeon & Opteron parts have 4
DDR3 memory channels, but still screwed. For the 2-socket space, it's
3
DDR3 memory channels for typical server processors. Of course, the
move
to on-chip memory controllers means that scope for additional memory
channels is pretty much "zero" but that's the price you pay for
commodity parts, they are designed to meet the majority of customers,
and it's hard to justify the costs of additional memory channels at
the
processor and board layout levels just to satisfy the needs of
bandwidth
crazy HPC apps ;-)

--
Nik Simpson

Robert Myers

unread,
Jul 23, 2010, 2:55:17 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: jacko <jackokr...@gmail.com>
Date: Jul 19, 6:21 pm
Subject: High-bandwidth computing interest group
To: comp.arch


Why do memory channels have to be wired by inverter chains and
relativly long track interconnect on the circuit board? Microwave
pipework from chiptop to chiptop is perhaps possible, but maintaining
enough bandwidth over the microwave channel is many GHz, but it is
close, so of a low radiant power!

Flops or not? lets generalize and call them nops, said he with a touch
of carcasm. Non-specific Operations, needing GB/s.

Cheers Jacko

Robert Myers

unread,
Jul 23, 2010, 2:55:39 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: Robert Myers <rbmyers...@gmail.com>
Date: Jul 19, 8:18 pm
Subject: High-bandwidth computing interest group
To: comp.arch


On Jul 19, 3:44 pm, nik Simpson <ni...@knology.net> wrote:

> On 7/19/2010 10:36 AM, MitchAlsup wrote:

> > d) high end PC processors can afford 2 memory channels

> Not quite as screwed as that, the top-end Xeon & Opteron parts have 4
> DDR3 memory channels, but still screwed. For the 2-socket space, it's 3
> DDR3 memory channels for typical server processors. Of course, the move
> to on-chip memory controllers means that scope for additional memory
> channels is pretty much "zero" but that's the price you pay for
> commodity parts, they are designed to meet the majority of customers,
> and it's hard to justify the costs of additional memory channels at the
> processor and board layout levels just to satisfy the needs of bandwidth
> crazy HPC apps ;-)

Maybe the capabilities of high-end x86 are and will continue to be so
compelling that, unless IBM is building the machine, that's what we're
looking at for the foreseeable future.

I don't understand the economics of less mass-market designs, but
maybe the perfect chip would be some iteration of an "open" core,
maybe less heat-intensive, less expensive, and soldered-down with more
attention to memory and I/O resources.

Or maybe you could dual port or route memory, accepting whatever cost
in latency there is, and at least allow some pure DMA device to
perform I/O and gather/scatter chores so as to maximize what processor
bandwidth there is.

I'd like some blue sky thinking.

Robert.

Robert Myers

unread,
Jul 23, 2010, 2:56:00 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: Andrew Reilly <areilly...@bigpond.net.au>
Date: Jul 20, 12:43 am
Subject: High-bandwidth computing interest group
To: comp.arch


On Mon, 19 Jul 2010 08:36:18 -0700, MitchAlsup wrote:
> The memory system can supply only 1/3rd of what a single processor wants

If that's the case (and down-thread Nik Simpson suggests that the best
case might even be twice as "good", or 2/3 of a single processor's
worst-
case demand), then that's amazingly better than has been available, at
least in the commodity processor space, for quite a long time.  I
remember when I started moving DSP code onto PCs, and finding anything
with better than 10MB/s memory bandwidth was not easy.  These days my
problem set typically doesn't get out of the cache, so that's not
something I personally worry about much any more.  If your problem set
is
driven by stream-style vector ops, then you might as well switch to
low-
power critters like Atoms, and match the flops to the available
bandwidth, and save some power.

On the other hand, I have a lot of difficulty believing that even for
large-scale vector-style code, a bit of loop fusion, blocking or code
factoring can't bring value-reuse up to a level where even (0.3/
nProcs)
available bandwidth is plenty.

That's single-threaded application-think.  Where you *really* need
that
bandwidth, I suspect, is for the inter-processor communication between
your hoards of cooperating (ha!) cores.

Cheers,

--
Andrew

Robert Myers

unread,
Jul 23, 2010, 2:56:22 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: jacko <jackokr...@gmail.com>
Date: Jul 20, 1:44 am
Subject: High-bandwidth computing interest group
To: comp.arch


On 20 July, 05:43, Andrew Reilly <areilly...@bigpond.net.au> wrote:

> On Mon, 19 Jul 2010 08:36:18 -0700, MitchAlsup wrote:
> > The memory system can supply only 1/3rd of what a single processor wants

> If that's the case (and down-thread Nik Simpson suggests that the best
> case might even be twice as "good", or 2/3 of a single processor's worst-
> case demand), then that's amazingly better than has been available, at
> least in the commodity processor space, for quite a long time.  I
> remember when I started moving DSP code onto PCs, and finding anything
> with better than 10MB/s memory bandwidth was not easy.  These days my
> problem set typically doesn't get out of the cache, so that's not
> something I personally worry about much any more.  If your problem set is
> driven by stream-style vector ops, then you might as well switch to low-
> power critters like Atoms, and match the flops to the available
> bandwidth, and save some power.

Or run a bigger network off the same power.

> On the other hand, I have a lot of difficulty believing that even for
> large-scale vector-style code, a bit of loop fusion, blocking or code
> factoring can't bring value-reuse up to a level where even (0.3/nProcs)
> available bandwidth is plenty.

Prob(able)ly - sick perverse hanging on to a longer word in the post
quantum age.

> That's single-threaded application-think.  Where you *really* need that
> bandwidth, I suspect, is for the inter-processor communication between
> your hoards of cooperating (ha!) cores.

Maybe. I think much of the problem is not vectors, as these are
usually have a single index, it's matrix and tensor problems which
have 2 or n indexes. T[a,b,c,d]

The fact that many product sums over different indexes, even with
transpose elimination coding (automatic switching between row and
column order based on linear sequencing of a write target, or for best
read/write/read/write etc. performance) in the prefetch context, with
limited gather/scatter.

Maybe even some multi store (slightly wasteful of memory cells) with
differing address bit swapings? the high bits as an address map
translation selector with bank write and read combo 'union' operation
(* or +)?.

Ummm.

Robert Myers

unread,
Jul 23, 2010, 2:56:42 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: George Neuner <gneun...@comcast.net>
Date: Jul 20, 10:33 am
Subject: High-bandwidth computing interest group
To: comp.arch


On Mon, 19 Jul 2010 08:36:18 -0700 (PDT), MitchAlsup

<MitchAl...@aol.com> wrote:
>It seems to me that having less than 8 bytes of memory bandwidth per
>flop leads to an endless series of cache excersizes.**

>It also seems to me that nobody is going to be able to put the
>required 100 GB/s/processor pin interface on the part.*

>Nor does it seam, it would have the latency needed to strip mine main
>memory continuously were the required BW made available.

>Thus, we are in essence screwed.

>* current bandwidths
>a) 3 GHz processors with 2 FP pipes running 128-bit double DP flops
>(ala SSE) This gives 12 GFlop/processor
>b) 12 GFlop/processor demands 100 GByte/processor
>c) DDR3 can achieve 17 GBytes/channel

>d) high end PC processors can afford 2 memory channels

>e) therefore we are screwed:

>e.1)The memory system can supply only 1/3rd of what a single processor
>wants


>e.2)There are 4 and growing numbers of processors
>e.3) therefore the memory systen can support less than 1/12 as much BW
>as required.

>Mitch

>** The Ideal memBW/Flop is 3 memory operations per flop, and back in
>the Cray-1 to XMP transition much of the vectorization gain occurred
>from the added memBW  and the better chaining.

ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers.  Yes there was a lot of latency (and I
know you [Mitch] and Robert Myers are dead set against latency too)
but the staging data movement provided a lot of opportunity to overlap
with real computation.

YMMV, but I think pipeline vector units need to make a comeback.  I am
not particularly happy at the thought of using them again, but I don't
see a good way around it.

George

Robert Myers

unread,
Jul 23, 2010, 2:57:07 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: n...@cam.ac.uk
Date: Jul 20, 10:41 am
Subject: High-bandwidth computing interest group
To: comp.arch


In article <04cb46947eo6mur14842fqj45pvrqp6...@4ax.com>,
George Neuner  <gneun...@comcast.net> wrote:

>ISTM bandwidth was the whole point behind pipelined vector processors
>in the older supercomputers.  Yes there was a lot of latency (and I
>know you [Mitch] and Robert Myers are dead set against latency too)
>but the staging data movement provided a lot of opportunity to overlap
>with real computation.

Yes.

>YMMV, but I think pipeline vector units need to make a comeback.  I am
>not particularly happy at the thought of using them again, but I don't
>see a good way around it.

NO chance!  It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.

Regards,
Nick Maclaren.

Robert Myers

unread,
Jul 23, 2010, 2:57:31 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: jacko <jackokr...@gmail.com>
Date: Jul 20, 10:54 am
Subject: High-bandwidth computing interest group
To: comp.arch


On 20 July, 15:41, n...@cam.ac.uk wrote:

> In article <04cb46947eo6mur14842fqj45pvrqp6...@4ax.com>,
> George Neuner  <gneun...@comcast.net> wrote:

> >ISTM bandwidth was the whole point behind pipelined vector processors
> >in the older supercomputers.  Yes there was a lot of latency (and I
> >know you [Mitch] and Robert Myers are dead set against latency too)
> >but the staging data movement provided a lot of opportunity to overlap
> >with real computation.

> Yes.

> >YMMV, but I think pipeline vector units need to make a comeback.  I am
> >not particularly happy at the thought of using them again, but I don't
> >see a good way around it.

> NO chance!  It's completely infeasible - they were dropped because
> the vendors couldn't make them for affordable amounts of money any
> longer.

> Regards,
> Nick Maclaren.

Maybe he needs a FPGA card with many single cycle Boothe multipliers
on chip, A bit slow though due to routing delays, but much parallel.
There really should be a way to que mulmac pairs with a reset to zero
(or the nilpotent).

Robert Myers

unread,
Jul 23, 2010, 2:58:04 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: George Neuner <gneun...@comcast.net>
Date: Jul 21, 6:18 pm
Subject: High-bandwidth computing interest group
To: comp.arch


On Tue, 20 Jul 2010 15:41:13 +0100 (BST), n...@cam.ac.uk wrote:
>In article <04cb46947eo6mur14842fqj45pvrqp6...@4ax.com>,
>George Neuner  <gneun...@comcast.net> wrote:

>>ISTM bandwidth was the whole point behind pipelined vector processors

>>in the older supercomputers.  ...
>> ... the staging data movement provided a lot of opportunity to
>>overlap with real computation.

>>YMMV, but I think pipeline vector units need to make a comeback.

>NO chance!  It's completely infeasible - they were dropped because


>the vendors couldn't make them for affordable amounts of money any
>longer.

Hi Nick,

Actually I'm a bit skeptical of the cost argument ... obviously it's
not feasible to make large banks of vector registers fast enough for
multiple GHz FPUs to fight over, but what about a vector FPU with a
few dedicated registers?

There are a number of (relatively) low cost DSPs in the up to ~300MHz
range that have large (32KB and up, 4K double floats) 1ns dual ported
SRAM, are able to sustain 1 or more flops/SRAM cycle, and which match
or exceed the sustainable FP performance of much faster CPUs.  Some of
these DSPs are $5-$10 in industrial quantities and some even are cheap
in hobby quantities.

Given the economics of mass production, it would seem that creating
some kind of vector coprocessor combining FPU, address units and a few
banks of SRAM with host DMA access should be relatively cheap if the
FPU is kept in under 500MHz.

Obviously, it could not have the peak performance of the GHz host FPU,
but a suitable problem could easily keep several such processors
working.  Cray's were a b*tch, but when the problem suited them ...  
With several vector coprocessors on a plug-in board, this isn't very
different from the GPU model other than having more flexibility in
staging data.

The other issue is this: what exactly are we talking about in this
thread ... are we trying to have the fastest FPUs possible or do we
want a low cost machine with (very|extremely) high throughput?

No doubt I've overlooked something (or many things 8) pertaining to
economics or politics or programming - I don't think there is any
question that there are plenty of problems (or subproblems) suitable
for solving on vector machines.  So please feel free to enlighten me.

George

Robert Myers

unread,
Jul 23, 2010, 2:58:28 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: jacko <jackokr...@gmail.com>
Date: Jul 21, 7:31 pm
Subject: High-bandwidth computing interest group
To: comp.arch

> No doubt I've overlooked something (or many things 8) pertaining to
> economics or politics or programming - I don't think there is any
> question that there are plenty of problems (or subproblems) suitable
> for solving on vector machines.  So please feel free to enlighten me.

I think it's that FPU speed is not the bottleneck at present. It's
keeping it fed with data, and shifting it arround memory in suitable
ordered patterns. Maybe not fetching data as a linear cacheline unit,
but maybe a generic step n (not just powers of 2) as a generic
scatter/
gather, with n changable on the virtual cache line before a save say.

Maybe it's about what an address is and can it specify process to
smart memories on read and write.

It's definatly about reducing latency when this is possible or how
this may be possible.

And it's about cache structures which may help in any or all of the
above, by preventing an onset of thrashing.

SIMD is part of this, as the program size drops. But even vector units
have to be kept fed with data.

Robert Myers

unread,
Jul 23, 2010, 2:59:11 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: n...@cam.ac.uk
Date: Jul 22, 4:46 am
Subject: High-bandwidth computing interest group
To: comp.arch


In article <b1oe46pjp0lqi30fr03i75tnii94j40...@4ax.com>,
George Neuner  <gneun...@comcast.net> wrote:

>On Tue, 20 Jul 2010 15:41:13 +0100 (BST), n...@cam.ac.uk wrote:

>>In article <04cb46947eo6mur14842fqj45pvrqp6...@4ax.com>,
>>George Neuner  <gneun...@comcast.net> wrote:

>>>ISTM bandwidth was the whole point behind pipelined vector processors
>>>in the older supercomputers.  ...
>>> ... the staging data movement provided a lot of opportunity to
>>>overlap with real computation.

>>>YMMV, but I think pipeline vector units need to make a comeback.

>>NO chance!  It's completely infeasible - they were dropped because
>>the vendors couldn't make them for affordable amounts of money any
>>longer.

>Actually I'm a bit skeptical of the cost argument ... obviously it's


>not feasible to make large banks of vector registers fast enough for
>multiple GHz FPUs to fight over, but what about a vector FPU with a
>few dedicated registers?

'Tain't the computation that's the problem - it's the memory access,
as "jacko" said.

Many traditional vector units had enough bandwidth to keep an AXPY
running at full tilt - nowadays, one would need 1 TB/sec for a low
end vector computer, and 1 PB/sec for a high-end one.  Feasible,
but not cheap.

Also, the usefulness of such things was very dependent on whether
they would allow 'fancy' vector operations, such as strided and
indexed vectors, gather/scatter and so on.  The number of programs
that need only simple vector operations is quite small.

I believe that, by the end, 90% of the cost of such machines was in
the memory management and only 10% in the computation.  At very
rough hand-waving levels.

Regards,
Nick Maclaren.

Robert Myers

unread,
Jul 23, 2010, 2:59:35 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: George Neuner <gneun...@comcast.net>
Date: Jul 22, 12:39 pm
Subject: High-bandwidth computing interest group
To: comp.arch


On Thu, 22 Jul 2010 09:46:44 +0100 (BST), n...@cam.ac.uk wrote:

>Also, the usefulness of [vector processors] was very dependent on


>whether they would allow 'fancy' vector operations, such as strided
>and indexed vectors, gather/scatter and so on.  The number of programs
>that need only simple vector operations is quite small.

>I believe that, by the end, 90% of the cost of such machines was in
>the memory management and only 10% in the computation.  At very
>rough hand-waving levels.

I get that ... but the near impossibility (with current technology) of
feeding several FPUs - vector or otherwise - from a shared memory
feeds right back into my argument that they need separate memories.  

Reading further down the thread, Robert seems to be mainly concerned
with keeping his FPUs fed in a shared memory environment.  I don't
really care whether the "Right Thing" architecture is a loose gang of
AL/FP units with programmable interconnects fed by scatter/gather DMA
channels[*].  Data placement and staging are, IMO, mainly software
issues (though, naturally, I appreciate any help the hardware can
give).  

George

[*] Many DSPs can pipe one or more of their DMA channels through the
ALU to do swizzling, packing/unpacking and other rearrangements.  Some
DSP permit general ALU operations on the DMA stream for use in real
time data capture.

Robert Myers

unread,
Jul 23, 2010, 2:59:51 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: n...@cam.ac.uk
Date: Jul 22, 12:49 pm
Subject: High-bandwidth computing interest group
To: comp.arch


In article <bjqg46hoimkgf5gj1jgedhv2q7h5k2j...@4ax.com>,
George Neuner  <gneun...@comcast.net> wrote:

>>Also, the usefulness of [vector processors] was very dependent on
>>whether they would allow 'fancy' vector operations, such as strided
>>and indexed vectors, gather/scatter and so on.  The number of programs
>>that need only simple vector operations is quite small.

>>I believe that, by the end, 90% of the cost of such machines was in
>>the memory management and only 10% in the computation.  At very
>>rough hand-waving levels.

>I get that ... but the near impossibility (with current technology) of
>feeding several FPUs - vector or otherwise - from a shared memory
>feeds right back into my argument that they need separate memories.  

And, as I said, the killer with that is the very small number of
programs that can make use of such a system.  The requirement for
'fancy' vector operations was largely to provide facilities to
transfer elements between locations in vectors.

Regards,
Nick Maclaren.

Robert Myers

unread,
Jul 23, 2010, 3:00:20 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: jacko <jackokr...@gmail.com>
Date: Jul 22, 1:58 pm
Subject: High-bandwidth computing interest group
To: comp.arch


On 22 July, 17:49, n...@cam.ac.uk wrote:

> In article <bjqg46hoimkgf5gj1jgedhv2q7h5k2j...@4ax.com>,
> George Neuner  <gneun...@comcast.net> wrote:

> >>Also, the usefulness of [vector processors] was very dependent on
> >>whether they would allow 'fancy' vector operations, such as strided
> >>and indexed vectors, gather/scatter and so on.  The number of programs
> >>that need only simple vector operations is quite small.

> >>I believe that, by the end, 90% of the cost of such machines was in
> >>the memory management and only 10% in the computation.  At very
> >>rough hand-waving levels.

> >I get that ... but the near impossibility (with current technology) of
> >feeding several FPUs - vector or otherwise - from a shared memory
> >feeds right back into my argument that they need separate memories.  

> And, as I said, the killer with that is the very small number of
> programs that can make use of such a system.  The requirement for
> 'fancy' vector operations was largely to provide facilities to
> transfer elements between locations in vectors.

> Regards,
> Nick Maclaren.

Let's call this the data shovelling to data processing ratio. SPR.

Robert Myers

unread,
Jul 23, 2010, 3:00:38 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: George Neuner <gneun...@comcast.net>
Date: Jul 22, 11:37 am
Subject: High-bandwidth computing interest group
To: comp.arch


On Wed, 21 Jul 2010 18:18:19 -0400, George Neuner

<gneun...@comcast.net> wrote:

> Some stuff

I think the exchange among Robert, Mitch and Andy that just appeared
answered most of my question.

George

Robert Myers

unread,
Jul 23, 2010, 3:01:03 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: Robert Myers <rbmyers...@gmail.com>
Date: Jul 22, 2:41 pm
Subject: High-bandwidth computing interest group
To: comp.arch


On Jul 22, 11:37 am, George Neuner <gneun...@comcast.net> wrote:

> On Wed, 21 Jul 2010 18:18:19 -0400, George Neuner

> <gneun...@comcast.net> wrote:

> > Some stuff

> I think the exchange among Robert, Mitch and Andy that just appeared
> answered most of my question.

I feel like I'm walking a tightrope in some of these discussions.

At a time when vector processors were still a fading memory (even in
the US), an occasional article would mention that "vector computers"
were easier to use for many scientists than thousands of cots
processors hooked together by whatever.

The real problem is not in how the computation is organized, but in
how memory is accessed.  Replicating the memory access style of the
early Cray architectures isn't possible beyond a very limited memory
size, but it sure would be nice to figure out a way to simulate the
experience.

Robert.

Robert Myers

unread,
Jul 23, 2010, 3:02:39 PM7/23/10
to high-bandwid...@googlegroups.com
---------- Forwarded message ----------
From: Thomas Womack <twom...@chiark.greenend.org.uk>
Date: Jul 23, 1:19 pm
Subject: High-bandwidth computing interest group
To: comp.arch


In article <34ea667e-779a-44d8-ab63-
c032df1cb...@q35g2000yqn.googlegroups.com>,
Robert Myers  <rbmyers...@gmail.com> wrote:

>At a time when vector processors were still a fading memory (even in
>the US), an occasional article would mention that "vector computers"
>were easier to use for many scientists than thousands of cots
>processors hooked together by whatever.

Yes, this is certainly true.  Earth Simulator demonstrated that you
could build a pretty impressive vector processor, which (Journal of
the Earth Simulator - one of the really good resources since it talks
about both the science and the implementation issues) managed 90%
performance on lots of tasks, partly because using it was very
prestigious and you weren't allowed to use the whole machine on jobs
which didn't manage very high performance on a 10% subset.  But it was
a $400 million project to build a 35Tflops machine, and the subsequent
project to spend a similar amount this decade on a heftier machine
came to nothing.

I've worked at an establishment with an X1, and it was woefully
under-used because the problems that came up didn't fit the vector
organisation terribly well; it is not at all clear why they bought the
X1 in the first place.

>The real problem is not in how the computation is organized, but in
>how memory is accessed.  Replicating the memory access style of the
>early Cray architectures isn't possible beyond a very limited memory
>size, but it sure would be nice to figure out a way to simulate the
>experience.

I _think_ this starts, particularly for the crystalline memory access
case, to be almost a language-design issue.

Tom

Andy Glew <newsgroup at comp-arch.net

unread,
Jul 29, 2010, 11:49:24 PM7/29/10
to high-bandwid...@googlegroups.com
This response, to my own post, talks about how processor support is
necessary, even though the memory subsystem is where the action is.

It also serves to get this post about a scatter/gather memory subsystem
sent to the high-bandwid...@googlegroups.com mailing list,

On 7/20/2010 10:03 PM, Andy Glew wrote:
> On 7/20/2010 11:49 AM, Robert Myers wrote:
>> On Jul 20, 1:49 pm, "David L. Craig"<dlc....@gmail.com> wrote:
>>
>>> If we're talking about custom, never-mind-the-cost
>>> designs, then that's the stuff that should make this
>>> a really fun group.
>>
>> If no one ever goes blue sky and asks: what is even physically
>> possible without worrying what may or may not be already in the works
>> at Intel, then we are forever limited, even in the imagination, to
>> what a marketdroid at Intel believes can be sold at Intel's customary
>> margins.
>
> Coupling this to stuff we said earlier about
>
> a) sequential access patterns, brute force - neither of us consider that
> interesting
>
> b) random access patterns
>
> c) what you, Robert, siad you were most interested in, and rather nicely
> called "crystalline" access patterns. By the way, I rather like that
> term: it is much more accurate than saying "stride-N", and encapsulates
> several sorts of regularity.
>
> Now, I think it can be said that a machine that does random access
> patterns efficiently also does "crystalline" access patterns. Yes?
>
> I can imagine optimizations specific to the crystalline access patterns,
> that do not help true random access. But I'd like to kill two birds with
> one stone.
>
> So, how can we make these access patterns more effective?
>
> Perhaps we should lose the cache line orientation - transferring data
> bytes that aren't needed.
>
> I envision an interconnect fabric that is completely scatter/gather
> oriented. We don't do away with burst or block operations: we always
> transfer, say, 64 bytes at a time. But into that 64 bytes we might pack,
> say, 4 pairs of 64 bit address and 64 bit data, for stores. Or perhaps
> bursts of 128 bytes, mixing tuples of 64 bit address and 128 bit data.
> Or maybe... compression, whatever. Stores are the complicated one; reads
> are relatively simple, vectors of, say, 8 64 bit addresses.
>
> By the way, this is where strided or crystalline access patterns might
> have some advantages: they may compress better.
>
> Your basic processing element produces such scatter gather load or store
> requests. Particularly if it has scatter/gather vector instructions like
> Larrabee (per wikipedia), or if it is a CIMT coherent threaded
> architecture like the GPUs. The scatter/gather operations emitted by a
> processor need not be directed at a single target - they may be split
> and merged as they flow through the fabric.
>
> In order to eliminate unnecessary full-cache line flow, we do not
> require read-for-ownership. But we don't go the stupid way of
> write-through. I lean towards having a valid bit per byte, in these
> scatter-gather requests, and possibly in the caches. As I have discussed
> in this newsgroup before, this allows us to have writeback caches where
> multiple processors can write to the same memory location
> simultaneously. The byte valids allows us to live with weak memory
> ordering, but do away with the bad problem of losing data when people
> write to different bytes of the same line simultaneously. In fact,
> depending on the interconnection fabric topology, you might even have
> processor ordering. But basically it eliminates the biggest source of
> overhead in cache coherency.
>
> Of course, you want to handle non-cache friendly memory access patterns.
> I don't think you can safely get rid of caches; but I think that there
> should be a full suite of cache control operations, such as is partially
> listed at
> http://semipublic.comp-arch.net/wiki/Andy_Glew's_List_of_101_Cache_Control_Operations
>
>
> Such a scatter/gather memory subsystem might exist in the fabric. It
> works best with processor support to generate and handle the
> scatter/gather requests ad replies. (Yes, the main thing is in the
> interconnect; but some processor support is needed, to get crap out of
> the way of the fabric).
>
> The scatter/gather interconnect fabric might be interfaced to
> conventional DRAMs, with their block transfers of 64 or 128 bytes. If
> so, I would be tempted to create a memory side cache - a cache that is
> in the memory controller, not the processor - seeking to leverage some
> of the wasted parts of cache lines. With cache control, of course.
>
> However, if there is any chance of getting DRAM architectures to be more
> scatter/gather friendly, great. But the people who can really talk about
> that are Tom Pawlowski at Micron, and his counterpart at Samsung. I've
> not been at a company that could influence DRAM much, since Motorola in
> the late 1980s. And I dare say that Mitch didn't make much headway
> there. I've mentioned Tom Pawlowski's vision, as presented at SC09 and
> elsewhere, of an abstract DRAM interface for stacked DRAM+logic units. I
> think the scattter/gather approach I describe above should be a
> candidate for such an abstract interface.
>
> If there is anyone that thinks that there is a great new memory
> technology coming down the pike that will make the bandwidth wars
> easier, I'd love to hear about it. For that matter, the impending
> integration of non-volatile memory is great - but as I understand
> things, it will probably make the memory hierarchy even more sequential
> bandwidth oriented, unfriendly to other access patterns.
>
> --
>
> On this fabric, also pass messages - probably with instruction set
> support to directly produce messages, and mechanisms such as TLBs to
> route them without OS intervention.
>
> --
>
> I.e. my overall approach is - eliminate unnecessary ful cache line
> transfers, emphasize scatter gather. Make the most efficient use of what
> we have.
>
> --
>
> Now, I remain an unrepentant mass market computer architect. Some people
> want to design the fastest supercomputer in the world; I want to design
> the computer my mother uses. But, I'm not so far removed fromn the
> buildings full of stuff supercomputers that Robert Myers describes.
> First, I have worked on such. But, second, I'm interested in much of
> this not just because it is relevant to cost no barrier supercomputers,
> but also because it is relevant to mass markets.
> Most specifically, datacenters. Although datacenters tend not to use
> large scale shared memory, and tend to be unwilling to compromise the
> memory ordering and cache coherency guidelines in their small scale
> shared memory nodes, I suspect that PGAS has applications, e.g. to
> Hadoop like map/reduce. Moreover, much of this scatter/gather is also
> what network routers want - that OTHER form of computing system that can
> occupy large buildings, but which also comes in smaller flavors.
> Finally, the above applies even to moderate sized, say 16 or 32,
> multiprocessor systems in manycore chips.
>
> I.e. I am interested in such scatter/gather memory and interconnect,
> that make the most efficient use of bandwidth, because they apply to the
> entire spectrum,

It looks like I ended in mid-sentence - and I don't know how to finish
that thought.

But I need to add the following:


Some people say "The only important thing (for high bandwidth, or for
supercomputing in general) is the memory subsystem and interconnect."
They think they can take standard microprocessors, paste them together,
and get as good as possible.

This is not true.

If you follow my reasoning about the need for a scatter/gather memory
subsystem, you may think that it is possible to build it entirely
outside processor.

This is true. But it would suck.

It would suck because the processor doesn't provide the support needed
to make scatter/gather work.

It would be nicest to have scatter/gather vector instructions. No Intel
x86 processor has those yet. Although Larrabee proposed scatter/gather,
and I suppose that the HPC Larrabee leftover will have it, this was
scatter/gather from within the cache. It still used cache line
transfers on the interconnect. I.e. in many ways it was the opposite
form of scatter/gather from what you want.

What you really need to make scatter/gather work is non-cache-line
transfers. Either individual, partial, non-cache-line loads and stores.
Or, better yet, such operations "convoyed" as I have described.

Problem: the x86 has two classes of loads

* fast, speculatable, ordinary loads that cause cache line transfers

* slow, at retirement, partial loads

Missing is what you need if an external memory controller is to build
fast scatter/gather:

- fast, speculatable, partial loads

If you want to build a high bandwidth memory subsystem, you have to
change the processor enough to add this. E.g. a new memory type
(UC-MEM), or a new prefix, or a new instruction.

Lacking this, UC loads are at least 20 times slower than WB loads.
There is no way that an external controller can make up for that deficit.

---

This interacts with memory ordering. basically, you build this, and you
give up on sequerntial consistency, string memory ordery, TSO or
processor consistency. You get weak ordering. In order to make code
work, you will need fence instructions. The existing fence instructions
might do, but their implementation is not externally visible. At the
very least you will need to fix them to work with your high bandwidth
supercomputer memory subsystem and interface.

I like the idea of going all the way to bitmask coherency. So that you
can live with caches in the system, without paying the cost in bus traffic.

---

MORAL: yes, the action is in the interconnect and memory subsystem.

But in order to get there, you have to have some processor support.

del

unread,
Aug 29, 2010, 3:18:25 PM8/29/10
to high-bandwidth computing
That is an interesting point as to ordering. How much of an issue is
ordering to the programmer? Is relaxed or sloppy ordering a big
pain? does it adversely affect performance?

On Jul 29, 10:49 pm, "Andy Glew <newsgroup at comp-arch.net"
> >http://semipublic.comp-arch.net/wiki/Andy_Glew's_List_of_101_Cache_Co...

Andy Glew (comp-arch.net)

unread,
Aug 29, 2010, 8:57:29 PM8/29/10
to high-bandwid...@googlegroups.com
That is an interesting point as to ordering.  How much of an issue is
ordering to the programmer?  Is relaxed or sloppy ordering a big
pain?  does it adversely affect performance?


I'm a bit lost wrt context of Del's email - probably something buried in my gmail inbox - but

a) One of the things that I am trying to point out with bitmask coherency is that we can have a memory model that never loses writes, is cache coherent but not instantaneously cache coherent - call it eventually consistent, and which is weakly ordered.

In some sense what I am proposing is to remove the single writer constraint per cache line, and even per byte.  In some sense "location consistency" amounts to TSO on a per location basis - and I am saying "let's have processor consistency on a per byte basis".

I.e. I am suggesting that location conistency at cache line granularity has been holding us back.  So I am suggesting not exactly eliminating location consistency, but relaxing it, while making a strong guarantee of no write lossage.  And overall extending weak memory ordering from being between cached linees to even within cache lines and on a per byte granularity.

b) Hotchips, IBM PERC's "Gups" transaction, remote atomic update, allows a single transaction to hold multiple remote atomic updates to multiple cache lines, multiple targets, etc.

del

unread,
Aug 29, 2010, 11:05:53 PM8/29/10
to high-bandwidth computing
Here is the section I was referring to in your previous post, I think
you said it....

" If you want to build a high bandwidth memory subsystem, you have to
> change the processor enough to add this. E.g. a new memory type
> (UC-MEM), or a new prefix, or a new instruction.

> Lacking this, UC loads are at least 20 times slower than WB loads.
> There is no way that an external controller can make up for that deficit.

> ---

> This interacts with memory ordering. basically, you build this, and you
> give up on sequerntial consistency, string memory ordery, TSO or
> processor consistency. You get weak ordering. In order to make code
> work, you will need fence instructions. The existing fence instructions
> might do, but their implementation is not externally visible. At the
> very least you will need to fix them to work with your high bandwidth
> supercomputer memory subsystem and interface. "

On Aug 29, 7:57 pm, "Andy Glew (comp-arch.net)" <a...@comp-arch.net>
wrote:

Andy Glew (comp-arch.net)

unread,
Aug 30, 2010, 10:56:16 AM8/30/10
to high-bandwid...@googlegroups.com
Yep, I said it.

I stand by it: in order to get high bandwidth, we need to make changes to the processor cores of Intel and AMD x86.  I suspect also IBM Power, although it is possible that IBM has emphasized supercomputing oriented HBC more and may need  fewer changes.

The changes  can range 

from small - a new memory type, and/or tweaking thee existing memory types

to larger - but not really all that large.

Personally, I think that a scatter/gather memory subsystem - with transactions such as IBM PERC's remote atomic update, or GUPS, somehow generatable from normal processor code (i.e. not a GUPS I/O device accessible only from the OS kernel) - is a good way to go.  Possibly in combination with what I call bitmask coherency.

Robert Myers

unread,
Aug 30, 2010, 4:53:34 PM8/30/10
to high-bandwidth computing
On Aug 30, 10:56 am, "Andy Glew (comp-arch.net)" <a...@comp-arch.net>
wrote:
I have a lot of catching-up to do.

ibm percs memory ordering

was a good start as a google.

Robert.

del

unread,
Aug 31, 2010, 9:31:41 AM8/31/10
to high-bandwidth computing
Thanks, I wasn't accusing anybody of anything. You just sounded a
little confused about what I was talking about. I took your comments
to at least imply that it would be helpful from a bandwidth statement
to relax the ordering requirements as part of the new memory system.
I am curious as to how this will affect software, given my background
in the kingdom of legacy and backwards compatibility.

On Aug 30, 9:56 am, "Andy Glew (comp-arch.net)" <a...@comp-arch.net>
wrote:
Reply all
Reply to author
Forward
0 new messages