Larrabee delayed: anyone know what's happening?

Mayan Moudgill

unread,

Dec 5, 2009, 8:46:11 PM12/5/09

to

All I've come across is the announcement that Larrabee has been delayed,
with the initial consumer version cancelled. Anyone know something more
substantive?

nm...@cam.ac.uk

unread,

Dec 6, 2009, 3:57:55 AM12/6/09

to

In article <xv2dnXl4eYh1kYbW...@bestweb.net>,

Eh? As far as I know, Intel have NEVER announced any plans for a
consumer version of Larrabee - it always was an experimental chip.
There was a chance that they would commoditise it, for experimental
purposes, but that didn't seem to pan out. Their current plans are
indicated here:

http://techresearch.intel.com/articles/Tera-Scale/1421.htm

They hope to have systems shortly, and to allow selected people
online access from mid-2010, so I would guess that the first ones
that could be bought would be in early 2011. If all goes well.

I have absolutely NO idea of where they are thinking of placing it,
or what scale of price they are considering.

Regards,
Nick Maclaren.

Michael S

unread,

Dec 6, 2009, 11:16:17 AM12/6/09

to

On Dec 6, 10:57 am, n...@cam.ac.uk wrote:
> In article <xv2dnXl4eYh1kYbWnZ2dnUVZ_q2dn...@bestweb.net>,

Nick, SCC and Larrabee are different species. Both have plenty of
relatively simple x86 cores on a single chips but that's about only
thing they have in common.

1. Larrabee cores are cache-coherent, SCC cores are not.
2. Larrabee interconnects have ring topology, SCC is a mesh
3. Larrabee cores are about vector performance (512-bit SIMD) and SMT
(4 hardware threads per core). SCC cores are supposed to be stronger
than Larrabee on scalar code and much much weaker on vector code.
4. Larrabee was originally intended for consumers, both as high-end 3D
graphics engine and as sort-of-GPGPU. Graphics as target for 1st
generation chip is canceled, but it still possible that it would be
shipped to paying customers as GPGPU. SCC, on the other hand, is
purely experimental.

Michael S

unread,

Dec 6, 2009, 11:19:09 AM12/6/09

to

Short article by David Kanter:
http://www.realworldtech.com/page.cfm?ArticleID=RWT120409180449

Michael S

unread,

Dec 6, 2009, 12:05:58 PM12/6/09

to

On Dec 6, 6:16 pm, Michael S <already5cho...@yahoo.com> wrote:
> 4. Larrabee was originally intended for consumers, both as high-end 3D
> graphics engine and as sort-of-GPGPU. Graphics as target for 1st
> generation chip is canceled, but it still possible that it would be
> shipped to paying customers as GPGPU.

Sorry, I missed the latest round of news. In fact GPGPU is canceled
together with GPU. So now 45nm LRB is officially "a prototype".
http://www.anandtech.com/weblog/showpost.aspx?i=659

nm...@cam.ac.uk

unread,

Dec 6, 2009, 12:39:39 PM12/6/09

to

In article <db0caa7f-6e7f-4fe2...@v37g2000vbb.googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>
>Nick, SCC and Larrabee are different species. Both have plenty of
>relatively simple x86 cores on a single chips but that's about only
>thing they have in common.
>
>1. Larrabee cores are cache-coherent, SCC cores are not.
>2. Larrabee interconnects have ring topology, SCC is a mesh
>3. Larrabee cores are about vector performance (512-bit SIMD) and SMT
>(4 hardware threads per core). SCC cores are supposed to be stronger
>than Larrabee on scalar code and much much weaker on vector code.

Thanks for the correction.

. I have been fully occupied with other matters, and
so seem to have missed some developments. Do you have a pointer
to any technical information?

>4. Larrabee was originally intended for consumers, both as high-end 3D
>graphics engine and as sort-of-GPGPU. Graphics as target for 1st
>generation chip is canceled, but it still possible that it would be
>shipped to paying customers as GPGPU. SCC, on the other hand, is
>purely experimental.

Now, there I beg to disagree. I have never seen anything reliable
indicating that Larrabee has ever been intended for consumers,
EXCEPT as a 'black-box' GPU programmed by 'Intel partners'. And
some of that information came from semi-authoritative sources in
Intel. Do you have a reference to an conflicting statement from
someone in Intel?

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Dec 7, 2009, 9:28:19 AM12/7/09

to

I can guess.

Part of my guess is that this is related to Pat Gelsinger's departure.
Gelsinger was (a) ambitious, intent on becoming Intel CEO (said so in
his book), (b) publicly very much behind Larrabee.

I'm guessing that Gelsinger was trying to ride Larrabee as his ticket to
the next level of executive power. And when Larrabee did not pan out
as, Hicc well as he might have liked, he left. And/or conversely: when
Gelsinger left, Larrabee lost its biggest executive proponent. Although
my guess is that it was technology wagging the executive career tail: no
amount of executive positioning can make a technology shippable when it
isn't ready.

However, I would not count Larrabee out yet. Hiccups happen.

Although I remain an advocate of GPU style coherent threading
microarchitectures - I think they are likely to be more power efficient
than simple MIMD, whether SMT/HT or MCMT - the pull of X86 will be
powerful. Eventually we will have X86 MIMD/SMT/HT in-order vs X86 MCMT.
Hetero almost guaranteed. Only question will be heteroOOO/lO, or hetero
X86 MCMT/GPU. Could be hetero X86 OOO & X86 W. GPU style Coherent
Threading. The latter could even be CT/OOO. But these "Could be"s have
no sightings.

Andy "Krazy" Glew

unread,

Dec 7, 2009, 9:51:49 AM12/7/09

to nm...@cam.ac.uk

nm...@cam.ac.uk wrote:

> Now, there I beg to disagree. I have never seen anything reliable
> indicating that Larrabee has ever been intended for consumers,
> EXCEPT as a 'black-box' GPU programmed by 'Intel partners'. And
> some of that information came from semi-authoritative sources in
> Intel. Do you have a reference to an conflicting statement from
> someone in Intel?

http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-and-the-future-of-computing/

Just a blog, not official, although of course anything blogged at Intel
is semi-blest (believe me, I know the flip side.)

Del Cecchi

unread,

Dec 7, 2009, 1:00:44 PM12/7/09

to

"Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message
news:4B1D1685...@patten-glew.net...

Does this mean Larrabee won't be the engine for the PS4?

We were assured that it was not long ago.

del

Robert Myers

unread,

Dec 7, 2009, 1:25:42 PM12/7/09

to

On Dec 7, 9:51 am, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> n...@cam.ac.uk wrote:
> > Now, there I beg to disagree. I have never seen anything reliable
> > indicating that Larrabee has ever been intended for consumers,
> > EXCEPT as a 'black-box' GPU programmed by 'Intel partners'. And
> > some of that information came from semi-authoritative sources in
> > Intel. Do you have a reference to an conflicting statement from
> > someone in Intel?
>

> http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-an...

>
> Just a blog, not official, although of course anything blogged at Intel
> is semi-blest (believe me, I know the flip side.)

The blog post reminded me. I have assumed, for years, that Intel
planned on putting many (>>4) x86 cores on a single-die. I'm sure I
can find Intel presentations from the nineties that seem to make that
clear if I dig hard enough.

From the very beginning, Larrabee seemed to be a technology of destiny
in search of a mission, and the first, most obvious mission for any
kind of massive parallelism is graphics. Thus, Intel explaining why
it would introduce Larrabee at Siggraph always seemed a case of
offering an explanation where none would be needed if the explanation
weren't something they weren't sure they believed themselves (or that
anyone else would). It just seemed like the least implausible mission
for hardware that had been designed to a concept rather than to a
mission. A more plausible claim that they were aiming at HPC probably
wouldn't have seemed like a very attractive business proposition for a
company the size of Intel.

Also from the beginning, I wondered if Intel seriously expected to be
able to compete at the high end with dedicated graphics engines using
x86 cores. Either there was something about the technology I was
missing completely, it was just another Intel bluff, or the "x86"
cores that ultimately appeared on a graphics chips for market would be
to an x86 as we know it as, say, a, lady bug is to a dalmatian.

Robert.

nm...@cam.ac.uk

unread,

Dec 7, 2009, 5:39:27 PM12/7/09

to

In article <4B1D1685...@patten-glew.net>,

I don't see anything in that that even hints at plans to make
Larrabee available for consumer use. It could just as well be a
probe to test consumer interest - something that even I do!

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Dec 7, 2009, 6:05:02 PM12/7/09

to

In article <b81b8239-b43c-46e9...@k13g2000prh.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>
>The blog post reminded me. I have assumed, for years, that Intel
>planned on putting many (>>4) x86 cores on a single-die. I'm sure I
>can find Intel presentations from the nineties that seem to make that
>clear if I dig hard enough.

Yes. But the word "planned" implies a degree of deliberate action
that I believe was absent. They assuredly blithered on about it,
and very probably had meetings about it ....

>From the very beginning, Larrabee seemed to be a technology of destiny
>in search of a mission, and the first, most obvious mission for any

>kind of massive parallelism is graphics. ...

Yes. But what they didn't seem to understand is that they should
have treated it as an experiment. I tried to persuade them that
they needed to make it widely available and cheap, so that the mad
hackers would start to play with it, and see what developed.
Perhaps nothing, but it wouldn't have been Intel's effort that was
wasted.

The same was true of Sun, but they had less margin for selling CPUs
at marginal cost.

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Dec 7, 2009, 11:04:17 PM12/7/09

to Del Cecchi

Del Cecchi wrote:
> "Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message news:4B1D1685...@patten-glew.net...
>> nm...@cam.ac.uk wrote:
>>
>>> I have never seen anything reliable
>>> indicating that Larrabee has ever been intended for consumers,
>>

>> http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-and-the-future-of-computing/
>>
>> Just a blog, not official, although of course anything blogged at
>> Intel is semi-blest (believe me, I know the flip side.)
>
> Does this mean Larrabee won't be the engine for the PS4?
>
> We were assured that it was not long ago.

My guess is that Intel was pushing for Larrabee to be the PS4 chip.

And, possibly, Sony agreed. Not unreasonably, if Intel had made a
consumer grade Larrabee. Since Larrabee's nig pitch is programmability
- cache coherence, MIMD, vectors, familiar stuff. As opposed to the
Cell's idiosyncrasies and programmer hostility, which are probably in
large part to blame for Sony's lack of success with the PS3.

Given the present Larrabee situation, Sony is probably scrambling. Options:

a) go back to Cell.

b) more likely, eke out a year or so with Cell and a PS4 stretch, and
then look around again - possibly at the next Larrabee

c) AMD/ATI Fusion

d) Nvidia? Possibly with the CPU that Nvidia is widely rumored to be
working on.

AMD/ATI and Nvidia might seem the most reasonable, except that both
companies have had trouble delivering. AMD/ATI look best now, but
Nvidia has more "vision". Whatever good that will do them.

Larrabee's attractions remain valid. It is more programmer friendly.
But waiting until Larrabee is ready may be too painful.

Historically, game consoles have a longer lifetime than PCs. They were
programmed closer to the metal, and hence needed stability in order to
warrant software investment.

But DX10-DX11 and Open GL are *almost* good enough for games. And allow
migrating more frequently to the latest and greatest.

Blue-sky possibility: the PS3-PS4 transition breaking with the tradition
of console stability. The console might stay stable form factor and UI
and device wise - screen pixels, joysticks, etc. - but may start
changing the underlying compute and graphics engine more quickly than in
the best.

Related: net games.

Torben �gidius Mogensen

unread,

Dec 8, 2009, 3:45:09 AM12/8/09

to

"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:

> Although I remain an advocate of GPU style coherent threading
> microarchitectures - I think they are likely to be more power
> efficient than simple MIMD, whether SMT/HT or MCMT - the pull of X86
> will be powerful.

The main (only?) advantage of the x86 ISA is for running legacy software
(yes, I do consider Windows to be legacy software). And I don't see
this applying for Larrabee -- you can't exploit the parallelism when you
run dusty decks.

When developing new software, you want to use high-level languages and
don't really care too much about the underlying instruction set -- the
programming model you have to use (i.e., shared memory versus message
passing, SIMD vs. MIMD, etc.) is much more important, and that is
largely independent of the ISA.

Torben

nm...@cam.ac.uk

unread,

Dec 8, 2009, 4:27:32 AM12/8/09

to

In article <4B1DD041...@patten-glew.net>,
Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:

>Del Cecchi wrote:
>>
>> Does this mean Larrabee won't be the engine for the PS4?
>>
>> We were assured that it was not long ago.
>
>My guess is that Intel was pushing for Larrabee to be the PS4 chip.
>
>And, possibly, Sony agreed. Not unreasonably, if Intel had made a
>consumer grade Larrabee. Since Larrabee's nig pitch is programmability
>- cache coherence, MIMD, vectors, familiar stuff. As opposed to the
>Cell's idiosyncrasies and programmer hostility, which are probably in
>large part to blame for Sony's lack of success with the PS3.

Could be. That would be especially relevant if Sony were planning
to break out of the 'pure' games market and producing a 'home
entertainment centre'. Larrabee's pitch implied that it would have
been simple to add general Internet access, probably including VoIP,
and quite possibly online ordering, Email etc. We know that some of
the marketing organisations are salivating at the prospect of being
able to integrate games playing, television and online ordering.

I am pretty sure that both Sun and Intel decided against the end-user
market because they correctly deduced that it would not return a
profit but, in my opinion incorrectly, did not think that it might
open up new opportunities. But why Intel seem to have decided
against the use described above is a mystery - perhaps because, like
Motorola with the 88000 as a desktop chip, every potential partner
backed off. And perhaps for some other reason - or perhaps the
rumour of its demise is exaggerated - I don't know.

I heard some interesting reports about the 48 thread CPU yesterday,
incidentally. It's unclear that's any more focussed than Larrabee.

Regards,
Nick Maclaren.

Ken Hagan

unread,

Dec 8, 2009, 5:45:52 AM12/8/09

to

On Tue, 08 Dec 2009 08:45:09 -0000, Torben ï¿œgidius Mogensen
<tor...@diku.dk> wrote:

> The main (only?) advantage of the x86 ISA is for running legacy software
> (yes, I do consider Windows to be legacy software). And I don't see
> this applying for Larrabee -- you can't exploit the parallelism when you
> run dusty decks.

But you can exploit the parallelism where you really needed it and carry
on using the dusty decks for all the other stuff, without which you don't
have a rounded product.

nm...@cam.ac.uk

unread,

Dec 8, 2009, 6:44:29 AM12/8/09

to

In article <op.u4l76...@khagan.ttx>,
Ken Hagan <K.H...@thermoteknix.com> wrote:
>On Tue, 08 Dec 2009 08:45:09 -0000, Torben �gidius Mogensen

That was the theory. We don't know how well it would have panned out,
but it is clearly a sane objective.

Regards,
Nick Maclaren.

Andrew Reilly

unread,

Dec 8, 2009, 7:14:08 AM12/8/09

to

On Tue, 08 Dec 2009 09:27:32 +0000, nmm1 wrote:

> Larrabee's pitch implied that it would have been simple to add general
> Internet access, probably including VoIP, and quite possibly online
> ordering, Email etc.

Why do you suggest that internet access, VoIP or online ordering are
impossible or even hard on existing Cell? It's a full-service Unix
engine, aside from all of the rendering business. Linux runs on it,
which means that all of the interesting browsers run on it just fine.

Sure, there's an advertising campaign (circa NetBurst) that says that
intel makes the internet work better, but we're not buying that, are we?

Cheers,

--
Andrew

nm...@cam.ac.uk

unread,

Dec 8, 2009, 7:33:36 AM12/8/09

to

In article <7o6u8fF...@mid.individual.net>,

Andrew Reilly <areil...@bigpond.net.au> wrote:
>
>> Larrabee's pitch implied that it would have been simple to add general
>> Internet access, probably including VoIP, and quite possibly online
>> ordering, Email etc.
>
>Why do you suggest that internet access, VoIP or online ordering are
>impossible or even hard on existing Cell? It's a full-service Unix
>engine, aside from all of the rendering business.

Quite a lot of (indirect) feedback from people who have tried using
it, as well as the not-wholly-unrelated Blue Gene. The killer is that
it is conceptually different from 'mainstream' systems, and so each
major version of each product is likely to require extensive work,
possibly including reimplementation or the implementation of a new
piece of infrastructure. That's a long-term sink of effort.

As a trivial example of the sort of problem, a colleague of mine has
some systems with NFS-mounted directories, but where file locking is
disabled (for good reasons). Guess what broke at a system upgrade?

> Linux runs on it,
>which means that all of the interesting browsers run on it just fine.

It means nothing of the sort - even if you mean a fully-fledged system
environment by "Linux", and not just a kernel and surrounding features,
there are vast areas of problematic facilities that most browsers use
that are not needed for a reasonable version of Linux.

>Sure, there's an advertising campaign (circa NetBurst) that says that
>intel makes the internet work better, but we're not buying that, are we?

Of course not.

Regards,
Nick Maclaren.

ChrisQ

unread,

Dec 8, 2009, 8:07:18 AM12/8/09

to

nm...@cam.ac.uk wrote:

>
>> Linux runs on it,
>> which means that all of the interesting browsers run on it just fine.
>
> It means nothing of the sort - even if you mean a fully-fledged system
> environment by "Linux", and not just a kernel and surrounding features,
> there are vast areas of problematic facilities that most browsers use
> that are not needed for a reasonable version of Linux.
>

For example ?.

Once you have an os kernel and drivers on top of the hardware, the hw is
essentially isolated and anything that can compile should run with few
problems. Ok, it may mean that the code runs on one of the n available
processors under the hood, but it should run...

Regards,

Chris

ChrisQ

unread,

Dec 8, 2009, 8:13:32 AM12/8/09

to

The obvious question then is: Would one of many x86 cores be fast enough
on it's own to run legacy windows code like office, photoshop etc ?...

Regards,

Chris

Andy "Krazy" Glew

unread,

Dec 8, 2009, 10:12:51 AM12/8/09

to Torben Ægidius Mogensen

I wish that this were so.

I naively thought it were so, e.g. for big supercomputers. After all,
they compile all of their code from scratch, right? What do they care
if the actual parallel compute engines are non-x86? Maybe have an x86 in
the box, to run legacy stuff.

Unfortunately, they do care. It may not be the primary concern - after
all, they often compile their code from scratch. But, if not primary,
it is one of the first of the secondary concerns.

Reason: Tools. Ubiquity. Libraries. Applies just as much to Linux as to
Windows. You are running along fine on your non-x86 box, and then
realize that you want to use some open source library that has been
developed and tested mainly on x86. You compile from source, and there
are issues. All undoubtedly solvable, but NOT solved right away. So as
a result, you either can't use the latest and greatest library, or you
have to fix it.

Like I said, this was supercomputer customers telling me this. Not all
- but maybe 2/3rds. Also, especially, the supercomputer customers'
sysadmins.

Perhaps supercomputers are more legacy x86 sensitive than game consoles...

I almost believed this when I wrote it. And then I thought about flash:

... Than game consoles that want to start running living room
mediacenter applications. That want to start running things like x86
binary plugins, and Flash. Looking at

http://www.adobe.com/products/flashplayer/systemreqs/

The following minimum hardware configurations are recommended for
an optimal playback experience: ... all x86, + PowerPC G5.

I'm sure that you can get a version that runs on your non-x86,
non-PowerPC platform. ... But it's a hassle.

===

Since I would *like* to work on chips in the future as I have in the
past, and since I will never work at Intel or AMD again, I *want* to
believe that non-x86s can be successful. I think they can be
successful. But we should not fool ourselves: there are significant
obstacles, even in the most surprising market segments where x86
compatibility should not be that much of an issue.

We, the non-x86 forces of the world, need to recognize those obstacles,
and overcome them. Not deny their existence.

Bernd Paysan

unread,

Dec 8, 2009, 12:33:10 PM12/8/09

to

Andy "Krazy" Glew wrote:
> I almost believed this when I wrote it. And then I thought about flash:
>
> ... Than game consoles that want to start running living room
> mediacenter applications. That want to start running things like x86
> binary plugins, and Flash. Looking at
>
> http://www.adobe.com/products/flashplayer/systemreqs/
>
> The following minimum hardware configurations are recommended for
> an optimal playback experience: ... all x86, + PowerPC G5.
>
> I'm sure that you can get a version that runs on your non-x86,
> non-PowerPC platform. ... But it's a hassle.

It's mainly a deal between the platform maker and Adobe. Consider another
market, where x86 is non-existent: Smartphones. They are now real
computers, and Flash is an issue. Solution: Adobe ports the Flash plugin
over to ARM, as well. They already have Flash 9.4 ported (runs on the Nokia
N900), and Flash 10 will get an ARM port soon, as well, and spread around to
more smartphones. Or Skype: Also necessary, also proprietary, but also
available on ARM. As long as the device maker cares, it's their hassle, not
the user's hassle (and even a "free software only" Netbook Ubuntu it's too
much of a hassle to install the Flash plugin to be considered fine for mere
mortals).

This of course would be much less of a problem if Flash wasn't something
proprietary from Adobe, but an open standard (or at least based on an open
source platform), like HTML.

Note however, that even for a console maker, backward compatibility to the
previous platform is an issue. Sony put the complete PS2 logic (packet into
a newer, smaller chip) on the first PS3 generation to allow people to play
PS2 games with their PS3. If they completely change architecture with the
PS4, will they do that again? Or are they now fed up with this problem, and
decide to go to x86, and be done with that recurring problem?

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Del Cecchi

unread,

Dec 9, 2009, 12:03:19 AM12/9/09

to

"Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message

news:4B1DD041...@patten-glew.net...

> Del Cecchi wrote:
>> "Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message
>> news:4B1D1685...@patten-glew.net...
>>> nm...@cam.ac.uk wrote:
>>>
>>>> I have never seen anything reliable
>>>> indicating that Larrabee has ever been intended for consumers,
>>>
>>> http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-and-the-future-of-computing/
>>>
>>> Just a blog, not official, although of course anything blogged at
>>> Intel is semi-blest (believe me, I know the flip side.)
>>
>> Does this mean Larrabee won't be the engine for the PS4?
>>
>> We were assured that it was not long ago.
>
> My guess is that Intel was pushing for Larrabee to be the PS4 chip.
>
> And, possibly, Sony agreed. Not unreasonably, if Intel had made a
> consumer grade Larrabee. Since Larrabee's nig pitch is
> programmability - cache coherence, MIMD, vectors, familiar stuff.
> As opposed to the Cell's idiosyncrasies and programmer hostility,
> which are probably in large part to blame for Sony's lack of success
> with the PS3.

I believe Cell was Sony's idea in the first place. I could be wrong
about that but it was sure the vibe at the time. And Sony's lateness
and high price was at least as much due to the Blue Ray drive
included, which did lead to them winning the DVD war

Torben �gidius Mogensen

unread,

Dec 9, 2009, 3:47:40 AM12/9/09

to

"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:

> Torben �gidius Mogensen wrote:

>> When developing new software, you want to use high-level languages and
>> don't really care too much about the underlying instruction set -- the
>> programming model you have to use (i.e., shared memory versus message
>> passing, SIMD vs. MIMD, etc.) is much more important, and that is
>> largely independent of the ISA.

> I naively thought it were so, e.g. for big supercomputers. After all,

> they compile all of their code from scratch, right? What do they care
> if the actual parallel compute engines are non-x86? Maybe have an x86
> in the box, to run legacy stuff.
>
> Unfortunately, they do care. It may not be the primary concern -
> after all, they often compile their code from scratch. But, if not
> primary, it is one of the first of the secondary concerns.
>
> Reason: Tools. Ubiquity. Libraries. Applies just as much to Linux as
> to Windows. You are running along fine on your non-x86 box, and then
> realize that you want to use some open source library that has been
> developed and tested mainly on x86. You compile from source, and
> there are issues. All undoubtedly solvable, but NOT solved right
> away. So as a result, you either can't use the latest and greatest
> library, or you have to fix it.
>
> Like I said, this was supercomputer customers telling me this. Not
> all - but maybe 2/3rds. Also, especially, the supercomputer
> customers' sysadmins.

Libraries are, of course, important to supercomputer users. But if they
are written in a high-level language and the new CPU uses the same
representation of floating-point numbers as the old (e.g., IEEE), they
should compile to the new platform. Sure, some low-level optimisations
may not apply, but if the new platform is a lot faster than the old,
that may not matter. And you can always address the optimisation issue
later.

Besides, until recently supercomputers were not mainly x86-based.

> Perhaps supercomputers are more legacy x86 sensitive than game consoles...
>
> I almost believed this when I wrote it. And then I thought about flash:
>
> ... Than game consoles that want to start running living room
> mediacenter applications. That want to start running things like x86
> binary plugins, and Flash. Looking at
>
> http://www.adobe.com/products/flashplayer/systemreqs/
>
> The following minimum hardware configurations are recommended for
> an optimal playback experience: ... all x86, + PowerPC G5.
>
> I'm sure that you can get a version that runs on your non-x86,
> non-PowerPC platform. ... But it's a hassle.

Flash is available on ARM too. And if another platform becomes popular,
Adobe will port Flash to this too. But that is not the issue: Flash
doesn't run on the graphics processor, it runs on the main CPU, though
it may use the graphics processor through a standard API that hides the
details of the GPU ISA.

Torben

nm...@cam.ac.uk

unread,

Dec 9, 2009, 4:42:14 AM12/9/09

to

In article <7zzl5sr...@pc-003.diku.dk>,

Torben �gidius Mogensen <tor...@diku.dk> wrote:

>"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:
>
>>
>> Reason: Tools. Ubiquity. Libraries. Applies just as much to Linux as
>> to Windows. You are running along fine on your non-x86 box, and then
>> realize that you want to use some open source library that has been
>> developed and tested mainly on x86. You compile from source, and
>> there are issues. All undoubtedly solvable, but NOT solved right
>> away. So as a result, you either can't use the latest and greatest
>> library, or you have to fix it.
>>
>> Like I said, this was supercomputer customers telling me this. Not
>> all - but maybe 2/3rds. Also, especially, the supercomputer
>> customers' sysadmins.
>
>Libraries are, of course, important to supercomputer users. But if they
>are written in a high-level language and the new CPU uses the same
>representation of floating-point numbers as the old (e.g., IEEE), they
>should compile to the new platform. Sure, some low-level optimisations
>may not apply, but if the new platform is a lot faster than the old,
>that may not matter. And you can always address the optimisation issue
>later.

Grrk. All of the above is partially true, but only partially. The
problem is almost entirely with poor-quality software (which is,
regrettably, most of it). Good quality software is portable to
quite wildly different systems fairly easily. It depends on whether
you are talking about performance-critical, numerical libraries
(i.e. what supercomputer users really want to do) or administrative
and miscellaneous software.

For the former, the representation isn't enough, as subtle differences
like hard/soft underflow and exception handling matter, too. And you
CAN'T disable optimisation for supercomputers, because you can't
accept the factor of 10+ degradation. It doesn't help, anyway,
because you will be comparing with an optimised version on the
other systems.

With the latter, porting is usually trivial, provided that the
program has not been rendered non-portable by the use of autoconfigure,
and that it doesn't use the more ghastly parts of the infrastructure.
But most applications that do rely on those areas aren't relevant
to supercomputers, anyway, because they are concentrated around
the GUI area (and, yes, flash is a good example).

I spent a decade managing the second-largest supercomputer in UK
academia, incidentally, and some of the systems I managed were
'interesting'.

>Besides, until recently supercomputers were not mainly x86-based.
>
>> Perhaps supercomputers are more legacy x86 sensitive than game consoles...

Much less so.

Ken Hagan

unread,

Dec 9, 2009, 5:54:37 AM12/9/09

to

On Wed, 09 Dec 2009 08:47:40 -0000, Torben ï¿œgidius Mogensen
<tor...@diku.dk> wrote:

> Sure, some low-level optimisations
> may not apply, but if the new platform is a lot faster than the old,
> that may not matter. And you can always address the optimisation issue
> later.

I don't think Andy was talking about poor optimisation. Perhaps these
libraries have assumed the fairly strong memory ordering model of an x86,
and in its absence would be chock full of bugs.

> Flash is available on ARM too. And if another platform becomes popular,
> Adobe will port Flash to this too.

When hell freezes over. It took Adobe *years* to get around to porting
Flash to x64.

They had 32-bit versions for Linux and Windows for quite a while, but no
64-bit version for either. To me, that suggests the problem was the
int-size rather than the platform, and it just took several years to clean
it up sufficiently. So I suppose it is *possible* that the next port might
not take so long. On the other hand, both of these targets have Intel's
memory model, so I'd be surprised if even this "clean" version was truly
portable.

Ken Hagan

unread,

Dec 9, 2009, 6:18:43 AM12/9/09

to

On Tue, 08 Dec 2009 13:13:32 -0000, ChrisQ <me...@devnull.com> wrote:

> The obvious question then is: Would one of many x86 cores be fast enough
> on it's own to run legacy windows code like office, photoshop etc ?...

Almost certainly. From my own experience, Office 2007 is perfectly usable
on a 2GHz Pentium 4 and only slightly sluggish on a 1GHz Pentium 3. These
applications are already "lightly multi-threaded", so some of the
longer-running operations are spun off on background threads, so if you
had 2 or 3 cores that were even slower, that would probably still be OK
because the application *would* divide the workload. For screen drawing,
the OS plays a similar trick.

I would also imagine that Photoshop had enough embarrassing parallelism
that even legacy versions might run faster on a lot of slow cores, but I'm
definitely guessing here.

Noob

unread,

Dec 9, 2009, 7:28:53 AM12/9/09

to

Bernd Paysan wrote:

> This of course would be much less of a problem if Flash wasn't something

> proprietary from Adobe [...]

A relevant article:
Free Flash community reacts to Adobe Open Screen Project
http://www.openmedianow.org/?q=node/21

Stefan Monnier

unread,

Dec 9, 2009, 9:56:08 AM12/9/09

to

> They had 32-bit versions for Linux and Windows for quite a while, but no
> 64-bit version for either. To me, that suggests the problem was the

It's just a question of market share.
Contrary to Free Software where any idiot can port the code to his
platform if he so wishes, propretary software first requires collecting
a large number of idiots so as to justify
compiling/testing/marketing/distributing the port.

Stefan

Paul Wallich

unread,

Dec 9, 2009, 1:10:35 PM12/9/09

to

From an outside perspective, this sounds a lot like the Itanic roadmap:
announce something brilliant and so far out there that your competitors
believe you must have solutions to all the showstoppers up your sleeve.
Major difference being that Larrabee's potential/probable competitors
didn't fold.

paul

Robert Myers

unread,

Dec 9, 2009, 3:25:16 PM12/9/09

to

On Dec 9, 3:47 am, torb...@diku.dk (Torben Ægidius Mogensen) wrote:

>
> Libraries are, of course, important to supercomputer users. But if they
> are written in a high-level language and the new CPU uses the same
> representation of floating-point numbers as the old (e.g., IEEE), they
> should compile to the new platform. Sure, some low-level optimisations
> may not apply, but if the new platform is a lot faster than the old,
> that may not matter. And you can always address the optimisation issue
> later.
>

But if some clever c programmer or committee of c programmers has made
a convoluted and idiosyncratic change to a definition in a header
file, you may have to unscramble all kinds of stuff hidden under
macros just to get it to compile and link, and that effort can't be
deferred until later.

Robert.

Robert Myers

unread,

Dec 9, 2009, 4:49:54 PM12/9/09

to

On Dec 9, 1:10 pm, Paul Wallich <p...@panix.com> wrote:

> From an outside perspective, this sounds a lot like the Itanic roadmap:
> announce something brilliant and so far out there that your competitors
> believe you must have solutions to all the showstoppers up your sleeve.
> Major difference being that Larrabee's potential/probable competitors
> didn't fold.

In American football, "A good quarterback can freeze the opposition’s
defensive secondary with a play-action move, a pump fake or even his
eyes."

http://www.dentonrc.com/sharedcontent/dws/drc/opinion/editorials/stories/DRC_Editorial_1123.2e4a496a2.html

where the analogy is used in a political context.

If I were *any* of the players in this game, I'd be studying the
tactics of quarterbacks who need time to find an open receiver, since
*no one* appears to have the right product ready for prime time. If I
were Intel, I'd be nervous, but if I were any of the other players,
I'd be nervous, too.

Nvidia stock has drooped a bit after the *big* bounce it took on the
Larrabee announcement, but I'm not sure why everyone is so negative on
Nvidia (especially Andy). They don't appear to be in much more
parlous a position than anyone else. If Fermi is a real product, even
if only at a ruinous price, there will be buyers.

N.B. I follow the financial markets for information only. I am not an
active investor.

Robert.

Andy "Krazy" Glew

unread,

Dec 9, 2009, 11:12:39 PM12/9/09

to Robert Myers

Robert Myers wrote:
> Nvidia stock has drooped a bit after the *big* bounce it took on the
> Larrabee announcement, but I'm not sure why everyone is so negative on
> Nvidia (especially Andy). They don't appear to be in much more
> parlous a position than anyone else. If Fermi is a real product, even
> if only at a ruinous price, there will be buyers.

Let me be clear: I'm not negative on Nvidia. I think their GPUs are the
most elegant of the lot. If anything, I am overcompensating: within
Intel, I was probably the biggest advocate of Nvidia style
microarchitecture, arguing against a lot of guys who came to Intel from
ATI. Also on this newsgroup.

However, I don't think that anyone can deny that Nvidia had some
execution problems recently. For their sake, I hope that they have
overcome them.

Also, AMD/ATI definitely overtook Nvidia. I think that Nvidia
emphasized elegance, and GP GPU futures stuff, whereas ATI went the
slightly inelegant way of combining SIMT Coherent Threading with VLIW.
It sounds more elegant when you phrase it my way, "combining SIMT
Coherent Threading with VLIW", than when you have to describe it without
my terminology. Anyway, ATI definitely had a performance per transistor
advantage. I suspect they will continue to have such an advantage over
Fermi, because, after all, VLIW works to some limited extent.

I think Fermi is more programmable and more general purpose, while ATI's
VLIW approach has efficiencies in some areas.

I think that Nvidia absolutely has to have a CPU to have a chance of
competing. One measly ARM chip or Power PC on an Nvidia die. Or maybe
one CPU chip, one GPU chip, and a stack of memory in a package; or a GPU
plus a memory interface with a lousy CPU. Or, heck, a reasonably
efficient way of decoupling one of Nvidia's processors and running 1
thread, non-SIMT, of scalar code. SIMT is great, but there is important
non-SIMT scalar code.

Ultimately, the CPU vendors will squeeze GPU-only vendors out of the
market. AMD & ATI are already combined. If Intel's Larrabee is
stalled, it gives Nvidia some breathing room, bit not much. Even if
Larrabee is completely cancelled, which I doubt, Intel would eventually
squeeze Nvidia out with its evolving integrated graphics. Which,
although widely dissed, really has a lot of potential.

Nvidia's best chance is if Intel thrashes, dithering between Larrabee
and Intel's integrated graphics and ... isn't Intel using PowerVR in
some Atom chips? I.e. Intel currently has at least 3 GPU solutions in
flight. *This* sounds like the sort of thrash Intel had -
x86/i960/i860 ... I personally think that Intel's best path to success
would be to go with a big core + the Intel integrated graphics GPU,
evolved, and then jump to Larrabee. But if they focus on Larrabee, or
an array of Atoms + a big core, their success will just be delayed.

Intel is its own biggest problem, with thrashing.

Meanwhile, AMD/ATI are in the best position. I don't necessarily like
Fusion CPU/GPU, but they have all the pieces. But it's not clear they
know how to use it.

And Nvidia needs to get out of the discrete graphics board market niche
as soon as possible. If they can do so, I bet on Nvidia.

Robert Myers

unread,

Dec 9, 2009, 11:33:18 PM12/9/09

to

On Dec 9, 11:12 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> And Nvidia needs to get out of the discrete graphics board market niche

> as soon as possible. If they can do so, I bet on Nvidia.

Cringely thinks, well, the link says it all:

http://www.cringely.com/2009/12/intel-will-buy-nvidia/

Robert.

Andy "Krazy" Glew

unread,

Dec 10, 2009, 12:25:32 AM12/10/09

to Ken Hagan

Ken Hagan wrote:
> On Wed, 09 Dec 2009 08:47:40 -0000, Torben ï¿œgidius Mogensen
> <tor...@diku.dk> wrote:
>
>> Sure, some low-level optimisations
>> may not apply, but if the new platform is a lot faster than the old,
>> that may not matter. And you can always address the optimisation issue
>> later.
>
> I don't think Andy was talking about poor optimisation. Perhaps these
> libraries have assumed the fairly strong memory ordering model of an
> x86, and in its absence would be chock full of bugs.

Nick is correct to say that memory ordering is harder to port around
than instruction set or word size.

A surprisingly large number of supercomputer customers use libraries and
tools that have some specific x86 knowledge.

For example, folks who use tools like Pin, the binary instrumentation
tool. Although Intel makes Pin available on some non-x86 machines,
where do you think Pin runs best?

Or the Boehm garbage collector for C++. Although it's fairly portable -

http://www.hpl.hp.com/personal/Hans_Boehm/gc/#where says
The collector is not completely portable, but the distribution includes
ports to most standard PC and UNIX/Linux platforms. The collector should
work on Linux, *BSD, recent Windows versions, MacOS X, HP/UX, Solaris,
Tru64, Irix and a few other operating systems. Some ports are more
polished than others.

again, if your platform is "less polished"...

Plus, there are the libraries and tools like Intel's Thread Building
Blocks.

Personally, I prefer not to use libraries that are tied to one processor
architecture, but many people just want to get their job done.

The list goes on.

Like I said, I was surprised at how many supercomputer customers
expressed this x86 orientation. I expected them to care little about x86.

Andrew Reilly

unread,

Dec 10, 2009, 2:22:55 AM12/10/09

to

On Wed, 09 Dec 2009 21:25:32 -0800, Andy \"Krazy\" Glew wrote:

> Like I said, I was surprised at how many supercomputer customers
> expressed this x86 orientation. I expected them to care little about
> x86.

I still expect those who use Cray or NEC vector supers, or any of the
scale-up SGI boxes, or any of the Blue-foo systems to care very little
indeed. The folk who seem to be getting mileage from the CUDA systems
probably only care peripherally. I suspect that it depends on how your
focus group self-selects.

Yes there are some big-iron x86 systems now, but they haven't even been a
majority on the top500 for very long.

I suppose that it doesn't take too long for bit-rot to set in, if the
popular crowd goes in a different direction.

Cheers,

--
Andrew

Terje Mathisen

unread,

Dec 10, 2009, 3:32:31 AM12/10/09

to

Robert Myers wrote:
> Nvidia stock has drooped a bit after the *big* bounce it took on the
> Larrabee announcement, but I'm not sure why everyone is so negative on
> Nvidia (especially Andy). They don't appear to be in much more
> parlous a position than anyone else. If Fermi is a real product, even
> if only at a ruinous price, there will be buyers.

I have seen a report by a seismic processing software firm, indicating
that their first experiments with GPGPU programming had gone very well:

After 8 rounds of optimization, which basically consisted of mapping
their problem (acoustic wave propagation, according to Kirchoff) onto
the actual capabilities of a GPU card, they went from being a little
slower than the host CPU up to nearly two orders of magnitude faster.

This meant that Amdahl's law started rearing it's ugly head: The setup
overhead took longer than the actual processing, so now they are working
on moving at least some of that surrounding code on the GPU as well.

Anyway, with something like 40-100x speedups, oil companies will be
willing to spend at least $1000+ per chip.

However, I'm guessing that the global oil processing market has not more
than 100 of the TOP500 clusters, so this is 100K to 1M chips if everyone
would scrap their current setup.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Dec 10, 2009, 3:37:02 AM12/10/09

to

A rumor which has re-surfaced at least every year for as long as I can
remember, gaining strength since the AMD/ATI deal was announced.

Yes, it would well happen, Intel does have some spare change laying
around in the couch cushions. :-)

Torben �gidius Mogensen

unread,

Dec 10, 2009, 3:57:02 AM12/10/09

to

"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:

> I think that Nvidia absolutely has to have a CPU to have a chance of
> competing. One measly ARM chip or Power PC on an Nvidia die.

They do have Tegra, which is an ARM11 core on a chip with a graphics
processor (alas, not CUDA compatible) plus some other stuff. Adding one
or more ARM cores to a Fermi would not be that far a step. It would
require porting CUDA to ARM, though.

> Or, heck, a reasonably efficient way of decoupling one of Nvidia's
> processors and running 1 thread, non-SIMT, of scalar code.

The Nvidia processors lack interrupts andotheer stuff necesary for
running an OS, so it is probably better with a different processor.

> isn't Intel using PowerVR in some Atom chips?

I know ARM uses PowerVR, but I hadn't heard Intel doing so.

Torben

Michael S

unread,

Dec 10, 2009, 4:06:53 AM12/10/09

to

"8 rounds of optimization", that's impressive.
I wonder how much speed-up could they get from the host CPU after just
3 rounds:
1. double->single, to reduce memory footprint
2. SIMD
3. Exploit all available cores/threads

nm...@cam.ac.uk

unread,

Dec 10, 2009, 4:13:14 AM12/10/09

to

In article <4B20864...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:
>
>A surprisingly large number of supercomputer customers use libraries and
>tools that have some specific x86 knowledge.

It used to be very few, but is increasing.

>Like I said, I was surprised at how many supercomputer customers
>expressed this x86 orientation. I expected them to care little about x86.

A lot of it is due to the change in 'community' and their applications.
It's not just the libraries, but I have little to add to what you
said on those (well, other examples, but so?)

Traditionally, people were really up against hard limits, and they
were prepared to both spend serious effort in tuning and switch to
whichever system offered them most time. There still are a lot like
that. Fortran and MPI dominate, and few people give a damn about the
architecture.

An increasing number want to use a 'supercomputer' as an alternative
to tuning their code. Some of those codes are good, some are merely
inefficient, some are unnecessarily x86-dependent, and some LOOK
x86-dependent because they are just plain broken. C++ and shared
memory dominate.

And, as usual, nothing is hard and fast, so there are intermediates
and mixtures and ....

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Dec 10, 2009, 4:48:31 AM12/10/09

to

Michael S wrote:
> On Dec 10, 10:32 am, Terje Mathisen<Terje.Mathi...@tmsw.no> wrote:
>> I have seen a report by a seismic processing software firm, indicating
>> that their first experiments with GPGPU programming had gone very well:
>>
>> After 8 rounds of optimization, which basically consisted of mapping
>> their problem (acoustic wave propagation, according to Kirchoff) onto
>> the actual capabilities of a GPU card, they went from being a little
>> slower than the host CPU up to nearly two orders of magnitude faster.
>>
>> This meant that Amdahl's law started rearing it's ugly head: The setup
>> overhead took longer than the actual processing, so now they are working
>> on moving at least some of that surrounding code on the GPU as well.
>>
>> Anyway, with something like 40-100x speedups, oil companies will be
>> willing to spend at least $1000+ per chip.
>>
>> However, I'm guessing that the global oil processing market has not more
>> than 100 of the TOP500 clusters, so this is 100K to 1M chips if everyone
>> would scrap their current setup.
>>
>> Terje

> "8 rounds of optimization", that's impressive.
> I wonder how much speed-up could they get from the host CPU after just
> 3 rounds:
> 1. double->single, to reduce memory footprint
> 2. SIMD
> 3. Exploit all available cores/threads

I'm pretty sure they are already doing all of those, at least in the lab
where they tested GPGPU.

Thomas Womack

unread,

Dec 10, 2009, 5:24:16 AM12/10/09

to

In article <4B207537...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:

>Also, AMD/ATI definitely overtook Nvidia. I think that Nvidia
>emphasized elegance, and GP GPU futures stuff, whereas ATI went the
>slightly inelegant way of combining SIMT Coherent Threading with VLIW.
>It sounds more elegant when you phrase it my way, "combining SIMT
>Coherent Threading with VLIW", than when you have to describe it without
>my terminology. Anyway, ATI definitely had a performance per transistor
>advantage.

ATI win on performance, but nVidia win by miles on GPGPU software
development, simply because they've picked a language and stuck with
it, and at some point some high-up insisted that the GPGPU compilers
be roughly synchronised with the hardware releases; I expect to be
able to pick up a Fermi card, download the latest nvidia SDK, build
something linked with cufft, and get a reasonable performance.

ATI's compiler and driver stack, to the best of my knowledge, doesn't
support double precision yet, well after the second generation of
chips with DP on has appeared.

An AMD employee posted in their OpenCL forum about four weeks ago:

"Double precision floating point support is important for us. We are
planning to begin to introduce double precision arithmetic support in
first half of 2010 as well as the start of some built-ins over time."

Tom

Michael S

unread,

Dec 10, 2009, 9:01:50 AM12/10/09

to

I they are doing all that I simply can't see how one of existing GPUs
(i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
significant above 1 SP TFLOPs? According to Wikipedia there are not.
So, either they compare an array of GPUs with single host CPU or their
host code is very far from optimal. I'd bet on later.

Bernd Paysan

unread,

Dec 10, 2009, 9:28:06 AM12/10/09

to

Michael S wrote:
> I they are doing all that I simply can't see how one of existing GPUs
> (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> significant above 1 SP TFLOPs? According to Wikipedia there are not.

AFAIK the ATI 5870 can achieve up to 3.04 SP TFLOPS at 950MHz. That's a
single chip. And the data is on Wikipedia - where did you look? Look here:

http://en.wikipedia.org/wiki/FLOPS

Michael S

unread,

Dec 10, 2009, 12:11:35 PM12/10/09

to

I looked at the same page but didn't pay attention to a fine print at
the bottom :(

Anyway, Radeon™ HD 5870 is in the field for 2 month or something like
that? Somehow I don't think Terje's buddies had it for 8 rounds of
optimization.
Also it seems up until very recently very few organizations tried non-
NVidea GPGPU.

Torbjorn Lindgren

unread,

Dec 10, 2009, 1:14:59 PM12/10/09

to

Torben �gidius Mogensen <tor...@diku.dk> wrote:

The chipset for the MID/UMPC Atom's is Poulsbo (aka "Intel System
Controller Hub US15W") and contains "GMA500".

GMA500 is entirely unrelated to all other GMA models and consists of
PowerVR SGX 535 (graphics) and PowerVR VXD (H.264/MPEG-4 AVC
playback)...

IIRC only the special MID/UMDPC Atom's can be coupled with Poulsbo (ie
Z5xx/Silverthorne)?

The other Intel chipsets are uses a lot more power than the Atom
itself (extremely bad when you have a low-power CPU) and are pretty
anemic to boot. In the other end is Nvidia Ion which also uses a bit
more power than one would hope but at least have an usefull GPU/HD
playback accelerator (like Poulsbo but faster).

http://en.wikipedia.org/wiki/Poulsbo_(chipset)
http://en.wikipedia.org/wiki/Intel_GMA
http://en.wikipedia.org/wiki/Intel_Atom#Power_requirements

j...@cix.compulink.co.uk

unread,

Dec 10, 2009, 4:15:18 PM12/10/09

to

In article <1isTm.93647$Pi.2...@newsfe30.ams2>, me...@devnull.com
(ChrisQ) wrote:

> The obvious question then is: Would one of many x86 cores be fast
> enough on it's own to run legacy windows code like office, photoshop
> etc ?...

Maybe. But can marketing men convince themselves that this would be the
case? Almost certainly: a few studies about how many apps the average
corporate Windows user has open at a time could work wonders. The
problem, of course, is that most of those apps aren't consuming much CPU
except when they have input focus. But that's the kind of thing that
marketing departments are good at neglecting.

--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.

j...@cix.compulink.co.uk

unread,

Dec 10, 2009, 4:15:18 PM12/10/09

to

In article <hfk1mu$i7j$1...@smaug.linux.pwf.cam.ac.uk>, nm...@cam.ac.uk ()
wrote:
> Yes. But the word "planned" implies a degree of deliberate action
> that I believe was absent. They assuredly blithered on about it,
> and very probably had meetings about it ....

Indeed. Intel don't seem to have become serious about multi-core until
they discovered that they could not clock the NetBurst above 4GHz, but
that their fab people could readily fit two of them on a single die.

I did some work with the early "Pentium D", which was two NetBursts on
the same die, but with two sets of legs, and no communication between
the cores that didn't go through the legs and the motherboard FSB.
Locking performance was unimpressive, to say the least, and early
Opterons beat it utterly. I'm going to take a lot of convincing that
this was a long-planned product; the design just isn't good enough for
that to be convincing.

Robert Myers

unread,

Dec 10, 2009, 4:47:53 PM12/10/09

to

On Dec 10, 4:15 pm, j...@cix.compulink.co.uk wrote:

> I did some work with the early "Pentium D", which was two NetBursts on
> the same die, but with two sets of legs, and no communication between
> the cores that didn't go through the legs and the motherboard FSB.
> Locking performance was unimpressive, to say the least, and early
> Opterons beat it utterly. I'm going to take a lot of convincing that
> this was a long-planned product; the design just isn't good enough for
> that to be convincing.

The charts I remember, and I'm sure they were from the last
millennium, observed the rate at which power per unit area was
increasing, had a space shuttle thermal tile number on the same slide
for comparison, and concluded that the trend was not sustainable.

There were, in concept, at least two ways you could beat the trend: go
to multiple cores not running so fast (the proposal in that
presentation) or bet on a miracle. Apparently, the NetBurst team was
betting on a miracle.

From the outside, Intel looks arrogant enough to believe that they
could do multiple cores when they were forced to and no sooner. In
actuality, they weren't far wrong. Most people don't remember Pentium
D.

Robert.

Andy "Krazy" Glew

unread,

Dec 11, 2009, 12:23:44 AM12/11/09

to Michael S

Michael S wrote:

> I they are doing all that I simply can't see how one of existing GPUs
> (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> significant above 1 SP TFLOPs? According to Wikipedia there are not.
> So, either they compare an array of GPUs with single host CPU or their
> host code is very far from optimal. I'd bet on later.

Let's see: http://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg
says that Nhm can 2 2 128 bit SSE adds and 1 128 bit SSE mul per cycle.
Now, you might count that as 12 SP FLOPs or 6 DP FLOPS. Multiplied by 8
cores on a chip, you might get 96 SP FLOPS.

However, most supercomputer people count a flop as a
multiply-accumulate. By that standard, Nhm is only 4 SP mul-add FLOPs
per cycle. Add a fudge factor for the extra adder, but certainly not
2X, probably not even 1.5X -- and purists won't even give you that. 32
FLOPS. If you are lucky.

Seldom do you get the 100% utilization of the FMUL unit that you would
need to get 32 SP FLOPS. Especially not when you through in MP bus
contention, thread contention, etc.

Whereas the GPUs tend to have real FMAs. Intel and AMD have both
indicated that they are going the FMA direction. But I don't think that
has shipped yet.

And, frankly, it is easier to tune your code to get good utilization on
a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
for simple codes, yes, CUDA is easier. In my experience. And I'm a
fairly good x86 programmer, and a novice CUDA GPU programmer. I look
forward to Terje reporting his experience tuning code for CUDA (as long
as he isn't tuning wc).

The painful thing about CUDA is the ugly memory model - blocks, blah,
blah, blah. And it is really bad when you have to transfer stuff from
CPU to GPU memory. I hope and expect that Fermi will ameliorate this pain.

---

People are reporting 50x and 100x improvements on CUDA all over the
place. Try it yourself. Be sure to google for tuning advice.

Andy "Krazy" Glew

unread,

Dec 11, 2009, 12:26:27 AM12/11/09

to Andrew Reilly

Andrew Reilly wrote:
> On Wed, 09 Dec 2009 21:25:32 -0800, Andy \"Krazy\" Glew wrote:
>
>> Like I said, I was surprised at how many supercomputer customers
>> expressed this x86 orientation. I expected them to care little about
>> x86.
>
> I still expect those who use Cray or NEC vector supers, or any of the
> scale-up SGI boxes, or any of the Blue-foo systems to care very little
> indeed. The folk who seem to be getting mileage from the CUDA systems
> probably only care peripherally.

Actually some of the CUDA people do care.

They'll use CUDA for the performance critical code, and x86 for all the
rest, in the system it is attached to. With the x86 tools.

Or at least that's what they told me at SC09.

Andy "Krazy" Glew

unread,

Dec 11, 2009, 12:36:12 AM12/11/09

to j...@cix.compulink.co.uk

j...@cix.compulink.co.uk wrote:
> In article <1isTm.93647$Pi.2...@newsfe30.ams2>, me...@devnull.com
> (ChrisQ) wrote:
>
>> The obvious question then is: Would one of many x86 cores be fast
>> enough on it's own to run legacy windows code like office, photoshop
>> etc ?...
>
> Maybe. But can marketing men convince themselves that this would be the
> case? Almost certainly: a few studies about how many apps the average
> corporate Windows user has open at a time could work wonders. The
> problem, of course, is that most of those apps aren't consuming much CPU
> except when they have input focus. But that's the kind of thing that
> marketing departments are good at neglecting.

At SC09 the watchword was heterogeneity.

E.g. a big OOO x86 core, with small efficient cores of your favorite
flavour. On the same chip.

While you could put a bunch of small x86 cores on the side, I think that
you would probably be better off putting a bunch of small non-x86 cores
on the side. Like GPU cores. Like Nvidia. OR AMD/ATI Fusion.

Although this makes sense to me, I wonder if the people who want x86
really want x86 everywhere - on both the big cores, and the small.

Nobody likes the hetero programming model. But if you get a 100x perf
benefit from GPGPU...

Terje Mathisen

unread,

Dec 11, 2009, 1:48:52 AM12/11/09

to

Michael S wrote:
> I they are doing all that I simply can't see how one of existing GPUs
> (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of>10.
> Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> significant above 1 SP TFLOPs? According to Wikipedia there are not.
> So, either they compare an array of GPUs with single host CPU or their
> host code is very far from optimal. I'd bet on later.

It seems to be a bandwidth problem much more than a flops, i.e. using
the texture units effectively was the key to the big wins.

Take a look yourself:

http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf

Page 15 has the optimization graph, stating that 'Global Memory
Coalescing', 'Optimized Use of Shared Memory' and getting rid of
branches were the main contributors to the speedups.

It is of course possible that the CPU baseline was quite naive code, or
that the cpus used were quite old, but I would hope not.

Noob

unread,

Dec 11, 2009, 7:28:02 AM12/11/09

to

Andy "Krazy" Glew wrote:

> Nobody likes the hetero programming model.

They prefer the homo programming model? :-�

Michael S

unread,

Dec 11, 2009, 8:10:36 AM12/11/09

to

On Dec 11, 7:23 am, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> Michael S wrote:
> > I they are doing all that I simply can't see how one of existing GPUs
> > (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> > Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> > significant above 1 SP TFLOPs? According to Wikipedia there are not.
> > So, either they compare an array of GPUs with single host CPU or their
> > host code is very far from optimal. I'd bet on later.
>
> Let's see:http://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg
> says that Nhm can 2 2 128 bit SSE adds and 1 128 bit SSE mul per cycle.
> Now, you might count that as 12 SP FLOPs or 6 DP FLOPS. Multiplied by 8
> cores on a chip, you might get 96 SP FLOPS.
>

I counted it as 8 SP FLOPs. If wikipedia claims that Nehalem can do 2
FP128 adds per cicle than they are wrong, but more likely you misread
it. Nehalem has only one 1288-bit FP adder attached to port 1, exactly
like the previous members of Core2 family. Port 5 is only "move and
logic", not capable of FP arithmetic.
8 FLOPs/core * 4 cores/chip * 2.93 GHz => 94 GFLOPs

> However, most supercomputer people count a flop as a
> multiply-accumulate.
> By that standard, Nhm is only 4 SP mul-add FLOPs
> per cycle.

Bullshit. Supercomputer people count exactly like everybody else. Look
at "peak flops" in LINPACK reports.

>Add a fudge factor for the extra adder, but certainly not
> 2X, probably not even 1.5X -- and purists won't even give you that. 32
> FLOPS. If you are lucky.
>
> Seldom do you get the 100% utilization of the FMUL unit that you would
> need to get 32 SP FLOPS. Especially not when you through in MP bus
> contention, thread contention, etc.
>
> Whereas the GPUs tend to have real FMAs.

That has nothing to do with calculations in hand. When AMD says that
their new chip does 2.72 TFLOPs they really mean 1.36 TFMAs

>Intel and AMD have both indicated that they are going the FMA direction. But I don't think that
> has shipped yet.

Hopefully, Intel is not going in FMA direction. 3 source operands is a
major PITA for P6-derived Uarch. Most likely requires coordinated
dispatch via two execution ports so it would give nothing for peak
throughput. But you sure know more than me about it.
FMA makes sense on Silverthorne but I'd rather see Silverthorne dead.

>
> And, frankly, it is easier to tune your code to get good utilization on
> a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
> for simple codes, yes, CUDA is easier. In my experience. And I'm a
> fairly good x86 programmer, and a novice CUDA GPU programmer. I look
> forward to Terje reporting his experience tuning code for CUDA (as long
> as he isn't tuning wc).

I'd guess you played with microbenchmerks. Can't imagine it to be true
on real-world code that, yes, is "ugly" but, what can we do, real-
world problems are almost never nice and symmetric.

>
> The painful thing about CUDA is the ugly memory model - blocks, blah,
> blah, blah. And it is really bad when you have to transfer stuff from
> CPU to GPU memory. I hope and expect that Fermi will ameliorate this pain.
>
> ---
>
> People are reporting 50x and 100x improvements on CUDA all over the
> place. Try it yourself. Be sure to google for tuning advice.

First, 95% of the people can't do proper SIMD+multicore on host CPU to
save their lives and that already large proportion of "people are
reporting". Of those honest and knowing what they are doing majority
likely had not computationally bound problem to start with and they
found a way to take advantage of texture units.
According to Terje (see below) that was a case in Seismic code he
brought as an example.

Still, I have a feeling that a majority (not all) of PDE-type problems
that on GPU could be assisted by texture on host CPU could be
reformulated to exploit temporal locality via on-chip cache. But
that's just a feeling, nothing scientific.

Andy "Krazy" Glew

unread,

Dec 11, 2009, 9:59:07 AM12/11/09

to Michael S

Michael S wrote:
> First, 95% of the people can't do proper SIMD+multicore on host CPU to
> save their lives

Right. If only 5% (probably less) of people can't do SIMD+multicore on
host CPU, but 10% can do it on a coherent threaded microarchitecture,
which is better?

> On Dec 11, 7:23 am, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net> wrote:
>> And, frankly, it is easier to tune your code to get good utilization on
>> a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
>> for simple codes, yes, CUDA is easier. In my experience. And I'm a
>> fairly good x86 programmer, and a novice CUDA GPU programmer. I look
>> forward to Terje reporting his experience tuning code for CUDA (as long
>> as he isn't tuning wc).
>
> I'd guess you played with microbenchmerks.

Yep, you're right.

But even on the simplest microbenchmark, DAXPY, I needed to spend less
time tuning it on CUDA than I did tuning it on x86.

Now, there are some big real world apps where coherent threading falls
off a cliff. Where SIMT just doesn't work.

But if GPGPU needs less work for simple stuff, and comparable work for
hard stuff, and if the places where it falls off a cliff are no more
common than where MIMD CPU falls off a cliff...

If this was Willamette versus CUDA, there would not even be a question.
CUDA is easier to tune than Wmt. Nhm is easier to tune than Wmt, but it
still has a fairy complex microarchitecture, with lots of features that
get in the way. Sometimes simpler is better.

Andy "Krazy" Glew

unread,

Dec 11, 2009, 10:24:36 AM12/11/09

to Terje Mathisen

Terje Mathisen wrote:
> It seems to be a bandwidth problem much more than a flops, i.e. using
> the texture units effectively was the key to the big wins.

I am puzzled, torn, wondering about the texture units.

There's texture memory, which is just a slightly funky form of
cache/memory, with funky locality, and possible compression.

But then there's also the texture computation capability. Which is just
a funky form of 2 or 3D interpolation.

Most people seem to be getting the benefit from texture memory. But
when people use the texture interpolation compute capabilities, there's
another kicker.

Back in the 1990s on P6, when I was trying to make the case for CPUs to
own the graphics market, and not surrender it to the only-just-nascent
GPUs, the texture units were the oinker: they are just so damned
necessary to graphics, and they are just so damned idiosyncratic. I do
not know of any good way to do texturing in software that doesn't lose
performance, or of any good way to decompose texturing into simpler
instruction set primitives that could reasonably be added to an
instruction set. E.g. I don't know of any good way to express texture
operations in terms of 2, or 3, or even 4, register inputs.

Let's try again: how about an interpolate instruction that takes 3
vector registers, and performs interpolation between tuples in the X
direction along the length of the register, and in the Y generation
between correspondng elements in different registers?

But do you want 2, or 3, or 4, or ... arguments to interpolate along?
And what about Z interpolation?

Let alone compression? And skewed sampling? And ...

Textures just seem to be this big mass of stuff, all of which has to be
done in order to be credible.

Although I usually try to decompose complex things into simpler
operations, sometimes it is necessary to go the other way. Maybe we can
make the texture units more general. Make them into generally useful
function interpolation units. Add that capability to general purpose CPUs.

How much of the benefit is texture computation vs texture memory? Can
we separate these two things?

Texture computation is interpolation. (Which, of course, often
translates to memory savings because it changes the amount of memory you
need for lookup tables - higher order interpolation, or multiscale
interpolation => less memory traffic.) It looks like this can be made
general purpose. But how many people need it?

Texture memory is ... a funky sort of cache, with compression. Caches we
can make generically useful. Compression - for read-only data
structures, sure. But how can we write INTO the "compressed texture
cache memory", in such a way that we don't blow out the compression when
in gets kicked out of the cache?

Or, can we safely create a hardware datastructure that is mainly useful
for caching readonly, heaviliy preprocessed, data?

It seems to me that most of the GPGPU codes are not using the compute or
compression aspects of texture units. Indeed, CUDA doesn't really give
access to that. So it is probably just the extra memory ports and cache
behavior.

--

Terje, you're the master of lookup tables. Can you see a way to make
texture units generally useful?

Andy "Krazy" Glew

unread,

Dec 11, 2009, 4:20:02 PM12/11/09

to Michael S

Andy "Krazy" Glew wrote:
> Michael S wrote:
>> First, 95% of the people can't do proper SIMD+multicore on host CPU to
>> save their lives
>
> Right. If only 5% (probably less) of people can't do SIMD+multicore on
> host CPU, but 10% can do it on a coherent threaded microarchitecture,
> which is better?

Urg. Language. Poor typing skills.

If only 5% can do good SIMD+multicore tuning on a host CPU,
but 10% can do it on a coherent threaded GPU-style microarchitecture,
which is better?

Andy "Krazy" Glew

unread,

Dec 11, 2009, 4:50:15 PM12/11/09

to Robert Myers

Robert Myers wrote:
> On Dec 9, 11:12 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>

> wrote:
>
>> And Nvidia needs to get out of the discrete graphics board market niche
>> as soon as possible. If they can do so, I bet on Nvidia.
>
> Cringely thinks, well, the link says it all:
>
> http://www.cringely.com/2009/12/intel-will-buy-nvidia/

Let's have some fun. Not gossip, but complete speculation. Let's think
about what companies might have a business interest in or be capable of
buying Nvidia. Add to this one extra consideration: Jen-Hsun Huang is
rumored to have wanted the CEO position in such merger possibilities in
the past

http://www.tomsguide.com/us/nvidia-amd-acquisition,news-594.html

The list:

---> Intel + Nvidia:
I almost hope not, but Cringely has described the possibility.
However, Jen-Hsun would be unlikely to get CEO. Would he be happy with
being in charge of all graphics operations at Intel?
PRO: Nvidia is in CA in OR, 2 big Oregon sites. CON: Nvidia is in CA,
which is being deprecated by all cost sensitive companies.
CON: Retention. I suspect that not only would many Larrabee and Intel
integrated GPU guys leave in such a merger, but also many Nvidia guys
would too. For many people, Nvidia's biggest advantage is that it is not
Intel or AMD.

---> AMD + Nvidia:
I know, AMD already has ATI. But this crops up from time to time. I
think that it is unlikely now, but possible if either makes a mistep and
shrinks market-cap-wise.

http://www.tomsguide.com/us/nvidia-amd-acquisition,news-594.html, 2008.

---> IBM + Nvidia:
Also crops up. Maybe marginally more likely than AMD. Perhaps more
likely now that Cell is deprecated. But IMHO unlikely that IBM wants to
be in consumer. Most likely if IBM went for HPC/servers/Tesla.

http://www.tomsguide.com/us/nvidia-amd-acquisition,news-594.html, 2008.

---> Apple + Nvidia:

Now, this is interesting. But Apple has been burned by Nvidia before.

---> Oracle/Sun + Nvidia:

Long shot. Does Larry really want to risk that much money seeking
world domination, over and above Sun?

---> Samsung + Nvidia:

I keep coming around to this being the most likely, although cultural
differences seem to suggest not. Very different company styles.

---> ARM + Nvidia:

???. Actually, ARM could not buy Nvidia, would have to be some other
sort of deal. But the combination would be interesting. Nvidia's
market would probably be cratered by Intel in the short term, but that
might happen anyway.

---> Some unknown Chinese or Taiwanese PC maker + Nvidia ...: ???

---> Micron + Nvidia:

Finance challenged, but might have interesting potential.

OK, I am sipping dregs here. Did I miss anything?

Oh, yes:

---> Cisco and Nvidia:

Already allied in supercomputers. Makes a lot of sense technically, if
you believe in GPGPU for HPC and servers and databases. But would open
Cisco up to Intel counter-attacks.
The more I learn about Cisco routing, the more I believe that a
coherent threaded GPU-style machine would be really good. Particularly
if they use some of the dynamic CT techniques I described at my Berkeley
ParLab talk in August of this year,

nm...@cam.ac.uk

unread,

Dec 11, 2009, 4:54:34 PM12/11/09

to

In article <4B22B782...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:

The "probably less" is a gross understatement. Make it 0.5%. And
the only reason that rather more can do it on a GPU is that they are
tackling simpler tasks. Put them onto Dirichlet tesselation, and
watch them sweat :-)

Regards,
Nick Maclaren.

Andrew Reilly

unread,

Dec 11, 2009, 5:18:14 PM12/11/09

to

On Fri, 11 Dec 2009 07:24:36 -0800, Andy \"Krazy\" Glew wrote:

> Back in the 1990s on P6, when I was trying to make the case for CPUs to
> own the graphics market, and not surrender it to the only-just-nascent
> GPUs, the texture units were the oinker: they are just so damned
> necessary to graphics, and they are just so damned idiosyncratic. I do
> not know of any good way to do texturing in software that doesn't lose
> performance, or of any good way to decompose texturing into simpler
> instruction set primitives that could reasonably be added to an
> instruction set. E.g. I don't know of any good way to express texture
> operations in terms of 2, or 3, or even 4, register inputs.

Isn't that a fairly damning argument against Larabee, as a general-
purpose graphics part? Or did Larabee have equivalent texture units
bolted on to the side of their Atom-ish cores?

Cheers,

--
Andrew

Del Cecchi

unread,

Dec 11, 2009, 7:18:40 PM12/11/09

to

"Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message
news:4B22BE9...@patten-glew.net...

And where would all the GPU guys go after the merger? In this
economy?
What's in it for Intel?

>
> ---> AMD + Nvidia:
> I know, AMD already has ATI. But this crops up from time to time.
> I think that it is unlikely now, but possible if either makes a
> mistep and shrinks market-cap-wise.
>
> http://www.tomsguide.com/us/nvidia-amd-acquisition,news-594.html,
> 2008.
>
> ---> IBM + Nvidia:
> Also crops up. Maybe marginally more likely than AMD. Perhaps more
> likely now that Cell is deprecated. But IMHO unlikely that IBM
> wants to be in consumer. Most likely if IBM went for
> HPC/servers/Tesla.

IBM is slowly getting out of the hardware business, in general. And
IBM certainly doesn't need Nvidia to do multiprocessors or HPC.
Selling graphics cards for a few hundred bucks to go in PCs is close
to the last thing IBM seems to be interested in.
(snip)

I have been reading stories about what IBM is doing and why for going
on 40 years now and very very few have even been close to accurate.

The "raw rumors and random data" from Datamation used to be my
favorite. :-)

del

Andy "Krazy" Glew

unread,

Dec 12, 2009, 1:02:51 AM12/12/09

to Andrew Reilly

Where did you get your information about Larrabee?

Wikipedia (http://en.wikipedia.org/wiki/Larrabee_%28GPU%29) says
(as of the time I am posting this):

Larrabee's x86 cores will be based on the much simpler Pentium P54C design

Larrabee includes one major fixed-function graphics hardware feature:
texture sampling units. These perform trilinear and anisotropic
filtering and texture decompression.

The following seems to be the standard reference for Larrabee:

http://software.intel.com/file/2824/

Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey,
P., Junkins, S., Lake, A., Sugerman,
J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P. 2008.
Larrabee: A Many–Core x86
Architecture for Visual Computing. ACM Trans. Graph. 27, 3, Article 18
(August 2008), 15 pages. DOI =
10.1145/1360612.1360617 http://doi.acm.org/10.1145/1360612.1360617.

I like their quote on texture units:

Larrabee includes texture filter logic because this operation
cannot be efficiently performed in software on the cores. Our
analysis shows that software texture filtering on our cores would
take 12x to 40x longer than our fixed function logic, depending on
whether decompression is required. There are four basic reasons:
• Texture filtering still most commonly uses 8-bit color
components, which can be filtered more efficiently in
dedicated logic than in the 32-bit wide VPU lanes.
• Efficiently selecting unaligned 2x2 quads to filter requires a
specialized kind of pipelined gather logic.
• Loading texture data into the VPU for filtering requires an
impractical amount of register file bandwidth.
• On-the-fly texture decompression is dramatically more
efficient in dedicated hardware than in CPU code.
The Larrabee texture filter logic is internally quite similar to
typical GPU texture logic. It provides 32KB of texture cache per
core and supports all the usual operations, such as DirectX 10
compressed texture formats, mipmapping, anisotropic filtering,
etc. Cores pass commands to the texture units through the L2
cache and receive results the same way. The texture units perform
virtual to physical page translation and report any page misses to
the core, which retries the texture filter command after the page is
in memory. Larrabee can also perform texture operations directly
on the cores when the performance is fast enough in software

Andrew Reilly

unread,

Dec 12, 2009, 2:28:19 AM12/12/09

to

On Fri, 11 Dec 2009 22:02:51 -0800, Andy \"Krazy\" Glew wrote:

> Where did you get your information about Larrabee?

Only here. I don't recall it coming up, before. I'm not all that
interested in specialized graphics pipelines. Thanks for the great quote!

Cheers,

--
Andrew

Andy "Krazy" Glew

unread,

Dec 12, 2009, 11:26:18 AM12/12/09

to Andrew Reilly

I guess that part of the reason for this conversation is...

Although I *am* interested in specialized graphics functions

I am much more interested in operations that are of general use.

If you think of texture units as a generalized interploation and cache
with compression, then we can think of areas of more general use.

j...@cix.compulink.co.uk

unread,

Dec 13, 2009, 9:02:57 AM12/13/09

to

In article <4B21DA4C...@patten-glew.net>, ag-...@patten-glew.net (
Glew) wrote:

> At SC09 the watchword was heterogeneity.
>
> E.g. a big OOO x86 core, with small efficient cores of your favorite
> flavour. On the same chip.

It's a nice idea, but it leaves some questions unanswered. The small
cores are going to need access to memory, and that means more
controllers in the packages, and more legs on the chip. That costs,
whatever.

Now, are the small cores cache-coherent with the big one? If so, that's
more complexity, if not, it's harder to program. I suspect that if they
share an instruction set with the big core, cache coherency is
worthwhile, but if not, not.

Overall, the main advantage of this idea seems to be having a low-
latency link between main and small cores. That is not to be sneezed at:
we've given up a co-processor project because of the geological ages
needed to communicate across PCI-Express busses. Back-of-the-envelope
calculations made it clear that even if the co-processor took zero time
to do its work, we made a speed loss overall.

> While you could put a bunch of small x86 cores on the side, I think
> that you would probably be better off putting a bunch of small
> non-x86 cores on the side. Like GPU cores. Like Nvidia. OR AMD/ATI
> Fusion.
>
> Although this makes sense to me, I wonder if the people who want x86
> really want x86 everywhere - on both the big cores, and the small.
>
> Nobody likes the hetero programming model. But if you get a 100x
> perf benefit from GPGPU...

The stuff I produce is libraries, that get licensed to third parties and
put into a wide range of apps. Those get run on all sorts of machines,
from all sorts of manufacturers; we need to run on whatever the customer
has, rather simply than what the software developers' managers chose to
buy.

That means "small efficient cores of your favourite flavour" are
something of a pain: if there are several different varieties of such
things out there, I have to support (and thus build for and test) most
of them, or plump for one with a significant chance of being wrong, or
wait for a dominant one to emerge. Which is easiest?

That's the attraction of OpenCL as opposed to CUDA: it isn't tied to one
manufacturer's hardware. However, AMD don't seem to be doing a great job
of spreading it around at present.

The great potential advantage, to me, of the small cores being x86 is
not the x86 instruction set, or its familiarity, or its widespread
development tools. It's having them standardised. That doesn't solve the
problem of making good use of them, but it takes some logistic elements
(and thus costs) out of it.

Terje Mathisen

unread,

Dec 13, 2009, 4:07:58 PM12/13/09

to

Andy "Krazy" Glew wrote:
> Terje, you're the master of lookup tables. Can you see a way to make
> texture units generally useful?

No, not really.

I _would_ use them for interpolated lookup tables, for things like
really fast but limited-precision math functions.

Most of the texture stuff is, as you say, very much dedicated to one
particular task.

An additional problem is that they only work well for throughput
computing, since typical latency can be rather bad.

Robert Myers

unread,

Dec 13, 2009, 4:16:31 PM12/13/09

to

On Dec 13, 4:07 pm, Terje Mathisen <Terje.Mathi...@tmsw.no> wrote:
> Andy "Krazy" Glew wrote:
> > Terje, you're the master of lookup tables. Can you see a way to make
> > texture units generally useful?
>
> No, not really.
>
> I _would_ use them for interpolated lookup tables, for things like
> really fast but limited-precision math functions.
>
> Most of the texture stuff is, as you say, very much dedicated to one
> particular task.
>
> An additional problem is that they only work well for throughput
> computing, since typical latency can be rather bad.

We went through this lengthy discussion of hyperthreading. One of the
conclusions was that there wasn't much savings because the execution
units that were being maximally utilized didn't take up all that much
space, anyway. Does it matter all that much if texture units are
nothing but dead weight for some applications (so long as you can turn
them off completely)?

Robert.

Andy "Krazy" Glew

unread,

Dec 14, 2009, 9:57:53 AM12/14/09

to j...@cix.compulink.co.uk

j...@cix.compulink.co.uk wrote:
> In article <4B21DA4C...@patten-glew.net>, ag-...@patten-glew.net (
> Glew) wrote:
>
>> At SC09 the watchword was heterogeneity.
>>
>> E.g. a big OOO x86 core, with small efficient cores of your favorite
>> flavour. On the same chip.
>
> It's a nice idea, but it leaves some questions unanswered. The small
> cores are going to need access to memory, and that means more
> controllers in the packages, and more legs on the chip. That costs,
> whatever.
>
> Now, are the small cores cache-coherent with the big one? If so, that's
> more complexity, if not, it's harder to program. I suspect that if they
> share an instruction set with the big core, cache coherency is
> worthwhile, but if not, not.

I must admit that I do not understand your term "legs on the chip". When
I first saw it, I thought that you meant pins. Like, the old two chips
in same package, or on same chip, not sharing a memory controller. But
that does not make sense here.

Whenever you have multicore, you have to arrange for memory access. The
main way this is done is to arrange for all to access the same memory
controller. (Multiple memory controllers are a possibility. Multiple
MCs subdividing the address space, either by address ranges or by
interleaved cache lines or similar blocks, a possibility. Multiple MCs
with separate address spaces, dedicated to separate groups of
processors, are possible. But I don't know what would would motivate
that. Bandwidth - but non-cache coherent shared memory has the same
bandwidth advantages. Security?)

I therefore do not understand you when you say "that means more
controllers in the package". The hetero chips would probably share the
same memory controller.

If you mean cache controllers, yes: if you want cache consistency, you
will need cache controlers for every small processor, or at least group
of processors.

If you have a scalable interconnect on chip, then both big and small
processors will connect to it. Having N big cores + M small cores is no
more complex in that regard than having N+M big cores. Except... since
the sizes and shapes of the big and small cores is different, the
physical layout will be different. Timing, etc. (But if you are
creating a protocol that is timing and layout sensitive, you deserve to
be cancelled.) Logically, same complexity.

Testing wise, of course, different complexity. You would have to test
all of the combinations big/big, big/small, small/small, small/small on
the ends of the IC, ...

--

As for cache consistency, that is on and off. Folks like me aren't
afraid to take the cache protocols that work on multichip systems, and
put them on-chip. Integration is obvious. Where you get into problems
is wrt tweaking.

On the other hand, big MP / HPC systems tend to have nodes that consist
of 4-8-16 cache consistent shared memory cores, and then run PGAS style
non-cache-coherent shared memory between them, or MPI message passing.
Since integration is inevitable as well as obvious, inevitably we
will have more than one cache coherent domains on chip, which are PGAS
or MPI non-cache coherent between the domains.

Andy "Krazy" Glew

unread,

Dec 14, 2009, 10:15:09 AM12/14/09

to Terje Mathisen

Terje Mathisen wrote:
> Andy "Krazy" Glew wrote:
>> Terje, you're the master of lookup tables. Can you see a way to make
>> texture units generally useful?
>
> No, not really.
>
> I _would_ use them for interpolated lookup tables, for things like
> really fast but limited-precision math functions.
>
> Most of the texture stuff is, as you say, very much dedicated to one
> particular task.
>
> An additional problem is that they only work well for throughput
> computing, since typical latency can be rather bad.

That, I can improve, if there is motivation. There are several factors
that give the long latency:

a) the typical I/O interface - memory mapped, PCIe, whatever. This is
not necessary. We could easily create a fast interface, e.g. by binding
them into the ISA, with a fast connection if warranted. And a slow
implementation in terms of something like the existing interface if not
warranted.

Please note: there is nothing or little special about an ISA interface
per se. The main advantage is that an ISA binding allows you to have the
option of spending the money to implement a fast aggressive interface,
or a slow low cost interface, without changing the interface to
software. There also are few other potential ISA advantages, such as
knowing exactly what sort of serialization is required.

b) The latency induced by the size`of the texrure mapping unit. This has
two aspects: the compute, and the cache. If the two could be separated...

nm...@cam.ac.uk

unread,

Dec 14, 2009, 10:20:01 AM12/14/09

to

In article <4B265271...@patten-glew.net>,
Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:

>j...@cix.compulink.co.uk wrote:
>>
>>> At SC09 the watchword was heterogeneity.
>>>
>>> E.g. a big OOO x86 core, with small efficient cores of your favorite
>>> flavour. On the same chip.
>>

>> It's a nice idea, but it leaves some questions unanswered. ...

>>
>> Now, are the small cores cache-coherent with the big one? If so, that's
>> more complexity, if not, it's harder to program. I suspect that if they
>> share an instruction set with the big core, cache coherency is
>> worthwhile, but if not, not.
>

>As for cache consistency, that is on and off. Folks like me aren't
>afraid to take the cache protocols that work on multichip systems, and
>put them on-chip. Integration is obvious. Where you get into problems
>is wrt tweaking.

Precisely. Therefore, when considering larger multi-core than today,
one should look at the systems that have already delivered that using
multiple chips, and see how they have done. It's not pretty.

Now, it is POSSIBLE that multi-core coherence is easier to make
reliable and efficient than multi-chip coherence, but a wise man
will not assume that until he has investigated the causes of the
previous problems and seen at least draft solutions.

8-way shouldn't be a big deal, 32-way will be a lot trickier,
128-way will be a serious problem and 512-way will be a nightmare.
All numbers subject to scaling :-)

>On the other hand, big MP / HPC systems tend to have nodes that consist
>of 4-8-16 cache consistent shared memory cores, and then run PGAS style
>non-cache-coherent shared memory between them, or MPI message passing.

The move to that was a response to the reliability, efficiency and
(most of all) cost problems on the previous multi-chip coherent
systems.

> Since integration is inevitable as well as obvious, inevitably we
>will have more than one cache coherent domains on chip, which are PGAS
>or MPI non-cache coherent between the domains.

Extremely likely - nay, almost certain. Whether those domains will
share an address space or not, it's hard to say. My suspicion is
that they will, but there will be a SHMEM-like interface to them
from their non-owning cores.

Regards,
Nick Maclaren.

j...@cix.compulink.co.uk

unread,

Dec 14, 2009, 4:03:23 PM12/14/09

to

In article <4B265271...@patten-glew.net>, ag-...@patten-glew.net (
Glew) wrote:

> I must admit that I do not understand your term "legs on the chip".
> When I first saw it, I thought that you meant pins. Like, the old two
> chips in same package, or on same chip, not sharing a memory
> controller. But that does not make sense here.

That is what I meant. I just wasn't clear enough.

> Whenever you have multicore, you have to arrange for memory access.
> The main way this is done is to arrange for all to access the same
> memory controller. (Multiple memory controllers are a possibility.

I wasn't explaining enough. A single memory controller does not seem
to be enough for today's big OOO x86 cores. A Core 2 Duo has two memory
controllers; a Core i7 has three. This is inevitably pushing up pin
count. If you add a bunch more small cores, you're going to need even
more memory bandwidth, and thus presumably more memory controllers. This
is do doubt achievable, but the price may be a problem.

Robert Myers

unread,

Dec 14, 2009, 7:55:18 PM12/14/09

to

On Dec 14, 4:03 pm, j...@cix.compulink.co.uk wrote:

>
> I wasn't explaining enough. A single memory controller does not seem
> to be enough for today's big OOO x86 cores. A Core 2 Duo has two memory
> controllers; a Core i7 has three. This is inevitably pushing up pin
> count. If you add a bunch more small cores, you're going to need even
> more memory bandwidth, and thus presumably more memory controllers. This
> is do doubt achievable, but the price may be a problem.

Bandwidth. Bandwidth. Bandwidth.

It must be in scripture somewhere. It is, but no one reads the Gospel
according to Seymour any more.

Is an optical fat link out of the question? I know that optical on-
chip will take a miracle and maybe a Nobel prize, but just one fat
link. Is that too much to ask?

Robert.

Andy "Krazy" Glew

unread,

Dec 14, 2009, 11:12:23 PM12/14/09

to nm...@cam.ac.uk

nm...@cam.ac.uk wrote:
>> Since integration is inevitable as well as obvious, inevitably we
>> will have more than one cache coherent domains on chip, which are PGAS
>> or MPI non-cache coherent between the domains.
>
> Extremely likely - nay, almost certain. Whether those domains will
> share an address space or not, it's hard to say. My suspicion is
> that they will, but there will be a SHMEM-like interface to them
> from their non-owning cores.

I'm using PGAS as my abbreviation for "shared memory, shared address
space, but not cache coherent, and not memory ordered". I realize,
though, that some people consider Cray SHMEM different from PGAS. Can
you suggest a more generic term?

Hmmm... "shared memory, shared address space, but not cache coherent,
and not memory ordered"
SM-SAS-NCC-NMO ?
No, needs a better name.

--

Let's see, if I have it right,

In strict PGAS (Private/Global Address Space) there are only two forms
of memory access:
1. local private memory, inaccessible to other processors
2. global shared memory, accessible by all other processors,
although implicitly accessible everywhere the same. Not local to anyone.

Whereas SHMEM allows more types of memory accesses, including
a. local memory, that may be shared with other processors
b. remote accesses to memory that is local to other processors
as well as remote access to memory that isn't local to anyone.
And potentially other memory types.

--

Some people, seem to assume that PGAS/SHMEM imply a special type of
programmatic memory access. E.g. Kathy Yelick, in one of her SC09
talks, said "PGAS gives programmers access to DMA controllers."

Maybe often so, but tain't necessarily so. There are several different
ways of "binding" such remote memory accesses to an instruction set so
that a programmer can use them, including:

The first two do not involve changes to the CPU microarchitecture:
a) DMA-style
b) Prefetch-style
The last involves making the CPU aware of remote memory
c) CPU-remote-aware

a) DMA-style - ideally user level, non-privileged, access to something
like a DMA engine. The main question is, how do you give user level
access to a DMA engine? Memory mapped command registers?
Virtualization issues. Queues? (Notification issues. E.g. interrupt on
completion? Not everyone has user level interrupts. (And even though x86
does, they are not frequently used.))

b) Prefetch-style - have the programmer issue a prefetch, somehow.
Later, allow the programmer to perform an access. If the prefetch is
complete, allow it. (Notification issues.)

Could be a normal prefetch instruction, that somehow bypasses the CPU
cache prefetch logic (e.g. because of address range.)

Or, the prefetch could be something like an uncached, UC, store:
UC-STORE
to: magic-address
data: packet containing PGAS address Aremote you want to load from

plus maybe a few other things in the store data packet - length, stride,
etc. Plus maybe the actual store data.

Later, you might do a load.

Possibly a real load: UC-LOAD from: PGAS address Aremote

or possibly a fake load, with a transformed address:
UC-LOAD hash(Aremote)

The load result may contain flags that indicate succes/failure/not yet
arrived.

Life would be particularly nice if your instruction set had operations
that allowed you to write out a store address and a data packet, and
then read from the same location, atomically. Yes, atomic RMWs. Like
in PCIe. Like in the processor CMPXCHG type instructions.

But, the big cost in all of this is that you probably need to make the
operations involved be UC, uncached. Anbd, because we have on x86 only
one main UC main memory type, used for legacy I/O, it is not optimized
for the usage models that PGAS/SHMEM expect.

c) Finally, one could make the CPU aware of PGAS/SHMEM remote accesses.
Possibly as new instructions. Or, possibly as a new memory type.

Now, it is a truism that x86 can't add new memory types. No more page
table bits. We'd rather add new instructions. I think this is bogus.

However, I have always liked the idea of being able to specify the
memory type on a per instruction basis. E.g. in x86, having a new
prefix applicable to memory instructions that says "The type of this
memory access is ...REMOTE-ordinary-memory..." Probably with combining
rules for the page tables and MTRR memory types.

If you come from another instruction set, perhaps like Sun's alternate
address space.

In either case, possibly wit the new memory type as a literal field in
the instruction, or possibly from a small set of registers.

If you allow normal memory instructions to access remote memory, and
then just use a memory type, then you could use the same libraries for
both local and remote: e.g. the same linked list routine could work in
both. Assuming itmade no assumptions about memory ordering that would
work in local but not in remote memory.

Is this worth doing?

I think that it is always a good idea to have the DMA style or prefetch
style interfaces. Particularly if on a RISC ISA that has no block
instructions like REP MOVS. Also if one wants to add extra
instructions for remote access that are not already in local memory.

But, the a) DMA-style and b) preftch-style interfaces are probabky
slower, for small accesses, on many common implementations. We can more
aggressively optimize the c) CPU-remote-aware.

Conversely, if you don't need it, you can always implement the
CPU-remote-aware in terms of the other two.

Andy "Krazy" Glew

unread,

Dec 14, 2009, 11:51:41 PM12/14/09

to j...@cix.compulink.co.uk

j...@cix.compulink.co.uk wrote:
> I wasn't explaining enough. A single memory controller does not seem
> to be enough for today's big OOO x86 cores. A Core 2 Duo has two memory
> controllers; a Core i7 has three. This is inevitably pushing up pin
> count. If you add a bunch more small cores, you're going to need even
> more memory bandwidth, and thus presumably more memory controllers. This
> is do doubt achievable, but the price may be a problem.

Terminology.

Core 2 Duo has 2 memory channels. Core i7 has 3 memory channels.

You may have separate memory controllers for each memory channel. Or
you may have a single memory controller for each memory channel. Or,
something in between.

Actually, there's always a bit of memory channel specific logic - like,
the actual drivers.

The rest - the interface to your on-chip interconnect, buffering for
outstanding transactions, tracking which banks are open or closed per
chip - may be shared between memory channels, or not.

If your point was that a hetero system is more complex than a homo
system, it's immaterial: more cores, whether all big, or all small, or a
mix of both big and small, requires more memory bandwidth. And hence
more memory channels, and hence work at the memory controller level.

Hetero doesn't impact this - unless you are tempted to do things like
track, say, only one outstanding transaction per small core, and not to
allocate memory controller buffers for small core requests.

Just say no.

Andy "Krazy" Glew

unread,

Dec 15, 2009, 12:11:44 AM12/15/09

to nm...@cam.ac.uk

nm...@cam.ac.uk wrote:
>> Since integration is inevitable as well as obvious, inevitably we
>> will have more than one cache coherent domains on chip, which are PGAS
>> or MPI non-cache coherent between the domains.
>
> Extremely likely - nay, almost certain. Whether those domains will
> share an address space or not, it's hard to say. My suspicion is
> that they will, but there will be a SHMEM-like interface to them
> from their non-owning cores.

Actually, it's not an either/or choice. There aren't just two points on
the spectrum. We have already mentioned three, including the MPI space.
I like thinking about a few more:

1) SMP: shared memory, cache coherent, a relatively strong memory
ordering model like SC or TSO or PC. Typically writeback cache.

0) MPI: no shared memory, message passing

0.5) PGAS: shared memory, non-cache coherent. Typically UC, with DMA as
described in other posts.

0.9) SMP-WC: shared memory, cache coherent, a relatively weak memory
ordering model like RC or WC. Typically writeback cache.

0.8) ... with WT, writethrough, caches. Actually, it becomes a partial
order: there's WT-PC, and WT-WC.

0.7) SMP-WB-SWCO: non-cache-coherent, WB (or WT), with software managed
cache coherency via operations such as cache flushes.

I am particularly intrigued by the possibility of

0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually
coherent". Track which bytes have been written by a bitmask per cache
line. When evicting a cache line, evict with the bitmask, and
write-back only the written bytes. (Or words, if you prefer).

What I like about this is that it avoids one of the hardest aspects of
non-cache-coherent systems: (a) the fact that writes can disappear - not
just be observed in a different order, but actually disappear, and the
old data reappear (b) tied to cache line granularity.

Tracking bitmasks in this way means that you will never lose writes.

You may not know what order they get done in. There may be no global
order.

But you will never lose writes.

While we are at it

1.1) SMP with update cache protocols.

===

Sorting these according to "strength" - although, as I say above, there
are really some divergences, it is a partial order or lattice:

1.1) SMP with update cache protocols.

****
1) SMP: shared memory, cache coherent, a relatively strong memory
ordering model like SC or TSO or PC. Typically writeback cache.

0.9) SMP-WB-weak: shared memory, cache coherent, a relatively weak
memory ordering model like RC or WC. Typically writeback cache.

0.8) ... with WT, writethrough, caches.

0.7) SMP-WB-SWCO: non-cache-coherent, WB (or WT), with software managed
cache coherency via operations such as cache flushes

0.65) .. with WT

****???????
0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually
coherent". Track which bytes have been written by a bitmask per cache
line. When evicting a cache line, evict with the bitmask, and
write-back only the written bytes. (Or words, if you prefer).

0.55) ... with WT

****
0.5) PGAS: shared memory, non-cache coherent. Typically UC, with DMA as
described in other posts.

****
0) MPI: no shared memory, message passing

I've marked the models that I think are likely to be most important.

I think SMB-WB-bitmask is more likely to be important than the weak
models 0.7 and 0.9,
in part because I am in love with new ideas
but also because I think it scales better.

It provides the performance of conventional PGAS, but supports cache
locality when it is present. And poses none of the semantic challenges
of software managed cache coherency, although it has all of the same
performance issues.

Of ourse, it needs roghly 64 bits per cache line. Which may be enough
to kill it in its tracks.

Terje Mathisen

unread,

Dec 15, 2009, 1:48:46 AM12/15/09

to

Andy "Krazy" Glew wrote:
[interesting spectrum of distributed memory models snipped]

> I think SMB-WB-bitmask is more likely to be important than the weak
> models 0.7 and 0.9,
> in part because I am in love with new ideas
> but also because I think it scales better.
>
> It provides the performance of conventional PGAS, but supports cache
> locality when it is present. And poses none of the semantic challenges
> of software managed cache coherency, although it has all of the same
> performance issues.
>
>
> Of ourse, it needs roghly 64 bits per cache line. Which may be enough to
> kill it in its tracks.

Isn't this _exactly_ the same as the current setup on some chips that
use 128-byte cache lines, split into two sectors of 64 bytes each.

I.e. an effective cache line size that is smaller than the "real" line
size, taken to its logical end point.

I would suggest that (as you note) register size words is the smallest
item you might need to care about and track, so 8 bits for a 64-bit
platform with 64-byte cache lines, but most likely you'll have to
support semi-atomic 32-bit operations, so 16 bits which is a 3% overhead.

nm...@cam.ac.uk

unread,

Dec 15, 2009, 4:18:46 AM12/15/09

to

In article <4B270CA7...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:
>

>I'm using PGAS as my abbreviation for "shared memory, shared address
>space, but not cache coherent, and not memory ordered". I realize,
>though, that some people consider Cray SHMEM different from PGAS. Can
>you suggest a more generic term?

No, but that isn't what PGAS normally means. However, no matter.

>Let's see, if I have it right,
>
>In strict PGAS (Private/Global Address Space) there are only two forms
>of memory access:
> 1. local private memory, inaccessible to other processors
> 2. global shared memory, accessible by all other processors,
>although implicitly accessible everywhere the same. Not local to anyone.

I wasn't aware of that meaning. Its most common meaning at present
is Partitioned Global Address Space, with each processor owning some
memory but others being able to access it, possibly by the use of
special syntax. Very like some forms of SHMEM.

>Whereas SHMEM allows more types of memory accesses, including
> a. local memory, that may be shared with other processors
> b. remote accesses to memory that is local to other processors
>as well as remote access to memory that isn't local to anyone.
>And potentially other memory types.

Yes, and each use of SHMEM is different.

Regards,
Nick Maclaren.

Mayan Moudgill

unread,

Dec 15, 2009, 5:07:47 AM12/15/09

to

Andy "Krazy" Glew wrote:

>
> In strict PGAS (Private/Global Address Space) there are only two forms
> of memory access:
> 1. local private memory, inaccessible to other processors
> 2. global shared memory, accessible by all other processors,
> although implicitly accessible everywhere the same. Not local to anyone.
>
> Whereas SHMEM allows more types of memory accesses, including
> a. local memory, that may be shared with other processors
> b. remote accesses to memory that is local to other processors
> as well as remote access to memory that isn't local to anyone.
> And potentially other memory types.
>

I can't see that there is any benefit between having strictly private
memory (PGAS 1. above), at least on a high-performance MP system.

The CPUs are going to access memory via a cache. I doubt that there will
be 2 separate kinds of caches, one for private and one for the rest of
the memory. So, as far as the CPUs are concerned there is no distinction.

Since the CPUs are still going to have to talk to a shared memory (PGAS
2. above), there will still be an path/controller between the bottom of
the cache hierarchy and the shared memory. This "controller" will have
to implement whatever snooping/cache-coherence/transfer protocol is
needed by the global memory.

The difference between shared local memory (SHMEM a) and strictly
private local memory (PGAS 1) is whether the local memory sits below the
memory controlller or bypasses it. Its not obvious (to me at least)
whether there are any benefits to be had by bypassing it. Can anyone
come up with something?

nm...@cam.ac.uk

unread,

Dec 15, 2009, 5:23:47 AM12/15/09

to

In article <JfednUfEbp5qwrrW...@bestweb.net>,

I don't think you realise how much cache coherence costs, once you
get beyond small core-counts. There are two main methods: snooping
is quadratic in the number of packets and directories are quadratic
in the amount of logic (for constant time accesses). As usual,
there are intermediates, e.g. directories that are (say) N*sqrt(N)
in both logic and number of packets.

The main advantage of truly private memory, rather than incoherent
sharing across domains, is reliability. You can guarantee that it
won't change because of a bug in the code being run on another
processor.

Regards,
Nick Maclaren.

Mayan Moudgill

unread,

Dec 15, 2009, 6:07:46 AM12/15/09

to

nm...@cam.ac.uk wrote:

>
> I don't think you realise how much cache coherence costs, once you
> get beyond small core-counts.

That has nothing to do with truly private vs. shared-local memory:
that's in the cache-coherence protocol. One can (in theory) have the
cross product of {local,global} x {coherent,non-coherent}.

And you really need to stop assuming what other people do and don't know
about stuff...

>
> The main advantage of truly private memory, rather than incoherent
> sharing across domains, is reliability. You can guarantee that it
> won't change because of a bug in the code being run on another
> processor.
>

If I wanted to absolutely guarantee that, I would put the access control
in the memory controller. If I wanted to somewhat guarantee that, I
would use the VM access right bits.

nm...@cam.ac.uk

unread,

Dec 15, 2009, 7:08:57 AM12/15/09

to

In article <B7SdnYVDl8Wf87rW...@bestweb.net>,

Mayan Moudgill <ma...@bestweb.net> wrote:
>
>> I don't think you realise how much cache coherence costs, once you
>> get beyond small core-counts.
>
>That has nothing to do with truly private vs. shared-local memory:
>that's in the cache-coherence protocol. One can (in theory) have the
>cross product of {local,global} x {coherent,non-coherent}.

One can in theory do many things that have proved to be infeasible
in practice. It is true that I misunderstood what you were trying
to say, but I assert that your words (which I quote below) matches
my understanding better than your intent does.

I can't see that there is any benefit between having strictly private
memory (PGAS 1. above), at least on a high-performance MP system.

The CPUs are going to access memory via a cache. I doubt that there will
be 2 separate kinds of caches, one for private and one for the rest of
the memory. So, as far as the CPUs are concerned there is no distinction.

Since the CPUs are still going to have to talk to a shared memory (PGAS
2. above), there will still be an path/controller between the bottom of
the cache hierarchy and the shared memory. This "controller" will have
to implement whatever snooping/cache-coherence/transfer protocol is
needed by the global memory.

>And you really need to stop assuming what other people do and don't know
>about stuff...

I suggest that you read what I post before responding like that.
I can judge what you know only from your postings, and this is not
the first time that you have posted assertions that fly in the face
of all HPC experience, without posting any explanation of why you
think that is mistaken, even after being queried.

In particular, using a common cache with different coherence
protocols for different parts of it has been done, but has never
been very successful. I have no idea why you think that the previous
experience of its unsatisfactoriness is misleading.

>> The main advantage of truly private memory, rather than incoherent
>> sharing across domains, is reliability. You can guarantee that it
>> won't change because of a bug in the code being run on another
>> processor.
>
>If I wanted to absolutely guarantee that, I would put the access control
>in the memory controller. If I wanted to somewhat guarantee that, I
>would use the VM access right bits.

Doubtless you would. And that is another example of what I said
earlier. That does not "absolutely guarantee" that - indeed, it
doesn't even guarantee it, because it still leaves the possibility
of a privileged process on another processor accessing the pseudo-
local memory. And, yes, I have seen that cause trouble.

You might claim that it is a bug, but you would be wrong if you did.
Consider the case when processor A performs some DMA-capable I/O on
its pseudo-local memory. You now have different consistency
semantics according to where the I/O process runs.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Dec 15, 2009, 7:42:41 AM12/15/09

to

In article <4B271A90...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:
>
>>> Since integration is inevitable as well as obvious, inevitably we
>>> will have more than one cache coherent domains on chip, which are PGAS
>>> or MPI non-cache coherent between the domains.
>>
>> Extremely likely - nay, almost certain. Whether those domains will
>> share an address space or not, it's hard to say. My suspicion is
>> that they will, but there will be a SHMEM-like interface to them
>> from their non-owning cores.
>
>Actually, it's not an either/or choice. There aren't just two points on
>the spectrum. We have already mentioned three, including the MPI space.
> I like thinking about a few more:

Gug. I need to print those out and study them! Yes, I agree that
it's not an either/or choice, but I hadn't thought out that many
possibilities.

>I am particularly intrigued by the possibility of
>
>0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually
>coherent". Track which bytes have been written by a bitmask per cache
>line. When evicting a cache line, evict with the bitmask, and
>write-back only the written bytes. (Or words, if you prefer).
>
>What I like about this is that it avoids one of the hardest aspects of
>non-cache-coherent systems: (a) the fact that writes can disappear - not
>just be observed in a different order, but actually disappear, and the
>old data reappear (b) tied to cache line granularity.
>
>Tracking bitmasks in this way means that you will never lose writes.
>
>You may not know what order they get done in. There may be no global
>order.
>
>But you will never lose writes.
>

>I think SMB-WB-bitmask is more likely to be important than the weak
>models 0.7 and 0.9,
> in part because I am in love with new ideas
> but also because I think it scales better.

It also matches language specifications much better than most of the
others, which is not a minor advantage. That could well be the
factor that gets it accepted, if it is.

Regards,
Nick Maclaren.

Del Cecchi

unread,

Dec 15, 2009, 1:37:25 PM12/15/09

to

"Robert Myers" <rbmye...@gmail.com> wrote in message
news:eb1c4904-abd3-4646...@b15g2000yqd.googlegroups.com...

Bandwidth. Bandwidth. Bandwidth.

Robert.
----------------------
Yes it is at the moment. On the other hand you can do
10Gb/sec/differential pair on copper if you don't want to go too far.
So you don't really need optics. But all the fancy dancy interface
stuff adds latency, if that's ok.

del.

Andy "Krazy" Glew

unread,

Dec 15, 2009, 8:57:19 PM12/15/09

to Mayan Moudgill

Mayan Moudgill wrote:
> I can't see that there is any benefit between having strictly private
> memory (PGAS 1. above), at least on a high-performance MP system.
>
> The CPUs are going to access memory via a cache. I doubt that there will
> be 2 separate kinds of caches, one for private and one for the rest of
> the memory. So, as far as the CPUs are concerned there is no distinction.
>
> Since the CPUs are still going to have to talk to a shared memory (PGAS
> 2. above), there will still be an path/controller between the bottom of
> the cache hierarchy and the shared memory. This "controller" will have
> to implement whatever snooping/cache-coherence/transfer protocol is
> needed by the global memory.
>
> The difference between shared local memory (SHMEM a) and strictly
> private local memory (PGAS 1) is whether the local memory sits below the

> memory controller or bypasses it. Its not obvious (to me at least)

> whether there are any benefits to be had by bypassing it. Can anyone
> come up with something?

Nick is right: the P in PGAS stands for partitioned, not private. For
some reason, I keep making this confusion.

(Pictures such as slide 4 in
http://groups.google.com/group/scaling-to-petascale-workshop-2009/web/introduction-to-pgas-languages?pli=1
are, perhaps, one source of my confusion, since Snir definitely
depicts private/global,not partitioned.)

Mayan is right: the main motivation in having private memory is whether
you want to bypass any cache. Believe it or not, many HPC people do not
want to have any cache whatsoever. I agree with Mayan: we will
definitely cache local accesses because uncached and we probably don't
want to create special cases for remote memory. That being said, I will
admit that I have been thinking about special protocols for global
memory, such as described in the previous post.

I suppose that one of the reasons I have been thinking of private as
opposed to partitioned has been thinking about languages that have
"private" and " global" keywords. This is a smaller addition to the
language than adding a placement syntax. The question then is whether
you can convert a pointer to private T into a pointer to public T. UPC
seems to disallow of this.

Even if in the implementation in hardware private and global memory
locations are cached in the same way, it may be desirable to distinguish
some of the language level: the compiler may be able to use more
efficient synchronization mechanisms for variables that are guaranteed
to be local private than it can use for global variables that might be
local or might be remote and might be shared with other processors.
Typically, on X86 the local variables may not require fencing because
the X86's default strong memory ordering, whereas fences may be
required for global variables because the global interconnect may not
provide the snooping mechanisms that processors such as the P6 family
use to enforce strong memory ordering. Note that these fences may not
be the standard LFENCE, SFENCE, or MFENCE instructions, since those are
typically not externally visible. Instead they might have to be
expensive UC memory accesses, so that the are visible to the outside
world. Of course it would be wonderful to create new versions of the
fence instructions that could be visible to external memory fabric. But
if you go down that path you might actually end up having to distinguish
private and global memory.

- - -

(I am writing this in the Seattle to Portland van, bouncing on the rough
roads. It is quite remarkable how much slower the computer is when
there is this much vibration. I fear that my heads are crashing all the
time. I really need to save up the money to get myself a solid state
disk. Also, as I have noted before, speech recognition works better in
this fight vibration environment than keyboarding, with handwriting
recognition in between. This is the first time I've actually used
speech recognition in the van with somebody else present, except for
Monday when I was matching a person who was talking loudly on the cell
phone. I hope that I'm not disturbing the other passenger. I hope that
she will tell me honestly if I am, and not just be polite. I'm curious
to find out if speech recognition is socially acceptable in such
relatively high noise environments as the shuttle van or an airplane. I
hope that it is less obnoxious that speaking on a cell phone. Of
course, the impoliteness of talking on a cell phone does not stop many
people doing it. I suspect that dictating text is better than listening
to a cell phone, because I dictate in full sentences; but listening to
me edit text is probably even more than listening to a cell phone. I
am falling into an odd hybrid of using speech to dictate and editing
with the pen.)

Mayan Moudgill

unread,

Dec 16, 2009, 7:04:16 AM12/16/09

to

nm...@cam.ac.uk wrote:

>
> In particular, using a common cache with different coherence
> protocols for different parts of it has been done, but has never
> been very successful.

There is a distinction between choosing between two different coherence
protocols and between a simpler coherent/not-coherent memory.

At the hardware level, this would be a choice between running MOESI
(or whatever MESI variant is being used) when running with coherence and
imnmediately promoting a line from S/O to M for purposes of writes (for
non-coherence); you'd use instruction control (cache flush, e.g.) or
write-through to guarantee its visibility to the outside world.
Following tradition, this would probably be controlled by bits in the
page-table.

So, its demonstrably simple to *implement* coherence/non-coherence. If
the lack of success is because it is difficult to use in an MP context,
that is a different issue.

>
>>>The main advantage of truly private memory, rather than incoherent
>>>sharing across domains, is reliability. You can guarantee that it
>>>won't change because of a bug in the code being run on another
>>>processor.
>>
>>If I wanted to absolutely guarantee that, I would put the access control
>>in the memory controller. If I wanted to somewhat guarantee that, I
>>would use the VM access right bits.
>
>
> Doubtless you would. And that is another example of what I said
> earlier. That does not "absolutely guarantee" that - indeed, it
> doesn't even guarantee it, because it still leaves the possibility
> of a privileged process on another processor accessing the pseudo-
> local memory. And, yes, I have seen that cause trouble.

Absolutely guarantee would imply a control register in the memory
controller with a bit that, if set, ensures that the only write (or
write and read) requests the memory controller allows through are those
from its "owning" processor. That is why the absolute guarantee is part
of the controller.

As you correctly pointed out, a VM based scheme fails in the presence of
bugs. Which is why I called it a "somewhat guarantee" exclusivity model.

Bernd Paysan

unread,

Dec 18, 2009, 4:51:37 PM12/18/09

to

Andy "Krazy" Glew wrote:
> 1) SMP: shared memory, cache coherent, a relatively strong memory
> ordering model like SC or TSO or PC. Typically writeback cache.
>
> 0) MPI: no shared memory, message passing

You can also have shared "write-only" memory. That's close to the MPI
side of the tradeoffs. Each CPU can read and write its own memory, but
can only write remote memories. The pro side is that all you need is a
similar infrastructure to MPI (send data packets around), and thus it
scales well; also, there are no blocking latencies.

The programming model can be closer do data flow than pure MPI, since
when you only pass data, writing the data to the target destination is
completely sufficient. An "this data is now valid" message might be
necessary (or some log of the memory controller where each CPU can
extract what regions were written to).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Andy "Krazy" Glew

unread,

Dec 16, 2009, 5:13:01 PM12/16/09

to Mayan Moudgill

Andy "Krazy" Glew wrote:
> Mayan Moudgill wrote:
>> I can't see that there is any benefit between having strictly private
>> memory (PGAS 1. above), at least on a high-performance MP system.
>>
>> The CPUs are going to access memory via a cache. I doubt that there
>> will be 2 separate kinds of caches, one for private and one for the
>> rest of the memory. So, as far as the CPUs are concerned there is no
>> distinction.
>>
>> Since the CPUs are still going to have to talk to a shared memory
>> (PGAS 2. above), there will still be an path/controller between the
>> bottom of the cache hierarchy and the shared memory. This "controller"
>> will have to implement whatever snooping/cache-coherence/transfer
>> protocol is needed by the global memory.

> Even if in the implementation in hardware private and global memory

> locations are cached in the same way, it may be desirable to distinguish
> some of the language level: the compiler may be able to use more
> efficient synchronization mechanisms for variables that are guaranteed
> to be local private than it can use for global variables that might be
> local or might be remote and might be shared with other processors.

I mentioned the possibility of fencing being different for local/
private memory and for global memory.

I forgot to mention the possibility of software controlled cache coherence.

If the compiler has to emit cache flush directives around accesses to
global memory that is cached, and if these directives are as slow as on
present X86, then the compiler definitely warts to know what is private
and what is not.

IMHO this is a good reason to use the DMA model.

If flushing cache is slow, then you may want to distinguish private
memory that can be cached, e.g. in your 2M/ core L3 cache, from remote
cacheable memory_ caching the latter in a smaller, cheaper to flush,
structure.

Andy "Krazy" Glew

unread,

Dec 21, 2009, 4:57:44 PM12/21/09

to Bernd Paysan

Bernd Paysan wrote:
> Andy "Krazy" Glew wrote:
>> 1) SMP: shared memory, cache coherent, a relatively strong memory
>> ordering model like SC or TSO or PC. Typically writeback cache.
>>
>> 0) MPI: no shared memory, message passing
>
> You can also have shared "write-only" memory. That's close to the MPI
> side of the tradeoffs. Each CPU can read and write its own memory, but
> can only write remote memories. The pro side is that all you need is a
> similar infrastructure to MPI (send data packets around), and thus it
> scales well; also, there are no blocking latencies.
>
> The programming model can be closer do data flow than pure MPI, since
> when you only pass data, writing the data to the target destination is
> completely sufficient. An "this data is now valid" message might be
> necessary (or some log of the memory controller where each CPU can
> extract what regions were written to).

At first I liked this, and then I realized what I liked was the idea of
being able to create linked data structures, readable by anyone, but
only manipulated by the local node - except for the minimal operations
necessary to link new nodes into the data structure.

I don't think that ordinary read/write semantics are acceptable. I
think that you need the ability to "atomically" (for some definition of
atomic - all atomicity is relative) read a large block of data. Used by
a node A to read a data node in node B's memory.

Node A might then allocate new nodes in its own memory. And publish
them as follows, probably using atomic rmw type operations to link the
new node into the old data structure. Compare-and-swap, possibly
fancier ops like atomic insert into hash table.

(By the way, I have great sympathy for sending chunks of code around -
is that "Actors"? - except that I would like some of these operations to
be handled by memory controller hardware (for any of the several
definitions of memory controller), and it is hard to think of arbitrary
code that is sufficiently constrained.)

nm...@cam.ac.uk

unread,

Dec 22, 2009, 3:56:00 PM12/22/09

to

In article <4B2FEF58...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:

>Bernd Paysan wrote:
>>
>> You can also have shared "write-only" memory. That's close to the MPI
>> side of the tradeoffs. Each CPU can read and write its own memory, but
>> can only write remote memories. The pro side is that all you need is a
>> similar infrastructure to MPI (send data packets around), and thus it
>> scales well; also, there are no blocking latencies.
>>
>> The programming model can be closer do data flow than pure MPI, since
>> when you only pass data, writing the data to the target destination is
>> completely sufficient. An "this data is now valid" message might be
>> necessary (or some log of the memory controller where each CPU can
>> extract what regions were written to).
>
>At first I liked this, and then I realized what I liked was the idea of
>being able to create linked data structures, readable by anyone, but
>only manipulated by the local node - except for the minimal operations
>necessary to link new nodes into the data structure.

Yes, that's a model I have liked for some time. I should be very
interested to know why Bernd regards the other way round as better;
I can't see it, myself, but can't convince myself that it isn't.

>I don't think that ordinary read/write semantics are acceptable. I
>think that you need the ability to "atomically" (for some definition of
>atomic - all atomicity is relative) read a large block of data. Used by
>a node A to read a data node in node B's memory.

I agree, but the problem has been solved for file-systems, where
snapshots are implemented in such a way as to appear to give such
atomic read semantics.

Actually, what I like is the database/BSP semantics. Updates are
purely local, until the owner says "commit", when all other nodes
will see the new structure when they next say "accept". Before
that, they see the old structure. Details of whether commit and
accept should be directed or global are topics for research ....

I think that it could be done fairly easily at the page level,
using virtual memory primitives, but not below unless the cache
line ones were extended.

Regards,
Nick Maclaren.

Bernd Paysan

unread,

Dec 22, 2009, 5:54:23 PM12/22/09

to

Andy "Krazy" Glew wrote:
>> You can also have shared "write-only" memory. That's close to the
>> MPI
>> side of the tradeoffs. Each CPU can read and write its own memory,
>> but
>> can only write remote memories. The pro side is that all you need is
>> a similar infrastructure to MPI (send data packets around), and thus
>> it scales well; also, there are no blocking latencies.
>>
>> The programming model can be closer do data flow than pure MPI, since
>> when you only pass data, writing the data to the target destination
>> is
>> completely sufficient. An "this data is now valid" message might be
>> necessary (or some log of the memory controller where each CPU can
>> extract what regions were written to).
>
> At first I liked this, and then I realized what I liked was the idea
> of being able to create linked data structures, readable by anyone,
> but only manipulated by the local node - except for the minimal
> operations necessary to link new nodes into the data structure.

That's the other way round, i.e. single writer, multiple readers (pull
data in). What I propose is single reader, multiple writer (push data
out).

> I don't think that ordinary read/write semantics are acceptable. I
> think that you need the ability to "atomically" (for some definition
> of
> atomic - all atomicity is relative) read a large block of data. Used
> by a node A to read a data node in node B's memory.

Works by asking B to send data over to A.

> Node A might then allocate new nodes in its own memory. And publish
> them as follows, probably using atomic rmw type operations to link the
> new node into the old data structure. Compare-and-swap, possibly
> fancier ops like atomic insert into hash table.
>
> (By the way, I have great sympathy for sending chunks of code around -
> is that "Actors"? - except that I would like some of these operations
> to be handled by memory controller hardware (for any of the several
> definitions of memory controller), and it is hard to think of
> arbitrary code that is sufficiently constrained.)

Sending chunks of code around which are automatically executed by the
receiver is called "active messages". I not only like the idea, a
friend of mine has done that successfully for decades (the messages in
question were Forth source - it was a quite high level of active
messages). Doing that in the memory controller looks like a good idea
for me, too, at least for that kind of code a memory controller can
handle. The good thing about this is that you can collect all your
"orders", and send them in one go - this removes a lot of latency,
especially if your commands can include something like compare&swap or
even a complete "insert into list/hash table" (that, unlike
compare&swap, won't fail).

Robert Myers

unread,

Dec 23, 2009, 12:48:18 AM12/23/09

to

I don't know all the buzz words, so forgive me.

If you know the future (or the dataflow graph ahead of time), you can
assemble packets of whatever. Could be any piece of the problem:
code, data, meta-data, meta-code,... whatever, and send it off to some
location where it knows that the other pieces that are needed for that
piece of the problem will also arrive, pushed from who-cares-where.
When enough pieces are in hand to act on, the receiving location acts
on whatever pieces it can. When any piece of anything that can be
used elsewhere is finished, it is sent on to wherever. The only
requirement is that there is some agent like a DNS that can tell
pieces with particular characteristics the arbitrarily chosen
processors (or collections of processors) to which they should migrate
for further use, and that receiving agents are not required to do
anything but wait until they have enough information to act on, and
the packets themselves will inform the receiving agent what else is
needed for further action (but not where it can be found). Many
problems seem to disappear as if by magic: the need for instruction
and data prefetch (two separate prediction processes), latency issues,
need for cache, and the need to invent elaborate constraints on what
kinds of packets can be passed around, as the structure (and, in
effect, the programming language) can be completely ad hoc.
Concurrency doesn't even seem to be an issue. It's a bit like an
asynchronous processor, and it seems implementable in any circumstance
where a data-push model can be implemented.

I know (or hope) that I'll be told that it's all been thought of and
tried and the reasons why it is impractical. That's the point of the
post.

Robert.

Bernd Paysan

unread,

Dec 23, 2009, 9:05:03 AM12/23/09

to

Robert Myers wrote:
> I don't know all the buzz words, so forgive me.

Buzz words are only useful for "buzzword bingo" and when feeding search
engines ;-).

> If you know the future (or the dataflow graph ahead of time), you can
> assemble packets of whatever. Could be any piece of the problem:
> code, data, meta-data, meta-code,... whatever, and send it off to some
> location where it knows that the other pieces that are needed for that
> piece of the problem will also arrive, pushed from who-cares-where.
> When enough pieces are in hand to act on, the receiving location acts
> on whatever pieces it can. When any piece of anything that can be
> used elsewhere is finished, it is sent on to wherever. The only
> requirement is that there is some agent like a DNS that can tell
> pieces with particular characteristics the arbitrarily chosen
> processors (or collections of processors) to which they should migrate
> for further use, and that receiving agents are not required to do
> anything but wait until they have enough information to act on, and
> the packets themselves will inform the receiving agent what else is
> needed for further action (but not where it can be found). Many
> problems seem to disappear as if by magic: the need for instruction
> and data prefetch (two separate prediction processes), latency issues,
> need for cache, and the need to invent elaborate constraints on what
> kinds of packets can be passed around, as the structure (and, in
> effect, the programming language) can be completely ad hoc.
> Concurrency doesn't even seem to be an issue. It's a bit like an
> asynchronous processor, and it seems implementable in any circumstance
> where a data-push model can be implemented.

Indeed.

> I know (or hope) that I'll be told that it's all been thought of and
> tried and the reasons why it is impractical. That's the point of the
> post.

It has been tried and it works - you can find a number of papers about
active message passing from various universities. However, it seems to
be that most people try to implement some standard protocols like MPI on
top of it, so the benefits might be smaller than expected. And as Andy
already observed: Most people seem to be more comfortable with
sequential programming. Using such an active message system makes the
parallel programming quite explicit - you model a data flow graph, you
create packets with code and data, and so on.

Terje Mathisen

unread,

Dec 23, 2009, 12:21:59 PM12/23/09

to

Bernd Paysan wrote:
> Sending chunks of code around which are automatically executed by the
> receiver is called "active messages". I not only like the idea, a
> friend of mine has done that successfully for decades (the messages in
> question were Forth source - it was a quite high level of active
> messages). Doing that in the memory controller looks like a good idea
> for me, too, at least for that kind of code a memory controller can
> handle. The good thing about this is that you can collect all your
> "orders", and send them in one go - this removes a lot of latency,
> especially if your commands can include something like compare&swap or
> even a complete "insert into list/hash table" (that, unlike
> compare&swap, won't fail).
>

Why do a feel that this feels a lot like IBM mainframe channel programs?
:-)

(Security is of course implicit here: If you _can_ send the message,
you're obviously safe, right?)

Terje
PS. This is my very first post from my personal leafnode installation: I
have free news access via my home (fiber) ISP, but not here in Rauland
on Christmas/New Year vacation, so today I finally broke down and
installed leafnode on my home FreeBSD gps-based ntp server. :-)

Robert Myers

unread,

Dec 23, 2009, 3:29:41 PM12/23/09

to

On Dec 23, 12:21 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:

> Bernd Paysan wrote:
> > Sending chunks of code around which are automatically executed by the
> > receiver is called "active messages". I not only like the idea, a
> > friend of mine has done that successfully for decades (the messages in
> > question were Forth source - it was a quite high level of active
> > messages). Doing that in the memory controller looks like a good idea
> > for me, too, at least for that kind of code a memory controller can
> > handle. The good thing about this is that you can collect all your
> > "orders", and send them in one go - this removes a lot of latency,
> > especially if your commands can include something like compare&swap or
> > even a complete "insert into list/hash table" (that, unlike
> > compare&swap, won't fail).
>
> Why do a feel that this feels a lot like IBM mainframe channel programs?
> :-)

Could I persuade you to take time away from your first love
(programming your own computers, of course) to elaborate/pontificate a
bit? After forty years, I'm still waiting for someone to tell me
something interesting about mainframes. Well, other than that IBM bet
big and won big on them.

And CHANNELS. Well. That's clearly like the number 42.

Robert.

Bernd Paysan

unread,

Dec 23, 2009, 6:14:16 PM12/23/09

to

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

> Why do a feel that this feels a lot like IBM mainframe channel
> programs?
> :-)

But there's a fundamental difference: A channel program is executed on
your side. An active message is executed on the other side (when you
use a channel-based memory system, you'll send a message to your
communication channel to send the message over to the other computer).

> (Security is of course implicit here: If you _can_ send the message,
> you're obviously safe, right?)

It all depends. You can implement something similar as the Internet
using active messages, and there of course, every message would be
potentially hostile. Solution: Keep these message instruction set
simple, and have a rigid framework of protecting you against malicious
messages.

As said before, the successful active message system my friend has made
is based on Forth source code - this is probably the worst thing for
security, but also the most powerful and robust one. Naming it "Skynet"
is not too far - it is extremely robust, since each node can download
all the code it needs from a repository or even from other nodes. I
would not use such a scheme to implement a secure network with public
access.

> Terje
> PS. This is my very first post from my personal leafnode installation:
> I have free news access via my home (fiber) ISP, but not here in
> Rauland on Christmas/New Year vacation, so today I finally broke down
> and installed leafnode on my home FreeBSD gps-based ntp server. :-)

I use leafnode locally for a decade or so now; it does a good job on
message prefetching, and it also can be used to hide details like where
my actual news feed is coming from.

Robert Myers

unread,

Dec 23, 2009, 6:22:20 PM12/23/09

to

On Dec 23, 9:05 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:

>
> It has been tried and it works - you can find a number of papers about
> active message passing from various universities. However, it seems to
> be that most people try to implement some standard protocols like MPI on
> top of it, so the benefits might be smaller than expected. And as Andy
> already observed: Most people seem to be more comfortable with
> sequential programming. Using such an active message system makes the
> parallel programming quite explicit - you model a data flow graph, you
> create packets with code and data, and so on.

Maybe I can add, even if it's already in the literature, that such a
computing model makes the non-uniform address space problem disappear,
as well. One process pushes its packets to another. It can even
happen by DMA, just so long as there is a way to refer uniformly to
the input buffers of receiving locations. Then, both sending and
receiving process can address memory (which is entirely private,
except for the input buffer) in whatever idiosyncratic ways they care
to.

Forgive me for this, please. I was prepared with a post like: look,
you can't trust people to do it (non-trivially concurrent programming)
in a flat, uniform SMP address space, how in the name of heaven will
anyone do it correctly with heterogeneous everything? I think I just
answered my own question.

Robert.

Anne & Lynn Wheeler

unread,

Dec 23, 2009, 7:57:19 PM12/23/09

to

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
> Why do a feel that this feels a lot like IBM mainframe channel programs?
> :-)

downside was that mainframe channel programs were half-duplex end-to-end
serialization. there were all sorts of heat & churn in fiber-channel
standardization with the efforts to overlay mainframe channel program
(half-duplex, end-to-end serialization) paradigm on underlying
full-duplex asynchronous operation.

from the days of scarce, very expensive electronic storage
... especially disk channel programs ... used "self-modifying" operation
... i.e. read operation would fetch the argument used by the following
channel command (both specifying the same real address). couple round
trips of this end-to-end serialization potentially happening over 400'
channel cable within small part of disk rotation.

trying to get a HYPERChannel "remote device adapter" (simulated
mainframe channel) working at extended distances with disk controller &
drives ... took a lot of slight of hand. a copy of the
completedmainframe channel program was created and downloaded into the
memory of the remote device adapter .... to minimize the
command-to-command latency. the problem was that some of the disk
command arguments had very tight latencies ... and so those arguments
had to be recognized and also downloaded into the remote device adapter
memory (and the related commands redone to fetch/store to the local
adapter memory rather than the remote mainframe memory). this process
was never extended to be able to handle the "self-modifying" sequences.

on the other hand ... there was a serial-copper disk project that
effectively packetized SCSI commands ... sent them down outgoing
link ... and allowed asynchronous return on the incoming link
... eliminating loads of the scsi latency. we tried to get this morphed
into interoperating with fiber-channel standard ... but it morphed into
SSA instead.

--
40+yrs virtualization experience (since Jan68), online at home since Mar1970