Larrabee delayed: anyone know what's happening?

33 views
Skip to first unread message

Mayan Moudgill

unread,
Dec 5, 2009, 8:46:11 PM12/5/09
to

All I've come across is the announcement that Larrabee has been delayed,
with the initial consumer version cancelled. Anyone know something more
substantive?

nm...@cam.ac.uk

unread,
Dec 6, 2009, 3:57:55 AM12/6/09
to
In article <xv2dnXl4eYh1kYbW...@bestweb.net>,

Eh? As far as I know, Intel have NEVER announced any plans for a
consumer version of Larrabee - it always was an experimental chip.
There was a chance that they would commoditise it, for experimental
purposes, but that didn't seem to pan out. Their current plans are
indicated here:

http://techresearch.intel.com/articles/Tera-Scale/1421.htm

They hope to have systems shortly, and to allow selected people
online access from mid-2010, so I would guess that the first ones
that could be bought would be in early 2011. If all goes well.

I have absolutely NO idea of where they are thinking of placing it,
or what scale of price they are considering.


Regards,
Nick Maclaren.

Michael S

unread,
Dec 6, 2009, 11:16:17 AM12/6/09
to
On Dec 6, 10:57 am, n...@cam.ac.uk wrote:
> In article <xv2dnXl4eYh1kYbWnZ2dnUVZ_q2dn...@bestweb.net>,

Nick, SCC and Larrabee are different species. Both have plenty of
relatively simple x86 cores on a single chips but that's about only
thing they have in common.

1. Larrabee cores are cache-coherent, SCC cores are not.
2. Larrabee interconnects have ring topology, SCC is a mesh
3. Larrabee cores are about vector performance (512-bit SIMD) and SMT
(4 hardware threads per core). SCC cores are supposed to be stronger
than Larrabee on scalar code and much much weaker on vector code.
4. Larrabee was originally intended for consumers, both as high-end 3D
graphics engine and as sort-of-GPGPU. Graphics as target for 1st
generation chip is canceled, but it still possible that it would be
shipped to paying customers as GPGPU. SCC, on the other hand, is
purely experimental.

Michael S

unread,
Dec 6, 2009, 11:19:09 AM12/6/09
to

Michael S

unread,
Dec 6, 2009, 12:05:58 PM12/6/09
to
On Dec 6, 6:16 pm, Michael S <already5cho...@yahoo.com> wrote:
> 4. Larrabee was originally intended for consumers, both as high-end 3D
> graphics engine and as sort-of-GPGPU. Graphics as target for 1st
> generation chip is canceled, but it still possible that it would be
> shipped to paying customers as GPGPU.

Sorry, I missed the latest round of news. In fact GPGPU is canceled
together with GPU. So now 45nm LRB is officially "a prototype".
http://www.anandtech.com/weblog/showpost.aspx?i=659

nm...@cam.ac.uk

unread,
Dec 6, 2009, 12:39:39 PM12/6/09
to
In article <db0caa7f-6e7f-4fe2...@v37g2000vbb.googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>
>Nick, SCC and Larrabee are different species. Both have plenty of
>relatively simple x86 cores on a single chips but that's about only
>thing they have in common.
>
>1. Larrabee cores are cache-coherent, SCC cores are not.
>2. Larrabee interconnects have ring topology, SCC is a mesh
>3. Larrabee cores are about vector performance (512-bit SIMD) and SMT
>(4 hardware threads per core). SCC cores are supposed to be stronger
>than Larrabee on scalar code and much much weaker on vector code.

Thanks for the correction.

. I have been fully occupied with other matters, and
so seem to have missed some developments. Do you have a pointer
to any technical information?

>4. Larrabee was originally intended for consumers, both as high-end 3D
>graphics engine and as sort-of-GPGPU. Graphics as target for 1st
>generation chip is canceled, but it still possible that it would be
>shipped to paying customers as GPGPU. SCC, on the other hand, is
>purely experimental.

Now, there I beg to disagree. I have never seen anything reliable
indicating that Larrabee has ever been intended for consumers,
EXCEPT as a 'black-box' GPU programmed by 'Intel partners'. And
some of that information came from semi-authoritative sources in
Intel. Do you have a reference to an conflicting statement from
someone in Intel?


Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,
Dec 7, 2009, 9:28:19 AM12/7/09
to

I can guess.

Part of my guess is that this is related to Pat Gelsinger's departure.
Gelsinger was (a) ambitious, intent on becoming Intel CEO (said so in
his book), (b) publicly very much behind Larrabee.

I'm guessing that Gelsinger was trying to ride Larrabee as his ticket to
the next level of executive power. And when Larrabee did not pan out
as, Hicc well as he might have liked, he left. And/or conversely: when
Gelsinger left, Larrabee lost its biggest executive proponent. Although
my guess is that it was technology wagging the executive career tail: no
amount of executive positioning can make a technology shippable when it
isn't ready.

However, I would not count Larrabee out yet. Hiccups happen.

Although I remain an advocate of GPU style coherent threading
microarchitectures - I think they are likely to be more power efficient
than simple MIMD, whether SMT/HT or MCMT - the pull of X86 will be
powerful. Eventually we will have X86 MIMD/SMT/HT in-order vs X86 MCMT.
Hetero almost guaranteed. Only question will be heteroOOO/lO, or hetero
X86 MCMT/GPU. Could be hetero X86 OOO & X86 W. GPU style Coherent
Threading. The latter could even be CT/OOO. But these "Could be"s have
no sightings.

Andy "Krazy" Glew

unread,
Dec 7, 2009, 9:51:49 AM12/7/09
to nm...@cam.ac.uk
nm...@cam.ac.uk wrote:

> Now, there I beg to disagree. I have never seen anything reliable
> indicating that Larrabee has ever been intended for consumers,
> EXCEPT as a 'black-box' GPU programmed by 'Intel partners'. And
> some of that information came from semi-authoritative sources in
> Intel. Do you have a reference to an conflicting statement from
> someone in Intel?

http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-and-the-future-of-computing/

Just a blog, not official, although of course anything blogged at Intel
is semi-blest (believe me, I know the flip side.)

Del Cecchi

unread,
Dec 7, 2009, 1:00:44 PM12/7/09
to

"Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message
news:4B1D1685...@patten-glew.net...

Does this mean Larrabee won't be the engine for the PS4?

We were assured that it was not long ago.

del


Robert Myers

unread,
Dec 7, 2009, 1:25:42 PM12/7/09
to
On Dec 7, 9:51 am, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> n...@cam.ac.uk wrote:
> > Now, there I beg to disagree.  I have never seen anything reliable
> > indicating that Larrabee has ever been intended for consumers,
> > EXCEPT as a 'black-box' GPU programmed by 'Intel partners'.  And
> > some of that information came from semi-authoritative sources in
> > Intel.  Do you have a reference to an conflicting statement from
> > someone in Intel?
>
> http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-an...

>
> Just a blog, not official, although of course anything blogged at Intel
> is semi-blest (believe me, I know the flip side.)

The blog post reminded me. I have assumed, for years, that Intel
planned on putting many (>>4) x86 cores on a single-die. I'm sure I
can find Intel presentations from the nineties that seem to make that
clear if I dig hard enough.

From the very beginning, Larrabee seemed to be a technology of destiny
in search of a mission, and the first, most obvious mission for any
kind of massive parallelism is graphics. Thus, Intel explaining why
it would introduce Larrabee at Siggraph always seemed a case of
offering an explanation where none would be needed if the explanation
weren't something they weren't sure they believed themselves (or that
anyone else would). It just seemed like the least implausible mission
for hardware that had been designed to a concept rather than to a
mission. A more plausible claim that they were aiming at HPC probably
wouldn't have seemed like a very attractive business proposition for a
company the size of Intel.

Also from the beginning, I wondered if Intel seriously expected to be
able to compete at the high end with dedicated graphics engines using
x86 cores. Either there was something about the technology I was
missing completely, it was just another Intel bluff, or the "x86"
cores that ultimately appeared on a graphics chips for market would be
to an x86 as we know it as, say, a, lady bug is to a dalmatian.

Robert.

nm...@cam.ac.uk

unread,
Dec 7, 2009, 5:39:27 PM12/7/09
to
In article <4B1D1685...@patten-glew.net>,

I don't see anything in that that even hints at plans to make
Larrabee available for consumer use. It could just as well be a
probe to test consumer interest - something that even I do!


Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,
Dec 7, 2009, 6:05:02 PM12/7/09
to
In article <b81b8239-b43c-46e9...@k13g2000prh.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>
>The blog post reminded me. I have assumed, for years, that Intel
>planned on putting many (>>4) x86 cores on a single-die. I'm sure I
>can find Intel presentations from the nineties that seem to make that
>clear if I dig hard enough.

Yes. But the word "planned" implies a degree of deliberate action
that I believe was absent. They assuredly blithered on about it,
and very probably had meetings about it ....

>From the very beginning, Larrabee seemed to be a technology of destiny
>in search of a mission, and the first, most obvious mission for any

>kind of massive parallelism is graphics. ...

Yes. But what they didn't seem to understand is that they should
have treated it as an experiment. I tried to persuade them that
they needed to make it widely available and cheap, so that the mad
hackers would start to play with it, and see what developed.
Perhaps nothing, but it wouldn't have been Intel's effort that was
wasted.

The same was true of Sun, but they had less margin for selling CPUs
at marginal cost.


Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,
Dec 7, 2009, 11:04:17 PM12/7/09
to Del Cecchi
Del Cecchi wrote:
> "Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message news:4B1D1685...@patten-glew.net...
>> nm...@cam.ac.uk wrote:
>>
>>> I have never seen anything reliable
>>> indicating that Larrabee has ever been intended for consumers,
>>
>> http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-and-the-future-of-computing/
>>
>> Just a blog, not official, although of course anything blogged at
>> Intel is semi-blest (believe me, I know the flip side.)
>
> Does this mean Larrabee won't be the engine for the PS4?
>
> We were assured that it was not long ago.

My guess is that Intel was pushing for Larrabee to be the PS4 chip.

And, possibly, Sony agreed. Not unreasonably, if Intel had made a
consumer grade Larrabee. Since Larrabee's nig pitch is programmability
- cache coherence, MIMD, vectors, familiar stuff. As opposed to the
Cell's idiosyncrasies and programmer hostility, which are probably in
large part to blame for Sony's lack of success with the PS3.

Given the present Larrabee situation, Sony is probably scrambling. Options:

a) go back to Cell.

b) more likely, eke out a year or so with Cell and a PS4 stretch, and
then look around again - possibly at the next Larrabee

c) AMD/ATI Fusion

d) Nvidia? Possibly with the CPU that Nvidia is widely rumored to be
working on.

AMD/ATI and Nvidia might seem the most reasonable, except that both
companies have had trouble delivering. AMD/ATI look best now, but
Nvidia has more "vision". Whatever good that will do them.

Larrabee's attractions remain valid. It is more programmer friendly.
But waiting until Larrabee is ready may be too painful.

Historically, game consoles have a longer lifetime than PCs. They were
programmed closer to the metal, and hence needed stability in order to
warrant software investment.

But DX10-DX11 and Open GL are *almost* good enough for games. And allow
migrating more frequently to the latest and greatest.

Blue-sky possibility: the PS3-PS4 transition breaking with the tradition
of console stability. The console might stay stable form factor and UI
and device wise - screen pixels, joysticks, etc. - but may start
changing the underlying compute and graphics engine more quickly than in
the best.

Related: net games.

Torben �gidius Mogensen

unread,
Dec 8, 2009, 3:45:09 AM12/8/09
to
"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:

> Although I remain an advocate of GPU style coherent threading
> microarchitectures - I think they are likely to be more power
> efficient than simple MIMD, whether SMT/HT or MCMT - the pull of X86
> will be powerful.

The main (only?) advantage of the x86 ISA is for running legacy software
(yes, I do consider Windows to be legacy software). And I don't see
this applying for Larrabee -- you can't exploit the parallelism when you
run dusty decks.

When developing new software, you want to use high-level languages and
don't really care too much about the underlying instruction set -- the
programming model you have to use (i.e., shared memory versus message
passing, SIMD vs. MIMD, etc.) is much more important, and that is
largely independent of the ISA.

Torben

nm...@cam.ac.uk

unread,
Dec 8, 2009, 4:27:32 AM12/8/09
to
In article <4B1DD041...@patten-glew.net>,
Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:

>Del Cecchi wrote:
>>
>> Does this mean Larrabee won't be the engine for the PS4?
>>
>> We were assured that it was not long ago.
>
>My guess is that Intel was pushing for Larrabee to be the PS4 chip.
>
>And, possibly, Sony agreed. Not unreasonably, if Intel had made a
>consumer grade Larrabee. Since Larrabee's nig pitch is programmability
>- cache coherence, MIMD, vectors, familiar stuff. As opposed to the
>Cell's idiosyncrasies and programmer hostility, which are probably in
>large part to blame for Sony's lack of success with the PS3.

Could be. That would be especially relevant if Sony were planning
to break out of the 'pure' games market and producing a 'home
entertainment centre'. Larrabee's pitch implied that it would have
been simple to add general Internet access, probably including VoIP,
and quite possibly online ordering, Email etc. We know that some of
the marketing organisations are salivating at the prospect of being
able to integrate games playing, television and online ordering.

I am pretty sure that both Sun and Intel decided against the end-user
market because they correctly deduced that it would not return a
profit but, in my opinion incorrectly, did not think that it might
open up new opportunities. But why Intel seem to have decided
against the use described above is a mystery - perhaps because, like
Motorola with the 88000 as a desktop chip, every potential partner
backed off. And perhaps for some other reason - or perhaps the
rumour of its demise is exaggerated - I don't know.

I heard some interesting reports about the 48 thread CPU yesterday,
incidentally. It's unclear that's any more focussed than Larrabee.


Regards,
Nick Maclaren.

Ken Hagan

unread,
Dec 8, 2009, 5:45:52 AM12/8/09
to
On Tue, 08 Dec 2009 08:45:09 -0000, Torben ï¿œgidius Mogensen
<tor...@diku.dk> wrote:

> The main (only?) advantage of the x86 ISA is for running legacy software
> (yes, I do consider Windows to be legacy software). And I don't see
> this applying for Larrabee -- you can't exploit the parallelism when you
> run dusty decks.

But you can exploit the parallelism where you really needed it and carry
on using the dusty decks for all the other stuff, without which you don't
have a rounded product.

nm...@cam.ac.uk

unread,
Dec 8, 2009, 6:44:29 AM12/8/09
to
In article <op.u4l76...@khagan.ttx>,
Ken Hagan <K.H...@thermoteknix.com> wrote:
>On Tue, 08 Dec 2009 08:45:09 -0000, Torben �gidius Mogensen

That was the theory. We don't know how well it would have panned out,
but it is clearly a sane objective.


Regards,
Nick Maclaren.

Andrew Reilly

unread,
Dec 8, 2009, 7:14:08 AM12/8/09
to
On Tue, 08 Dec 2009 09:27:32 +0000, nmm1 wrote:

> Larrabee's pitch implied that it would have been simple to add general
> Internet access, probably including VoIP, and quite possibly online
> ordering, Email etc.

Why do you suggest that internet access, VoIP or online ordering are
impossible or even hard on existing Cell? It's a full-service Unix
engine, aside from all of the rendering business. Linux runs on it,
which means that all of the interesting browsers run on it just fine.

Sure, there's an advertising campaign (circa NetBurst) that says that
intel makes the internet work better, but we're not buying that, are we?

Cheers,

--
Andrew

nm...@cam.ac.uk

unread,
Dec 8, 2009, 7:33:36 AM12/8/09
to
In article <7o6u8fF...@mid.individual.net>,

Andrew Reilly <areil...@bigpond.net.au> wrote:
>
>> Larrabee's pitch implied that it would have been simple to add general
>> Internet access, probably including VoIP, and quite possibly online
>> ordering, Email etc.
>
>Why do you suggest that internet access, VoIP or online ordering are
>impossible or even hard on existing Cell? It's a full-service Unix
>engine, aside from all of the rendering business.

Quite a lot of (indirect) feedback from people who have tried using
it, as well as the not-wholly-unrelated Blue Gene. The killer is that
it is conceptually different from 'mainstream' systems, and so each
major version of each product is likely to require extensive work,
possibly including reimplementation or the implementation of a new
piece of infrastructure. That's a long-term sink of effort.

As a trivial example of the sort of problem, a colleague of mine has
some systems with NFS-mounted directories, but where file locking is
disabled (for good reasons). Guess what broke at a system upgrade?

> Linux runs on it,
>which means that all of the interesting browsers run on it just fine.

It means nothing of the sort - even if you mean a fully-fledged system
environment by "Linux", and not just a kernel and surrounding features,
there are vast areas of problematic facilities that most browsers use
that are not needed for a reasonable version of Linux.

>Sure, there's an advertising campaign (circa NetBurst) that says that
>intel makes the internet work better, but we're not buying that, are we?

Of course not.


Regards,
Nick Maclaren.

ChrisQ

unread,
Dec 8, 2009, 8:07:18 AM12/8/09
to
nm...@cam.ac.uk wrote:

>
>> Linux runs on it,
>> which means that all of the interesting browsers run on it just fine.
>
> It means nothing of the sort - even if you mean a fully-fledged system
> environment by "Linux", and not just a kernel and surrounding features,
> there are vast areas of problematic facilities that most browsers use
> that are not needed for a reasonable version of Linux.
>

For example ?.

Once you have an os kernel and drivers on top of the hardware, the hw is
essentially isolated and anything that can compile should run with few
problems. Ok, it may mean that the code runs on one of the n available
processors under the hood, but it should run...

Regards,

Chris

ChrisQ

unread,
Dec 8, 2009, 8:13:32 AM12/8/09
to

The obvious question then is: Would one of many x86 cores be fast enough
on it's own to run legacy windows code like office, photoshop etc ?...

Regards,

Chris

Andy "Krazy" Glew

unread,
Dec 8, 2009, 10:12:51 AM12/8/09
to Torben Ægidius Mogensen


I wish that this were so.

I naively thought it were so, e.g. for big supercomputers. After all,
they compile all of their code from scratch, right? What do they care
if the actual parallel compute engines are non-x86? Maybe have an x86 in
the box, to run legacy stuff.

Unfortunately, they do care. It may not be the primary concern - after
all, they often compile their code from scratch. But, if not primary,
it is one of the first of the secondary concerns.

Reason: Tools. Ubiquity. Libraries. Applies just as much to Linux as to
Windows. You are running along fine on your non-x86 box, and then
realize that you want to use some open source library that has been
developed and tested mainly on x86. You compile from source, and there
are issues. All undoubtedly solvable, but NOT solved right away. So as
a result, you either can't use the latest and greatest library, or you
have to fix it.

Like I said, this was supercomputer customers telling me this. Not all
- but maybe 2/3rds. Also, especially, the supercomputer customers'
sysadmins.

Perhaps supercomputers are more legacy x86 sensitive than game consoles...

I almost believed this when I wrote it. And then I thought about flash:

... Than game consoles that want to start running living room
mediacenter applications. That want to start running things like x86
binary plugins, and Flash. Looking at

http://www.adobe.com/products/flashplayer/systemreqs/

The following minimum hardware configurations are recommended for
an optimal playback experience: ... all x86, + PowerPC G5.

I'm sure that you can get a version that runs on your non-x86,
non-PowerPC platform. ... But it's a hassle.

===

Since I would *like* to work on chips in the future as I have in the
past, and since I will never work at Intel or AMD again, I *want* to
believe that non-x86s can be successful. I think they can be
successful. But we should not fool ourselves: there are significant
obstacles, even in the most surprising market segments where x86
compatibility should not be that much of an issue.

We, the non-x86 forces of the world, need to recognize those obstacles,
and overcome them. Not deny their existence.

Bernd Paysan

unread,
Dec 8, 2009, 12:33:10 PM12/8/09
to
Andy "Krazy" Glew wrote:
> I almost believed this when I wrote it. And then I thought about flash:
>
> ... Than game consoles that want to start running living room
> mediacenter applications. That want to start running things like x86
> binary plugins, and Flash. Looking at
>
> http://www.adobe.com/products/flashplayer/systemreqs/
>
> The following minimum hardware configurations are recommended for
> an optimal playback experience: ... all x86, + PowerPC G5.
>
> I'm sure that you can get a version that runs on your non-x86,
> non-PowerPC platform. ... But it's a hassle.

It's mainly a deal between the platform maker and Adobe. Consider another
market, where x86 is non-existent: Smartphones. They are now real
computers, and Flash is an issue. Solution: Adobe ports the Flash plugin
over to ARM, as well. They already have Flash 9.4 ported (runs on the Nokia
N900), and Flash 10 will get an ARM port soon, as well, and spread around to
more smartphones. Or Skype: Also necessary, also proprietary, but also
available on ARM. As long as the device maker cares, it's their hassle, not
the user's hassle (and even a "free software only" Netbook Ubuntu it's too
much of a hassle to install the Flash plugin to be considered fine for mere
mortals).

This of course would be much less of a problem if Flash wasn't something
proprietary from Adobe, but an open standard (or at least based on an open
source platform), like HTML.

Note however, that even for a console maker, backward compatibility to the
previous platform is an issue. Sony put the complete PS2 logic (packet into
a newer, smaller chip) on the first PS3 generation to allow people to play
PS2 games with their PS3. If they completely change architecture with the
PS4, will they do that again? Or are they now fed up with this problem, and
decide to go to x86, and be done with that recurring problem?

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Del Cecchi

unread,
Dec 9, 2009, 12:03:19 AM12/9/09
to

"Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message
news:4B1DD041...@patten-glew.net...

> Del Cecchi wrote:
>> "Andy "Krazy" Glew" <ag-...@patten-glew.net> wrote in message
>> news:4B1D1685...@patten-glew.net...
>>> nm...@cam.ac.uk wrote:
>>>
>>>> I have never seen anything reliable
>>>> indicating that Larrabee has ever been intended for consumers,
>>>
>>> http://software.intel.com/en-us/blogs/2008/08/11/siggraph-larrabee-and-the-future-of-computing/
>>>
>>> Just a blog, not official, although of course anything blogged at
>>> Intel is semi-blest (believe me, I know the flip side.)
>>
>> Does this mean Larrabee won't be the engine for the PS4?
>>
>> We were assured that it was not long ago.
>
> My guess is that Intel was pushing for Larrabee to be the PS4 chip.
>
> And, possibly, Sony agreed. Not unreasonably, if Intel had made a
> consumer grade Larrabee. Since Larrabee's nig pitch is
> programmability - cache coherence, MIMD, vectors, familiar stuff.
> As opposed to the Cell's idiosyncrasies and programmer hostility,
> which are probably in large part to blame for Sony's lack of success
> with the PS3.

I believe Cell was Sony's idea in the first place. I could be wrong
about that but it was sure the vibe at the time. And Sony's lateness
and high price was at least as much due to the Blue Ray drive
included, which did lead to them winning the DVD war

Torben �gidius Mogensen

unread,
Dec 9, 2009, 3:47:40 AM12/9/09
to
"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:

> Torben �gidius Mogensen wrote:

>> When developing new software, you want to use high-level languages and
>> don't really care too much about the underlying instruction set -- the
>> programming model you have to use (i.e., shared memory versus message
>> passing, SIMD vs. MIMD, etc.) is much more important, and that is
>> largely independent of the ISA.

> I naively thought it were so, e.g. for big supercomputers. After all,


> they compile all of their code from scratch, right? What do they care
> if the actual parallel compute engines are non-x86? Maybe have an x86
> in the box, to run legacy stuff.
>
> Unfortunately, they do care. It may not be the primary concern -
> after all, they often compile their code from scratch. But, if not
> primary, it is one of the first of the secondary concerns.
>
> Reason: Tools. Ubiquity. Libraries. Applies just as much to Linux as
> to Windows. You are running along fine on your non-x86 box, and then
> realize that you want to use some open source library that has been
> developed and tested mainly on x86. You compile from source, and
> there are issues. All undoubtedly solvable, but NOT solved right
> away. So as a result, you either can't use the latest and greatest
> library, or you have to fix it.
>
> Like I said, this was supercomputer customers telling me this. Not
> all - but maybe 2/3rds. Also, especially, the supercomputer
> customers' sysadmins.

Libraries are, of course, important to supercomputer users. But if they
are written in a high-level language and the new CPU uses the same
representation of floating-point numbers as the old (e.g., IEEE), they
should compile to the new platform. Sure, some low-level optimisations
may not apply, but if the new platform is a lot faster than the old,
that may not matter. And you can always address the optimisation issue
later.

Besides, until recently supercomputers were not mainly x86-based.

> Perhaps supercomputers are more legacy x86 sensitive than game consoles...
>
> I almost believed this when I wrote it. And then I thought about flash:
>
> ... Than game consoles that want to start running living room
> mediacenter applications. That want to start running things like x86
> binary plugins, and Flash. Looking at
>
> http://www.adobe.com/products/flashplayer/systemreqs/
>
> The following minimum hardware configurations are recommended for
> an optimal playback experience: ... all x86, + PowerPC G5.
>
> I'm sure that you can get a version that runs on your non-x86,
> non-PowerPC platform. ... But it's a hassle.

Flash is available on ARM too. And if another platform becomes popular,
Adobe will port Flash to this too. But that is not the issue: Flash
doesn't run on the graphics processor, it runs on the main CPU, though
it may use the graphics processor through a standard API that hides the
details of the GPU ISA.

Torben

nm...@cam.ac.uk

unread,
Dec 9, 2009, 4:42:14 AM12/9/09
to
In article <7zzl5sr...@pc-003.diku.dk>,

Torben �gidius Mogensen <tor...@diku.dk> wrote:
>"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:
>
>>
>> Reason: Tools. Ubiquity. Libraries. Applies just as much to Linux as
>> to Windows. You are running along fine on your non-x86 box, and then
>> realize that you want to use some open source library that has been
>> developed and tested mainly on x86. You compile from source, and
>> there are issues. All undoubtedly solvable, but NOT solved right
>> away. So as a result, you either can't use the latest and greatest
>> library, or you have to fix it.
>>
>> Like I said, this was supercomputer customers telling me this. Not
>> all - but maybe 2/3rds. Also, especially, the supercomputer
>> customers' sysadmins.
>
>Libraries are, of course, important to supercomputer users. But if they
>are written in a high-level language and the new CPU uses the same
>representation of floating-point numbers as the old (e.g., IEEE), they
>should compile to the new platform. Sure, some low-level optimisations
>may not apply, but if the new platform is a lot faster than the old,
>that may not matter. And you can always address the optimisation issue
>later.

Grrk. All of the above is partially true, but only partially. The
problem is almost entirely with poor-quality software (which is,
regrettably, most of it). Good quality software is portable to
quite wildly different systems fairly easily. It depends on whether
you are talking about performance-critical, numerical libraries
(i.e. what supercomputer users really want to do) or administrative
and miscellaneous software.

For the former, the representation isn't enough, as subtle differences
like hard/soft underflow and exception handling matter, too. And you
CAN'T disable optimisation for supercomputers, because you can't
accept the factor of 10+ degradation. It doesn't help, anyway,
because you will be comparing with an optimised version on the
other systems.

With the latter, porting is usually trivial, provided that the
program has not been rendered non-portable by the use of autoconfigure,
and that it doesn't use the more ghastly parts of the infrastructure.
But most applications that do rely on those areas aren't relevant
to supercomputers, anyway, because they are concentrated around
the GUI area (and, yes, flash is a good example).

I spent a decade managing the second-largest supercomputer in UK
academia, incidentally, and some of the systems I managed were
'interesting'.

>Besides, until recently supercomputers were not mainly x86-based.
>
>> Perhaps supercomputers are more legacy x86 sensitive than game consoles...

Much less so.

Ken Hagan

unread,
Dec 9, 2009, 5:54:37 AM12/9/09
to
On Wed, 09 Dec 2009 08:47:40 -0000, Torben ï¿œgidius Mogensen
<tor...@diku.dk> wrote:

> Sure, some low-level optimisations
> may not apply, but if the new platform is a lot faster than the old,
> that may not matter. And you can always address the optimisation issue
> later.

I don't think Andy was talking about poor optimisation. Perhaps these
libraries have assumed the fairly strong memory ordering model of an x86,
and in its absence would be chock full of bugs.

> Flash is available on ARM too. And if another platform becomes popular,
> Adobe will port Flash to this too.

When hell freezes over. It took Adobe *years* to get around to porting
Flash to x64.

They had 32-bit versions for Linux and Windows for quite a while, but no
64-bit version for either. To me, that suggests the problem was the
int-size rather than the platform, and it just took several years to clean
it up sufficiently. So I suppose it is *possible* that the next port might
not take so long. On the other hand, both of these targets have Intel's
memory model, so I'd be surprised if even this "clean" version was truly
portable.

Ken Hagan

unread,
Dec 9, 2009, 6:18:43 AM12/9/09
to
On Tue, 08 Dec 2009 13:13:32 -0000, ChrisQ <me...@devnull.com> wrote:

> The obvious question then is: Would one of many x86 cores be fast enough
> on it's own to run legacy windows code like office, photoshop etc ?...

Almost certainly. From my own experience, Office 2007 is perfectly usable
on a 2GHz Pentium 4 and only slightly sluggish on a 1GHz Pentium 3. These
applications are already "lightly multi-threaded", so some of the
longer-running operations are spun off on background threads, so if you
had 2 or 3 cores that were even slower, that would probably still be OK
because the application *would* divide the workload. For screen drawing,
the OS plays a similar trick.

I would also imagine that Photoshop had enough embarrassing parallelism
that even legacy versions might run faster on a lot of slow cores, but I'm
definitely guessing here.

Noob

unread,
Dec 9, 2009, 7:28:53 AM12/9/09
to
Bernd Paysan wrote:

> This of course would be much less of a problem if Flash wasn't something

> proprietary from Adobe [...]

A relevant article:
Free Flash community reacts to Adobe Open Screen Project
http://www.openmedianow.org/?q=node/21

Stefan Monnier

unread,
Dec 9, 2009, 9:56:08 AM12/9/09
to
> They had 32-bit versions for Linux and Windows for quite a while, but no
> 64-bit version for either. To me, that suggests the problem was the

It's just a question of market share.
Contrary to Free Software where any idiot can port the code to his
platform if he so wishes, propretary software first requires collecting
a large number of idiots so as to justify
compiling/testing/marketing/distributing the port.


Stefan

Paul Wallich

unread,
Dec 9, 2009, 1:10:35 PM12/9/09
to

From an outside perspective, this sounds a lot like the Itanic roadmap:
announce something brilliant and so far out there that your competitors
believe you must have solutions to all the showstoppers up your sleeve.
Major difference being that Larrabee's potential/probable competitors
didn't fold.

paul

Robert Myers

unread,
Dec 9, 2009, 3:25:16 PM12/9/09
to
On Dec 9, 3:47 am, torb...@diku.dk (Torben Ægidius Mogensen) wrote:

>
> Libraries are, of course, important to supercomputer users.  But if they
> are written in a high-level language and the new CPU uses the same
> representation of floating-point numbers as the old (e.g., IEEE), they
> should compile to the new platform.  Sure, some low-level optimisations
> may not apply, but if the new platform is a lot faster than the old,
> that may not matter.  And you can always address the optimisation issue
> later.
>

But if some clever c programmer or committee of c programmers has made
a convoluted and idiosyncratic change to a definition in a header
file, you may have to unscramble all kinds of stuff hidden under
macros just to get it to compile and link, and that effort can't be
deferred until later.

Robert.

Robert Myers

unread,
Dec 9, 2009, 4:49:54 PM12/9/09
to
On Dec 9, 1:10 pm, Paul Wallich <p...@panix.com> wrote:

>  From an outside perspective, this sounds a lot like the Itanic roadmap:
> announce something brilliant and so far out there that your competitors
> believe you must have solutions to all the showstoppers up your sleeve.
> Major difference being that Larrabee's potential/probable competitors
> didn't fold.

In American football, "A good quarterback can freeze the opposition’s
defensive secondary with a play-action move, a pump fake or even his
eyes."

http://www.dentonrc.com/sharedcontent/dws/drc/opinion/editorials/stories/DRC_Editorial_1123.2e4a496a2.html

where the analogy is used in a political context.

If I were *any* of the players in this game, I'd be studying the
tactics of quarterbacks who need time to find an open receiver, since
*no one* appears to have the right product ready for prime time. If I
were Intel, I'd be nervous, but if I were any of the other players,
I'd be nervous, too.

Nvidia stock has drooped a bit after the *big* bounce it took on the
Larrabee announcement, but I'm not sure why everyone is so negative on
Nvidia (especially Andy). They don't appear to be in much more
parlous a position than anyone else. If Fermi is a real product, even
if only at a ruinous price, there will be buyers.

N.B. I follow the financial markets for information only. I am not an
active investor.

Robert.

Andy "Krazy" Glew

unread,
Dec 9, 2009, 11:12:39 PM12/9/09
to Robert Myers
Robert Myers wrote:
> Nvidia stock has drooped a bit after the *big* bounce it took on the
> Larrabee announcement, but I'm not sure why everyone is so negative on
> Nvidia (especially Andy). They don't appear to be in much more
> parlous a position than anyone else. If Fermi is a real product, even
> if only at a ruinous price, there will be buyers.

Let me be clear: I'm not negative on Nvidia. I think their GPUs are the
most elegant of the lot. If anything, I am overcompensating: within
Intel, I was probably the biggest advocate of Nvidia style
microarchitecture, arguing against a lot of guys who came to Intel from
ATI. Also on this newsgroup.

However, I don't think that anyone can deny that Nvidia had some
execution problems recently. For their sake, I hope that they have
overcome them.

Also, AMD/ATI definitely overtook Nvidia. I think that Nvidia
emphasized elegance, and GP GPU futures stuff, whereas ATI went the
slightly inelegant way of combining SIMT Coherent Threading with VLIW.
It sounds more elegant when you phrase it my way, "combining SIMT
Coherent Threading with VLIW", than when you have to describe it without
my terminology. Anyway, ATI definitely had a performance per transistor
advantage. I suspect they will continue to have such an advantage over
Fermi, because, after all, VLIW works to some limited extent.

I think Fermi is more programmable and more general purpose, while ATI's
VLIW approach has efficiencies in some areas.

I think that Nvidia absolutely has to have a CPU to have a chance of
competing. One measly ARM chip or Power PC on an Nvidia die. Or maybe
one CPU chip, one GPU chip, and a stack of memory in a package; or a GPU
plus a memory interface with a lousy CPU. Or, heck, a reasonably
efficient way of decoupling one of Nvidia's processors and running 1
thread, non-SIMT, of scalar code. SIMT is great, but there is important
non-SIMT scalar code.

Ultimately, the CPU vendors will squeeze GPU-only vendors out of the
market. AMD & ATI are already combined. If Intel's Larrabee is
stalled, it gives Nvidia some breathing room, bit not much. Even if
Larrabee is completely cancelled, which I doubt, Intel would eventually
squeeze Nvidia out with its evolving integrated graphics. Which,
although widely dissed, really has a lot of potential.

Nvidia's best chance is if Intel thrashes, dithering between Larrabee
and Intel's integrated graphics and ... isn't Intel using PowerVR in
some Atom chips? I.e. Intel currently has at least 3 GPU solutions in
flight. *This* sounds like the sort of thrash Intel had -
x86/i960/i860 ... I personally think that Intel's best path to success
would be to go with a big core + the Intel integrated graphics GPU,
evolved, and then jump to Larrabee. But if they focus on Larrabee, or
an array of Atoms + a big core, their success will just be delayed.

Intel is its own biggest problem, with thrashing.

Meanwhile, AMD/ATI are in the best position. I don't necessarily like
Fusion CPU/GPU, but they have all the pieces. But it's not clear they
know how to use it.

And Nvidia needs to get out of the discrete graphics board market niche
as soon as possible. If they can do so, I bet on Nvidia.

Robert Myers

unread,
Dec 9, 2009, 11:33:18 PM12/9/09
to
On Dec 9, 11:12 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> And Nvidia needs to get out of the discrete graphics board market niche


> as soon as possible. If they can do so, I bet on Nvidia.

Cringely thinks, well, the link says it all:

http://www.cringely.com/2009/12/intel-will-buy-nvidia/

Robert.

Andy "Krazy" Glew

unread,
Dec 10, 2009, 12:25:32 AM12/10/09
to Ken Hagan
Ken Hagan wrote:
> On Wed, 09 Dec 2009 08:47:40 -0000, Torben ï¿œgidius Mogensen
> <tor...@diku.dk> wrote:
>
>> Sure, some low-level optimisations
>> may not apply, but if the new platform is a lot faster than the old,
>> that may not matter. And you can always address the optimisation issue
>> later.
>
> I don't think Andy was talking about poor optimisation. Perhaps these
> libraries have assumed the fairly strong memory ordering model of an
> x86, and in its absence would be chock full of bugs.

Nick is correct to say that memory ordering is harder to port around
than instruction set or word size.

A surprisingly large number of supercomputer customers use libraries and
tools that have some specific x86 knowledge.

For example, folks who use tools like Pin, the binary instrumentation
tool. Although Intel makes Pin available on some non-x86 machines,
where do you think Pin runs best?

Or the Boehm garbage collector for C++. Although it's fairly portable -

http://www.hpl.hp.com/personal/Hans_Boehm/gc/#where says
The collector is not completely portable, but the distribution includes
ports to most standard PC and UNIX/Linux platforms. The collector should
work on Linux, *BSD, recent Windows versions, MacOS X, HP/UX, Solaris,
Tru64, Irix and a few other operating systems. Some ports are more
polished than others.

again, if your platform is "less polished"...

Plus, there are the libraries and tools like Intel's Thread Building
Blocks.

Personally, I prefer not to use libraries that are tied to one processor
architecture, but many people just want to get their job done.

The list goes on.

Like I said, I was surprised at how many supercomputer customers
expressed this x86 orientation. I expected them to care little about x86.

Andrew Reilly

unread,
Dec 10, 2009, 2:22:55 AM12/10/09
to
On Wed, 09 Dec 2009 21:25:32 -0800, Andy \"Krazy\" Glew wrote:

> Like I said, I was surprised at how many supercomputer customers
> expressed this x86 orientation. I expected them to care little about
> x86.

I still expect those who use Cray or NEC vector supers, or any of the
scale-up SGI boxes, or any of the Blue-foo systems to care very little
indeed. The folk who seem to be getting mileage from the CUDA systems
probably only care peripherally. I suspect that it depends on how your
focus group self-selects.

Yes there are some big-iron x86 systems now, but they haven't even been a
majority on the top500 for very long.

I suppose that it doesn't take too long for bit-rot to set in, if the
popular crowd goes in a different direction.

Cheers,

--
Andrew

Terje Mathisen

unread,
Dec 10, 2009, 3:32:31 AM12/10/09
to
Robert Myers wrote:
> Nvidia stock has drooped a bit after the *big* bounce it took on the
> Larrabee announcement, but I'm not sure why everyone is so negative on
> Nvidia (especially Andy). They don't appear to be in much more
> parlous a position than anyone else. If Fermi is a real product, even
> if only at a ruinous price, there will be buyers.

I have seen a report by a seismic processing software firm, indicating
that their first experiments with GPGPU programming had gone very well:

After 8 rounds of optimization, which basically consisted of mapping
their problem (acoustic wave propagation, according to Kirchoff) onto
the actual capabilities of a GPU card, they went from being a little
slower than the host CPU up to nearly two orders of magnitude faster.

This meant that Amdahl's law started rearing it's ugly head: The setup
overhead took longer than the actual processing, so now they are working
on moving at least some of that surrounding code on the GPU as well.

Anyway, with something like 40-100x speedups, oil companies will be
willing to spend at least $1000+ per chip.

However, I'm guessing that the global oil processing market has not more
than 100 of the TOP500 clusters, so this is 100K to 1M chips if everyone
would scrap their current setup.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Dec 10, 2009, 3:37:02 AM12/10/09
to

A rumor which has re-surfaced at least every year for as long as I can
remember, gaining strength since the AMD/ATI deal was announced.

Yes, it would well happen, Intel does have some spare change laying
around in the couch cushions. :-)

Torben �gidius Mogensen

unread,
Dec 10, 2009, 3:57:02 AM12/10/09
to
"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:


> I think that Nvidia absolutely has to have a CPU to have a chance of
> competing. One measly ARM chip or Power PC on an Nvidia die.

They do have Tegra, which is an ARM11 core on a chip with a graphics
processor (alas, not CUDA compatible) plus some other stuff. Adding one
or more ARM cores to a Fermi would not be that far a step. It would
require porting CUDA to ARM, though.

> Or, heck, a reasonably efficient way of decoupling one of Nvidia's
> processors and running 1 thread, non-SIMT, of scalar code.

The Nvidia processors lack interrupts andotheer stuff necesary for
running an OS, so it is probably better with a different processor.

> isn't Intel using PowerVR in some Atom chips?

I know ARM uses PowerVR, but I hadn't heard Intel doing so.

Torben

Michael S

unread,
Dec 10, 2009, 4:06:53 AM12/10/09
to


"8 rounds of optimization", that's impressive.
I wonder how much speed-up could they get from the host CPU after just
3 rounds:
1. double->single, to reduce memory footprint
2. SIMD
3. Exploit all available cores/threads

nm...@cam.ac.uk

unread,
Dec 10, 2009, 4:13:14 AM12/10/09
to
In article <4B20864...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:
>
>A surprisingly large number of supercomputer customers use libraries and
>tools that have some specific x86 knowledge.

It used to be very few, but is increasing.

>Like I said, I was surprised at how many supercomputer customers
>expressed this x86 orientation. I expected them to care little about x86.

A lot of it is due to the change in 'community' and their applications.
It's not just the libraries, but I have little to add to what you
said on those (well, other examples, but so?)

Traditionally, people were really up against hard limits, and they
were prepared to both spend serious effort in tuning and switch to
whichever system offered them most time. There still are a lot like
that. Fortran and MPI dominate, and few people give a damn about the
architecture.

An increasing number want to use a 'supercomputer' as an alternative
to tuning their code. Some of those codes are good, some are merely
inefficient, some are unnecessarily x86-dependent, and some LOOK
x86-dependent because they are just plain broken. C++ and shared
memory dominate.

And, as usual, nothing is hard and fast, so there are intermediates
and mixtures and ....


Regards,
Nick Maclaren.

Terje Mathisen

unread,
Dec 10, 2009, 4:48:31 AM12/10/09
to
Michael S wrote:
> On Dec 10, 10:32 am, Terje Mathisen<Terje.Mathi...@tmsw.no> wrote:
>> I have seen a report by a seismic processing software firm, indicating
>> that their first experiments with GPGPU programming had gone very well:
>>
>> After 8 rounds of optimization, which basically consisted of mapping
>> their problem (acoustic wave propagation, according to Kirchoff) onto
>> the actual capabilities of a GPU card, they went from being a little
>> slower than the host CPU up to nearly two orders of magnitude faster.
>>
>> This meant that Amdahl's law started rearing it's ugly head: The setup
>> overhead took longer than the actual processing, so now they are working
>> on moving at least some of that surrounding code on the GPU as well.
>>
>> Anyway, with something like 40-100x speedups, oil companies will be
>> willing to spend at least $1000+ per chip.
>>
>> However, I'm guessing that the global oil processing market has not more
>> than 100 of the TOP500 clusters, so this is 100K to 1M chips if everyone
>> would scrap their current setup.
>>
>> Terje
> "8 rounds of optimization", that's impressive.
> I wonder how much speed-up could they get from the host CPU after just
> 3 rounds:
> 1. double->single, to reduce memory footprint
> 2. SIMD
> 3. Exploit all available cores/threads

I'm pretty sure they are already doing all of those, at least in the lab
where they tested GPGPU.

Thomas Womack

unread,
Dec 10, 2009, 5:24:16 AM12/10/09
to
In article <4B207537...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:
>Also, AMD/ATI definitely overtook Nvidia. I think that Nvidia
>emphasized elegance, and GP GPU futures stuff, whereas ATI went the
>slightly inelegant way of combining SIMT Coherent Threading with VLIW.
>It sounds more elegant when you phrase it my way, "combining SIMT
>Coherent Threading with VLIW", than when you have to describe it without
>my terminology. Anyway, ATI definitely had a performance per transistor
>advantage.

ATI win on performance, but nVidia win by miles on GPGPU software
development, simply because they've picked a language and stuck with
it, and at some point some high-up insisted that the GPGPU compilers
be roughly synchronised with the hardware releases; I expect to be
able to pick up a Fermi card, download the latest nvidia SDK, build
something linked with cufft, and get a reasonable performance.

ATI's compiler and driver stack, to the best of my knowledge, doesn't
support double precision yet, well after the second generation of
chips with DP on has appeared.

An AMD employee posted in their OpenCL forum about four weeks ago:

"Double precision floating point support is important for us. We are
planning to begin to introduce double precision arithmetic support in
first half of 2010 as well as the start of some built-ins over time."

Tom

Michael S

unread,
Dec 10, 2009, 9:01:50 AM12/10/09
to

I they are doing all that I simply can't see how one of existing GPUs
(i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
significant above 1 SP TFLOPs? According to Wikipedia there are not.
So, either they compare an array of GPUs with single host CPU or their
host code is very far from optimal. I'd bet on later.

Bernd Paysan

unread,
Dec 10, 2009, 9:28:06 AM12/10/09
to
Michael S wrote:
> I they are doing all that I simply can't see how one of existing GPUs
> (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> significant above 1 SP TFLOPs? According to Wikipedia there are not.

AFAIK the ATI 5870 can achieve up to 3.04 SP TFLOPS at 950MHz. That's a
single chip. And the data is on Wikipedia - where did you look? Look here:

http://en.wikipedia.org/wiki/FLOPS

Michael S

unread,
Dec 10, 2009, 12:11:35 PM12/10/09
to


I looked at the same page but didn't pay attention to a fine print at
the bottom :(

Anyway, Radeon™ HD 5870 is in the field for 2 month or something like
that? Somehow I don't think Terje's buddies had it for 8 rounds of
optimization.
Also it seems up until very recently very few organizations tried non-
NVidea GPGPU.

Torbjorn Lindgren

unread,
Dec 10, 2009, 1:14:59 PM12/10/09
to
Torben �gidius Mogensen <tor...@diku.dk> wrote:

The chipset for the MID/UMPC Atom's is Poulsbo (aka "Intel System
Controller Hub US15W") and contains "GMA500".

GMA500 is entirely unrelated to all other GMA models and consists of
PowerVR SGX 535 (graphics) and PowerVR VXD (H.264/MPEG-4 AVC
playback)...

IIRC only the special MID/UMDPC Atom's can be coupled with Poulsbo (ie
Z5xx/Silverthorne)?

The other Intel chipsets are uses a lot more power than the Atom
itself (extremely bad when you have a low-power CPU) and are pretty
anemic to boot. In the other end is Nvidia Ion which also uses a bit
more power than one would hope but at least have an usefull GPU/HD
playback accelerator (like Poulsbo but faster).

http://en.wikipedia.org/wiki/Poulsbo_(chipset)
http://en.wikipedia.org/wiki/Intel_GMA
http://en.wikipedia.org/wiki/Intel_Atom#Power_requirements

j...@cix.compulink.co.uk

unread,
Dec 10, 2009, 4:15:18 PM12/10/09
to
In article <1isTm.93647$Pi.2...@newsfe30.ams2>, me...@devnull.com
(ChrisQ) wrote:

> The obvious question then is: Would one of many x86 cores be fast
> enough on it's own to run legacy windows code like office, photoshop
> etc ?...

Maybe. But can marketing men convince themselves that this would be the
case? Almost certainly: a few studies about how many apps the average
corporate Windows user has open at a time could work wonders. The
problem, of course, is that most of those apps aren't consuming much CPU
except when they have input focus. But that's the kind of thing that
marketing departments are good at neglecting.

--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.

j...@cix.compulink.co.uk

unread,
Dec 10, 2009, 4:15:18 PM12/10/09
to
In article <hfk1mu$i7j$1...@smaug.linux.pwf.cam.ac.uk>, nm...@cam.ac.uk ()
wrote:
> Yes. But the word "planned" implies a degree of deliberate action
> that I believe was absent. They assuredly blithered on about it,
> and very probably had meetings about it ....

Indeed. Intel don't seem to have become serious about multi-core until
they discovered that they could not clock the NetBurst above 4GHz, but
that their fab people could readily fit two of them on a single die.

I did some work with the early "Pentium D", which was two NetBursts on
the same die, but with two sets of legs, and no communication between
the cores that didn't go through the legs and the motherboard FSB.
Locking performance was unimpressive, to say the least, and early
Opterons beat it utterly. I'm going to take a lot of convincing that
this was a long-planned product; the design just isn't good enough for
that to be convincing.

Robert Myers

unread,
Dec 10, 2009, 4:47:53 PM12/10/09
to
On Dec 10, 4:15 pm, j...@cix.compulink.co.uk wrote:

> I did some work with the early "Pentium D", which was two NetBursts on
> the same die, but with two sets of legs, and no communication between
> the cores that didn't go through the legs and the motherboard FSB.
> Locking performance was unimpressive, to say the least, and early
> Opterons beat it utterly. I'm going to take a lot of convincing that
> this was a long-planned product; the design just isn't good enough for
> that to be convincing.

The charts I remember, and I'm sure they were from the last
millennium, observed the rate at which power per unit area was
increasing, had a space shuttle thermal tile number on the same slide
for comparison, and concluded that the trend was not sustainable.

There were, in concept, at least two ways you could beat the trend: go
to multiple cores not running so fast (the proposal in that
presentation) or bet on a miracle. Apparently, the NetBurst team was
betting on a miracle.

From the outside, Intel looks arrogant enough to believe that they
could do multiple cores when they were forced to and no sooner. In
actuality, they weren't far wrong. Most people don't remember Pentium
D.

Robert.

Andy "Krazy" Glew

unread,
Dec 11, 2009, 12:23:44 AM12/11/09
to Michael S
Michael S wrote:

> I they are doing all that I simply can't see how one of existing GPUs
> (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> significant above 1 SP TFLOPs? According to Wikipedia there are not.
> So, either they compare an array of GPUs with single host CPU or their
> host code is very far from optimal. I'd bet on later.

Let's see: http://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg
says that Nhm can 2 2 128 bit SSE adds and 1 128 bit SSE mul per cycle.
Now, you might count that as 12 SP FLOPs or 6 DP FLOPS. Multiplied by 8
cores on a chip, you might get 96 SP FLOPS.

However, most supercomputer people count a flop as a
multiply-accumulate. By that standard, Nhm is only 4 SP mul-add FLOPs
per cycle. Add a fudge factor for the extra adder, but certainly not
2X, probably not even 1.5X -- and purists won't even give you that. 32
FLOPS. If you are lucky.

Seldom do you get the 100% utilization of the FMUL unit that you would
need to get 32 SP FLOPS. Especially not when you through in MP bus
contention, thread contention, etc.

Whereas the GPUs tend to have real FMAs. Intel and AMD have both
indicated that they are going the FMA direction. But I don't think that
has shipped yet.

And, frankly, it is easier to tune your code to get good utilization on
a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
for simple codes, yes, CUDA is easier. In my experience. And I'm a
fairly good x86 programmer, and a novice CUDA GPU programmer. I look
forward to Terje reporting his experience tuning code for CUDA (as long
as he isn't tuning wc).

The painful thing about CUDA is the ugly memory model - blocks, blah,
blah, blah. And it is really bad when you have to transfer stuff from
CPU to GPU memory. I hope and expect that Fermi will ameliorate this pain.

---

People are reporting 50x and 100x improvements on CUDA all over the
place. Try it yourself. Be sure to google for tuning advice.

Andy "Krazy" Glew

unread,
Dec 11, 2009, 12:26:27 AM12/11/09
to Andrew Reilly
Andrew Reilly wrote:
> On Wed, 09 Dec 2009 21:25:32 -0800, Andy \"Krazy\" Glew wrote:
>
>> Like I said, I was surprised at how many supercomputer customers
>> expressed this x86 orientation. I expected them to care little about
>> x86.
>
> I still expect those who use Cray or NEC vector supers, or any of the
> scale-up SGI boxes, or any of the Blue-foo systems to care very little
> indeed. The folk who seem to be getting mileage from the CUDA systems
> probably only care peripherally.

Actually some of the CUDA people do care.

They'll use CUDA for the performance critical code, and x86 for all the
rest, in the system it is attached to. With the x86 tools.

Or at least that's what they told me at SC09.

Andy "Krazy" Glew

unread,
Dec 11, 2009, 12:36:12 AM12/11/09
to j...@cix.compulink.co.uk
j...@cix.compulink.co.uk wrote:
> In article <1isTm.93647$Pi.2...@newsfe30.ams2>, me...@devnull.com
> (ChrisQ) wrote:
>
>> The obvious question then is: Would one of many x86 cores be fast
>> enough on it's own to run legacy windows code like office, photoshop
>> etc ?...
>
> Maybe. But can marketing men convince themselves that this would be the
> case? Almost certainly: a few studies about how many apps the average
> corporate Windows user has open at a time could work wonders. The
> problem, of course, is that most of those apps aren't consuming much CPU
> except when they have input focus. But that's the kind of thing that
> marketing departments are good at neglecting.

At SC09 the watchword was heterogeneity.

E.g. a big OOO x86 core, with small efficient cores of your favorite
flavour. On the same chip.

While you could put a bunch of small x86 cores on the side, I think that
you would probably be better off putting a bunch of small non-x86 cores
on the side. Like GPU cores. Like Nvidia. OR AMD/ATI Fusion.

Although this makes sense to me, I wonder if the people who want x86
really want x86 everywhere - on both the big cores, and the small.

Nobody likes the hetero programming model. But if you get a 100x perf
benefit from GPGPU...

Terje Mathisen

unread,
Dec 11, 2009, 1:48:52 AM12/11/09
to
Michael S wrote:
> I they are doing all that I simply can't see how one of existing GPUs
> (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of>10.
> Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> significant above 1 SP TFLOPs? According to Wikipedia there are not.
> So, either they compare an array of GPUs with single host CPU or their
> host code is very far from optimal. I'd bet on later.

It seems to be a bandwidth problem much more than a flops, i.e. using
the texture units effectively was the key to the big wins.

Take a look yourself:

http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf

Page 15 has the optimization graph, stating that 'Global Memory
Coalescing', 'Optimized Use of Shared Memory' and getting rid of
branches were the main contributors to the speedups.

It is of course possible that the CPU baseline was quite naive code, or
that the cpus used were quite old, but I would hope not.

Noob

unread,
Dec 11, 2009, 7:28:02 AM12/11/09
to
Andy "Krazy" Glew wrote:

> Nobody likes the hetero programming model.

They prefer the homo programming model? :-�

Michael S

unread,
Dec 11, 2009, 8:10:36 AM12/11/09
to
On Dec 11, 7:23 am, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> Michael S wrote:
> > I they are doing all that I simply can't see how one of existing GPUs
> > (i.e. not Fermi) could possibly beat 3 GHz Nehalem by factor of >10.
> > Nehalem is rated at ~100 SP GFLOPs. Are there GPU chips that are
> > significant above 1 SP TFLOPs? According to Wikipedia there are not.
> > So, either they compare an array of GPUs with single host CPU or their
> > host code is very far from optimal. I'd bet on later.
>
> Let's see:http://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg
> says that Nhm can 2 2 128 bit SSE adds and 1 128 bit SSE mul per cycle.
> Now, you might count that as 12 SP FLOPs or 6 DP FLOPS. Multiplied by 8
> cores on a chip, you might get 96 SP FLOPS.
>

I counted it as 8 SP FLOPs. If wikipedia claims that Nehalem can do 2
FP128 adds per cicle than they are wrong, but more likely you misread
it. Nehalem has only one 1288-bit FP adder attached to port 1, exactly
like the previous members of Core2 family. Port 5 is only "move and
logic", not capable of FP arithmetic.
8 FLOPs/core * 4 cores/chip * 2.93 GHz => 94 GFLOPs

> However, most supercomputer people count a flop as a
> multiply-accumulate.
> By that standard, Nhm is only 4 SP mul-add FLOPs
> per cycle.

Bullshit. Supercomputer people count exactly like everybody else. Look
at "peak flops" in LINPACK reports.

>Add a fudge factor for the extra adder, but certainly not
> 2X, probably not even 1.5X -- and purists won't even give you that. 32
> FLOPS. If you are lucky.
>
> Seldom do you get the 100% utilization of the FMUL unit that you would
> need to get 32 SP FLOPS. Especially not when you through in MP bus
> contention, thread contention, etc.
>
> Whereas the GPUs tend to have real FMAs.

That has nothing to do with calculations in hand. When AMD says that
their new chip does 2.72 TFLOPs they really mean 1.36 TFMAs

>Intel and AMD have both indicated that they are going the FMA direction. But I don't think that
> has shipped yet.

Hopefully, Intel is not going in FMA direction. 3 source operands is a
major PITA for P6-derived Uarch. Most likely requires coordinated
dispatch via two execution ports so it would give nothing for peak
throughput. But you sure know more than me about it.
FMA makes sense on Silverthorne but I'd rather see Silverthorne dead.

>
> And, frankly, it is easier to tune your code to get good utilization on
> a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
> for simple codes, yes, CUDA is easier. In my experience. And I'm a
> fairly good x86 programmer, and a novice CUDA GPU programmer. I look
> forward to Terje reporting his experience tuning code for CUDA (as long
> as he isn't tuning wc).

I'd guess you played with microbenchmerks. Can't imagine it to be true
on real-world code that, yes, is "ugly" but, what can we do, real-
world problems are almost never nice and symmetric.


>
> The painful thing about CUDA is the ugly memory model - blocks, blah,
> blah, blah. And it is really bad when you have to transfer stuff from
> CPU to GPU memory. I hope and expect that Fermi will ameliorate this pain.
>
> ---
>
> People are reporting 50x and 100x improvements on CUDA all over the
> place. Try it yourself. Be sure to google for tuning advice.

First, 95% of the people can't do proper SIMD+multicore on host CPU to
save their lives and that already large proportion of "people are
reporting". Of those honest and knowing what they are doing majority
likely had not computationally bound problem to start with and they
found a way to take advantage of texture units.
According to Terje (see below) that was a case in Seismic code he
brought as an example.

Still, I have a feeling that a majority (not all) of PDE-type problems
that on GPU could be assisted by texture on host CPU could be
reformulated to exploit temporal locality via on-chip cache. But
that's just a feeling, nothing scientific.

Andy "Krazy" Glew

unread,
Dec 11, 2009, 9:59:07 AM12/11/09
to Michael S
Michael S wrote:
> First, 95% of the people can't do proper SIMD+multicore on host CPU to
> save their lives

Right. If only 5% (probably less) of people can't do SIMD+multicore on
host CPU, but 10% can do it on a coherent threaded microarchitecture,
which is better?

> On Dec 11, 7:23 am, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net> wrote:
>> And, frankly, it is easier to tune your code to get good utilization on
>> a GPU. Yes, easier. Try it yourself. Not for really ugly code, but
>> for simple codes, yes, CUDA is easier. In my experience. And I'm a
>> fairly good x86 programmer, and a novice CUDA GPU programmer. I look
>> forward to Terje reporting his experience tuning code for CUDA (as long
>> as he isn't tuning wc).
>
> I'd guess you played with microbenchmerks.

Yep, you're right.

But even on the simplest microbenchmark, DAXPY, I needed to spend less
time tuning it on CUDA than I did tuning it on x86.

Now, there are some big real world apps where coherent threading falls
off a cliff. Where SIMT just doesn't work.

But if GPGPU needs less work for simple stuff, and comparable work for
hard stuff, and if the places where it falls off a cliff are no more
common than where MIMD CPU falls off a cliff...

If this was Willamette versus CUDA, there would not even be a question.
CUDA is easier to tune than Wmt. Nhm is easier to tune than Wmt, but it
still has a fairy complex microarchitecture, with lots of features that
get in the way. Sometimes simpler is better.

Andy "Krazy" Glew

unread,
Dec 11, 2009, 10:24:36 AM12/11/09
to Terje Mathisen
Terje Mathisen wrote:
> It seems to be a bandwidth problem much more than a flops, i.e. using
> the texture units effectively was the key to the big wins.

I am puzzled, torn, wondering about the texture units.

There's texture memory, which is just a slightly funky form of
cache/memory, with funky locality, and possible compression.

But then there's also the texture computation capability. Which is just
a funky form of 2 or 3D interpolation.

Most people seem to be getting the benefit from texture memory. But
when people use the texture interpolation compute capabilities, there's
another kicker.

Back in the 1990s on P6, when I was trying to make the case for CPUs to
own the graphics market, and not surrender it to the only-just-nascent
GPUs, the texture units were the oinker: they are just so damned
necessary to graphics, and they are just so damned idiosyncratic. I do
not know of any good way to do texturing in software that doesn't lose
performance, or of any good way to decompose texturing into simpler
instruction set primitives that could reasonably be added to an
instruction set. E.g. I don't know of any good way to express texture
operations in terms of 2, or 3, or even 4, register inputs.

Let's try again: how about an interpolate instruction that takes 3
vector registers, and performs interpolation between tuples in the X
direction along the length of the register, and in the Y generation
between correspondng elements in different registers?

But do you want 2, or 3, or 4, or ... arguments to interpolate along?
And what about Z interpolation?

Let alone compression? And skewed sampling? And ...

Textures just seem to be this big mass of stuff, all of which has to be
done in order to be credible.

Although I usually try to decompose complex things into simpler
operations, sometimes it is necessary to go the other way. Maybe we can
make the texture units more general. Make them into generally useful
function interpolation units. Add that capability to general purpose CPUs.

How much of the benefit is texture computation vs texture memory? Can
we separate these two things?

Texture computation is interpolation. (Which, of course, often
translates to memory savings because it changes the amount of memory you
need for lookup tables - higher order interpolation, or multiscale
interpolation => less memory traffic.) It looks like this can be made
general purpose. But how many people need it?

Texture memory is ... a funky sort of cache, with compression. Caches we
can make generically useful. Compression - for read-only data
structures, sure. But how can we write INTO the "compressed texture
cache memory", in such a way that we don't blow out the compression when
in gets kicked out of the cache?

Or, can we safely create a hardware datastructure that is mainly useful
for caching readonly, heaviliy preprocessed, data?

It seems to me that most of the GPGPU codes are not using the compute or
compression aspects of texture units. Indeed, CUDA doesn't really give
access to that. So it is probably just the extra memory ports and cache
behavior.

--

Terje, you're the master of lookup tables. Can you see a way to make
texture units generally useful?

Andy "Krazy" Glew

unread,
Dec 11, 2009, 4:20:02 PM12/11/09
to Michael S
Andy "Krazy" Glew wrote:
> Michael S wrote:
>> First, 95% of the people can't do proper SIMD+multicore on host CPU to
>> save their lives
>
> Right. If only 5% (probably less) of people can't do SIMD+multicore on
> host CPU, but 10% can do it on a coherent threaded microarchitecture,
> which is better?

Urg. Language. Poor typing skills.

If only 5% can do good SIMD+multicore tuning on a host CPU,
but 10% can do it on a coherent threaded GPU-style microarchitecture,
which is better?

Andy "Krazy" Glew

unread,
Dec 11, 2009, 4:50:15 PM12/11/09
to Robert Myers
Robert Myers wrote:
> On Dec 9, 11:12 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>

> wrote:
>
>> And Nvidia needs to get out of the discrete graphics board market niche
>> as soon as possible. If they can do so, I bet on Nvidia.
>
> Cringely thinks, well, the link says it all:
>
> http://www.cringely.com/2009/12/intel-will-buy-nvidia/

Let's have some fun. Not gossip, but complete speculation. Let's think
about what companies might have a business interest in or be capable of
buying Nvidia. Add to this one extra consideration: Jen-Hsun Huang is
rumored to have wanted the CEO position in such merger possibilities in
the past

http://www.tomsguide.com/us/nvidia-amd-acquisition,news-594.html

The list:

---> Intel + Nvidia:
I almost hope not, but Cringely has