Bulldozer on Slashdot

nedbrek

unread,

Aug 24, 2010, 7:27:45 PM8/24/10

to

http://hardware.slashdot.org/story/10/08/24/1521203/AMD-Details-Upcoming-Bulldozer-Architecture

Very few details, except this stuck out:
"AMD expects around 80% of the performance of a traditional dual core part
in about 50% to 60% of the size and power consumption."

The workload is obviously going to matter a lot. But, it seems that the
integer performance must be lower.

It makes me sad.

Ned

Robert Myers

unread,

Aug 24, 2010, 9:22:34 PM8/24/10

to

On Aug 24, 7:27 pm, "nedbrek" <nedb...@yahoo.com> wrote:
> http://hardware.slashdot.org/story/10/08/24/1521203/AMD-Details-Upcom...

>
> Very few details, except this stuck out:
> "AMD expects around 80% of the performance of a traditional dual core part
> in about 50% to 60% of the size and power consumption."
>
> The workload is obviously going to matter a lot. But, it seems that the
> integer performance must be lower.
>
> It makes me sad.
>

The discussions of anything involving any flavor of SMT have never
left me with much in the way of insight.

They've apparently decided to double the number of integer execution
units (at some cost to single thread performance, I gather). That
allows the marketeers to make slides like the one shown on
http://www.pcper.com/article.php?type=expert&aid=985&pid=2, wherein it
is implied that an Intel-style hyperthreaded core will be wimpy
compared to an AMD "fat" SMT core.

As you say, the value of the trade will depend on the workload, but
the logic of the trades is, as always, completely obscure, at least to
me. It looks like the part is aimed at virtualized servers and
highly parallel integer media processing.

Robert.

Skybuck Flying

unread,

Aug 25, 2010, 12:53:44 AM8/25/10

to

"nedbrek" <ned...@yahoo.com> wrote in message
news:i51khm$cat$1...@news.eternal-september.org...

If you quote the whole thing it doesn't sound so bad:

"
By smartly sharing and beefing up components, AMD is able to double the
amount of integer units into a design. AMD expects around 80% of the

performance of a traditional dual core part in about 50% to 60% of the size

and power consumption. This allows AMD to pack 33% more cores and an
estimated 50% higher throughput than a current generation Magny Cours for
its given die size and power envelope.
"

Looks like these will be 16 core processors or more.

Bye,
Skybuck.

Brett Davis

unread,

Aug 25, 2010, 2:16:35 AM8/25/10

to

In article <i51khm$cat$1...@news.eternal-september.org>,
"nedbrek" <ned...@yahoo.com> wrote:

Me thinks you and or that reporter confused that quote.
Turn off the second CPU and you get 100%, turn on the second CPU and get 180%.
Hyperthreading gives you +-5%, for about 5%(?) die space.

The second set of integer pipes cost 12.5% of die space, as it shares fetch,
decode, and the huge FPU.

Intel would claim 100% for the second core, because someone will get that, but
AMD cannot get away with promising the moon and delivering useless crap like
Hypethreading. (Which Intel claims 100% speedups for.)
AMD just does not have the marketing money to buy off the press like Intel.

I want to know how it compares to K10. Only two AGUs and two ALUs, verses three
of each for K10. AMD has gone for even heavier microOps. (Smart move, saves power.)

http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010
http://www.anandtech.com/show/3865/amd-bobcat-bulldozer-hot-chips-presentations-online

Brett

Andy Glew

unread,

Aug 25, 2010, 3:39:50 AM8/25/10

to

On 8/24/2010 11:16 PM, Brett Davis wrote:
> In article<i51khm$cat$1...@news.eternal-september.org>,
> "nedbrek"<ned...@yahoo.com> wrote:
>
>> http://hardware.slashdot.org/story/10/08/24/1521203/AMD-Details-Upcoming-Bulldozer-Architecture
>>
>> Very few details, except this stuck out:
>> "AMD expects around 80% of the performance of a traditional dual core part
>> in about 50% to 60% of the size and power consumption."
>>
>> The workload is obviously going to matter a lot. But, it seems that the
>> integer performance must be lower.
>>
>> It makes me sad.
>
> Me thinks you and or that reporter confused that quote.
> Turn off the second CPU and you get 100%, turn on the second CPU and get 180%.
> Hyperthreading gives you +-5%, for about 5%(?) die space.
>
> The second set of integer pipes cost 12.5% of die space, as it shares fetch,
> decode, and the huge FPU.

The area and performance numbers for Bulldozer are scattered around
various slidesets, and do not appear to be consistent. We will have to
wait for more details.

There are several design points worth comparing, real and hypothetical.

Old system (K10?)
* on old process
* if, hypothetically, upgraded to new process

Bulldozer, MCMT-2: 2 integer "cores", shared front end, FP, L2

Bulldozer, strictly single core with L2:
* approximate by subtracting the area of 1 integer core, plus any
duplicated resources in front end.
* i.e 1 integer "core" + shared front end + FP + L2 - fudge for other
replicated stuff

Bulldozer, strictly double core with private L2:
* 2X the strict single core; i.e. private L2

Bulldozer, strictly double core with shared L2:
* 2 x (integer "core" + shared front end + FP - fudge for other)
+ L2 shared between the strict single cores

nedbrek

unread,

Aug 25, 2010, 7:32:30 AM8/25/10

to

Hello all,

"Brett Davis" <gg...@yahoo.com> wrote in message
news:ggtgp-2F5622.01163525082010@news.isp.giganews.com...

> In article <i51khm$cat$1...@news.eternal-september.org>,
> "nedbrek" <ned...@yahoo.com> wrote:
>
>> http://hardware.slashdot.org/story/10/08/24/1521203/AMD-Details-Upcoming-Bulldozer-Architecture
>>

>> "AMD expects around 80% of the performance of a traditional dual core
>> part
>> in about 50% to 60% of the size and power consumption."
>

> Me thinks you and or that reporter confused that quote.
> Turn off the second CPU and you get 100%, turn on the second CPU and get
> 180%.
> Hyperthreading gives you +-5%, for about 5%(?) die space.
>

> I want to know how it compares to K10. Only two AGUs and two ALUs, verses
> three
> of each for K10. AMD has gone for even heavier microOps. (Smart move,
> saves
> power.)

100% of something less than K10. With only 4 integer pipes, rather than 6
it will be less. I'd be surprised if it was 20% less, though...

They can't use both sides on a single thread, can they?

Ned

Andy Glew

unread,

Aug 25, 2010, 10:22:34 AM8/25/10

to

No.

comp.arch has seen my discussion of attempts to do this sort of thing,
such as batched instruction execution (execute one batch on a first
cluster (AMD terminology, core)), and the next batch on the next.
Similarly, SpMT, where MCMT is motivated primarily as a way of making
thread migration easier, and/or isolating the non-speculative thread
from the speculative thread(s). Sharing the renamer between the
cluster/cores facilitates these single thread optimizations.

Brett Davis

unread,

Aug 26, 2010, 12:38:20 AM8/26/10

to

In article <ggtgp-2F5622....@news.isp.giganews.com>,

Brett Davis <gg...@yahoo.com> wrote:
> Me thinks you and or that reporter confused that quote.
> Turn off the second CPU and you get 100%, turn on the second CPU and get 180%.
> Hyperthreading gives you +-5%, for about 5%(?) die space.
>
> The second set of integer pipes cost 12.5% of die space, as it shares fetch,
> decode, and the huge FPU.
>
> Intel would claim 100% for the second core, because someone will get that, but
> AMD cannot get away with promising the moon and delivering useless crap like
> Hypethreading. (Which Intel claims 100% speedups for.)
> AMD just does not have the marketing money to buy off the press like Intel.
>
> I want to know how it compares to K10. Only two AGUs and two ALUs, verses three
> of each for K10. AMD has gone for even heavier microOps. (Smart move, saves power.)
>
> http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010
> http://www.anandtech.com/show/3865/amd-bobcat-bulldozer-hot-chips-presentations-online

K10 has one major bottleneck outside the issue pipeline to executing
more instructions per cycle. The 16 byte decode unit will give you 3.5
instructions per cycle on average, less for SSE code, as few as 2.5.

A 32 byte decode unit will be idle greater than 50% of the time on average.
Huge die area and a huge win to share.

The k10 retirement unit can only retire 3 instructions a cycle, Bulldozer
will do 4.

The third AGU was never used, waste of die area and heat.

The third ALU is of more concern, Intel will standardize benchmarks to
make this look bad, even though I know it was used 1% on average.

AMD now has separate load and store pipelines, this can be a huge advantage.
For every 90 instructions on average you will have 60 integer ops, 20 loads,
and 10 stores.
Load ops are executed as soon as the AGU has the data to compute the address,
and does not need to stick around waiting dozens of cycles for the data to
actually load. A que can take care of managing the data from cache, and
loading that data into a renamed register, say AX[25]. You do not care that
the ALU is still writing to say AX[13].

So after 20 instructions the ALU may have executed 40 instructions and you
may have emptied the load pipe of its 20 loads. The ALU has all the data
it needs to run full out on the next 20 instructions.

Compared to K10 which for a straw man comparison is one pipe three ALUs wide.
Even with OoO you are never going to get 10 loads ahead, and all that work
to push loads forward past integer ops costs power.

The branch unit is not on the Bulldozer slides, I assume it has its own
pipe that is not shown. So the average 4 way case would be 2 int, 1 load,
half a store and half a branch. this looks sustainable.

Bulldozer will be faster than K10, the question is how much, and how it
does against Core2 on single thread code. On Hypethreaded code Intel will
get Bulldozered, except that Intel will demand reviewers use twice the
die size for Intel to match the AMD core count, and then Intel will win,
at twice the cost.

Brett

Terje Mathisen

unread,

Aug 26, 2010, 4:30:18 AM8/26/10

to

Brett Davis wrote:
> The third ALU is of more concern, Intel will standardize benchmarks to
> make this look bad, even though I know it was used 1% on average.

Depends on the code and the programmers.

>
> AMD now has separate load and store pipelines, this can be a huge advantage.
> For every 90 instructions on average you will have 60 integer ops, 20 loads,
> and 10 stores.

Have you got a reference for those numbers? Particularly the 2:1 load vs
store ratio?

What kind of code? Which cpu?

In my experience high-performance x86 code tended to use a lot of lookup
tables, tilting the balance point far more into the load vs store camp.

Currently, with SIMD code used for many/most of those high-perf tasks,
lookup tables have gone out of fashion again, but it still seems like
two loads, one store would only fit simple fp kernels?

If you have two combined load/store pipes you _will_ get better
performance than from two single-task dedicated pipes, but those
combined pipes will also be a bit more complicated.

> Bulldozer will be faster than K10, the question is how much, and how it
> does against Core2 on single thread code. On Hypethreaded code Intel will
> get Bulldozered, except that Intel will demand reviewers use twice the
> die size for Intel to match the AMD core count, and then Intel will win,
> at twice the cost.

Somewhat cynical view?
:-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

nedbrek

unread,

Aug 26, 2010, 7:08:04 AM8/26/10

to

"Brett Davis" <gg...@yahoo.com> wrote in message

news:ggtgp-930DEE....@news.isp.giganews.com...

>
> Bulldozer will be faster than K10, the question is how much

That's good to hear. Now that they have gone MCMT, it is an evolutionary
step to using both clusters on a single thread. Although I am not
optimistic it will ever happen...

Ned

Tim McCaffrey

unread,

Aug 26, 2010, 3:17:56 PM8/26/10

to

In article <i55ji6$9kk$1...@news.eternal-september.org>, ned...@yahoo.com
says...

I'm guessing bulldozer will be faster as well, because the load/store units
have been optimized (loads & stores are allowed to pass each other in some
cases), and because of a line on one of the slides that they got rid of the
beginning of instruction markers in the L1 cache, which implies they can
always decode 4 instructions at a time even when fetching from memory.

Assuming you have integer only code, why only 80%? Is it because the
fetch/decoder is shared or because the caches are shared and there is access
contention (actually, that would make sense, that is about the scaling factor
for an old dual processor mainframe, I think that is Amdahl's Law showing up
again).

Post-Bulldozer: split the caches, don't share the instruction decoder,
increase the cache sizes (16K seems a bit small for the L1 I-cache).

I find Bobcat more interesting, really, it looks like the K6 got resurrected.
I really liked my K6-III system...

- Tim

MitchAlsup

unread,

Aug 26, 2010, 3:37:49 PM8/26/10

to

On Aug 25, 11:38 pm, Brett Davis <gg...@yahoo.com> wrote:
> In article <ggtgp-2F5622.01163525082...@news.isp.giganews.com>,
> Brett Davis <gg...@yahoo.com> wrote:

> K10 has one major bottleneck outside the issue pipeline to executing
> more instructions per cycle. The 16 byte decode unit will give you 3.5
> instructions per cycle on average, less for SSE code, as few as 2.5.
>
> A 32 byte decode unit will be idle greater than 50% of the time on average.
> Huge die area and a huge win to share.
>
> The k10 retirement unit can only retire 3 instructions a cycle, Bulldozer
> will do 4.

(Ahem) K10 is BullDozer, K8 is Opteron and follow-ons.

> The third AGU was never used, waste of die area and heat.

The issue was that the 3rd unit was used a lot, only to run into the
dual-only ported DataCache. This caused sequencing issues.

> The third ALU is of more concern, Intel will standardize benchmarks to
> make this look bad, even though I know it was used 1% on average.

So what else is new.

> AMD now has separate load and store pipelines, this can be a huge advantage.
> For every 90 instructions on average you will have 60 integer ops, 20 loads,
> and 10 stores.

We measured very close to 50% of x86 instructions having memory
reference attachments. So, for every 90 x86 instructions, on would
expect 45 memory references wiht a general ratio of just over 2 reads
to 1 write. Thus, I would expect 30-33 reads and 12-15 writes.

> The branch unit is not on the Bulldozer slides,

We always put these in the ALUs with means to redirect the front-end
on discovery of mispredict.

> Bulldozer will be faster than K10, the question is how much,

When I left, BD was supposed to be 20-25% faster frequency wise, and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture. The surprising thing was the lack of mention of
frequency in the market-droid-ing.

Mitch

Anton Ertl

unread,

Aug 26, 2010, 3:57:58 PM8/26/10

to

MitchAlsup <Mitch...@aol.com> writes:
>(Ahem) K10 is BullDozer, K8 is Opteron and follow-ons.

In many web sites and also in the press K10 is Barcelona and its
offspring. See http://en.wikipedia.org/wiki/AMD_K10#Nomenclatures.

For Bulldozer AMD marketing finally got rid of K numbers.

I found the K and P numbers always pretty useful. Code names like
Williamette and Barcelona refer more to a particular chip than to a
microarchitecture.

>When I left, BD was supposed to be 20-25% faster frequency wise, and
>loose a little architectural figure (5%-ish) of merit due to the
>microarchitecture. The surprising thing was the lack of mention of
>frequency in the market-droid-ing.

The usual thing: Nobody knows final numbers, and giving numbers is
unnecessary for marketing at this stage. If they give numbers, and
they are too high, this would generate bad press at release. If they
give low numbers, that will doom their marketing effort right away.
So they don't give numbers.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

nedbrek

unread,

Aug 26, 2010, 7:44:53 PM8/26/10

to

Hello all,

"MitchAlsup" <Mitch...@aol.com> wrote in message
news:7e7794d2-ef3d-4059...@v8g2000yqe.googlegroups.com...

> On Aug 25, 11:38 pm, Brett Davis <gg...@yahoo.com> wrote:
>> In article <ggtgp-2F5622.01163525082...@news.isp.giganews.com>,
>> Brett Davis <gg...@yahoo.com> wrote:
>
>> The third AGU was never used, waste of die area and heat.
>
> The issue was that the 3rd unit was used a lot, only to run into the
> dual-only ported DataCache. This caused sequencing issues.

Too bad. I would hope the 3 AGUs could enable 2 load + 1 store per cycle
(let the stores drain on long memory waits and branch mispredicts)... of
course, weird beat patterns seem inescapable.

>> Bulldozer will be faster than K10, the question is how much,
>
> When I left, BD was supposed to be 20-25% faster frequency wise, and

> lose a little architectural figure (5%-ish) of merit due to the

> microarchitecture. The surprising thing was the lack of mention of
> frequency in the market-droid-ing.

That's a lot of frequency! Is that gain from process shrink (even just
optimization for the process)?

Do we expect the frequency to be kept low due to 2x "cores"? (clusters, this
is going to be just as painful as L0 cache!)

Thanks for that!
Ned

MitchAlsup

unread,

Aug 26, 2010, 8:17:08 PM8/26/10

to

On Aug 26, 6:44 pm, "nedbrek" <nedb...@yahoo.com> wrote:

> > When I left, BD was supposed to be 20-25% faster frequency wise, and
> > lose a little architectural figure (5%-ish) of merit due to the
> > microarchitecture. The surprising thing was the lack of mention of
> > frequency in the market-droid-ing.
>
> That's a lot of frequency! Is that gain from process shrink (even just
> optimization for the process)?

BD was to use a 12-gate pipeline, while Athlon used a 16-gate pipe and
Opteron used a 17-gate pipe. {Add 5 more gates for FLOP, jitter and
Skew to arrive at actual cycle time.} Process shrink is on top of his.
{K9 was to use an 8-gate pipeline.}

Most of what got cut was cut to enable the 12-gate pipe (if indeed
they did achieve that.) In Athlon/Opteron, one can forward a byte,
word, double, or quad from any of the 5 results to any operand of any
6 integer computation units {ALU, AGU}. If BD can't (or couldn't when
I left) forward anything to anywhere, and eats a little AFoM because
of this. This probably saved 2 real gate delays. Lopping off the extra
ALU, and a few other things saves another gate and we are then within
spitting distance (1-gate) of the desired 12-gate pipe in the integer
pipe. More lopping occured in the L1cache pipe to reach the cycle time
goal.

Mitch

Andy Glew

unread,

Aug 26, 2010, 11:05:36 PM8/26/10

to

On 8/26/2010 4:44 PM, nedbrek wrote:
> Do we expect the frequency to be kept low due to 2x "cores"? (clusters, this
> is going to be just as painful as L0 cache!)

Even some of the Hotchips slides talk about voltage gating the entire
core, aka the entire module (which they now call the cluster).

I.e. it is pretty obvious that the change in terminology was marketing
driven.

Brett Davis

unread,

Aug 26, 2010, 11:30:49 PM8/26/10

to

In article
<7e7794d2-ef3d-4059...@v8g2000yqe.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
> On Aug 25, 11:38 pm, Brett Davis <gg...@yahoo.com> wrote:
> > In article <ggtgp-2F5622.01163525082...@news.isp.giganews.com>,
> > Brett Davis <gg...@yahoo.com> wrote:
> > AMD now has separate load and store pipelines, this can be a huge advantage.
> > For every 90 instructions on average you will have 60 integer ops, 20 loads,
> > and 10 stores.
>
> We measured very close to 50% of x86 instructions having memory
> reference attachments. So, for every 90 x86 instructions, on would
> expect 45 memory references wiht a general ratio of just over 2 reads
> to 1 write. Thus, I would expect 30-33 reads and 12-15 writes.

I used RISC numbers for comp/load/store, your number sounds right for
8086/80386, but AMD64 mode has 16 registers. I assume the compilers
use those registers, and now x64 should have close to RISC ratios.

After all the function call interface for x64 has a bunch of values
passed in registers instead of on the stack like the 80386 did it.

The average RISC function will never use more than 16 registers,
32 is actually wasteful and arguably unproductive.

Brett

Nicholas King

unread,

Aug 27, 2010, 10:50:21 AM8/27/10

to

On 08/27/2010 01:30 PM, Brett Davis wrote:
> After all the function call interface for x64 has a bunch of values
> passed in registers instead of on the stack like the 80386 did it.
>
> The average RISC function will never use more than 16 registers,
> 32 is actually wasteful and arguably unproductive.

What about inter procedural optimisations if we force code to spill to
memory on a function call than we then have the cost of accessing cache
and the load on the load store pipeline. I'm led to wonder how much of
modern processor design is influenced by past decisions and legacy systems.

Is going from 16-32bits really that much of a cost? and does choosing 16
registers require more work on the various schedulers and register
renaming. It's interesting to see that IBM has chosen to go 128
registers for it's VMX vector instructions in the POWER 7 design.

I'm personally a big fan of JIT re-optimisation of code based upon
performance and likelihood statistics. I'd love to see the CPU, OS,
Language and Compiler eco-system better support it.

Cheers,
Ze (Nicholas King)

MitchAlsup

unread,

Aug 27, 2010, 12:49:17 PM8/27/10

to

On Aug 26, 10:30 pm, Brett Davis <gg...@yahoo.com> wrote:
> I used RISC numbers for comp/load/store, your number sounds right for
> 8086/80386, but AMD64 mode has 16 registers. I assume the compilers
> use those registers, and now x64 should have close to RISC ratios.

Also note that Load-Op instructions lessen the need for 32 registers
compared to the OP-only RISC architectures. This Load-Op (and 2-
register) instruction set gives the compiler the illusion that inbound
memrefs are cheap (especially when the flags are set for code
density). So, while the ratio may have headed in the direction of the
RISC architectures, it remains with more memrefs.

> After all the function call interface for x64 has a bunch of values
> passed in registers instead of on the stack like the 80386 did it.
>
> The average RISC function will never use more than 16 registers,
> 32 is actually wasteful and arguably unproductive.

We had many heated discussions about providing 2-REX prefixs to enable
access to 32-registers (and different ways to encode REX-like stuff).
In the end, the statistics indicated that while an 8-register
architecture suffered a 25%-odd degredation compared to a 32-register
architecture, a 16-register architecture only suffered a 3%
degredation. It seemed good enough.

Mitch

Jeremy Linton

unread,

Aug 27, 2010, 1:52:37 PM8/27/10

to

On 8/27/2010 11:49 AM, MitchAlsup wrote:
> Also note that Load-Op instructions lessen the need for 32 registers
> compared to the OP-only RISC architectures. This Load-Op (and 2-

> architecture suffered a 25%-odd degredation compared to a 32-register

> architecture, a 16-register architecture only suffered a 3%
> degredation. It seemed good enough.

I've heard similar statics before, but to be clear the "3%" is for a
machine with Load-Op instructions or for one that can only to reg-reg
operations?

Brett Davis

unread,

Aug 29, 2010, 7:22:18 PM8/29/10

to

In article <4c77d0b4$0$11128$c3e...@news.astraweb.com>,
Nicholas King <z...@zerandconsulting.com> wrote:

> On 08/27/2010 01:30 PM, Brett Davis wrote:
> > After all the function call interface for x64 has a bunch of values
> > passed in registers instead of on the stack like the 80386 did it.
> >
> > The average RISC function will never use more than 16 registers,
> > 32 is actually wasteful and arguably unproductive.
> What about inter procedural optimisations if we force code to spill to
> memory on a function call than we then have the cost of accessing cache
> and the load on the load store pipeline. I'm led to wonder how much of
> modern processor design is influenced by past decisions and legacy systems.
>
> Is going from 16-32bits really that much of a cost? and does choosing 16
> registers require more work on the various schedulers and register
> renaming. It's interesting to see that IBM has chosen to go 128
> registers for it's VMX vector instructions in the POWER 7 design.

32 registers is cheap, and it fitted well into 32 bit fixed width three
operand opcodes back in the day. VMX and friends changes things.

Everyone but ARM went with 32 registers, ARM used 4 bits to make all
opcodes conditional, and thus went with 16 registers to save 3 bits.

ARM won the RISC battle for second place behind x86, just as the
RISC idea died. The last new RISC chip was Thumb1, there will never
be another fixed width RISC design of note.

Vector register count is a whole different boat. You are always going
to be running long complex calculations with few loads and stores.
Now Intel would make up for its small register count by streaming
temp values back and forth through the L1. This costs heat, lots of
heat as the L1 is so much father away than registers are, and half
your power use is reading registers, not the ALU.

If you want to make a case for Larrabee being doomed, this is it.

Larrabee2 could radically up the register count, but then you still
face the doom of super long vector registers just not being the way
forward. NVidia and ATI have both moved away from long vectors.

We moved past the "DOOM" game where polys had hundreds of pixels
which could use shared long vector calculation. Games are moving
toward subpixel RenderMan type calculations, not long vector friendly.

With a 100 million in software cost maybe you could make long
vectors work, but by the time the software is written the industry
will have moved onto yet another rendering engine style, and then
you will have to spend anther 100 million on software, spin, repeat.

> I'm personally a big fan of JIT re-optimisation of code based upon
> performance and likelihood statistics. I'd love to see the CPU, OS,
> Language and Compiler eco-system better support it.

We are going to get this regardless of CPU and language.

Brett