Multithreading: to share or not to share?

Paul A. Clayton

unread,

Oct 7, 2004, 9:20:03 AM10/7/04

to

Much of the debate seems to be based on the expected form and
quantity of available parallelism and the costs and benefits of
various forms and quantities of sharing. Obviously, the benefit
of a particular form and quantity of sharing depends on the form
and quantity of application parallelism. It also seems obvious
that much of the low-hanging fruit has been plucked for ILP and
increasingly aggressive designs (greater design and production
costs) will bring less relative performance benefit.
Furthermore, both higher volume (e.g., x86 desktop) and higher
margin (e.g., POWER-based servers) products can support larger
design costs, and flexibility in application targeting can
increase volume while narrowed targetting can increase
attractiveness to a specific market, increasing margins in that
market.

If it is known that a large number of threads will be active (or
at least will be under performance-demanding conditions), many
optimizations can be made to exploit TLP and avoid some
complexity of targeting ILP. E.g., interleaving threads in a
single core can allow forwarding paths to be removed. Such also
allows a multi-cycle cache access to appear as a single-cycle or
zero-cycle access, allowing even greater flexibility in cache
design for lower power consumption or higher hit rate. (Network
processors and cache-coherence processors might fall into this
category.) Simple replication of cores is another option, which
has design cost advantages but can be less efficient in energy
consumption and die area.

If it is known that several threads will be cooperative, it may
be possible to share certain register values, particularly
addresses. (With virtual address translation, it even becomes
somewhat reasonable for unrelated processes to share read-only
base pointers for global data areas. Of course, a zero-register
can be freely shared.) It also becomes attractive to provide
fast communication between such threads so proximity becomes
attractive.)

If it is known that longish stalls are somewhat common and
runnable threads are plentiful, then SoEMT becomes attractive.
By knowing that a stall will be relatively long, it becomes
efficient to transfer state to storage which is less expensive in
terms of area and power consumption (highly banked, single-ported
memory with a content-retaining drowsy mode). (In some sense,
shadow register subsets for interrupts are a form of SoEMT.)

If the presence of TLP is uncertain, SMT becomes attractive.
Althought SMT requires a greater design investment than other
forms of exploiting TLP, because it converts TLP into ILP it is
suited to an aggressive ILP-oriented design.

In addition to variations in sharing of instruction caches,
decoding logic, instruction buffering, functional units,
registers, register values, data caches, I/O interfaces, and the
deeper parts of the memory hierarchy, there is also the potential
for variations in sharing of portions of these areas. E.g., a
pair of cores might have separate user-level Icaches but share a
common supervisor/interrupt Icache. By distributing the cost
over more threads of execution, it can become more reasonable to
implement certain relatively rarely used functions in hardware.
Of course, any form of sharing presents the possibility of
resource contention.

Paul A. Clayton
(a 'Dysthymicdolt' reachable at aol.com)

Mitch Alsup

unread,

Oct 7, 2004, 10:43:35 PM10/7/04

to

Let us postulate a fair comparison, and since I happen to be composing
this, lets use data I am familliar with. Disclaimer: all data herein is
illustrative.

The core size of an Athlon or Opteron is about 12 times the size of
the data cache (or instruction cache) of Athlon or Opteron. I happen
to know that one can build a 486-like processor* in less area than
than the data cache of Athlon, and that this 486-like core could run
between 75% and 85% of the frequency of Opteron.

[*] 7 stage pipeline, 1-wide, in-order, x86 to SSE3 instruction set.

Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processor
is a 0.5 IPC machine. (At this point you see that we have spent the last
15 years in microprocessor development getting that last factor of 2 and
it has cost us around 12X in silicon real estate...)

CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
Opteron 1 1.0 2.4 GHz 2.4 2.4
486-like 12 0.5 2.0 GHz 1.0 12.0

If you really want to get into the game of large thread count MPs;
smaller slower less complicated in-order blocking cores delivers
more performance per area and more performance per Watt than any
of the current SMT/CMP hype.

Lets look at why:

Uniprocessor Best Case Typical Case Worst Case
DRAM access time*: 42 ns 58 ns 120+ns
CPU cycles @ 2.0 GHz 84 116 240

MultiProcessor
DRAM access time*: 103 ns** 103 ns 500 ns
CPU cycles @ 2.0 GHz 206 206 1000

[*] as seen in the execution pipeline
[**] best case is coherence bound not memory access time bound.

One needs a very large L2 cache to usefully ameliorate these kinds
of main memory latencies. Something on the order of fraction of 1%.
L2 Cache miss rates on commercial workloads: 64 GBytes of main memory
1 TByte commercial data base, thousands of disks in multiple RAID
channels, current Data Base software....
L2 miss
Miss Rate CPI cost
1 MB 5%+ 10.3
2 MB 4%-ish 8.2
4 MB 3%-ish 6.2
8 MB 2%-ish 4.1

So the fancy OoO core goes limping along at 0.2 MIPS while the itty bitty
486-like core goes limping along at 0.17 MIPS. And you get 12 of them!
So, the measly 5X advantage above, becomes a 10X advantage in the face
of bad cache behavior.

Now if I were to postulate sharing the FP/MMX/SSE units between two
486-like cores, I can get 18 of them in the same footprint as the
Opteron core.

I can also postulate what the modern instruction set additions hav done to
processor area: Leave out MMX/SSE and the 486-like size drops to 1/18 of
an Opteron core.

The problem at this instant in time is that very few benchmarks have
enough thread level parallelism to enable a company such as Intel or
AMD to embark on such a (radical) path.

Mitch
#include <std.disclaimer>

Terje Mathisen

unread,

Oct 8, 2004, 4:23:12 AM10/8/04

to

Mitch Alsup wrote:

> Let us postulate a fair comparison, and since I happen to be composing
> this, lets use data I am familliar with. Disclaimer: all data herein is
> illustrative.

[huge snip]

> I can also postulate what the modern instruction set additions hav done to
> processor area: Leave out MMX/SSE and the 486-like size drops to 1/18 of
> an Opteron core.
>
> The problem at this instant in time is that very few benchmarks have
> enough thread level parallelism to enable a company such as Intel or
> AMD to embark on such a (radical) path.

Thanks Mitch!

That was a very good post. (c.arch post of the month?)

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Ketil Malde

unread,

Oct 8, 2004, 5:55:32 AM10/8/04

to

Mitch...@aol.com (Mitch Alsup) writes:

> I happen to know that one can build a 486-like processor* in less
> area than than the data cache of Athlon, and that this 486-like core
> could run between 75% and 85% of the frequency of Opteron.

Side note: isn't this more or less what Via is building? AFAIK,
they're only at a little over 1GHz - inferior process technology?

> The problem at this instant in time is that very few benchmarks have
> enough thread level parallelism to enable a company such as Intel or
> AMD to embark on such a (radical) path.

So, I guess the problem is, at least in part, that too much code and
in particular, too many benchmark have too little parallelism to risk
a design with 18 "486"s. And a single threaded benchmark will run at
half speed or so.

The obvious question then becomes: instead of dual-core Opterons, what
about an assymetric design with one Opteron core and 18 "486-like"
cores? Single thread gets same performance as usual, while
multi-thread can get a larger benefit, as long as multi > 3.

I guess I'm ignoring a bunch of difficult issues, but would it be
impossible?

(ISTR somebody suggesting a similar design, with one core doing nasty
kernel stuff, and one or more "application" cores running simpler
application stuff - which perhaps would be a similar thing?)

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

Anton Ertl

unread,

Oct 8, 2004, 6:10:39 AM10/8/04

to

Ketil Malde <ke...@ii.uib.no> writes:
>Mitch...@aol.com (Mitch Alsup) writes:
>
>> I happen to know that one can build a 486-like processor* in less
>> area than than the data cache of Athlon, and that this 486-like core
>> could run between 75% and 85% of the frequency of Opteron.
>
>Side note: isn't this more or less what Via is building? AFAIK,
>they're only at a little over 1GHz

Right, and they need twelve stages to get there, not seven.

> - inferior process technology?

Or inferior circuit design (synthesized?). Probably both.

>The obvious question then becomes: instead of dual-core Opterons, what
>about an assymetric design with one Opteron core and 18 "486-like"
>cores? Single thread gets same performance as usual, while
>multi-thread can get a larger benefit, as long as multi > 3.

There was a talk at the last ISCA that explored this with Alphas.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Oct 8, 2004, 6:14:44 AM10/8/04

to

Mitch...@aol.com (Mitch Alsup) writes:
>The core size of an Athlon or Opteron is about 12 times the size of
>the data cache (or instruction cache) of Athlon or Opteron. I happen
>to know that one can build a 486-like processor* in less area than
>than the data cache of Athlon, and that this 486-like core could run
>between 75% and 85% of the frequency of Opteron.
>
>[*] 7 stage pipeline, 1-wide, in-order, x86 to SSE3 instruction set.
>
>Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processor
>is a 0.5 IPC machine. (At this point you see that we have spent the last
>15 years in microprocessor development getting that last factor of 2 and
>it has cost us around 12X in silicon real estate...)
>
> CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
>Opteron 1 1.0 2.4 GHz 2.4 2.4
>486-like 12 0.5 2.0 GHz 1.0 12.0

Without data cache the 486-likes will not get 0.5 IPC, so you can
probably only use 6 or 8 486-likes in that area. Also, the L2 cache
or the maim memory interface would have to be shared between the
486-likes, which requires some arbitration circuitry. Also, L2 cache
contention (or, for small, nonshared L2s, capacity misses and main
memory contention) between the 486-likes would reduce the IPC of the
486-likes. As a rough guess, I would expect something like this:

CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
Opteron 1 1.0 2.4 GHz 2.4 2.4

486-like 6 0.3 2.0 GHz 0.6 3.6

>If you really want to get into the game of large thread count MPs;
>smaller slower less complicated in-order blocking cores delivers
>more performance per area and more performance per Watt than any
>of the current SMT/CMP hype.

My guess is that for this goal you could add MT to the 486-like
relatively cheaply, and get a vast improvement in potential thread
counts, resulting in something like this (4 threads per core):

CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
Opteron 1 1.0 2.4 GHz 2.4 2.4

486-like 5(*4) 0.2(*4) 2.0 GHz 0.4(*4) 8.0

I guess Sun is doing something like this in Niagra.

In any case, thanks for your posting.

Joe Seigh

unread,

Oct 8, 2004, 7:38:25 AM10/8/04

to

Ketil Malde wrote:
>
> The obvious question then becomes: instead of dual-core Opterons, what
> about an assymetric design with one Opteron core and 18 "486-like"
> cores? Single thread gets same performance as usual, while
> multi-thread can get a larger benefit, as long as multi > 3.
>
> I guess I'm ignoring a bunch of difficult issues, but would it be
> impossible?
>

The scheduler gets a lot more complicated/interesting. They've done
this on mainframes for quite a while, supporting asymetric processor
features. And that included running processes on processors without
the feature as long as the process wasn't currently using the feature.

Joe Seigh

Paul A. Clayton

unread,

Oct 8, 2004, 9:34:54 AM10/8/04

to

In article <e90782f7.04100...@posting.google.com>,
Mitch...@aol.com (Mitch Alsup) wrote:

[snip]

>Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processor
>is a 0.5 IPC machine. (At this point you see that we have spent the last
>15 years in microprocessor development getting that last factor of 2 and
>it has cost us around 12X in silicon real estate...)
>
> CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
>Opteron 1 1.0 2.4 GHz 2.4 2.4
>486-like 12 0.5 2.0 GHz 1.0 12.0

Are you excluding the die area cost of L1-caches? Although a
scalar processor would not need as much cache for the same
burden on L2 cache, with so many more cores, one might want
the caches larger than necessary for a per-cycle miss rate target
(and one would probably want multiple L2 caches). This might
cut the size benefit down significantly.

BTW, doesn't one have a strong incentive to expand the die size
anyway since off-chip bandwidth is limited by die size? Smarter
caching might be helpful (e.g., an on-die L3 cache that only holds
smallish [32B?] blocks that are non-stride prefetchable misses),
but just adding cache does not seem to bring much gain.

[snip]

>The problem at this instant in time is that very few benchmarks have
>enough thread level parallelism to enable a company such as Intel or
>AMD to embark on such a (radical) path.

And the benchmarks, in this case, seem to be representing the
typical workloads. (I suspect that high thread-count scaling will
require more hardware support for OS-level functions.)

(Thank you Mitch for making my post a stimulant to such an
interesting posting.)

Andy Freeman

unread,

Oct 8, 2004, 11:43:26 AM10/8/04

to

Ketil Malde <ke...@ii.uib.no> wrote in message news:<egsm8p1...@dverghimalayaeiner.ii.uib.no>...

> The obvious question then becomes: instead of dual-core Opterons, what
> about an assymetric design with one Opteron core and 18 "486-like"
> cores? Single thread gets same performance as usual, while
> multi-thread can get a larger benefit, as long as multi > 3.

That has the advantage of not being completely obvlivious to Amdahl's law.

There are applications that scale nicely, where fast processors have no
advantage on small instances and large instances are nicely parallel.

However, it isn't clear that they're worth enough to support significant
design effort. (The opportunity is smaller than I described as many
of them can be run effectively on multi-box systems.)

There are probably more applications that would like as much single
processor performance as they can get, but can make some use of some
slower aux processors.

The problem is that there are some apps that need single processor performance
and can't make much use of slower aux processors. Since designing for them
also satisfies the other two groups....

Mitch Alsup

unread,

Oct 8, 2004, 12:02:04 PM10/8/04

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote in message news:<2004Oct...@mips.complang.tuwien.ac.at>...

> Mitch...@aol.com (Mitch Alsup) writes:
> >The core size of an Athlon or Opteron is about 12 times the size of
> >the data cache (or instruction cache) of Athlon or Opteron. I happen
> >to know that one can build a 486-like processor* in less area than
> >than the data cache of Athlon, and that this 486-like core could run
> >between 75% and 85% of the frequency of Opteron.
> >
> >[*] 7 stage pipeline, 1-wide, in-order, x86 to SSE3 instruction set.
> >
> >Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processor
> >is a 0.5 IPC machine. (At this point you see that we have spent the last
> >15 years in microprocessor development getting that last factor of 2 and
> >it has cost us around 12X in silicon real estate...)
> >
> > CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
> >Opteron 1 1.0 2.4 GHz 2.4 2.4
> >486-like 12 0.5 2.0 GHz 1.0 12.0
>
> Without data cache the 486-likes will not get 0.5 IPC, so you can
> probably only use 6 or 8 486-likes in that area. Also, the L2 cache
> or the maim memory interface would have to be shared between the
> 486-likes, which requires some arbitration circuitry. Also, L2 cache
> contention (or, for small, nonshared L2s, capacity misses and main
> memory contention) between the 486-likes would reduce the IPC of the
> 486-likes.

My posited 486-like cores do contain caches, just not the monsters in
Opteron. In addition, the L2 would partitioned in such a way that
several 486-core misses could be in process simultaneously to different
interleaves of that L2.

> As a rough guess, I would expect something like this:
>
> CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
> Opteron 1 1.0 2.4 GHz 2.4 2.4
> 486-like 6 0.3 2.0 GHz 0.6 3.6
>
> >If you really want to get into the game of large thread count MPs;
> >smaller slower less complicated in-order blocking cores delivers
> >more performance per area and more performance per Watt than any
> >of the current SMT/CMP hype.
>
> My guess is that for this goal you could add MT to the 486-like
> relatively cheaply, and get a vast improvement in potential thread
> counts, resulting in something like this (4 threads per core):

I agree with the basic notion: the cost of adding MT/HT to a 486-core
is vastly easier than adding MT/HT to a great big monsteous core.

>
> CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
> Opteron 1 1.0 2.4 GHz 2.4 2.4
> 486-like 5(*4) 0.2(*4) 2.0 GHz 0.4(*4) 8.0
>
> I guess Sun is doing something like this in Niagra.
>
> In any case, thanks for your posting.
>
> - anton

Mitch

John Dallman

unread,

Oct 8, 2004, 9:56:00 PM10/8/04

to

In article <e90782f7.04100...@posting.google.com>,
Mitch...@aol.com (Mitch Alsup) wrote:

> The core size of an Athlon or Opteron is about 12 times the size of
> the data cache (or instruction cache) of Athlon or Opteron. I happen
> to know that one can build a 486-like processor* in less area than
> than the data cache of Athlon, and that this 486-like core could run
> between 75% and 85% of the frequency of Opteron.

Which leads one to the interesting question of what that 486-like
processor would be like for single-threaded applications if there were
just one or two of it on the die, and it had, say, 16MB of on-chip cache?
That lets a new class of problems fit into cache, and makes protecting the
cache against loss of code and data from competing OS processes an issue.

I've been wondering for a few days how many of the current problems would
go away if memory was suddenly only 1-2 clocks away, rather than up to
200. Mitch's post provides some rough figures for what could be
implemented today, and it's interesting. Of course, it won't run at 75% of
Opteron throughput, because of the lack of OOO, but it is a different way
to look at the problem.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Niels Jørgen Kruse

unread,

Oct 9, 2004, 7:38:21 AM10/9/04

to

I artiklen <memo.2004100...@jgd.compulink.co.uk> , j...@cix.co.uk
(John Dallman) skrev:

> In article <e90782f7.04100...@posting.google.com>,
> Mitch...@aol.com (Mitch Alsup) wrote:
>
>> The core size of an Athlon or Opteron is about 12 times the size of
>> the data cache (or instruction cache) of Athlon or Opteron. I happen
>> to know that one can build a 486-like processor* in less area than
>> than the data cache of Athlon, and that this 486-like core could run
>> between 75% and 85% of the frequency of Opteron.

While power consumption is 1/12 of the Opteron?

> Which leads one to the interesting question of what that 486-like
> processor would be like for single-threaded applications if there were
> just one or two of it on the die, and it had, say, 16MB of on-chip cache?
> That lets a new class of problems fit into cache, and makes protecting the
> cache against loss of code and data from competing OS processes an issue.
>
> I've been wondering for a few days how many of the current problems would
> go away if memory was suddenly only 1-2 clocks away, rather than up to

The Itanium2 is closest to the 16MB you dreamed up. Surely you know how many
clocks L3 is away on the Itanium2. If you have 16MB of cache even a big
honking core is just a bump on the side.

> 200. Mitch's post provides some rough figures for what could be
> implemented today, and it's interesting. Of course, it won't run at 75% of
> Opteron throughput, because of the lack of OOO, but it is a different way
> to look at the problem.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

john jakson

unread,

Oct 9, 2004, 11:28:19 AM10/9/04

to

j...@cix.co.uk (John Dallman) wrote in message news:<memo.2004100...@jgd.compulink.co.uk>...

Thanks Mitch for the fuel

Well there are ways to speed up and max up size of L2/L3 by using
something like RLDRAM 32Mbyte & more which runs 8 banks at 20ns cycle
time each with out alot of the fuss that reg DDR has on concurrency
limits. If addresses can be randomized over the 8 banks, new requests
can be issued almost every 2.5ns if not hitting any same bank in past
20ns.

If you have a PE that is 4way threaded to hide local latencies, a few
of these PEs could be matched to the shared memory channel. Depending
on intended workload type the PE ratio might be from 1-10 per mem
channel.

I think though I'd rather junk 486 entirely and replace with a decent
reg rich RISC and get maybe 2-4x the throughput per PE for same cost.
If you go back to simpler 486 cores, the old RISC v CISC argument
comes right back, doesn't it! A RISC designed around 4way threading
can get the 1 IPC fairly easily hiding simple latencies far more
easily than hacking x86 yet again. If some threads are idle for a few
cycles, the perf loss is smaller than upping the complexity 12x for
only 2x gain.

Also if apps are going to be rewritten with N >>= 12 cores in mind,
why insist on x86, we need new everything here, even new Par C (with
occam inside).

On the asymetric side it might just be easier to cop out and attach
either a Clearspeed Flop engine or an FPGA to Opteron on the HPC
front. Both are already available, either way would likily mean huge
changing in mindset to actually use the darn thing.

In the Octiga case I guess they add $1K worth of V2Pro FPGA to each
Opteron node.

I haven't heard of any Clearspeed type co processor except from
Clearspeed themselves on PCIX? cards, and thats a $1K chip too.

At the low end though, even a few $ of Spartan3 FPGA could do wonders
for some computing loads but I see no momentum to do this either.

I do have one app in mind that could soak up some of the PEs time, esp
if not x86 based, JITing incoming 86 apps on up to the mother cpu.
Kind of like having a Transmeta opened up so you can use for direct
code translation and native RISC apps.

While on the subject, Transputers (T800 etc) used about as much
silicon for the basic cpu core as they did for all the 4 HW links, I
think you'd need to do same with any large N cpu either message
passing or shared memory.

just some thoughts

johnjakson_usa_com

Mitch Alsup

unread,

Oct 11, 2004, 11:49:55 AM10/11/04

to

johnj...@yahoo.com (john jakson) wrote in message news:<adb3971c.04100...@posting.google.com>...

> j...@cix.co.uk (John Dallman) wrote in message news:<memo.2004100...@jgd.compulink.co.uk>...

> Thanks Mitch for the fuel

Its what pyros do.

>
> Well there are ways to speed up and max up size of L2/L3 by using
> something like RLDRAM 32Mbyte & more which runs 8 banks at 20ns cycle
> time each with out alot of the fuss that reg DDR has on concurrency
> limits. If addresses can be randomized over the 8 banks, new requests
> can be issued almost every 2.5ns if not hitting any same bank in past
> 20ns.

Lets take a modern x86 CPU and look at the L2 access time. Opteron's
L2 cache has an access time of 9 cycles. Of this 9 cycles 7 of them
are simply wire delay into and out of the array, one of them is tag
access and the other is output multiplexor selection for the 16 ways.
The only thing that will lessen this time is packing the array into
a smaller area! it is NOT a transistor delay problem!

Right now, with wire limited L2s/L3s, as you quad the area you double
the wire delay. Since DRAM memory is so very far away, this makes sense
in the 1MB -> 4MB ranges, but may not make sense in the 4 MB -> 16MB
realm because on-die access time is increasing faster than on-die miss
rates are improving.

>
> If you have a PE that is 4way threaded to hide local latencies, a few
> of these PEs could be matched to the shared memory channel. Depending
> on intended workload type the PE ratio might be from 1-10 per mem
> channel.
>
> I think though I'd rather junk 486 entirely and replace with a decent
> reg rich RISC and get maybe 2-4x the throughput per PE for same cost.
> If you go back to simpler 486 cores, the old RISC v CISC argument
> comes right back, doesn't it!

Then you missed the whole jist of my argument. One can build a 486-like
core for less area than one can add a thread to a great big OoO core!

> A RISC designed around 4way threading
> can get the 1 IPC fairly easily hiding simple latencies far more
> easily than hacking x86 yet again.

Unless, of course, you have a library filled with older x86 cores.....

> If some threads are idle for a few
> cycles, the perf loss is smaller than upping the complexity 12x for
> only 2x gain.
>
> Also if apps are going to be rewritten with N >>= 12 cores in mind,
> why insist on x86, we need new everything here, even new Par C (with
> occam inside).

I insist on x86s because they represent 94% of all CPUs sold last year,
have 7 Trillion dollars of infrastructure on which to leverage, and
the last decade has shown no persistent RISC design being able to keep
up in the performance race. This goes to the heart of the matter. x86
companies can simply invest more in CPU design than RISC companies
can afford to invest in RISC CPUs.

Mitch

Nick Maclaren

unread,

Oct 11, 2004, 1:18:49 PM10/11/04

to

In article <e90782f7.04101...@posting.google.com>,

Mitch Alsup <Mitch...@aol.com> wrote:
>
>Lets take a modern x86 CPU and look at the L2 access time. Opteron's
>L2 cache has an access time of 9 cycles. Of this 9 cycles 7 of them
>are simply wire delay into and out of the array, one of them is tag
>access and the other is output multiplexor selection for the 16 ways.
>The only thing that will lessen this time is packing the array into
>a smaller area! it is NOT a transistor delay problem!

Thanks for those figures.

>Right now, with wire limited L2s/L3s, as you quad the area you double
>the wire delay. Since DRAM memory is so very far away, this makes sense
>in the 1MB -> 4MB ranges, but may not make sense in the 4 MB -> 16MB
>realm because on-die access time is increasing faster than on-die miss
>rates are improving.

A prediction. In the near future, someone will have the bright idea
of splitting a cache level internally, and effectively having a L1.5
cache inside the L2 cache. It may well already have been done :-)

Regards,
Nick Maclaren.

john jakson

unread,

Oct 12, 2004, 9:37:50 AM10/12/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<ckef9p$c2d$1...@gemini.csx.cam.ac.uk>...

In the Cache design book by Handy I saw a ref to an Hitachi cache IIRC
of a combined memory arch where the lower cache tried to used the
bottom end of a memory array, if it missed, it joined up the remainder
90% of the array for another cycle.

regards

johnjakson_usa_com

Russell Wallace

unread,

Oct 12, 2004, 11:17:49 AM10/12/04

to

On 11 Oct 2004 08:49:55 -0700, Mitch...@aol.com (Mitch Alsup) wrote:

>I insist on x86s because they represent 94% of all CPUs sold last year,
>have 7 Trillion dollars of infrastructure on which to leverage, and
>the last decade has shown no persistent RISC design being able to keep
>up in the performance race. This goes to the heart of the matter. x86
>companies can simply invest more in CPU design than RISC companies
>can afford to invest in RISC CPUs.

If you want small, cheap, low-heat-dissipation cores and you also want
the economics of x86 compatibility, Transmeta's approach might be
worth considering. Particularly if you've a bunch of cores all running
the same inner loop (common for some kinds of workloads) - then the
cost of decoding x86 byte code only needs to be paid once if done in
software, rather than N times if every core does it in hardware.

--
"Always look on the bright side of life."
To reply by email, remove the small snack from address.

john jakson

unread,

Oct 13, 2004, 11:41:38 AM10/13/04

to

wallacet...@eircom.net (Russell Wallace) wrote in message news:<416bf532...@news.eircom.net>...

> On 11 Oct 2004 08:49:55 -0700, Mitch...@aol.com (Mitch Alsup) wrote:
>
> >I insist on x86s because they represent 94% of all CPUs sold last year,
> >have 7 Trillion dollars of infrastructure on which to leverage, and
> >the last decade has shown no persistent RISC design being able to keep
> >up in the performance race. This goes to the heart of the matter. x86
> >companies can simply invest more in CPU design than RISC companies
> >can afford to invest in RISC CPUs.
>
> If you want small, cheap, low-heat-dissipation cores and you also want
> the economics of x86 compatibility, Transmeta's approach might be
> worth considering. Particularly if you've a bunch of cores all running
> the same inner loop (common for some kinds of workloads) - then the
> cost of decoding x86 byte code only needs to be paid once if done in
> software, rather than N times if every core does it in hardware.

Exactly, what does it matter what the ISA is anymore when code
translation seems to be be in pretty good shape, DEC/HP/Transmeta/Sun
(any others?). If you want lots of cores, it makes more sense to
design for the future of PAR computing, not the DOS past. Why on
earth be plagued with tiny reg files, and so on. At the least the x86
could be deprecated to the common risc codes now mostly used with the
rest trapping into emulation or SW recompiling. The risc core arch
could borrow some from the x64 extensions but just fixed to a simpler
opcode decoding. Whatever, if it doesn't have large reg sets, its junk
IMHO.

Now if the open source guys got into the X compiler and could target
similar looking riscs with x86 as a base case. The gcc compiler could
then emit both x86 and raw risc where needed or just plain risc x86
and leave it to a JIT.

The possibility of asymetric x86 still sounds plausible too with
something like a P3 or M in the corner to run the protected OS, MMU
model and leave the other cores for non OS stuff. I can't see why the
other cores would need to duplicate all the OS HW support. If I am
wrong about that, I don't think I want that kind of OS.

Of more importance is the PAR model of programming. Are we still
running away from message passing, whats being offered as an
alternative for multi cores.

regards

johnjakson_usa_com

Jouni Osmala

unread,

Oct 13, 2004, 6:17:27 PM10/13/04

to

>>>The core size of an Athlon or Opteron is about 12 times the size of
>>>the data cache (or instruction cache) of Athlon or Opteron. I happen
>>>to know that one can build a 486-like processor* in less area than
>>>than the data cache of Athlon, and that this 486-like core could run
>>>between 75% and 85% of the frequency of Opteron.
>>>
>>>[*] 7 stage pipeline, 1-wide, in-order, x86 to SSE3 instruction set.
>>>
>>>Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processor
>>>is a 0.5 IPC machine. (At this point you see that we have spent the last
>>>15 years in microprocessor development getting that last factor of 2 and
>>>it has cost us around 12X in silicon real estate...)
>>>
>>> CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
>>>Opteron 1 1.0 2.4 GHz 2.4 2.4
>>>486-like 12 0.5 2.0 GHz 1.0 12.0
>>
>>Without data cache the 486-likes will not get 0.5 IPC, so you can
>>probably only use 6 or 8 486-likes in that area. Also, the L2 cache
>>or the maim memory interface would have to be shared between the
>>486-likes, which requires some arbitration circuitry. Also, L2 cache
>>contention (or, for small, nonshared L2s, capacity misses and main
>>memory contention) between the 486-likes would reduce the IPC of the
>>486-likes.
>
>
> My posited 486-like cores do contain caches, just not the monsters in
> Opteron. In addition, the L2 would partitioned in such a way that
> several 486-core misses could be in process simultaneously to different
> interleaves of that L2.

One way to estimate the die area requirements would be.
80486 took 81mm² @0.8u process. Currently we are at 0.09u process.
Then die area of optical shrink of 80486 would be around
(0.09/0.8)²*81=1mm²
Now it would be obvious that straight port over so many process
generations don't work.And we want separate L1 caches at least 8kb each,
longer pipeline and better FPU and addition of new instructions and
support for x86-64. So lets double the die area from the direct shrink,
to get better estimation, Thats 2mm² per core. Now 16of those hanging
behind a split but shared with cores L2 caches could get die area of
32mm²+L2 caches and memory controllers, and interface to chipset. In
real world nothing is so simple. The L2 caches may trash because of many
threads competing for it, or that simply that there is not enough thread
level parallerism. So speeding up single thread at least SOME amount
would be beneficial, so lets take a Pentium as basis for estimation.
Thats 296mm² @0.8u so it comes 3.74 @0.09 Now if we estimate adding the
extensions lengthening pipeline etc... to it, it would become 5mm². So
having 8 of them would be feasible. Anyway its totally different game if
we would be allowed to ditch x86 compability for the design, since
with such small designs the x86 compability contributes heavily on core
area.

But thats not a real question of bringing 486 or Pentium level big CMP:s
to market, its just not really encouraging at 0.65u process and smaller,
since there is MORE transistors available by then, and L2 cache trashing
is worsened by having more threads running at same time, so there is
reason to limit amount of threads running in parallel. Current cores
bring 4 as many cores as today in same area after TWO die shrinks, and
there COULD be design that would HALF the core die area, so that we
would get 8 cores by then but I doubt thats usefulness. Now 0.13u
opteron VS 0.65u opteron could result quadruple number of cores. The
reason why its this way is that bringing more execution resources
compared to K8 core is increasingly more complex and brings less
performance improvement, so overall going for more cores, and better
clock speed and lower instruction latencies help.

Jouni Osmala