Dual Core CPUs are slower than Dual Single core CPUs ??

आशू

unread,

Jun 7, 2006, 5:03:15 PM6/7/06

to

Hi,

Today, I was just thinking on "Whether multicore physical CPUs are
faster than multi physical singlecore CPUs ?", and I came to the
conclusion that multicore physical CPU is slower because of sharing of
the FSB bus with other cores, it is just fast, when you've everything
in caches so that you don't need any extra memory fetches, everything
is just restricted present in the CPU. And worst case is when you're
copying portions of memory from one part to the another on one
execution engine, whereas other execution engine is also performing
some task where it is continously polluting caches. And what about
scheduling of "fetching from memory" operation ?

Thanks in advance,
Ashish Shukla "Wah Java !!"
--
http://wahjava.wordpress.com/

Philipp Klaus Krause

unread,

Jun 7, 2006, 5:56:35 PM6/7/06

to

Both singlecore multiprocessor systems and multicore processors can be
implemented in lots of different ways.
With the Pentium III and the Athlon MP (both singlecore) all the
processors share the FSB.
On the other hand some multicore CPUs like the Opteron do not have a
shared FSB. They do share the memory controller though, which reduces
maximum bandwidth. On the other hand it reduces latency, too: Both cores
have an equally fast connection to the memory. Compare this with the
multiprocessor system, where the memory is distributed, if the first
processor wants to access a memory location on another other processor's
controller the transfer has to go over (sometimes multiple)
Hypertransport links.

Philipp

आशू

unread,

Jun 8, 2006, 12:55:46 AM6/8/06

to

Hi,

Thanks for answering.

Philipp Klaus Krause wrote:
> Both singlecore multiprocessor systems and multicore processors can be
> implemented in lots of different ways.
> With the Pentium III and the Athlon MP (both singlecore) all the
> processors share the FSB.
> On the other hand some multicore CPUs like the Opteron do not have a
> shared FSB. They do share the memory controller though, which reduces
> maximum bandwidth. On the other hand it reduces latency, too: Both cores
> have an equally fast connection to the memory. Compare this with the
> multiprocessor system, where the memory is distributed, if the first
> processor wants to access a memory location on another other processor's
> controller the transfer has to go over (sometimes multiple)
> Hypertransport links.

Okay, I've just visited the HyperTransport@Wikipedia
(http://en.wikipedia.org/wiki/Hypertransport), and found that
HyperTransport based MP (single core) systems, looks like NUMA (Non
Uniform Memory Access) kind of systems, where each processor has faster
access to some part of memory. So you mean that HyperTransport based MP
(dual core) systems are faster than their "single core" HyperTransport
based equivalents, because of this NUMA kindof architecture.

And what about FSB based, dual core Intel Pentium 4 processors ?
They'll be slower than their dual single core Pentium 4 processors.

>
> Philipp

Thanks in advance,
Ashish Shukla

--
http://wahjava.wordpress.com/

Mayank

unread,

Jun 8, 2006, 10:55:41 AM6/8/06

to

आशू wrote:
> And what about FSB based, dual core Intel Pentium 4 processors ?
> They'll be slower than their dual single core Pentium 4 processors.

I'm myself not an expert on this issue. Will be happy if corrected.

I believe, if indpendent single threaded applications need to be run,
then SMP (dual single core processors) shall provide a better
throughput.

CMP (dual core processors) have more than one execution unit on the
same die, possibly sharing L2 cache and FSB. CMP (and so did SMT)
probably came into existance because Instruction Level Parallelism was
not providing further parallelism (with power playing an important
role). If multi-threaded applications need to be run, then CMP in most
cases shall provide better throughput. This is due to inter thread data
sharing which shall hide the memory latencies.

In case the two threads scheduled on CMP result in more memory accesses
(less L2 cache hits), then the throughput shall be worse. There are
number of research papers suggesting different mechanisms for
scheduling threads/processes in a manner where lesser contention of
caches take place and therefore maximize throughput. [Side note: Linux
scheduler at present supports load balancing but does not take the
nature of the thread/process/task (other than being I/O bound or CPU
intensive) into account while scheduling.]

The nature and schedule of the tasks shall determine the performance on
CMP.

As suggested by one of the eminent computer architects, we have reached
a cross-road where programmers cannot automatically enjoy the speedup
due to ILP. They need to multi-thread their applications in order to
take advantage of new architectures. With more multi-threaded
applications, SMP (dual single core processors) might not scale well
due to expensive synchronization/coherency issue.

- Mayank

Anne & Lynn Wheeler

unread,

Jun 8, 2006, 1:12:44 PM6/8/06

to

"Mayank" <mayank...@gmail.com> writes:
> I believe, if indpendent single threaded applications need to be run,
> then SMP (dual single core processors) shall provide a better
> throughput.
>
> CMP (dual core processors) have more than one execution unit on the
> same die, possibly sharing L2 cache and FSB. CMP (and so did SMT)
> probably came into existance because Instruction Level Parallelism was
> not providing further parallelism (with power playing an important
> role). If multi-threaded applications need to be run, then CMP in most
> cases shall provide better throughput. This is due to inter thread data
> sharing which shall hide the memory latencies.

and the multiple threads are conserving cache lines in many cases by
making use of the exact same data (so you may be getting higher per
instruction cache hit ratio for the same number of cache lines).

there is an analogy to this from long ago and far away involving real
storage for tss/360 paging (from the 60s). tss/360 was originally
announced to run on a 512kbyte 360/67 ... but the tss/360 (fixed)
kernel was rapidly growing. eventually the minimum was 768kbytes and
to really get anything done with tss/360 you needed 1024kbytes
(largest memory configuration).

then then benchmarked two processor tss/360 on a two processor 360/67
with two megabytes of real storage (each processor came with 1mbyte
max. and multiprocessor support allowed the addressing to be linear)
... and tss/360 thruput was coming out around 3.5times that of tss/360
uniprocessor operation.

somebody made the claim that tss/360 scale-up, multiprocessor support
and algorithms were obviously the best in the industry ... being able
to get 3.5 times the thruput with only two times the resources.

it turns out that it was a relative measurement, both tss/360
uniprocessor and multiprocessor thruput was quite bad ... using an
absolute measure (as opposed to purely relative measurement).

the issue was that the tss/360 kernel requirements had grown so that
if you attempted to perform almost any operations ... with the amount
of real storage left over for paging in a 1mbyte configuration
... would page thrash. with double the real storage (2mbytes) ... the
amount of real storage left over for application paging increased by a
factor of 5-10 times (compared to single processor, 1mbyte
configuration) ... resulting in tss/360 seeing 3.5 times the aggregate
thruput (in two processor configuration) relative to single processor
configuration (however, neither numbers were actually that
remarkable).

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

Derek Simmons

unread,

Jun 8, 2006, 1:36:18 PM6/8/06

to

This has been debated to death in a couple of different forums. This
topic has taken a couple of different forms comparing single CPU
systems to dual or multiple CPU systems or dual or multiple core CPUs.
The determining factor is the amount of bandwidth and cache available
between the CPU or core and memory and how it is implemented.

If you are trying to decide which system to purchase some of this falls
on the shoulders of the motherboard manufacturer also. Some motherboard
manufactures (ie. Supermicro, but I don't recommend them for personal
reasons) have increased the amount of available bandwidth by requiring
you install memory modules in larger groups.

The best way to try to determine the best system for yourself is see if
you can find somebody that will allow you to setup systems side by side
and compare (good luck).

Derek

आशू

unread,

Jun 8, 2006, 3:46:56 PM6/8/06

to

Hi,

Thanks to all.

So, I think the result of this discussion is that when going for
HyperTransport based systems (NUMA kind of systems), it is better to go
with Dual Core CPUs, whereas for Shared FSB based sytems it is better
to go with Dual Single Core CPUs.

Well I got introduced to some new terms like CMP
(http://en.wikipedia.org/wiki/Chip-level_multiprocessing), and SMT
(http://en.wikipedia.org/wiki/Simultaneous_multithreading), thanks to
Mayank for this :-) .

Thanks again.

russell kym horsell

unread,

Jun 8, 2006, 11:30:22 PM6/8/06

to

Derek Simmons <dere...@gmail.com> wrote:
...

> The best way to try to determine the best system for yourself is see if
> you can find somebody that will allow you to setup systems side by side
> and compare (good luck).

...

It never fails to amaze me that debates of this kind dominate engineering-type
groups for years on end. Opinion vs "my years of (usually unrelated)
experience", red herring vs strawman. Vague generalisation vs negative
claim "proved" by 2 or 3 hand-selected examples.

But enough of venting my spleen. ;)

I'm with you -- there is really no substitute for eactually presenting
real-world measurements and some considered analysis of same.

An alternative -- in this case -- is to fall back on queuing theory 101
and analyse the relevant (admittedly simple) model. For the OP a quick
Google will probably bring up relevant material.

James Boswell

unread,

Jun 9, 2006, 11:55:53 PM6/9/06

to

Philipp Klaus Krause wrote:

> ??? wrote:
>> Hi,
>>
>> Today, I was just thinking on "Whether multicore physical CPUs are
>> faster than multi physical singlecore CPUs ?", and I came to the
>> conclusion that multicore physical CPU is slower because of sharing
>> of the FSB bus with other cores, it is just fast, when you've
>> everything in caches so that you don't need any extra memory
>> fetches, everything is just restricted present in the CPU. And worst
>> case is when you're copying portions of memory from one part to the
>> another on one execution engine, whereas other execution engine is
>> also performing some task where it is continously polluting caches.
>> And what about scheduling of "fetching from memory" operation ?
>>
>> Thanks in advance,
>> Ashish Shukla "Wah Java !!"
>
> Both singlecore multiprocessor systems and multicore processors can be
> implemented in lots of different ways.
> With the Pentium III and the Athlon MP (both singlecore) all the
> processors share the FSB.

AthlonMP's each have a dedicated point to point link to the northbridge?
they're not a shared bus architecture.

-JB

Doug MacKay

unread,

Jun 10, 2006, 2:28:34 AM6/10/06

to

आशू wrote:
> And what about FSB based, dual core Intel Pentium 4 processors ?
> They'll be slower than their dual single core Pentium 4 processors.

Blanket statements like this, especially with so little supporting
information, are dangerous.

You could imagine building a CMP system by packing two P4s together and
connecting their FSB inside the package. (Smithfield anyone?) The
resulting system could be setup to have the same bus running at the
same speed carrying the same traffic to the same northbridge. In such
a case you could expect the same performance.

Philipp Klaus Krause

unread,

Jun 10, 2006, 5:46:44 AM6/10/06

to

Sorry, seems I was a bit confused.
http://www.amd.com/de-de/Processors/TechnicalResources/0,,30_182_739_4296,00.html
Clearly shows that each Processor has a dedicated link to the Northbridge.

Philipp

Message has been deleted

Dale Morris

unread,

Jun 12, 2006, 4:30:42 PM6/12/06

to

I'd just like to point out that there's no single definition of "better" in
the domain of design that you're addressing. You appear to be implying that
total performance is the measure of "better". However, price/performance is
also quite important, and is more of a motivation behind multi-core
processors than performance alone.

- Dale Morris

"???" <wah...@gmail.com> wrote in message
news:1149714195.2...@i40g2000cwc.googlegroups.com...

> Hi,
>
> Today, I was just thinking on "Whether multicore physical CPUs are
> faster than multi physical singlecore CPUs ?", and I came to the
> conclusion that multicore physical CPU is slower because of sharing of
> the FSB bus with other cores, it is just fast, when you've everything
> in caches so that you don't need any extra memory fetches, everything
> is just restricted present in the CPU.

"???" <wah...@gmail.com> wrote in message
news:1149796015.9...@i40g2000cwc.googlegroups.com...