Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

new benchmark: core-2-core transfer-speed

0 views

Skip to first unread message

Elcaro Nosille

unread,

Mar 1, 2008, 10:07:36 PM3/1/08

It's always said that the single-die technology of AMD's phenon
is superior to Intel's dual-die solution. I wrote a little bench-
mark some time ago that measures the speed of transfers from one
CPU core to another. This benchmark measureed the throughput and
latency of linear and random memory accesses to cachelines writ-
ten by another core just before. I ran it on my 3GHz Core II Ex-
treme quadcore and someone gave me some numbers after running it
on a Phenon overclocked to 2,53GHz.

The numbers were surprising to me:

- First, it seems AMD didn't manage to get a real advantage from
its single-die technology. Random memory-acccesses to a 16kB
block written to the L1-cache of another core just before have
a throughput of about 1GB/s (!) whereas my Core 2 Extreme has
3,9GBs between the cores on the same die and 2,7GB/s between
the cores of different dice.
The linear throughput between the cores of the same block-size
is about 3GB/s for the Phenon-system and 6GB/s between cores
on the same die and 4,8GB/s between cores on different dice
when probed on my Core 2 Extreme.
- Second, transferring from the L1 cache of one core to the L1
cache to another core on the same die of Core-2-based CPUs is
slower than when the data has been written back to the common
L2-cache and is transferred from there to the destination-core.

Some on a german board mentioned that this tests test aren't
meaningful for real-world-purposes because I probe only trans-
fers from one core to another in one direction where other
cores do nothing.
So I wrote a new benchmark for Win32 that has configurable
behaviour on:
- the pattern:
Linear measures the throughput of linear memory-accesses and
random measures the throughput and latency of random memory
-accesses (measuring the latency of linear accesses doesn't
make sense in my case because I don't do pointer-chasing on
linear accesses and memory-accesses become pipelined).
- the direction - unidirectional vs. bidirectional:
When transferred unidirectional, one core produces some data
and another consumes it; when transferred bidirectional both
cores are producers and consumers.
- the block-size:
The block is the entity produced by the thread on one core
and consumed by the thread on the other core. Block-sizes
range from "4k" to "64m".
- producers and consumers:
You can give a number of core-pairs to the benchmark that
will be tested. When benchmarking unidirectional transfers
the first core is the producer and the second is the con-
sumer; when benchmarking bidirectional transfers both are
producers and consumers.
The core-numers are from 1 to N where N is the number of
cores in the system. With Intel's quadcores the cores on
the same die are 1 and 2 or 2 and 3 (relies on the APIC
-IDs and I've never seen a BIOS that does this different).

You can download the benchmark including the sources at [1].
There are two batch-files in the .zip-archive. These are
quadcore.cmd and dualcore.cmd; both run a large number of
benchmarks against different patterns, directions, block
-siztes and core-configurations and one is for dualcore,
the other for quadcore-systems (I could also build a batch
for 8-core-systems with two CPUs - or even larger).

It would be nice to see some results in any of the newsgroups
I posted to. You can copy the output of the batch by chosing
the copy-function of the console's system-menu.

[1] http://depositfiles.com/files/3877959

0 new messages