The numbers were surprising to me:
- First, it seems AMD didn't manage to get a real advantage from
its single-die technology. Random memory-acccesses to a 16kB
block written to the L1-cache of another core just before have
a throughput of about 1GB/s (!) whereas my Core 2 Extreme has
3,9GBs between the cores on the same die and 2,7GB/s between
the cores of different dice.
The linear throughput between the cores of the same block-size
is about 3GB/s for the Phenon-system and 6GB/s between cores
on the same die and 4,8GB/s between cores on different dice
when probed on my Core 2 Extreme.
- Second, transferring from the L1 cache of one core to the L1
cache to another core on the same die of Core-2-based CPUs is
slower than when the data has been written back to the common
L2-cache and is transferred from there to the destination-core.
Some on a german board mentioned that this tests test aren't
meaningful for real-world-purposes because I probe only trans-
fers from one core to another in one direction where other
cores do nothing.
So I wrote a new benchmark for Win32 that has configurable
behaviour on:
- the pattern:
Linear measures the throughput of linear memory-accesses and
random measures the throughput and latency of random memory
-accesses (measuring the latency of linear accesses doesn't
make sense in my case because I don't do pointer-chasing on
linear accesses and memory-accesses become pipelined).
- the direction - unidirectional vs. bidirectional:
When transferred unidirectional, one core produces some data
and another consumes it; when transferred bidirectional both
cores are producers and consumers.
- the block-size:
The block is the entity produced by the thread on one core
and consumed by the thread on the other core. Block-sizes
range from "4k" to "64m".
- producers and consumers:
You can give a number of core-pairs to the benchmark that
will be tested. When benchmarking unidirectional transfers
the first core is the producer and the second is the con-
sumer; when benchmarking bidirectional transfers both are
producers and consumers.
The core-numers are from 1 to N where N is the number of
cores in the system. With Intel's quadcores the cores on
the same die are 1 and 2 or 2 and 3 (relies on the APIC
-IDs and I've never seen a BIOS that does this different).
You can download the benchmark including the sources at [1].
There are two batch-files in the .zip-archive. These are
quadcore.cmd and dualcore.cmd; both run a large number of
benchmarks against different patterns, directions, block
-siztes and core-configurations and one is for dualcore,
the other for quadcore-systems (I could also build a batch
for 8-core-systems with two CPUs - or even larger).
It would be nice to see some results in any of the newsgroups
I posted to. You can copy the output of the batch by chosing
the copy-function of the console's system-menu.