Memory ping pong

Landmann

unread,

Mar 6, 2009, 3:21:57 AM3/6/09

to Scalable Synchronization Algorithms

Hello,

recently I decided to measure, how the "distance" of cores in larger
systems
affect the throughput of a simple ping-pong communication between two
cores
using a shared memory location. I would have expected near-by core
pairs having
a bigger throughput than core pairs located in different sockets.
But I measured exactly the opposite. Ping-Ponging between nearby cores
was
2..4 times slower than between cores on different sockets. (seen on
IA64 and Xeon)
Data transfer using bigger blocks of data behaved like one would
expect, cores
sharing the same caches showed higher transfer rates.
I used code like this (this is the 'ping' thread).
while(--lc) {
while(*location == 0) {
#ifdef USE_PAUSE
PauseThread();
#endif
}
*location = 0;
}
PauseThread() reduces to the platform specific pause instruction
recommended for
spin loops, but having it in or not does not change the observed
ratio.
Quite surprising, or not?

Dmitriy Vyukov

unread,

Mar 6, 2009, 7:19:04 AM3/6/09

to Scalable Synchronization Algorithms

Yes, quite surprising.
Here is my results on Intel Core2 Quad Q6600:

affinity: 0-1, time: 145
affinity: 2-3, time: 144
affinity: 0-2, time: 586
affinity: 1-3, time: 580
affinity: 0-1, time: 144
affinity: 2-3, time: 144
affinity: 0-2, time: 582
affinity: 1-3, time: 572
affinity: 0-1, time: 144
affinity: 2-3, time: 144
affinity: 0-2, time: 582
affinity: 1-3, time: 580

Times are in cycles. Note that Q6600 has 2 L2 caches, cores 0/1 share
first L2, and cores 2/3 shares second L2.

Here the test source code:

#include <iostream>
#include <windows.h>
#include <process.h>
#include <intrin.h>

unsigned const count = 100000000;
char pad1 [128];
volatile unsigned flag;
char pad2 [128];

unsigned __stdcall thread_proc(void* ctx)
{
int id = (int)ctx;
for (unsigned i = count; i; --i)
{
while (flag == id)
_mm_pause();
flag = id;
}
return 0;
}

int main()
{
unsigned affinities [4][2] = {
{0, 1},
{2, 3},
{0, 2},
{1, 3},
};
for (size_t x = 0; x != 3; ++x)
{
for (size_t i = 0; i != _countof(affinities); ++i)
{
HANDLE threads [2];
threads[0] = (HANDLE)_beginthreadex(0, 0, thread_proc,
(void*)0, CREATE_SUSPENDED, 0);
threads[1] = (HANDLE)_beginthreadex(0, 0, thread_proc,
(void*)1, CREATE_SUSPENDED, 0);
SetThreadAffinityMask(threads[0], 1 << affinities[i][0]);
SetThreadAffinityMask(threads[1], 1 << affinities[i][1]);
unsigned __int64 t1 = __rdtsc();
ResumeThread(threads[0]);
ResumeThread(threads[1]);
WaitForMultipleObjects(2, threads, 1, INFINITE);
unsigned __int64 t2 = __rdtsc();
std::cout << "affinity: " << affinities[i][0] << "-" <<
affinities[i][1] << ", time: " << (t2 - t1) / count << std::endl;
CloseHandle(threads[0]);
CloseHandle(threads[1]);
}
}
}

Please run my test on your machines and post results (modify
affinities accordingly, the code is for MSVC).
Also please post your full source code.

--
Dmitriy V'jukov

Landmann

unread,

Mar 6, 2009, 8:03:51 AM3/6/09

to Scalable Synchronization Algorithms

My systems don't run Windows at the very moment, so I would have
to re-write your implementation. My code is not of much use here (but
I
posted it anyways) because it uses some abstractions from
pthreads/Win32 threads to run on both systems, so copy&paste won't
work.

My code is merily :

#include <stdio.h>

const unsigned long LoopCount = 0xFFFFFF;

struct pingpongInfo
{
unsigned long long loopCount;
volatile unsigned int* volatile ping;
float duration;
unsigned int cpuPing;
unsigned int cpuPong;
};

void FixThread(unsigned int cpu)
{
int old;
pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &old);
pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, &old);

SetIdealCPU(cpu);

cpumask_t cpuMask;
CPU_ZERO(&cpuMask);
CPU_SET(cpu, &cpuMask);
SetCPUAffinityMask(&cpuMask);
SleepThread(0);
}

void* THREADCALL PingThread(void *p)
{
pingpongInfo *info = (pingpongInfo *)p;

FixThread(info->cpuPing);

unsigned long long lc = info->loopCount;
volatile unsigned int* ping = (volatile unsigned int*) malloc(sizeof
(unsigned int));
*ping = 0;
info->ping = ping;

while(*ping == 0);

CTimer::current_time_t started;

while(--lc)
{
while(*ping == 0)
{
#ifdef USE_PAUSE
PauseThread();
#endif
}
*ping = 0;
}

CTimer::current_time_t ended;

info->duration = CTimer::GetDiffTime(ended,started);

free((void*)ping);
return 0;
}

void* THREADCALL PongThread(void *p)
{
pingpongInfo *info = (pingpongInfo *)p;
unsigned long long lc = info->loopCount;
volatile unsigned int *ping;

FixThread(info->cpuPong);

do
{
ping = info->ping;
} while(!ping);

while(--lc)
{
while(*ping != 0)
{
#ifdef USE_PAUSE
PauseThread();
#endif
}

*ping = 1;
}

return 0;
}

int main(int argc, char *argv[])
{
unsigned int cpusetsize = 2;
unsigned int cpus = GetNumCPUs();

if(argc > 1)
{
cpusetsize = atoi(argv[1]);
if(cpusetsize < 1) cpusetsize = 1;
}

printf("Testing for %u CPUs, assuming %u cores per node
\n",cpus,cpusetsize);

pingpongInfo infoA;

thread_t A,B;

for(unsigned int cpuA=0; cpuA<cpus; ++cpuA)
{
for(unsigned int cpuB=(cpuA+1); cpuB<cpus; ++cpuB)
{
float worstDuration = 0.0f;

// measure pair (A,B)
printf("Measuring \t%u\t%u\tagainst\t\t...",cpuA,cpuB);

infoA.loopCount = LoopCount;
infoA.cpuPing = cpuA;
infoA.cpuPong = cpuB;
infoA.ping = 0;

CreateThread(&A,PingThread, &infoA);
CreateThread(&B,PongThread, &infoA);
AwaitThreadEnding(B);
AwaitThreadEnding(A);
printf("\t%f\n",infoA.duration);
fflush(stdout);
}
}

return 1;
}

Dmitriy Vyukov

unread,

Mar 6, 2009, 9:13:17 AM3/6/09

to Scalable Synchronization Algorithms

On Mar 6, 3:19 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> Here is my results on Intel Core2 Quad Q6600:
>
> affinity: 0-1, time: 145
> affinity: 2-3, time: 144
> affinity: 0-2, time: 586

On Intel Core2 Duo P9500 I get:
affinity: 0-1, time: 155
affinity: 0-1, time: 502 (in low power consumption mode)

Also I've placed in the loop SwitchToThread() function (somehow
analogous to pthread_yield()).
On Windows XP SP3 32-bit Q6600 I get:
affinity: 0-0, time: 1750 (actual switch between threads, time per 2
switches)
affinity: 0-1, time: 405 (no switch between threads, time per 1
"switch")
On Windows Vista SP1 32-bit P9500 I get:
affinity: 0-0, time: 6750 (actual switch between threads, time per 2
switches)
affinity: 0-1, time: 356 (no switch between threads, time per 1
"switch")

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Mar 6, 2009, 10:10:26 AM3/6/09

to Scalable Synchronization Algorithms

On 6 мар, 16:03, Landmann <landmann_...@gmx.de> wrote:
> My systems don't run Windows at the very moment, so I would have
> to re-write your implementation. My code is not of much use here (but
> I
> posted it anyways) because it uses some abstractions from
> pthreads/Win32 threads to run on both systems, so copy&paste won't
> work.
>
> My code is merily :

I mechanically replace your synch primitives with Win32 primitives and
here is the results:

Testing for 4 CPUs, assuming 2 cores per node
Measuring 0 1 against ... 141
Measuring 0 2 against ... 591
Measuring 0 3 against ... 580
Measuring 1 2 against ... 577
Measuring 1 3 against ... 579
Measuring 2 3 against ... 141

So on Intel Quad Core communication between neighborhood threads is
indeed faster. On what systems you test?

--
Dmitriy V'jukov

Landmann

unread,

Mar 6, 2009, 11:08:05 AM3/6/09

to Scalable Synchronization Algorithms

I tested on a 32x Itanium2 system and on a 8x Xeon system.
Both have quad core (8 and 2) CPUs.

Joerg

Dmitriy Vyukov

unread,

Mar 6, 2009, 1:29:48 PM3/6/09

to Scalable Synchronization Algorithms

On Mar 6, 7:08 pm, Landmann <landmann_...@gmx.de> wrote:
> I tested on a 32x Itanium2 system and on a 8x Xeon system.
> Both have quad core (8 and 2) CPUs.

Hmmm... I may only say that all synchronization algorithms for big
machines I am aware of are built around "local accesses are cheap and
remote accesses are expensive" concept. As an example hierarchical
mutexes, hierarchical barriers, hierarchical RCU, hierarchical memory
managers...
On the other hand, I don't see why you got opposite results (test
looks Ok... maybe system calls that set affinity fail)...

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Mar 6, 2009, 1:35:05 PM3/6/09

to Scalable Synchronization Algorithms

On Mar 6, 7:08 pm, Landmann <landmann_...@gmx.de> wrote:

> I tested on a 32x Itanium2 system and on a 8x Xeon system.
> Both have quad core (8 and 2) CPUs.

Btw, what numbers do you get for Xeon? I think that they must be
consistent with my numbers for Core2 Quad Q6600 (they are mainly
identical in this aspect). Is numbers for inter-processor ping-pong
for Xeon are lower than 140 cycles? Hmmm...

Numbers for Itanium2 will be also interesting to see.

--
Dmitriy V'jukov

Landmann

unread,

Mar 6, 2009, 3:36:52 PM3/6/09

to Scalable Synchronization Algorithms

Dmitriy, thanks for beeing sceptical, the Xeon's machine
topology did not match the CPU numbering used when pinning the
threads (e.g. core index 1 was on the second die). I had not
checked this before but now realized that at least under Linux
one carefully has to inspect /proc/cpuinfo or use the topology
information to figure out, where which cpu(index) is located.
No idea if Windows guarantees the mapping to be a simple counting
through all the cores.
I will re-do the tests using rdtsc to count cycles and post the
figures here, currently my code is just printing times in seconds
to get the ratio.
However the Itanium2 system's topology matched the cpu ordering
exactly, no clue what is going on there.

Dmitriy Vyukov

unread,

Mar 6, 2009, 3:42:40 PM3/6/09

to Scalable Synchronization Algorithms

It's just an assumption. There are 2 different approaches to cpu
numbering. First is so called 'compact' and second is
'scatter' (OpenMP terminology).
Compact is:
OS logical processor ID #0: physical processor 0, core 0
OS logical processor ID #1: physical processor 0, core 1
OS logical processor ID #2: physical processor 0, core 2
OS logical processor ID #3: physical processor 0, core 3
OS logical processor ID #4: physical processor 1, core 0
...

Scatter is:
OS logical processor ID #0: physical processor 0, core 0
OS logical processor ID #1: physical processor 1, core 0
OS logical processor ID #2: physical processor 2, core 0
OS logical processor ID #3: physical processor 3, core 0
OS logical processor ID #4: physical processor 0, core 1
...

Both are useful in different situations. Windows always uses 'compact'
AFAIK (not documented). I can't find any documentation for Linux.
Maybe your Linux uses 'scatter' numbering... well, it's unlikely but
at least it will explain your results (however then there will be a
problem with your throughput test).
Try to investigate information in:
/sys/devices/system/cpu/cpu*/topology
(sys file system must be mounted)

--
Dmitriy V'jukov

Landmann

unread,

Mar 10, 2009, 11:01:10 AM3/10/09

to Scalable Synchronization Algorithms

Hello,

now that I have checked that thread pinning does what it is
expected to do I post my results for a variety of platforms
running Linux.
What you see is how fast the communication between two cores
is (logic cpu numbers shown A<->B, APIC physical ID given in
brackets).
On some systems I also measured how the communication
speed changes with "background" talk between two other cpus happening
in parallel.

Outcomes:
So some systems have 'anomalies' I cannot explain yet.
Under Linux the topology information should be taken from sysfs
and not derived simply by the logical cpu numbering.

Sorry for the long post...

Ping-Pong Loop Count was always 33554432.

AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
Measuring 0<->1 [0,1] 10.815338s (598 cycl.)
- 'pause' : no difference

2x Opteron 252
Measuring 0<->1 [0,1] 11.237741s (874 cycl.)
- 'pause' : no difference

2x Opteron 270 X2
Measuring 0<->1 [0,1] 10.152258s (607 cycl.)
Measuring 0<->1 [0,1] against 2<->3 [2,3] 16.796986s
(1005 cycl.)
Measuring 0<->2 [0,2] 13.127724s (786 cycl.)
Measuring 0<->2 [0,2] against 1<->3 [1,3] 14.255553s
(853 cycl.)
Measuring 0<->3 [0,3] 13.151267s (787 cycl.)
Measuring 0<->3 [0,3] against 1<->2 [1,2] 14.273995s
(854 cycl.)
Measuring 1<->2 [1,2] 13.127604s (786 cycl.)
Measuring 1<->2 [1,2] against 0<->3 [0,3] 14.249063s
(853 cycl.)
Measuring 1<->3 [1,3] 13.154505s (787 cycl.)
Measuring 1<->3 [1,3] against 0<->2 [0,2] 14.276996s
(854 cycl.)
Measuring 2<->3 [2,3] 13.132011s (786 cycl.)
Measuring 2<->3 [2,3] against 0<->1 [0,1] 16.865227s
(1009 cycl.)
* 2<->3 communication is slower than 0<->1, seen on 2 machines of this
type
- 'pause' : no difference

2x Intel Xeon Quad with a 'mixed' topology
- with 'pause'
Measuring 0<->1 [0,4] 14.985144s (890 cycl.)
Measuring 0<->1 [0,4] against 2<->3 [1,2] 15.508083s
(921 cycl.)
Measuring 0<->1 [0,4] against 2<->3 [1,2] 15.514224s
(922 cycl.)
Measuring 0<->1 [0,4] against 2<->4 [1,3] 15.508224s
(921 cycl.)
Measuring 0<->1 [0,4] against 4<->5 [3,5] 14.780524s
(878 cycl.)
Measuring 0<->2 [0,1] 4.482252s (266 cycl.)
Measuring 0<->2 [0,1] against 1<->3 [4,2] 4.905601s (291
cycl.)
Measuring 0<->2 [0,1] against 1<->3 [4,2] 4.897936s (291
cycl.)
Measuring 0<->2 [0,1] against 1<->4 [4,3] 4.904848s (291
cycl.)
Measuring 0<->2 [0,1] against 4<->5 [3,5] 4.908122s (291
cycl.)
Measuring 0<->3 [0,2] 10.081909s (599 cycl.)
Measuring 0<->3 [0,2] against 1<->2 [4,1] 10.380939s
(617 cycl.)
Measuring 0<->3 [0,2] against 1<->2 [4,1] 10.380444s
(617 cycl.)
Measuring 0<->3 [0,2] against 1<->4 [4,3] 10.351627s
(615 cycl.)
Measuring 0<->3 [0,2] against 4<->5 [3,5] 10.323756s
(613 cycl.)
Measuring 0<->4 [0,3] 10.068174s (598 cycl.)
Measuring 0<->4 [0,3] against 1<->2 [4,1] 10.366900s
(616 cycl.)
Measuring 0<->4 [0,3] against 1<->5 [4,5] 9.967949s (592
cycl.)
Measuring 0<->5 [0,5] 14.994910s (891 cycl.)
Measuring 0<->5 [0,5] against 1<->2 [4,1] 15.033292s
(893 cycl.)
Measuring 0<->5 [0,5] against 1<->4 [4,3] 14.818316s
(880 cycl.)
Measuring 0<->6 [0,6] 14.975989s (890 cycl.)
Measuring 0<->6 [0,6] against 1<->2 [4,1] 14.682069s
(872 cycl.)
Measuring 0<->6 [0,6] against 1<->4 [4,3] 14.602189s
(868 cycl.)
Measuring 0<->7 [0,7] 14.979933s (890 cycl.)
Measuring 0<->7 [0,7] against 1<->2 [4,1] 14.677052s
(872 cycl.)
Measuring 0<->7 [0,7] against 1<->4 [4,3] 14.592800s
(867 cycl.)
Measuring 1<->2 [4,1] 14.968066s (889 cycl.)
Measuring 1<->2 [4,1] against 0<->3 [0,2] 15.500863s
(921 cycl.)
Measuring 1<->2 [4,1] against 0<->3 [0,2] 15.500372s
(921 cycl.)
Measuring 1<->2 [4,1] against 0<->4 [0,3] 15.501472s
(921 cycl.)
Measuring 1<->2 [4,1] against 4<->5 [3,5] 14.763144s
(877 cycl.)
Measuring 1<->3 [4,2] 14.994310s (891 cycl.)
Measuring 1<->3 [4,2] against 0<->2 [0,1] 14.808540s
(880 cycl.)
Measuring 1<->3 [4,2] against 0<->2 [0,1] 14.807692s
(880 cycl.)
Measuring 1<->3 [4,2] against 0<->4 [0,3] 15.305635s
(909 cycl.)
Measuring 1<->3 [4,2] against 4<->5 [3,5] 15.033173s
(893 cycl.)
Measuring 1<->4 [4,3] 14.964893s (889 cycl.)
Measuring 1<->4 [4,3] against 0<->2 [0,1] 14.786850s
(879 cycl.)
Measuring 1<->4 [4,3] against 0<->5 [0,5] 14.763496s
(877 cycl.)
Measuring 1<->5 [4,5] 4.486349s (266 cycl.)
Measuring 1<->5 [4,5] against 0<->2 [0,1] 4.582051s (272
cycl.)
Measuring 1<->5 [4,5] against 0<->4 [0,3] 4.603492s (273
cycl.)
Measuring 1<->6 [4,6] 10.060743s (598 cycl.)
Measuring 1<->6 [4,6] against 0<->2 [0,1] 10.010672s
(595 cycl.)
Measuring 1<->6 [4,6] against 0<->4 [0,3] 10.162225s
(604 cycl.)
Measuring 1<->7 [4,7] 10.068514s (598 cycl.)
Measuring 1<->7 [4,7] against 0<->2 [0,1] 10.008342s
(595 cycl.)
Measuring 1<->7 [4,7] against 0<->4 [0,3] 10.157720s
(603 cycl.)
Measuring 2<->3 [1,2] 10.076517s (599 cycl.)
Measuring 2<->3 [1,2] against 0<->1 [0,4] 10.223968s
(607 cycl.)
Measuring 2<->3 [1,2] against 0<->1 [0,4] 10.223618s
(607 cycl.)
Measuring 2<->3 [1,2] against 0<->4 [0,3] 10.209493s
(606 cycl.)
Measuring 2<->3 [1,2] against 4<->5 [3,5] 10.314022s
(613 cycl.)
Measuring 2<->4 [1,3] 10.060942s (598 cycl.)
Measuring 2<->4 [1,3] against 0<->1 [0,4] 10.208069s
(606 cycl.)
Measuring 2<->4 [1,3] against 0<->5 [0,5] 10.208735s
(606 cycl.)
Measuring 2<->5 [1,5] 14.983217s (890 cycl.)
Measuring 2<->5 [1,5] against 0<->1 [0,4] 15.039912s
(894 cycl.)
Measuring 2<->5 [1,5] against 0<->4 [0,3] 15.515972s
(922 cycl.)
Measuring 2<->6 [1,6] 14.971475s (890 cycl.)
Measuring 2<->6 [1,6] against 0<->1 [0,4] 14.771271s
(878 cycl.)
Measuring 2<->6 [1,6] against 0<->4 [0,3] 15.501116s
(921 cycl.)
Measuring 2<->7 [1,7] 14.959463s (889 cycl.)
Measuring 2<->7 [1,7] against 0<->1 [0,4] 14.768376s
(878 cycl.)
Measuring 2<->7 [1,7] against 0<->4 [0,3] 15.493128s
(921 cycl.)
Measuring 3<->4 [2,3] 4.487170s (266 cycl.)
Measuring 3<->4 [2,3] against 0<->1 [0,4] 4.914035s (292
cycl.)
Measuring 3<->4 [2,3] against 0<->5 [0,5] 4.915071s (292
cycl.)
Measuring 3<->5 [2,5] 15.007838s (892 cycl.)
Measuring 3<->5 [2,5] against 0<->1 [0,4] 14.802776s
(880 cycl.)
Measuring 3<->5 [2,5] against 0<->4 [0,3] 15.325644s
(911 cycl.)
Measuring 3<->6 [2,6] 15.019292s (892 cycl.)
Measuring 3<->6 [2,6] against 0<->1 [0,4] 14.603891s
(868 cycl.)
Measuring 3<->6 [2,6] against 0<->4 [0,3] 15.322671s
(910 cycl.)
Measuring 3<->7 [2,7] 14.989404s (891 cycl.)
Measuring 3<->7 [2,7] against 0<->1 [0,4] 14.575498s
(866 cycl.)
Measuring 3<->7 [2,7] against 0<->4 [0,3] 15.286992s
(908 cycl.)
Measuring 4<->5 [3,5] 14.999293s (891 cycl.)
Measuring 4<->5 [3,5] against 6<->7 [6,7] 14.809961s
(880 cycl.)
Measuring 4<->5 [3,5] against 6<->7 [6,7] 14.810522s
(880 cycl.)
Measuring 4<->5 [3,5] against 6<->0 [6,0] 14.608409s
(868 cycl.)
Measuring 4<->5 [3,5] against 0<->1 [0,4] 14.798182s
(879 cycl.)
Measuring 4<->6 [3,6] 14.972375s (890 cycl.)
Measuring 4<->6 [3,6] against 5<->7 [5,7] 15.394721s
(915 cycl.)
Measuring 4<->6 [3,6] against 5<->7 [5,7] 15.383972s
(914 cycl.)
Measuring 4<->6 [3,6] against 5<->0 [5,0] 14.594728s
(867 cycl.)
Measuring 4<->6 [3,6] against 0<->1 [0,4] 14.602990s
(868 cycl.)
Measuring 4<->7 [3,7] 14.969327s (889 cycl.)
Measuring 4<->7 [3,7] against 5<->6 [5,6] 15.266964s
(907 cycl.)
Measuring 4<->7 [3,7] against 5<->6 [5,6] 15.267613s
(907 cycl.)
Measuring 4<->7 [3,7] against 5<->0 [5,0] 14.619381s
(869 cycl.)
Measuring 4<->7 [3,7] against 0<->1 [0,4] 14.584147s
(867 cycl.)
Measuring 5<->6 [5,6] 9.979540s (593 cycl.)
Measuring 5<->6 [5,6] against 4<->7 [3,7] 10.622953s
(631 cycl.)
Measuring 5<->6 [5,6] against 4<->7 [3,7] 10.626292s
(631 cycl.)
Measuring 5<->6 [5,6] against 4<->0 [3,0] 10.527476s
(625 cycl.)
Measuring 5<->6 [5,6] against 0<->1 [0,4] 10.615711s
(631 cycl.)
Measuring 5<->7 [5,7] 9.968657s (592 cycl.)
Measuring 5<->7 [5,7] against 4<->6 [3,6] 10.613273s
(630 cycl.)
Measuring 5<->7 [5,7] against 4<->6 [3,6] 10.610933s
(630 cycl.)
Measuring 5<->7 [5,7] against 4<->0 [3,0] 10.517223s
(625 cycl.)
Measuring 5<->7 [5,7] against 0<->1 [0,4] 10.605052s
(630 cycl.)
Measuring 6<->7 [6,7] 4.488673s (266 cycl.)
Measuring 6<->7 [6,7] against 4<->5 [3,5] 4.917496s (292
cycl.)
Measuring 6<->7 [6,7] against 4<->5 [3,5] 4.922604s (292
cycl.)
Measuring 6<->7 [6,7] against 4<->0 [3,0] 4.595943s (273
cycl.)
Measuring 6<->7 [6,7] against 0<->1 [0,4] 4.921965s (292
cycl.)

- without 'pause' >10x slow-down

16x Itanium2 X2 (sorry, no physical ID printed)
-excerpt only, other pairs are symmetrical
-afaik 'cells' of 4 cores (2x X2) are connected through a backplane
(unknown arch.)
- /proc/cpuinfo revealed compact topology
Measuring 0<->1 [0,0] 107.251381s (1278 cycl.)
Measuring 0<->1 [0,0] against 2<->3 [0,0] 98.469162s
(1173 cycl.)
Measuring 0<->1 [0,0] against 2<->3 [0,0] 98.180916s
(1170 cycl.)
Measuring 0<->1 [0,0] against 2<->4 [0,0] 105.212402s
(1254 cycl.)
Measuring 0<->1 [0,0] against 4<->5 [0,0] 106.807167s
(1273 cycl.)
Measuring 0<->1 [0,0] against 4<->8 [0,0] 106.787651s
(1273 cycl.)
Measuring 0<->2 [0,0] 110.932060s (1322 cycl.)
Measuring 0<->2 [0,0] against 1<->3 [0,0] 110.184135s
(1313 cycl.)
Measuring 0<->2 [0,0] against 1<->3 [0,0] 109.783813s
(1308 cycl.)
Measuring 0<->2 [0,0] against 1<->4 [0,0] 110.720924s
(1320 cycl.)
Measuring 0<->2 [0,0] against 4<->5 [0,0] 110.330956s
(1315 cycl.)
Measuring 0<->2 [0,0] against 4<->8 [0,0] 110.319183s
(1315 cycl.)
Measuring 0<->3 [0,0] 111.223663s (1326 cycl.)
Measuring 0<->3 [0,0] against 1<->2 [0,0] 109.854431s
(1309 cycl.)
Measuring 0<->3 [0,0] against 1<->2 [0,0] 109.832581s
(1309 cycl.)
Measuring 0<->3 [0,0] against 1<->4 [0,0] 110.600349s
(1318 cycl.)
Measuring 0<->3 [0,0] against 4<->5 [0,0] 110.272530s
(1314 cycl.)
Measuring 0<->3 [0,0] against 4<->8 [0,0] 110.626572s
(1318 cycl.)
Measuring 0<->4 [0,0] 64.181007s (765 cycl.)
Measuring 0<->4 [0,0] against 1<->2 [0,0] 64.174301s
(765 cycl.)
Measuring 0<->4 [0,0] against 1<->5 [0,0] 64.139503s
(764 cycl.)
Measuring 0<->4 [0,0] against 1<->8 [0,0] 64.136887s
(764 cycl.)
Measuring 0<->4 [0,0] against 8<->9 [0,0] 64.135277s
(764 cycl.)
Measuring 0<->4 [0,0] against 8<->12 [0,0] 64.150650s
(764 cycl.)
Measuring 0<->5 [0,0] 64.148857s (764 cycl.)
Measuring 0<->5 [0,0] against 1<->2 [0,0] 64.179520s
(765 cycl.)
Measuring 0<->5 [0,0] against 1<->4 [0,0] 64.166222s
(765 cycl.)
Measuring 0<->5 [0,0] against 1<->8 [0,0] 64.512199s
(769 cycl.)
Measuring 0<->5 [0,0] against 8<->9 [0,0] 64.155739s
(764 cycl.)
Measuring 0<->5 [0,0] against 8<->12 [0,0] 64.136169s
(764 cycl.)
Measuring 0<->6 [0,0] 64.151947s (764 cycl.)
Measuring 0<->6 [0,0] against 1<->2 [0,0] 64.148712s
(764 cycl.)
Measuring 0<->6 [0,0] against 1<->4 [0,0] 64.155472s
(764 cycl.)
Measuring 0<->6 [0,0] against 1<->8 [0,0] 64.141830s
(764 cycl.)
Measuring 0<->6 [0,0] against 8<->9 [0,0] 64.132019s
(764 cycl.)
Measuring 0<->6 [0,0] against 8<->12 [0,0] 64.133232s
(764 cycl.)
Measuring 0<->7 [0,0] 64.514107s (769 cycl.)
Measuring 0<->7 [0,0] against 1<->2 [0,0] 64.172386s
(765 cycl.)
Measuring 0<->7 [0,0] against 1<->4 [0,0] 64.144875s
(764 cycl.)
Measuring 0<->7 [0,0] against 1<->8 [0,0] 64.149727s
(764 cycl.)
Measuring 0<->7 [0,0] against 8<->9 [0,0] 64.159027s
(764 cycl.)
Measuring 0<->7 [0,0] against 8<->12 [0,0] 64.166466s
(765 cycl.)
Measuring 0<->8 [0,0] 64.177841s (765 cycl.)
Measuring 0<->8 [0,0] against 1<->2 [0,0] 64.172874s
(765 cycl.)
Measuring 0<->8 [0,0] against 1<->9 [0,0] 64.188240s
(765 cycl.)
Measuring 0<->8 [0,0] against 1<->4 [0,0] 64.521500s
(769 cycl.)
Measuring 0<->8 [0,0] against 4<->5 [0,0] 64.197769s
(765 cycl.)
Measuring 0<->8 [0,0] against 4<->12 [0,0] 64.174019s
(765 cycl.)
Measuring 0<->9 [0,0] 64.153633s (764 cycl.)
Measuring 0<->9 [0,0] against 1<->2 [0,0] 64.150558s
(764 cycl.)
Measuring 0<->9 [0,0] against 1<->8 [0,0] 64.177643s
(765 cycl.)
Measuring 0<->9 [0,0] against 1<->4 [0,0] 64.156647s
(764 cycl.)
Measuring 0<->9 [0,0] against 4<->5 [0,0] 64.149498s
(764 cycl.)
Measuring 0<->9 [0,0] against 4<->12 [0,0] 64.164383s
(764 cycl.)
Measuring 0<->10 [0,0] 64.151146s (764 cycl.)
Measuring 0<->10 [0,0] against 1<->2 [0,0] 64.488953s
(768 cycl.)
Measuring 0<->10 [0,0] against 1<->8 [0,0] 64.152962s
(764 cycl.)
Measuring 0<->10 [0,0] against 1<->4 [0,0] 64.147171s
(764 cycl.)
Measuring 0<->10 [0,0] against 4<->5 [0,0] 64.155983s
(764 cycl.)
Measuring 0<->10 [0,0] against 4<->12 [0,0] 64.156029s
(764 cycl.)
Measuring 0<->11 [0,0] 64.179237s (765 cycl.)
Measuring 0<->11 [0,0] against 1<->2 [0,0] 64.172897s
(765 cycl.)
Measuring 0<->11 [0,0] against 1<->8 [0,0] 64.146286s
(764 cycl.)
Measuring 0<->11 [0,0] against 1<->4 [0,0] 64.142151s
(764 cycl.)
Measuring 0<->11 [0,0] against 4<->5 [0,0] 64.469162s
(768 cycl.)
Measuring 0<->11 [0,0] against 4<->12 [0,0] 64.149742s
(764 cycl.)
Measuring 0<->12 [0,0] 64.129936s (764 cycl.)
Measuring 0<->12 [0,0] against 1<->2 [0,0] 64.167274s
(765 cycl.)
Measuring 0<->12 [0,0] against 1<->13 [0,0] 64.178665s
(765 cycl.)
Measuring 0<->12 [0,0] against 1<->4 [0,0] 64.153198s
(764 cycl.)
Measuring 0<->12 [0,0] against 4<->5 [0,0] 64.147072s
(764 cycl.)
Measuring 0<->12 [0,0] against 4<->8 [0,0] 64.170601s
(765 cycl.)
Measuring 0<->13 [0,0] 64.139839s (764 cycl.)
Measuring 0<->13 [0,0] against 1<->2 [0,0] 64.473190s
(768 cycl.)
Measuring 0<->13 [0,0] against 1<->12 [0,0] 64.152153s
(764 cycl.)
Measuring 0<->13 [0,0] against 1<->4 [0,0] 64.148048s
(764 cycl.)
Measuring 0<->13 [0,0] against 4<->5 [0,0] 64.174812s
(765 cycl.)
Measuring 0<->13 [0,0] against 4<->8 [0,0] 64.174927s
(765 cycl.)
Measuring 0<->14 [0,0] 64.142151s (764 cycl.)
Measuring 0<->14 [0,0] against 1<->2 [0,0] 64.155281s
(764 cycl.)
Measuring 0<->14 [0,0] against 1<->12 [0,0] 64.144997s
(764 cycl.)
Measuring 0<->14 [0,0] against 1<->4 [0,0] 64.148842s
(764 cycl.)
Measuring 0<->14 [0,0] against 4<->5 [0,0] 64.173897s
(765 cycl.)
Measuring 0<->14 [0,0] against 4<->8 [0,0] 64.516327s
(769 cycl.)
Measuring 0<->15 [0,0] 64.158516s (764 cycl.)
Measuring 0<->15 [0,0] against 1<->2 [0,0] 64.173409s
(765 cycl.)
Measuring 0<->15 [0,0] against 1<->12 [0,0] 64.144753s
(764 cycl.)
Measuring 0<->15 [0,0] against 1<->4 [0,0] 64.147507s
(764 cycl.)
Measuring 0<->15 [0,0] against 4<->5 [0,0] 64.142700s
(764 cycl.)
Measuring 0<->15 [0,0] against 4<->8 [0,0] 64.150131s
(764 cycl.)
Measuring 0<->16 [0,0] 55.807411s (665 cycl.)
Measuring 0<->16 [0,0] against 1<->2 [0,0] 55.848919s
(665 cycl.)
Measuring 0<->16 [0,0] against 1<->17 [0,0] 55.827259s
(665 cycl.)
Measuring 0<->16 [0,0] against 1<->4 [0,0] 56.167343s
(669 cycl.)
Measuring 0<->16 [0,0] against 4<->5 [0,0] 55.816273s
(665 cycl.)
Measuring 0<->16 [0,0] against 4<->8 [0,0] 55.826588s
(665 cycl.)
Measuring 0<->17 [0,0] 55.813526s (665 cycl.)
Measuring 0<->17 [0,0] against 1<->2 [0,0] 55.837841s
(665 cycl.)
Measuring 0<->17 [0,0] against 1<->16 [0,0] 55.831417s
(665 cycl.)
Measuring 0<->17 [0,0] against 1<->4 [0,0] 55.813683s
(665 cycl.)
Measuring 0<->17 [0,0] against 4<->5 [0,0] 55.825752s
(665 cycl.)
Measuring 0<->17 [0,0] against 4<->8 [0,0] 55.820732s
(665 cycl.)
Measuring 0<->18 [0,0] 55.785587s (665 cycl.)
Measuring 0<->18 [0,0] against 1<->2 [0,0] 56.203472s
(670 cycl.)
Measuring 0<->18 [0,0] against 1<->16 [0,0] 55.824181s
(665 cycl.)
Measuring 0<->18 [0,0] against 1<->4 [0,0] 55.825645s
(665 cycl.)
Measuring 0<->18 [0,0] against 4<->5 [0,0] 55.835438s
(665 cycl.)
Measuring 0<->18 [0,0] against 4<->8 [0,0] 55.822479s
(665 cycl.)
Measuring 0<->19 [0,0] 58.729237s (700 cycl.)
Measuring 0<->19 [0,0] against 1<->2 [0,0] 58.010929s
(691 cycl.)
Measuring 0<->19 [0,0] against 1<->16 [0,0] 58.727673s
(700 cycl.)
Measuring 0<->19 [0,0] against 1<->4 [0,0] 58.721760s
(700 cycl.)
Measuring 0<->19 [0,0] against 4<->5 [0,0] 58.703808s
(699 cycl.)
Measuring 0<->19 [0,0] against 4<->8 [0,0] 58.731762s
(700 cycl.)
Measuring 0<->20 [0,0] 61.349407s (731 cycl.)
Measuring 0<->20 [0,0] against 1<->2 [0,0] 64.529930s
(769 cycl.)
Measuring 0<->20 [0,0] against 1<->21 [0,0] 64.470764s
(768 cycl.)
Measuring 0<->20 [0,0] against 1<->4 [0,0] 64.482765s
(768 cycl.)
Measuring 0<->20 [0,0] against 4<->5 [0,0] 64.461006s
(768 cycl.)
Measuring 0<->20 [0,0] against 4<->8 [0,0] 64.459862s
(768 cycl.)
Measuring 0<->21 [0,0] 60.989990s (727 cycl.)
Measuring 0<->21 [0,0] against 1<->2 [0,0] 64.520287s
(769 cycl.)
Measuring 0<->21 [0,0] against 1<->20 [0,0] 64.491570s
(768 cycl.)
Measuring 0<->21 [0,0] against 1<->4 [0,0] 64.783409s
(772 cycl.)
Measuring 0<->21 [0,0] against 4<->5 [0,0] 64.455513s
(768 cycl.)
Measuring 0<->21 [0,0] against 4<->8 [0,0] 64.457016s
(768 cycl.)
Measuring 0<->22 [0,0] 60.974831s (726 cycl.)
Measuring 0<->22 [0,0] against 1<->2 [0,0] 64.549599s
(769 cycl.)
Measuring 0<->22 [0,0] against 1<->20 [0,0] 64.472733s
(768 cycl.)
Measuring 0<->22 [0,0] against 1<->4 [0,0] 64.485153s
(768 cycl.)
Measuring 0<->22 [0,0] against 4<->5 [0,0] 64.455055s
(768 cycl.)
Measuring 0<->22 [0,0] against 4<->8 [0,0] 64.481216s
(768 cycl.)
Measuring 0<->23 [0,0] 61.004913s (727 cycl.)
Measuring 0<->23 [0,0] against 1<->2 [0,0] 64.871841s
(773 cycl.)
Measuring 0<->23 [0,0] against 1<->20 [0,0] 64.493553s
(768 cycl.)
Measuring 0<->23 [0,0] against 1<->4 [0,0] 64.489861s
(768 cycl.)
Measuring 0<->23 [0,0] against 4<->5 [0,0] 64.498985s
(768 cycl.)
Measuring 0<->23 [0,0] against 4<->8 [0,0] 64.498650s
(768 cycl.)
Measuring 0<->24 [0,0] 60.983841s (727 cycl.)
Measuring 0<->24 [0,0] against 1<->2 [0,0] 64.475929s
(768 cycl.)
Measuring 0<->24 [0,0] against 1<->25 [0,0] 64.444618s
(768 cycl.)
Measuring 0<->24 [0,0] against 1<->4 [0,0] 64.444046s
(768 cycl.)
Measuring 0<->24 [0,0] against 4<->5 [0,0] 64.820580s
(772 cycl.)
Measuring 0<->24 [0,0] against 4<->8 [0,0] 64.447762s
(768 cycl.)
Measuring 0<->25 [0,0] 60.992012s (727 cycl.)
Measuring 0<->25 [0,0] against 1<->2 [0,0] 64.473419s
(768 cycl.)
Measuring 0<->25 [0,0] against 1<->24 [0,0] 64.445488s
(768 cycl.)
Measuring 0<->25 [0,0] against 1<->4 [0,0] 64.423332s
(768 cycl.)
Measuring 0<->25 [0,0] against 4<->5 [0,0] 64.420876s
(768 cycl.)
Measuring 0<->25 [0,0] against 4<->8 [0,0] 64.421913s
(768 cycl.)
Measuring 0<->26 [0,0] 61.017406s (727 cycl.)
Measuring 0<->26 [0,0] against 1<->2 [0,0] 64.839035s
(773 cycl.)
Measuring 0<->26 [0,0] against 1<->24 [0,0] 64.468010s
(768 cycl.)
Measuring 0<->26 [0,0] against 1<->4 [0,0] 64.465378s
(768 cycl.)
Measuring 0<->26 [0,0] against 4<->5 [0,0] 64.428703s
(768 cycl.)
Measuring 0<->26 [0,0] against 4<->8 [0,0] 64.448578s
(768 cycl.)
Measuring 0<->27 [0,0] 61.014168s (727 cycl.)
Measuring 0<->27 [0,0] against 1<->2 [0,0] 64.483521s
(768 cycl.)
Measuring 0<->27 [0,0] against 1<->24 [0,0] 64.471924s
(768 cycl.)
Measuring 0<->27 [0,0] against 1<->4 [0,0] 64.435555s
(768 cycl.)
Measuring 0<->27 [0,0] against 4<->5 [0,0] 64.451805s
(768 cycl.)
Measuring 0<->27 [0,0] against 4<->8 [0,0] 64.716904s
(771 cycl.)
Measuring 0<->28 [0,0] 61.010834s (727 cycl.)
Measuring 0<->28 [0,0] against 1<->2 [0,0] 64.546364s
(769 cycl.)
Measuring 0<->28 [0,0] against 1<->29 [0,0] 64.624466s
(770 cycl.)
Measuring 0<->28 [0,0] against 1<->4 [0,0] 64.608391s
(770 cycl.)
Measuring 0<->28 [0,0] against 4<->5 [0,0] 64.640808s
(770 cycl.)
Measuring 0<->28 [0,0] against 4<->8 [0,0] 64.634064s
(770 cycl.)
Measuring 0<->29 [0,0] 60.990562s (727 cycl.)
Measuring 0<->29 [0,0] against 1<->2 [0,0] 64.519989s
(769 cycl.)
Measuring 0<->29 [0,0] against 1<->28 [0,0] 64.908897s
(773 cycl.)
Measuring 0<->29 [0,0] against 1<->4 [0,0] 64.551071s
(769 cycl.)
Measuring 0<->29 [0,0] against 4<->5 [0,0] 64.606598s
(770 cycl.)
Measuring 0<->29 [0,0] against 4<->8 [0,0] 64.612640s
(770 cycl.)
Measuring 0<->30 [0,0] 60.993958s (727 cycl.)
Measuring 0<->30 [0,0] against 1<->2 [0,0] 64.551079s
(769 cycl.)
Measuring 0<->30 [0,0] against 1<->28 [0,0] 64.623276s
(770 cycl.)
Measuring 0<->30 [0,0] against 1<->4 [0,0] 64.591354s
(770 cycl.)
Measuring 0<->30 [0,0] against 4<->5 [0,0] 64.635635s
(770 cycl.)
Measuring 0<->30 [0,0] against 4<->8 [0,0] 64.958443s
(774 cycl.)
Measuring 0<->31 [0,0] 61.043610s (727 cycl.)
Measuring 0<->31 [0,0] against 1<->2 [0,0] 64.561775s
(769 cycl.)
Measuring 0<->31 [0,0] against 1<->28 [0,0] 64.620903s
(770 cycl.)
Measuring 0<->31 [0,0] against 1<->4 [0,0] 64.594101s
(770 cycl.)
Measuring 0<->31 [0,0] against 4<->5 [0,0] 64.616486s
(770 cycl.)
Measuring 0<->31 [0,0] against 4<->8 [0,0] 64.616524s
(770 cycl.)
.....
- no difference without 'hint.pause'
Note that the latency over the backplane is lower than 'local'
communication.

ARM11 MPCore (sorry, no user-space cycle counter)
Measuring 0<->1 [0,0] 22.485418s (0 cycl.) <= 267 Cycles
derived by clock speed of 399MHz

Thanks for the valuable feedback provided here,
best regards!

Dmitriy Vyukov

unread,

Mar 11, 2009, 9:27:57 AM3/11/09

to Scalable Synchronization Algorithms

On Mar 10, 6:01 pm, Landmann <landmann_...@gmx.de> wrote:

> Outcomes:
> So some systems have 'anomalies' I cannot explain yet.
> Under Linux the topology information should be taken from sysfs
> and not derived simply by the logical cpu numbering.

I see one possible reason for the anomalies. Some of your machines are
NUMA machines (2x Opteron 252, 2x Opteron 270 X2, possibly 2x Intel
Xeon Quad, 16x Itanium2 X2). Memory placement may affect cache-
coherency latencies (depends on implementation). Assume: memory is
allocated on node 8, while 2 processors on node 1 ping-pong with each
other, each cache-coherence transaction may go to the node 8 as well
(this will cause slowdown). Assume: memory is allocated on node 1,
while processor on node 1 ping-pongs with processor on node 2, half
of the transactions may be local in this case (thus faster than
previous case). Try to allocate the memory on one of the ping-ponging
processors, you may use libnuma's numa_alloc_onnode().

--
Dmitriy V'jukov

Landmann

unread,

Mar 12, 2009, 3:28:43 AM3/12/09

to Scalable Synchronization Algorithms

> Try to allocate the memory on one of the ping-ponging
> processors, you may use libnuma's numa_alloc_onnode().

Good point!
But I actually did exactly that. The 'ping' thread allocates
the memory which is used for the flag. So I always lives on
the node where one of the measured threads is executed.
Yeah, you still see it in the source code here.

Dmitriy Vyukov

unread,

Mar 12, 2009, 3:56:55 AM3/12/09

to Scalable Synchronization Algorithms

I would not be so sure... Memory placement is managed on the page
level, and malloc() may cache pages. If every allocation would go
directly to the OS allocator and allocate separate pages, then
probably memory will be allocated on the right nodes. But if small
blocks of memory are allocated with malloc()... Assume: first
allocation allocates memory page and thus it is on the right node, but
a number of subsequent allocations just use unused space from the
first page.

--
Dmitriy V'jukov

Landmann

unread,

Mar 12, 2009, 4:12:38 AM3/12/09

to Scalable Synchronization Algorithms

I assumed malloc to take care of which pages belong to which node.
So it might cache on page level, but *hopefully* per node.
I will try to figure out if allocating a complete page (with proper
alignment) and filling it with zeros first produces any difference
here.
Another round ;)

Txs Joerg

Dmitriy Vyukov

unread,

Mar 12, 2009, 4:23:19 AM3/12/09

to Scalable Synchronization Algorithms

To be 100% sure I would use NUMA API. Who knows what glibc malloc
does?
It may maintain per-thread caches, however before allocating new
memory from OS, it may try to steal memory from other threads (I don't
think that standard glibc malloc is NUMA aware).
It may not maintain pre-thread caches and use global pool only.
Or it may allocate from OS several pages at once (eagerly).
Or...
If any if of above is true, then your test may be incorrect.

At least I suggest to allocate let's say 128k blocks, then zeroize
them, then use middle page from that allocation, then do not free it
at all.

--
Dmitriy V'jukov

Landmann

unread,

Mar 12, 2009, 5:06:18 AM3/12/09

to Scalable Synchronization Algorithms

Here we have it:
2x Opteron X2, posix_memalign requested a complete page
at page boundaries, never freed again:
Measuring 0<->1 [0,1] 10.152527s (607 cycl.)
Measuring 0<->2 [0,2] 13.081881s (783 cycl.)
Measuring 0<->3 [0,3] 13.106046s (784 cycl.)
Measuring 1<->2 [1,2] 13.084205s (783 cycl.)
Measuring 1<->3 [1,3] 13.109171s (784 cycl.)
Measuring 2<->3 [2,3] 10.161060s (608 cycl.) :)

As soon as I free() the pages I get the old figures again,
so they are recycled regardless where they lived.
Allocating memory below page size also results in the
old figures, regardless of free() called or not.
So that explains a lot here, thanks!

Joerg

Josh Dybnis

unread,

Mar 12, 2009, 5:22:16 AM3/12/09

to Scalable Synchronization Algorithms

Dmitriy is correct that glibc's malloc has complex behavior and it
isn't numa aware. And although it passes blocks >= 128k directly
through to mmap/munmap, I believe that won't help you because linux's
default numa allocation policy is round-robin. You can use libnuma to
change the policy within the program. Or use numactl from the command
line. The physical placement of a page is determined when the page is
first written to causing a page fault.

-Josh

Reply all

Reply to author

Forward