STL map and pthread performance problem on Linux/GCC

nan....@gmail.com

unread,

Aug 15, 2005, 11:28:33 PM8/15/05

to

Hello, all,
I have an interesting problem about stl map and pthread on Linux
and g++. The source code is as follows.

//mt_map_test.cpp
#include <string>
#include <map>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>

using namespace std;

static void* thrd_work( void* data )
{
long count = reinterpret_cast<long>( data );

const int SIZE =32;
char buf[SIZE];
memset( buf, 'h', SIZE );
map<string, string> strMap;

for ( long i=0; i< count ; i++ ) {
char key[8];
sprintf( key, "%d", i );
strMap[ key ] = buf;
}
}

int main()
{
int JOB_NUM = 320000;
pthread_t tid[THRD_NUM];

for ( int i=0; i< THRD_NUM; i++ ){
pthread_create( tid+i, NULL, &thrd_work,
reinterpret_cast<void*> (JOB_NUM) );
}

for ( int i=0; i< THRD_NUM; i++ ){
pthread_join( tid[i], NULL );
}

}

And here is what I got from my workstation ( Dual AMD Opteron machine,
RHEL 3)

[nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
mt_map_test.cpp -lpthread ; time ./a.out; done

real 0m1.390s
user 0m1.280s
sys 0m0.120s

real 0m3.450s
user 0m5.320s
sys 0m1.170s

I expected that the 2 times should be roughly equal. But clearly I
experienced significant slowdown with 2 threads.The same also happened
to a dual Intel Xeon machine. I suspect the internal stl map
implementation is improper.

I've spent hours googling without any answer. I really need advice from
a C++ expert. Thanks a lot.

Nan

Maxim Yegorushkin

unread,

Aug 16, 2005, 5:53:57 AM8/16/05

to

nan....@gmail.com wrote:

[]

> I expected that the 2 times should be roughly equal. But clearly I
> experienced significant slowdown with 2 threads.The same also happened
> to a dual Intel Xeon machine. I suspect the internal stl map
> implementation is improper.
>
> I've spent hours googling without any answer. I really need advice from
> a C++ expert. Thanks a lot.

The problem may well be in malloc(), since it uses a mutex to protect
its data structures, thus serializing your map<> inserts. Try
http://www.hoard.org/

Larry I Smith

unread,

Aug 16, 2005, 7:45:16 PM8/16/05

to

nan....@gmail.com wrote:
> Hello, all,
> I have an interesting problem about stl map and pthread on Linux
> and g++. The source code is as follows.
>
>
> //mt_map_test.cpp
> #include <string>
> #include <map>
> #include <unistd.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <string.h>
> #include <pthread.h>
>
>
> using namespace std;
>
> static void* thrd_work( void* data )
> {
> long count = reinterpret_cast<long>( data );
>
> const int SIZE =32;

> //char buf[SIZE];

// allow room for a nul-terminator so the temp std::string
// constructed from buf[] by 'strMap[ key ] = buf' will
// always be SIZE bytes in length.
char buf[SIZE + 1];

> memset( buf, 'h', SIZE );

// nul-terminate buf[]. if we don't, then when a
// std::string is made from buf[] the nbr of bytes
// put into the std::string will be all bytes up to the
// next zero byte (an unknown length) - undefined behavior.
buf[SIZE] = '\0';

Just because the you have 2 cpu's doesn't mean that each thread
will run on its own cpu. I've seen discussions about this in
the gcc and g++ newsgroups. I don't remember the details, but
I seem to recall that special compile/link options were involved
to force usage of the multiple cpu's.

FYI, here's what I get on my old single-cpu PII-450 with
384MB of RAM:

real 0m9.407s
user 0m9.006s
sys 0m0.215s

real 0m18.671s
user 0m17.959s
sys 0m0.489s

Regards,
Larry

nan....@gmail.com

unread,

Aug 16, 2005, 9:35:58 PM8/16/05

to

Thanks a lot for the replies. I can use sched_setaffinity for
processor binding on Linux. But that does not look like what the
problem is. Also, I made a multi-process program using fork(), and
the result there is satisfactory.

Below is the new code with processor binding, multi-process and
corrections on the std::string.

-----------------------------------------------------------

#include <string>
#include <map>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>

#include <sys/wait.h>
#include <sched.h>

using namespace std;

struct WorkData {

int cpu;
long count;

};

static void* work( void* data )
{
WorkData* workData = reinterpret_cast<WorkData*>( data );

cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET( workData->cpu, &mask);
sched_setaffinity(0, &mask);

const int SIZE =32;
char buf[SIZE];

memset( buf, 'h', SIZE );

buf[SIZE-1] = '\0';
map<string, string> strMap;

for ( long i=0; i< workData->count ; i++ ) {

char key[8];
sprintf( key, "%d", i );
strMap[ key ] = buf;
}
}

int main()
{
const int JOB_NUM = 320000;
WorkData workData[ CONCURRENCY ];
#ifdef MT
pthread_t tid[ CONCURRENCY ];
for ( int i=0; i< CONCURRENCY; i++ ){
workData[i].count = JOB_NUM;
workData[i].cpu = i;
pthread_create( tid+i, NULL, &work, reinterpret_cast<void*>
(&(workData[i])) );
}

for ( int i=0; i< CONCURRENCY; i++ ){

pthread_join( tid[i], NULL );
}

#endif

#ifdef MP

pid_t pid[ CONCURRENCY ];
for ( int i=0; i< CONCURRENCY; i++ ){
workData[i].count = JOB_NUM;
workData[i].cpu = i;

if ( (pid[i] = fork()) == 0 ) {

work( reinterpret_cast<void*> ( &(workData[i])));
exit( 0 );
}
}

for ( int i=0; i< CONCURRENCY; i++ ){
waitpid( pid[i], NULL, 0 );
}

#endif
}

----------------------------------------------------------
Performace:

[nan@eudyptula test]$ for m in MT MP; do for t in 1 2; do printf "\n%s
%d" $m $t; g++ -D$m -DCONCURRENCY=$t mt_map_test.cpp -lpthread ;
time ./a.out; done; done

MT 1
real 0m1.410s
user 0m1.380s
sys 0m0.030s

MT 2
real 0m3.650s
user 0m5.230s
sys 0m1.640s

MP 1
real 0m1.380s
user 0m1.340s
sys 0m0.040s

MP 2
real 0m1.400s
user 0m2.650s
sys 0m0.130s

I also suspect there is some locking problem. I have not got my
program to work with hoard yet. But I did not find any slowdown when I
did a simple test on malloc in a multi-threaded program. For now, I
am going to dig into the map implementation on my machine.

nan....@gmail.com

unread,

Aug 17, 2005, 12:28:37 AM8/17/05

to

I tried the same program again tonight on a Redhat FC3 dual processor
machine. The problem just went away. Notice the gcc on FC3 is
3.4.2 . Before I used 3.2.3. on RHEL3.

gcc -v
Reading specs from /usr/lib/gcc/i386-redhat-linux/3.4.2/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-java-awt=gtk
--host=i386-redhat-linux
Thread model: posix
gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)

$ for m in MT ; do for t in 1 2; do printf "\n%s %d" $m $t; g++

-D$m -DCONCURRENCY=$t mt_map_test.cpp -lpthread; time ./a.out; done;
done

MT 1
real 0m1.707s
user 0m1.632s
sys 0m0.075s

MT 2
real 0m1.805s
user 0m3.303s
sys 0m0.240s

BTW, the signature of sched_setaffinity changed on FC3,
need to use 'sched_setaffinity(0, sizeof(mask), &mask);'

I also got hoard working on FC3. But no further improvement. On RHEL3,
I got the so file with the following warning:
Compiling for Linux
In file included from libhoard.cpp:164:
usesimtls.cpp: In function `int pthread_create(pthread_t*, const
pthread_attr_t*, void*(*)(void*), void*)':
usesimtls.cpp:323: warning: `pthread_attr_setstackaddr' is deprecated
(declared
at /usr/include/nptl/pthread.h:299)
In file included from libhoard.cpp:164:
usesimtls.cpp: In function `void pthread_exit(void*)':
usesimtls.cpp:354: warning: `noreturn' function does ret

But I got segfault in running my program with MT2. MT1, MP1 and MP2
are all OK.

Clearly, something has changed between gcc 3.2 and gcc 3.4.

Maxim Yegorushkin

unread,

Aug 17, 2005, 5:36:45 AM8/17/05

to

nan....@gmail.com wrote:

[]

> And here is what I got from my workstation ( Dual AMD Opteron machine,
> RHEL 3)
>
> [nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
> mt_map_test.cpp -lpthread ; time ./a.out; done
>
> real 0m1.390s
> user 0m1.280s
> sys 0m0.120s
>
> real 0m3.450s
> user 0m5.320s
> sys 0m1.170s
>
>
> I expected that the 2 times should be roughly equal. But clearly I
> experienced significant slowdown with 2 threads.The same also happened
> to a dual Intel Xeon machine. I suspect the internal stl map
> implementation is improper.
>
> I've spent hours googling without any answer. I really need advice from
> a C++ expert. Thanks a lot.

The problem may be in libstdc++ caching allocator. See
http://gcc.gnu.org/onlinedocs/libstdc++/20_util/allocator.html

I ran your code on a Dual Xeon 2.8 box with caching disabled and
enabled. Here are my results:

my@devel:~/src/exp> cat /etc/issue
Welcome to SuSE Linux 9.2 (i586) - Kernel \r (\l).
my@devel:~/src/exp> uname -a
Linux devel 2.6.11.4-20a-smp #1 SMP Wed Mar 23 21:52:37 UTC 2005 i686
i686 i386 GNU/Linux
my@devel:~/src/exp> g++ --version
g++ (GCC) 3.3.4 (pre 3.3.5 20040809)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is
NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.

my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
-pthread ; GLIBCPP_FORCE_NEW=1 GLIBCXX_FORCE_NEW=1 time -p ./a.out;
done
real 1.16
user 1.10
sys 0.05
real 1.37
user 2.55
sys 0.14

my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
-pthread ; time -p ./a.out; done
real 1.12
user 1.06
sys 0.05
real 3.48
user 5.28
sys 1.40

In the former case with caching disabled it's clear from the real time
numbers that the task is scaled well on the two processors.

A little boost is gained using hoard allocator:

my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
-pthread ; LD_PRELOAD=/usr/local/lib/libhoard.so:/usr/lib/libdl.so
GLIBCPP_FORCE_NEW=1 GL\IBCXX_FORCE_NEW=1 time -p ./a.out; done
real 1.15
user 1.09
sys 0.05
real 1.31
user 2.46
sys 0.13

vic

unread,

Aug 17, 2005, 10:07:43 AM8/17/05

to

Just to fix up the report output, I ran the following command,

cat /proc/cpuinfo | egrep "(model name|cpu MHz)" ; \
uname -a | awk '{print "kernel name : " $1 " " $3 " " $11 " " $14
}' ; \
gcc -v 2>&1 | grep gcc | awk '{print "gcc version : " $3 $4 }'; \
for method in MT MP; do \
for concurrency in 1 2 3 4 5 ; do \
printf "%s / %s = " $method $concurrency ; \
g++ -D$method=1 -DCONCURRENCY=$concurrency mt_map_test.cpp
-lpthread ; \
for samples in 1 2 3 4 5 6 7 8 9 ; do \
/usr/bin/time ./a.out 2>&1 | \
grep -v swaps | \
sed -e "s/^$.*$user.*$/\1/g" ; \
done | sort | head -5 | tail -1 ; \
done ; \
done

The basic idea is to run multiple times and pull the median . . .

For convenience I also added some conditional definitions,
#ifndef CONCURRENCY
#define CONCURRENCY 1
#endif

#if !defined( MP ) && !defined( MT )
#define MT 1
#endif

My machine has only a single processor, so the results do not seem
surprising,

model name : AMD Athlon(tm) 64 Processor 2800+
cpu MHz : 1808.843
kernel name : Linux 2.6.12-1.1398_FC4 x86_64 GNU/Linux
gcc version : 4.0.120050727
MT / 1 = 1.30
MT / 2 = 2.65
MT / 3 = 4.01
MT / 4 = 5.35
MT / 5 = 6.47
MP / 1 = 1.34
MP / 2 = 2.69
MP / 3 = 4.01
MP / 4 = 5.35
MP / 5 = 6.69

nan....@gmail.com

unread,

Aug 17, 2005, 1:20:35 PM8/17/05

to

Thanks a lot. Maxim. GLIBCPP_FORCE_NEW is exactly where the problem
is. I dropped the processor binding code and ran the following script
( adapted from Vic's report script ) on 2 machines.

$ cat /home/nan/measure.sh
#!/bin/bash

cat /proc/cpuinfo | egrep "(model name|cpu MHz)" ;
uname -a | awk '{print "kernel name : " $1 " " $3 " " $11 " " $14
}';

gcc -v 2>& 1 | grep '^gcc' | awk '{print "gcc version : " $3 $4 }';
printf "\n CD CE";

for m in MT MP; do
for t in $(seq 1 4); do
printf "\n%s / %d = " $m $t;

g++ -D$m -DCONCURRENCY=$t mt_map_test.cpp
-lpthread;

for cache in 'export' 'export -n' ; do
for samples in $(seq 1 5); do
$cache GLIBCPP_FORCE_NEW=1; $cache
GLIBCXX_FORCE_NEW=1; /usr/bin/time -f %e ./a.out 2>&1;
done | sort | head -3 | tail -1 | tr '\n' ' '
done
done
done

Machine 1
$ sh measure.sh
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
kernel name : Linux 2.6.9-1.667smp 2004 i386
gcc version : 3.4.220041017

CD CE
MT / 1 = 1.69 1.68
MT / 2 = 1.79 1.86
MT / 3 = 2.85 2.93
MT / 4 = 3.12 3.11
MP / 1 = 1.54 1.54
MP / 2 = 1.56 1.57
MP / 3 = 3.49 2.45
MP / 4 = 3.51 3.14

Machine 2
$ sh measure.sh
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
kernel name : Linux 2.4.21-4.ELsmp 2003 i386
gcc version : 3.2.320030502

CD CE
MT / 1 = 2.00 1.97
MT / 2 = 2.19 4.93
MT / 3 = 3.16 7.22
MT / 4 = 3.75 9.94
MP / 1 = 1.82 1.99
MP / 2 = 1.84 2.00
MP / 3 = 3.11 3.24
MP / 4 = 3.73 4.25

Note the above machines have 2 Hyperthreaded CPUs, so it only scales
up to 2.
For this particular program, it seems that MP is consistently faster
than MT.
And surprisingly, GLIBCPP_FORCE_NEW or GLIBCXX_FORCE_NEW has no effect
on gcc 3.4.

Maxim Yegorushkin

unread,

Aug 18, 2005, 8:01:22 AM8/18/05

to

nan....@gmail.com wrote:
> Thanks a lot. Maxim. GLIBCPP_FORCE_NEW is exactly where the problem
> is. I dropped the processor binding code and ran the following script
> ( adapted from Vic's report script ) on 2 machines.

[]

> Note the above machines have 2 Hyperthreaded CPUs, so it only scales
> up to 2.
> For this particular program, it seems that MP is consistently faster
> than MT.
> And surprisingly, GLIBCPP_FORCE_NEW or GLIBCXX_FORCE_NEW has no effect
> on gcc 3.4.

I've just checked out gcc 3.4.3 sources and GLIBCXX_FORCE_NEW check is
still there.

I ran a slightly modified script version to include results with Hoard
allocator. Here CE/D stand for c++ caching allocator enabled/disabled,
H stands for Hoard.

my@devel:~/src/exp> ./measure

model name : Intel(R) Xeon(TM) CPU 2.80GHz

cpu MHz : 2792.047

model name : Intel(R) Xeon(TM) CPU 2.80GHz

cpu MHz : 2792.047

model name : Intel(R) Xeon(TM) CPU 2.80GHz

cpu MHz : 2792.047

model name : Intel(R) Xeon(TM) CPU 2.80GHz

cpu MHz : 2792.047
kernel name : Linux 2.6.11.4-20a-smp 2005 i386
gcc version : 3.3.4

CE CD CEH CDH
MT / 1 = 1.63 1.66 1.63 1.65
MT / 2 = 4.29 1.79 4.24 1.77
MT / 3 = 6.16 2.53 6.23 2.44
MT / 4 = 9.33 3.05 9.03 2.24
MP / 1 = 1.62 1.49 1.63 1.51
MP / 2 = 1.63 1.51 1.64 1.53
MP / 3 = 2.29 2.14 2.30 2.17
MP / 4 = 2.70 2.59 2.74 2.66

nan....@gmail.com

unread,

Aug 18, 2005, 11:26:56 AM8/18/05

to

It would be nice if you can try this again with gcc 3.4.3.