Possible causes for "MT slowdown"

Mirek Fidler

unread,

May 27, 2008, 12:40:00 PM5/27/08

to

Hi,

I am working on server application which basically serves as big cache
to sql database, performing query requests. As long as query requests
that are in cache are performed, there is no sql communication.

I believe that in such case, running N queries in single thread should
take approximately 2x as much time as running N/2 in 2 threads.
Anyway, so far I was able to achieve only 1.6x.

I am aware about this possible problems in MT performance:

* mutex contention (avoided most of it)

* false sharing (cacheline contention) (allocator used should not
have this problem)

* core affinity (I believe the sheduler :)

* cache issue - too much data (but I ask the same queries, so most
likely everything is in cache)

Is there anything else to investigate?

Mirek

Mirek Fidler

unread,

May 27, 2008, 12:42:01 PM5/27/08

to

Hi,

I am working on server application which basically serves as big cache
to sql database, performing query requests. As long as query requests
that are in cache are performed, there is no sql communication.

I believe that in such case, running N queries in single thread should

take approximately 2x as much time as running N/2 queries in 2

Steve Watt

unread,

May 27, 2008, 7:55:50 PM5/27/08

to

In article <3f28aafb-8ffd-4bfa...@25g2000hsx.googlegroups.com>,
Mirek Fidler <c...@ntllib.org> wrote:
[ snip ]

>I believe that in such case, running N queries in single thread should
>take approximately 2x as much time as running N/2 queries in 2
>threads. Anyway, so far I was able to achieve only 1.6x.

Uh, some math didn't come out right there. N queries in a single thread
should take something around 2x the time of running N queries in 2 threads,
not N/2 queries.

>Is there anything else to investigate?

The results of profiling in both cases?

Profilers are amazing tools for finding out why performance is doing what
it's doing.
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.5" / 37N 20' 15.3"
Internet: steve @ Watt.COM Whois: SW32-ARIN
Free time? There's no such thing. It just comes in varying prices...

Mirek Fidler

unread,

May 28, 2008, 3:56:29 AM5/28/08

to

On May 28, 1:55 am, Steve Watt <steve.removet...@Watt.COM> wrote:
> In article <3f28aafb-8ffd-4bfa-9c77-068ab6e9c...@25g2000hsx.googlegroups.com>,

> Mirek Fidler <c...@ntllib.org> wrote:
> [ snip ]
>
> >I believe that in such case, running N queries in single thread should
> >take approximately 2x as much time as running N/2 queries in 2
> >threads. Anyway, so far I was able to achieve only 1.6x.
>
> Uh, some math didn't come out right there. N queries in a single thread
> should take something around 2x the time of running N queries in 2 threads,
> not N/2 queries.

N queries in 2 threads means doing 2xN queries for 2 threads (I mean,
each core does N queries, that is 2xN total). In that case ideally
the time should be the same as single core doing N queries, right?

Sorry if my wording was confusing....

Mirek

Dmitriy V'jukov

unread,

May 28, 2008, 8:05:41 AM5/28/08

to

First of all, if it's server then it must involve some socket
communication or something. Maybe server is I/O bound, not processor
bound.

One of the most expensive things is (not false, just) sharing. Sharing
of mutable data between threads. Threads must not share any mutable
data on fast-path.

Dmitriy V'jukov

Mirek Fidler

unread,

May 28, 2008, 8:18:34 AM5/28/08

to

On May 28, 2:05 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> On 27 ÍÁÊ, 20:40, Mirek Fidler <c...@ntllib.org> wrote:
>
>
>
> > Hi,
>
> > I am working on server application which basically serves as big cache
> > to sql database, performing query requests. As long as query requests
> > that are in cache are performed, there is no sql communication.
>
> > I believe that in such case, running N queries in single thread should
> > take approximately 2x as much time as running N/2 in 2 threads.
> > Anyway, so far I was able to achieve only 1.6x.
>
> > I am aware about this possible problems in MT performance:
>
> > * mutex contention (avoided most of it)
>
> > * false sharing (cacheline contention) (allocator used should not
> > have this problem)
>
> > * core affinity (I believe the sheduler :)
>
> > * cache issue - too much data (but I ask the same queries, so most
> > likely everything is in cache)
>
> > Is there anything else to investigate?
>
> First of all, if it's server then it must involve some socket
> communication or something. Maybe server is I/O bound, not processor
> bound.

So far, I am testing only the query engine alone, socket
communications layer is not involved.

> One of the most expensive things is (not false, just) sharing. Sharing
> of mutable data between threads. Threads must not share any mutable
> data on fast-path.

Yes, I am aware of this; the fast path I am testing should have no
mutable data, only two threads running and reading the same data.

Mirek

Dmitriy V'jukov

unread,

May 28, 2008, 10:18:20 AM5/28/08

to

On 28 май, 16:18, Mirek Fidler <c...@ntllib.org> wrote:

> > One of the most expensive things is (not false, just) sharing. Sharing
> > of mutable data between threads. Threads must not share any mutable
> > data on fast-path.
>
> Yes, I am aware of this; the fast path I am testing should have no
> mutable data, only two threads running and reading the same data.

Just in case, mutation can be implicit. For example, read_lock()/
read_unlock() functions of reader-writer mutex, usually contains
mutation of shared data and can be very expensive. But logically one
can consider this as "reading".

Dmitriy V'jukov

David Schwartz

unread,

May 28, 2008, 10:37:24 AM5/28/08

to

On May 27, 9:40 am, Mirek Fidler <c...@ntllib.org> wrote:

> Is there anything else to investigate?

The cores share things like memory bandwidth. Also, the threads will
likely at least occasionally get in each others way, for example,
during any attempt to change the vm (for example, by calling
'malloc').

DS

Dmitriy V'jukov

unread,

May 28, 2008, 10:51:50 AM5/28/08

to

On 28 май, 18:37, David Schwartz <dav...@webmaster.com> wrote:

> > Is there anything else to investigate?
>
> The cores share things like memory bandwidth.

Good point.
Also if cores have shared L2 cache (very common), then size of L2
cache is effectively halved provided 2 cores busy with work. Also in
this case there can be different conflicts in cache. On what hardware
you work?

Dmitriy V'jukov

Mirek Fidler

unread,

May 28, 2008, 12:39:08 PM5/28/08

to

On May 28, 4:18 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

I am not sure I undestand this... How can mutex mutate data that are
not referenced in call to it? Perhaps another sign of my under-
education; can you provide some links?

Mirek

Dmitriy V'jukov

unread,

May 28, 2008, 1:07:16 PM5/28/08

to

Mutex mutates his own data:

class mutex
{
int state; <------------ shared mutable data

void lock()
{
some operations on 'state' variable
}

void unlock()
{
some operations on 'state' variable
}
};

So if you have something like this:

class query_cache
{
std::vector<query> data;
mutable rw_mutex guard;

void find_query(...) const
{
guard.read_lock();

some read-only operations on 'data' variable

guard.read_unlock();
}
};

It's mutable shared data!

Dmitriy V'jukov

Mirek Fidler

unread,

May 28, 2008, 2:53:32 PM5/28/08

to

Ah, I see, of course. In fact, a good point, it means that if more
than one thread does locking, there is a slowdown, despite the fact
mutex never blocks, right?

Some of mutexes get locked really a lot of times. I guess this can
explain a couple of percents of "slowdown". Thanks, I will think about
it (and will try to reduce locking using TLS).

Mirek

Dmitriy V'jukov

unread,

May 28, 2008, 4:01:29 PM5/28/08

to

On 28 май, 22:53, Mirek Fidler <c...@ntllib.org> wrote:

> Ah, I see, of course. In fact, a good point, it means that if more
> than one thread does locking, there is a slowdown, despite the fact
> mutex never blocks, right?

Right.

> Some of mutexes get locked really a lot of times. I guess this can
> explain a couple of percents of "slowdown". Thanks, I will think about
> it (and will try to reduce locking using TLS).

Try to investigate what speedup you can get, if you would completely
remove all mutexes. For example by commenting all calls to read_lock()/
read_unlock() out. If application will not fail, then you will see
what speedup up can get.

Dmitriy V'jukov

Mirek Fidler

unread,

May 28, 2008, 4:49:48 PM5/28/08

to

On May 28, 10:01 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> Try to investigate what speedup you can get, if you would completely
> remove all mutexes. For example by commenting all calls to read_lock()/
> read_unlock() out. If application will not fail, then you will see
> what speedup up can get.

Hehe, good hint, although I am afraid it will crash too soon ;)

Mirek

Chris Thomasson

unread,

May 28, 2008, 6:06:58 PM5/28/08

to

"Mirek Fidler" <c...@ntllib.org> wrote in message
news:f9dd932a-63f7-45e9...@27g2000hsf.googlegroups.com...

> Hi,
>
> I am working on server application which basically serves as big cache
> to sql database, performing query requests. As long as query requests
> that are in cache are performed, there is no sql communication.
>
> I believe that in such case, running N queries in single thread should
> take approximately 2x as much time as running N/2 in 2 threads.
> Anyway, so far I was able to achieve only 1.6x.
>
> I am aware about this possible problems in MT performance:
>
> * mutex contention (avoided most of it)
>
> * false sharing (cacheline contention) (allocator used should not
> have this problem)

[...]

Are you saying that the allocator your using eliminates false-sharing?
AFAICT, you need to properly pad and explicitly align your data-structures
on l2-cacheline boundaries regardless of what allocator your using.

http://groups.google.com/group/comp.programming.threads/msg/8036dc8a380718ad

Also, you should also ensure that your mutexs are padded and aligned. For
instance, on Windows I usually do something like:

/* Simple alignment code
____________________________________________________________*/
typedef union align_detail_types_u align_detail_types;
typedef struct align_detail_offset_s align_detail_offset;

union align_detail_types_u {
char char_; short s_l; int i_; long l_;
double d_; long double ld_; float f_;
union align_types* uap_;
void *p_; char (*fp0_) (char);
long double (*fp1_) (char, long double);
/* long long ll_; */
/* [...] */
};

struct align_detail_offset_s {
char offset;
align_detail_types types;
};

typedef long int align_detail_intptr;

#define ALIGN_MAX offsetof(align_detail_offset, types)

#define ALIGN_POW2(mp_this, mp_type) ((mp_type)( \
(((align_detail_intptr const)(mp_this)) + 1) & (-2) \
))

#define ALIGN(mp_this, mp_type, mp_align) ((mp_type)( \
(((align_detail_intptr const)(mp_this)) + \
ALIGN_POW2(mp_align, align_detail_intptr const) - 1) \
& (-ALIGN_POW2(mp_align, align_detail_intptr const)) \
))

#define ALIGN_CHECK(mp_this, mp_type, mp_align) ( \
(mp_this) == ALIGN(mp_this, mp_type, mp_align) \
)

typedef char ALIGN_DETAIL_SASSERT[
(ALIGN_MAX)
&& (ALIGN_CHECK(ALIGN_MAX, size_t, 2))
&& (ALIGN_CHECK(ALIGN_MAX, size_t, sizeof(void*)))
&& (ALIGN_CHECK(ALIGN_MAX, size_t, sizeof(void* (*) (void*))))
&& (sizeof(align_detail_intptr) >= sizeof(void*))
&& (sizeof(align_detail_intptr) >= sizeof(void* (*) (void*)))
? 1 : -1
];

/* Windows Fixup...
____________________________________________________________*/
#define L2CACHESIZE 128

union CRITICAL_SECTION_PAD {
CRITICAL_SECTION _this;
char pad[L2CACHESIZE];
};

typedef char CRITICAL_SECTION_BUF[
sizeof(CRITICAL_SECTION_PAD) + L2CACHESIZE - 1
];

#define CRITICAL_SECTION_BUF_ALIGN(mp_this) \
ALIGN((mp_this), LPCRITICAL_SECTION, L2CACHESIZE)

When I need to define a CRITICAL_SECTION, I do something like:

{
CRITICAL_SECTION_BUF my_cs_buf;
LPCRITICAL_SECTION my_cs = CRITICAL_SECTION_BUF_ALIGN(my_cs_buf);
InitializeCriticalSection(my_cs);
[...];
DeleteCriticalSection(my_cs);
}

This way I can be 100% sure that the lock will not be falsely shared with
any other lock.

David Schwartz

unread,

May 28, 2008, 6:58:57 PM5/28/08

to

On May 28, 3:06 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> union CRITICAL_SECTION_PAD {
> CRITICAL_SECTION _this;
> char pad[L2CACHESIZE];

> This way I can be 100% sure that the lock will not be falsely shared with
> any other lock.

IMO, if your allocator and lock structures don't do this already,
they're badly broken. (Not that it would surprise me to learn that
there's lots of broken allocators and locks out there.)

DS

Chris Thomasson

unread,

May 28, 2008, 7:26:26 PM5/28/08

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:9a6a117d-98aa-452f...@d1g2000hsg.googlegroups.com...

> On May 28, 3:06 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> > union CRITICAL_SECTION_PAD {
> > CRITICAL_SECTION _this;
> > char pad[L2CACHESIZE];

> > This way I can be 100% sure that the lock will not be falsely shared
> > with
> > any other lock.

> IMO, if your allocator and lock structures don't do this already,
> they're badly broken.

If your talking about custom allocators then I agree; AppCore handles all of
this for you. However, when your using stock malloc impl and stock lock
structures, then well:

> (Not that it would surprise me to learn that
> there's lots of broken allocators and locks out there.)

Your right! For instance, run this program on a Windows platform and report
what output you get:
______________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

#define L2CACHESIZE 128

struct my_data {
CRITICAL_SECTION lock;
int state;
};

int main() {
struct my_data* data1 = malloc(sizeof(*data1));
struct my_data* data2 = malloc(sizeof(*data2));

printf("Level-2 Cache Line == %lu\n",
(unsigned long)L2CACHESIZE);

printf("sizeof(CRITICAL_SECTION) == %lu\n",
(unsigned long)sizeof(CRITICAL_SECTION));

printf("sizeof(struct my_data) == %lu\n",
(unsigned long)sizeof(struct my_data));

if (((unsigned long)data1) % L2CACHESIZE) {
printf("(%p) data1 var is NOT L2 aligned! ;^(...\n", (void*)data1);
} else {
printf("(%p) data1 var IS L2 aligned. :^D\n", (void*)data1);
}

if (((unsigned long)data2) % L2CACHESIZE) {
printf("(%p) data2 var is NOT L2 aligned! ;^(...\n", (void*)data2);
} else {
printf("(%p) data2 var IS L2 aligned. :^D\n", (void*)data2);
}

free(data2);
free(data1);

return 0;
}

______________________________________________________________________

Unfortunately, I get:

Level-2 Cache Line == 128
sizeof(CRITICAL_SECTION) == 24
sizeof(struct my_data) == 28
(003D4B80) data1 var IS L2 aligned. :^D
(003D4BA8) data2 var is NOT L2 aligned! ;^(...

The L2-CACHELINE on the 32-bit x86 system I am using now is 128-bytes split
into two 64-byte regions. Therefore, as-is, there is false sharing between
`my_data::lock' and `my_data::state'. The malloc does not reliably align on
L2 CACHE, and CRITICAL_SECTION is not padded. Oh well...

Shi% happens!

;^(

Chris Thomasson

unread,

May 28, 2008, 8:31:39 PM5/28/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:AOSdnfFzhvpmdKDV...@comcast.com...

> "David Schwartz" <dav...@webmaster.com> wrote in message
> news:9a6a117d-98aa-452f...@d1g2000hsg.googlegroups.com...
>> On May 28, 3:06 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> > union CRITICAL_SECTION_PAD {
>> > CRITICAL_SECTION _this;
>> > char pad[L2CACHESIZE];
>
>> > This way I can be 100% sure that the lock will not be falsely shared
>> > with
>> > any other lock.
>
>> IMO, if your allocator and lock structures don't do this already,
>> they're badly broken.
>
> If your talking about custom allocators then I agree; AppCore handles all
> of this for you. However, when your using stock malloc impl and stock lock
> structures, then well:
>> (Not that it would surprise me to learn that
>> there's lots of broken allocators and locks out there.)
>
> Your right! For instance, run this program on a Windows platform and
> report what output you get:
> ______________________________________________________________________

[...]

> ______________________________________________________________________
>
>
>
>
> Unfortunately, I get:
>
> Level-2 Cache Line == 128
> sizeof(CRITICAL_SECTION) == 24
> sizeof(struct my_data) == 28
> (003D4B80) data1 var IS L2 aligned. :^D
> (003D4BA8) data2 var is NOT L2 aligned! ;^(...
>
> The L2-CACHELINE on the 32-bit x86 system I am using now is 128-bytes
> split into two 64-byte regions. Therefore, as-is, there is false sharing
> between `my_data::lock' and `my_data::state'. The malloc does not reliably
> align on L2 CACHE, and CRITICAL_SECTION is not padded. Oh well...

Not only will there be false-sharing between `my_data::lock' and
`my_data::state', it also exists between `data1' and `data2' because the
difference between the two pointers is only 28-bytes! Ouch.

> Shi% happens!
>
> ;^(

Mirek Fidler

unread,

May 28, 2008, 11:20:58 PM5/28/08

to

On May 29, 12:06 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Mirek Fidler" <c...@ntllib.org> wrote in message
>
> news:f9dd932a-63f7-45e9...@27g2000hsf.googlegroups.com...
>
> > Hi,
>
> > I am working on server application which basically serves as big cache
> > to sql database, performing query requests. As long as query requests
> > that are in cache are performed, there is no sql communication.
>
> > I believe that in such case, running N queries in single thread should
> > take approximately 2x as much time as running N/2 in 2 threads.
> > Anyway, so far I was able to achieve only 1.6x.
>
> > I am aware about this possible problems in MT performance:
>
> > * mutex contention (avoided most of it)
>
> > * false sharing (cacheline contention) (allocator used should not
> > have this problem)
>
> [...]
>
> Are you saying that the allocator your using eliminates false-sharing?

Yep. I did some work since the last time we discussed allocators :) It
is now lock-free and false-sharing free, unless you do remote, of
course.

> AFAICT, you need to properly pad and explicitly align your data-structures
> on l2-cacheline boundaries regardless of what allocator your using.
>

> http://groups.google.com/group/comp.programming.threads/msg/8036dc8a3...

>
> Also, you should also ensure that your mutexs are padded and aligned. For
> instance, on Windows I usually do something like:

Well, this is a good point I guess. All of my mutexes are global
variables now, however to reduce contention, I have a lot of them,
often in big static arrays.

Perhaps I can try to pad them to e.g. 128 byte unions, that would
still be acceptable size-wise (I have about 2000 mutexes and 8GB to
play with :).

Speaking about it, do you think it is a good idea to use "trylocks" to
reduce contention? I mean sometimes I need array data, each elemented
with mutex. I do not care about the order in which I get data from the
array, therefore I was trying to do 2 through the array, doing "tries"
on first pass and blocking lock for remaining elements on second. In
praxis, this so far seems to reduce blocking, but does not seem to
improve real performance :)

Mirek

Mirek Fidler

unread,

May 28, 2008, 11:25:19 PM5/28/08

to

Ehm, I am not quite sure I understand (or agree) here.

Certainly you do not want to say that ALL allocator requestes should
be 128 bytes aligned... or that sizeof(Mutex) should be 128... That
would be a little bit wasteful... (and would stress L2 cache too).

Mirek

Chris Thomasson

unread,

May 28, 2008, 11:46:09 PM5/28/08

to

"Mirek Fidler" <c...@ntllib.org> wrote in message

news:e843f5eb-fbf3-4a4b...@w7g2000hsa.googlegroups.com...

Yeah. I assumed that when David said "allocator" he was referring to an
allocator instance that was used to specifically allocate synchronization
objects. Why I made that assumption, I don't exactly know. Well, if you take
that into account and compare it against stock malloc then its suboptimal to
say the least. Over allocation to satisfy alignment wrt sync objects is
perfectly fine with me. Unless I am operating within the confinement of an
embedded platform...

Chris Thomasson

unread,

May 28, 2008, 11:52:17 PM5/28/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:W92dnW3MBOREu6PV...@comcast.com...

> "Mirek Fidler" <c...@ntllib.org> wrote in message
> news:e843f5eb-fbf3-4a4b...@w7g2000hsa.googlegroups.com...
>> On May 29, 12:58 am, David Schwartz <dav...@webmaster.com> wrote:
>>> On May 28, 3:06 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>>>
>>> > union CRITICAL_SECTION_PAD {
>>> > CRITICAL_SECTION _this;
>>> > char pad[L2CACHESIZE];
>>> > This way I can be 100% sure that the lock will not be falsely shared
>>> > with
>>> > any other lock.
>>>
>>> IMO, if your allocator and lock structures don't do this already,
>>> they're badly broken. (Not that it would surprise me to learn that
>>> there's lots of broken allocators and locks out there.)
>>
>> Ehm, I am not quite sure I understand (or agree) here.
>>
>> Certainly you do not want to say that ALL allocator requestes should
>> be 128 bytes aligned... or that sizeof(Mutex) should be 128... That
>> would be a little bit wasteful... (and would stress L2 cache too).
>
> Yeah. I assumed that when David said "allocator" he was referring to an
> allocator instance that was used to specifically allocate synchronization
> objects. Why I made that assumption, I don't exactly know. Well, if you
> take that into account and compare it against stock malloc then its
> suboptimal to say the least.

Humm. To clarify, the last sentence was trying to say that the stock malloc
impl would be suboptimal when compared to an allocator instance that deals
with synchronization object allocation and strict l2-alignment.

Sorry for any confusion!

;^(...

Chris Thomasson

unread,

May 29, 2008, 12:04:05 AM5/29/08

to

"Mirek Fidler" <c...@ntllib.org> wrote in message

news:57cb88a4-ea4a-45cf...@z72g2000hsb.googlegroups.com...

> On May 29, 12:06 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Mirek Fidler" <c...@ntllib.org> wrote in message
>>
>> news:f9dd932a-63f7-45e9...@27g2000hsf.googlegroups.com...
>>
>> > Hi,
>>
>> > I am working on server application which basically serves as big cache
>> > to sql database, performing query requests. As long as query requests
>> > that are in cache are performed, there is no sql communication.
>>
>> > I believe that in such case, running N queries in single thread should
>> > take approximately 2x as much time as running N/2 in 2 threads.
>> > Anyway, so far I was able to achieve only 1.6x.
>>
>> > I am aware about this possible problems in MT performance:
>>
>> > * mutex contention (avoided most of it)
>>
>> > * false sharing (cacheline contention) (allocator used should not
>> > have this problem)
>>
>> [...]
>>
>> Are you saying that the allocator your using eliminates false-sharing?
>
> Yep. I did some work since the last time we discussed allocators :) It
> is now lock-free and false-sharing free, unless you do remote, of
> course.

Great! :^D

>> AFAICT, you need to properly pad and explicitly align your
>> data-structures
>> on l2-cacheline boundaries regardless of what allocator your using.
>>
>> http://groups.google.com/group/comp.programming.threads/msg/8036dc8a3...
>>
>> Also, you should also ensure that your mutexs are padded and aligned. For
>> instance, on Windows I usually do something like:
>
> Well, this is a good point I guess. All of my mutexes are global
> variables now, however to reduce contention, I have a lot of them,
> often in big static arrays.
>
> Perhaps I can try to pad them to e.g. 128 byte unions, that would
> still be acceptable size-wise (I have about 2000 mutexes and 8GB to
> play with :).

Think of mutex's A and B which protect completely unrelated datum (dA and
dB). If stores to A or B invalidates dA or dB, well, that's a performance
issue. If stores to dA or dB invalidates A or B, that's not good. If
mutations to A invalidate B, or vise-verse, well, that's another performance
issue. IMVHO, it helps to think of that type of cross interference.
Sometimes, keeping track of all of this micro detail can have a marked
improvement on performance in general...

You generally don't want a store into a datum A to mess around with another
piece of unrelated data. You don't want a store into a lock which protects
datum A to mess around with the cache lines in which A exists...

For instance, if you frequently lock/unlock mutex A, why should that cast a
negative impact upon mutex B -or- the contents of the critical-section which
B protects access to?

> Speaking about it, do you think it is a good idea to use "trylocks" to
> reduce contention? I mean sometimes I need array data, each elemented
> with mutex. I do not care about the order in which I get data from the
> array, therefore I was trying to do 2 through the array, doing "tries"
> on first pass and blocking lock for remaining elements on second. In
> praxis, this so far seems to reduce blocking, but does not seem to
> improve real performance :)

IMVHO, only if your generally guaranteed to have other "important" work to
do if the failure case of a try_lock is hit. No need to call try_lock if the
result of a failure is a spin. Only do try_lock if you can immediately
operate on other work, or if your doing a STM like implementation and need
to lock multiple mutexs and have no access to a total ordering scheme...
Does that make any sense?

David Schwartz

unread,

May 29, 2008, 2:07:46 AM5/29/08

to

On May 28, 8:25 pm, Mirek Fidler <c...@ntllib.org> wrote:

> Ehm, I am not quite sure I understand (or agree) here.

> Certainly you do not want to say that ALL allocator requestes should
> be 128 bytes aligned... or that sizeof(Mutex) should be 128... That
> would be a little bit wasteful... (and would stress L2 cache too).

The allocator should be properly tuned for the hardware on which it
runs. What that means for alignment will vary from implementation to
implementation. The locks may or may not need to be padded and align
themselves, depending on what the allocator does.

I certainly agree that on most typical hardware platforms, aligning
small allocations on a 128-byte boundary makes little sense.

DS

Chris Thomasson

unread,

May 29, 2008, 3:39:07 AM5/29/08

to

"Mirek Fidler" <c...@ntllib.org> wrote in message

news:57cb88a4-ea4a-45cf...@z72g2000hsb.googlegroups.com...

Is that a static mutex table? If not, then certainly the max value cannot
get much higher than 2000; right? Also, you need to ensure that your that
important data within a critical-section is padded and properly aligned on
l2 cache line boundaries as well. This ensures that stores to CS A will not
invalidate data within any other CS and vise-versa...

[...]

Joe Greer

unread,

May 29, 2008, 9:55:53 AM5/29/08

to

Mirek Fidler <c...@ntllib.org> wrote in news:3f28aafb-8ffd-4bfa-9c77-
068ab6...@25g2000hsx.googlegroups.com:

I wasn't entirely clear as to where these threads were. Were they on this
cache server or on an application talking to the cache server?

Other possible problems/bottlenecks include:

* Virtual memory thrashing -- working set is smaller than your address
space causing pages to be swapped in and out.

* communication channel bandwidth - sql -> cache server -> application

* memory fragmentation in the allocator

joe

Mirek Fidler

unread,

May 29, 2008, 10:45:24 AM5/29/08

to

I guess my description was confusing. What I do is this:

I have data in array. Each element has mutex. For certain query, I
need a subset of elements of this array. After getting each element, I
process it. The order in which I am reading elements from the array
does not matter.

Now what I was trying was to read data from the array in two passes -
in first, read only elements where trylock succeeds, in second do
blocking lock. Of course, this can be extended to more passes as well.

As I said, this greatly reduced contention in element mutexes, but
surprisingly does not seem to increase the performance ;)

Mirek

Mirek Fidler

unread,

May 29, 2008, 10:47:40 AM5/29/08

to

On May 29, 3:55 pm, Joe Greer <jgr...@doubletake.com> wrote:
> Mirek Fidler <c...@ntllib.org> wrote in news:3f28aafb-8ffd-4bfa-9c77-

> 068ab6e9c...@25g2000hsx.googlegroups.com:

Thanks, I guess I can exclude all of them for the current benchmark,
but it is something to consider anyway.

Mirek

Chris Thomasson

unread,

May 29, 2008, 10:58:26 AM5/29/08

to

"Mirek Fidler" <c...@ntllib.org> wrote in message

news:3761b42f-e5a3-49a1...@26g2000hsk.googlegroups.com...
[...]

> I guess my description was confusing. What I do is this:
>
> I have data in array. Each element has mutex. For certain query, I
> need a subset of elements of this array. After getting each element, I
> process it. The order in which I am reading elements from the array
> does not matter.
>
> Now what I was trying was to read data from the array in two passes -
> in first, read only elements where trylock succeeds, in second do
> blocking lock. Of course, this can be extended to more passes as well.
>
> As I said, this greatly reduced contention in element mutexes, but
> surprisingly does not seem to increase the performance ;)

What is the average ratio of try_lock successes to failures? AFAICT, my
initial response is that I don't see anything wrong with your scheme. If you
don't care about order, then there is really no reason for a thread to get
inserted into the mutex wait-set when its currently being held during the
"first" pass. Your breaking data-acquisition into an opportunistic fast-path
(e.g., phase 1), and phase 2 would be the slow-path. This makes sense to me.
However, I would be interested in looking at a graph which plots average
ratio of try_lock successes to failures during a diverse number of
workloads. Humm...

Dmitriy V'jukov

unread,

May 29, 2008, 11:16:18 AM5/29/08

to

On 29 май, 18:45, Mirek Fidler <c...@ntllib.org> wrote:

> I guess my description was confusing. What I do is this:
>
> I have data in array. Each element has mutex. For certain query, I
> need a subset of elements of this array. After getting each element, I
> process it. The order in which I am reading elements from the array
> does not matter.
>
> Now what I was trying was to read data from the array in two passes -
> in first, read only elements where trylock succeeds, in second do
> blocking lock. Of course, this can be extended to more passes as well.

Then I think the main cause of slowdown in your server is cache-
coherence traffic due to operations on locks and modifications of data
elements.

Dmitriy V'jukov

Mirek Fidler

unread,

May 29, 2008, 12:01:15 PM5/29/08

to

On May 29, 5:16 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

There really is no modification of data elements, so the remaining
possibility is cache-choerence of mutexes itself :)

Will try avoid that too soon.

Mirek

Mirek Fidler

unread,

May 30, 2008, 3:18:33 AM5/30/08

to

On May 27, 6:40 pm, Mirek Fidler <c...@ntllib.org> wrote:
> Hi,
>
> I am working on server application which basically serves as big cache
> to sql database, performing query requests. As long as query requests
> that are in cache are performed, there is no sql communication.
>
> I believe that in such case, running N queries in single thread should

> take approximately 2x as much time as running N/2 in 2 threads.

> Anyway, so far I was able to achieve only 1.6x.
>
> I am aware about this possible problems in MT performance:
>
> * mutex contention (avoided most of it)
>
> * false sharing (cacheline contention) (allocator used should not
> have this problem)
>
> * core affinity (I believe the sheduler :)
>
> * cache issue - too much data (but I ask the same queries, so most
> likely everything is in cache)
>
> Is there anything else to investigate?
>
> Mirek

Ha, mystery solved, at least partially.

In the application, I have a lot of "performance counters" counting
specific events to provide profiling info. These work as simple
unlocked increments of global variables (because we do not need 100%
accuracy there). By having them "unlocked" I have falsely expected
that they will not have any impact on MT performance. Obviously,
false.

By removing these counters, I can gain most of "lost" of performance
back.

I think that the rest (about 5%) can be attributed to data sharing in
remaining mutex. Unfortunately, that one is unavoidable.

I guess my thanks should especially go to Dmitriy as he has sent me to
the right direction.

Thanks everybody. MT is fun:)

Mirek

Chris Thomasson

unread,

May 30, 2008, 3:29:04 AM5/30/08

to

"Mirek Fidler" <c...@ntllib.org> wrote in message

news:54a82404-c4cf-4b10...@d77g2000hsb.googlegroups.com...
[...]

> Ha, mystery solved, at least partially.

> In the application, I have a lot of "performance counters" counting
> specific events to provide profiling info. These work as simple
> unlocked increments of global variables (because we do not need 100%
> accuracy there). By having them "unlocked" I have falsely expected
> that they will not have any impact on MT performance. Obviously,
> false.

> By removing these counters, I can gain most of "lost" of performance
> back.

> I think that the rest (about 5%) can be attributed to data sharing in
> remaining mutex. Unfortunately, that one is unavoidable.

> I guess my thanks should especially go to Dmitriy as he has sent me to
> the right direction.

Well, AFAICT, Dmitriy is a quick, and very sharp/smart individual.

> Thanks everybody.

Your welcome.

> MT is fun:)

Indeed! :^D