I am working on server application which basically serves as big cache
to sql database, performing query requests. As long as query requests
that are in cache are performed, there is no sql communication.
I believe that in such case, running N queries in single thread should
take approximately 2x as much time as running N/2 in 2 threads.
Anyway, so far I was able to achieve only 1.6x.
I am aware about this possible problems in MT performance:
* mutex contention (avoided most of it)
* false sharing (cacheline contention) (allocator used should not
have this problem)
* core affinity (I believe the sheduler :)
* cache issue - too much data (but I ask the same queries, so most
likely everything is in cache)
Is there anything else to investigate?
Mirek
I am working on server application which basically serves as big cache
to sql database, performing query requests. As long as query requests
that are in cache are performed, there is no sql communication.
I believe that in such case, running N queries in single thread should
take approximately 2x as much time as running N/2 queries in 2
Uh, some math didn't come out right there. N queries in a single thread
should take something around 2x the time of running N queries in 2 threads,
not N/2 queries.
>Is there anything else to investigate?
The results of profiling in both cases?
Profilers are amazing tools for finding out why performance is doing what
it's doing.
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.5" / 37N 20' 15.3"
Internet: steve @ Watt.COM Whois: SW32-ARIN
Free time? There's no such thing. It just comes in varying prices...
N queries in 2 threads means doing 2xN queries for 2 threads (I mean,
each core does N queries, that is 2xN total). In that case ideally
the time should be the same as single core doing N queries, right?
Sorry if my wording was confusing....
Mirek
First of all, if it's server then it must involve some socket
communication or something. Maybe server is I/O bound, not processor
bound.
One of the most expensive things is (not false, just) sharing. Sharing
of mutable data between threads. Threads must not share any mutable
data on fast-path.
Dmitriy V'jukov
So far, I am testing only the query engine alone, socket
communications layer is not involved.
> One of the most expensive things is (not false, just) sharing. Sharing
> of mutable data between threads. Threads must not share any mutable
> data on fast-path.
Yes, I am aware of this; the fast path I am testing should have no
mutable data, only two threads running and reading the same data.
Mirek
> > One of the most expensive things is (not false, just) sharing. Sharing
> > of mutable data between threads. Threads must not share any mutable
> > data on fast-path.
>
> Yes, I am aware of this; the fast path I am testing should have no
> mutable data, only two threads running and reading the same data.
Just in case, mutation can be implicit. For example, read_lock()/
read_unlock() functions of reader-writer mutex, usually contains
mutation of shared data and can be very expensive. But logically one
can consider this as "reading".
Dmitriy V'jukov
> Is there anything else to investigate?
The cores share things like memory bandwidth. Also, the threads will
likely at least occasionally get in each others way, for example,
during any attempt to change the vm (for example, by calling
'malloc').
DS
> > Is there anything else to investigate?
>
> The cores share things like memory bandwidth.
Good point.
Also if cores have shared L2 cache (very common), then size of L2
cache is effectively halved provided 2 cores busy with work. Also in
this case there can be different conflicts in cache. On what hardware
you work?
Dmitriy V'jukov
I am not sure I undestand this... How can mutex mutate data that are
not referenced in call to it? Perhaps another sign of my under-
education; can you provide some links?
Mirek
Mutex mutates his own data:
class mutex
{
int state; <------------ shared mutable data
void lock()
{
some operations on 'state' variable
}
void unlock()
{
some operations on 'state' variable
}
};
So if you have something like this:
class query_cache
{
std::vector<query> data;
mutable rw_mutex guard;
void find_query(...) const
{
guard.read_lock();
some read-only operations on 'data' variable
guard.read_unlock();
}
};
It's mutable shared data!
Dmitriy V'jukov
Ah, I see, of course. In fact, a good point, it means that if more
than one thread does locking, there is a slowdown, despite the fact
mutex never blocks, right?
Some of mutexes get locked really a lot of times. I guess this can
explain a couple of percents of "slowdown". Thanks, I will think about
it (and will try to reduce locking using TLS).
Mirek
> Ah, I see, of course. In fact, a good point, it means that if more
> than one thread does locking, there is a slowdown, despite the fact
> mutex never blocks, right?
Right.
> Some of mutexes get locked really a lot of times. I guess this can
> explain a couple of percents of "slowdown". Thanks, I will think about
> it (and will try to reduce locking using TLS).
Try to investigate what speedup you can get, if you would completely
remove all mutexes. For example by commenting all calls to read_lock()/
read_unlock() out. If application will not fail, then you will see
what speedup up can get.
Dmitriy V'jukov
Hehe, good hint, although I am afraid it will crash too soon ;)
Mirek
Are you saying that the allocator your using eliminates false-sharing?
AFAICT, you need to properly pad and explicitly align your data-structures
on l2-cacheline boundaries regardless of what allocator your using.
http://groups.google.com/group/comp.programming.threads/msg/8036dc8a380718ad
Also, you should also ensure that your mutexs are padded and aligned. For
instance, on Windows I usually do something like:
/* Simple alignment code
____________________________________________________________*/
typedef union align_detail_types_u align_detail_types;
typedef struct align_detail_offset_s align_detail_offset;
union align_detail_types_u {
char char_; short s_l; int i_; long l_;
double d_; long double ld_; float f_;
union align_types* uap_;
void *p_; char (*fp0_) (char);
long double (*fp1_) (char, long double);
/* long long ll_; */
/* [...] */
};
struct align_detail_offset_s {
char offset;
align_detail_types types;
};
typedef long int align_detail_intptr;
#define ALIGN_MAX offsetof(align_detail_offset, types)
#define ALIGN_POW2(mp_this, mp_type) ((mp_type)( \
(((align_detail_intptr const)(mp_this)) + 1) & (-2) \
))
#define ALIGN(mp_this, mp_type, mp_align) ((mp_type)( \
(((align_detail_intptr const)(mp_this)) + \
ALIGN_POW2(mp_align, align_detail_intptr const) - 1) \
& (-ALIGN_POW2(mp_align, align_detail_intptr const)) \
))
#define ALIGN_CHECK(mp_this, mp_type, mp_align) ( \
(mp_this) == ALIGN(mp_this, mp_type, mp_align) \
)
typedef char ALIGN_DETAIL_SASSERT[
(ALIGN_MAX)
&& (ALIGN_CHECK(ALIGN_MAX, size_t, 2))
&& (ALIGN_CHECK(ALIGN_MAX, size_t, sizeof(void*)))
&& (ALIGN_CHECK(ALIGN_MAX, size_t, sizeof(void* (*) (void*))))
&& (sizeof(align_detail_intptr) >= sizeof(void*))
&& (sizeof(align_detail_intptr) >= sizeof(void* (*) (void*)))
? 1 : -1
];
/* Windows Fixup...
____________________________________________________________*/
#define L2CACHESIZE 128
union CRITICAL_SECTION_PAD {
CRITICAL_SECTION _this;
char pad[L2CACHESIZE];
};
typedef char CRITICAL_SECTION_BUF[
sizeof(CRITICAL_SECTION_PAD) + L2CACHESIZE - 1
];
#define CRITICAL_SECTION_BUF_ALIGN(mp_this) \
ALIGN((mp_this), LPCRITICAL_SECTION, L2CACHESIZE)
When I need to define a CRITICAL_SECTION, I do something like:
{
CRITICAL_SECTION_BUF my_cs_buf;
LPCRITICAL_SECTION my_cs = CRITICAL_SECTION_BUF_ALIGN(my_cs_buf);
InitializeCriticalSection(my_cs);
[...];
DeleteCriticalSection(my_cs);
}
This way I can be 100% sure that the lock will not be falsely shared with
any other lock.
> union CRITICAL_SECTION_PAD {
> CRITICAL_SECTION _this;
> char pad[L2CACHESIZE];
> This way I can be 100% sure that the lock will not be falsely shared with
> any other lock.
IMO, if your allocator and lock structures don't do this already,
they're badly broken. (Not that it would surprise me to learn that
there's lots of broken allocators and locks out there.)
DS
> > union CRITICAL_SECTION_PAD {
> > CRITICAL_SECTION _this;
> > char pad[L2CACHESIZE];
> > This way I can be 100% sure that the lock will not be falsely shared
> > with
> > any other lock.
> IMO, if your allocator and lock structures don't do this already,
> they're badly broken.
If your talking about custom allocators then I agree; AppCore handles all of
this for you. However, when your using stock malloc impl and stock lock
structures, then well:
> (Not that it would surprise me to learn that
> there's lots of broken allocators and locks out there.)
Your right! For instance, run this program on a Windows platform and report
what output you get:
______________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#define L2CACHESIZE 128
struct my_data {
CRITICAL_SECTION lock;
int state;
};
int main() {
struct my_data* data1 = malloc(sizeof(*data1));
struct my_data* data2 = malloc(sizeof(*data2));
printf("Level-2 Cache Line == %lu\n",
(unsigned long)L2CACHESIZE);
printf("sizeof(CRITICAL_SECTION) == %lu\n",
(unsigned long)sizeof(CRITICAL_SECTION));
printf("sizeof(struct my_data) == %lu\n",
(unsigned long)sizeof(struct my_data));
if (((unsigned long)data1) % L2CACHESIZE) {
printf("(%p) data1 var is NOT L2 aligned! ;^(...\n", (void*)data1);
} else {
printf("(%p) data1 var IS L2 aligned. :^D\n", (void*)data1);
}
if (((unsigned long)data2) % L2CACHESIZE) {
printf("(%p) data2 var is NOT L2 aligned! ;^(...\n", (void*)data2);
} else {
printf("(%p) data2 var IS L2 aligned. :^D\n", (void*)data2);
}
free(data2);
free(data1);
return 0;
}
______________________________________________________________________
Unfortunately, I get:
Level-2 Cache Line == 128
sizeof(CRITICAL_SECTION) == 24
sizeof(struct my_data) == 28
(003D4B80) data1 var IS L2 aligned. :^D
(003D4BA8) data2 var is NOT L2 aligned! ;^(...
The L2-CACHELINE on the 32-bit x86 system I am using now is 128-bytes split
into two 64-byte regions. Therefore, as-is, there is false sharing between
`my_data::lock' and `my_data::state'. The malloc does not reliably align on
L2 CACHE, and CRITICAL_SECTION is not padded. Oh well...
Shi% happens!
;^(
Not only will there be false-sharing between `my_data::lock' and
`my_data::state', it also exists between `data1' and `data2' because the
difference between the two pointers is only 28-bytes! Ouch.
> Shi% happens!
>
> ;^(
Yep. I did some work since the last time we discussed allocators :) It
is now lock-free and false-sharing free, unless you do remote, of
course.
> AFAICT, you need to properly pad and explicitly align your data-structures
> on l2-cacheline boundaries regardless of what allocator your using.
>
> http://groups.google.com/group/comp.programming.threads/msg/8036dc8a3...
>
> Also, you should also ensure that your mutexs are padded and aligned. For
> instance, on Windows I usually do something like:
Well, this is a good point I guess. All of my mutexes are global
variables now, however to reduce contention, I have a lot of them,
often in big static arrays.
Perhaps I can try to pad them to e.g. 128 byte unions, that would
still be acceptable size-wise (I have about 2000 mutexes and 8GB to
play with :).
Speaking about it, do you think it is a good idea to use "trylocks" to
reduce contention? I mean sometimes I need array data, each elemented
with mutex. I do not care about the order in which I get data from the
array, therefore I was trying to do 2 through the array, doing "tries"
on first pass and blocking lock for remaining elements on second. In
praxis, this so far seems to reduce blocking, but does not seem to
improve real performance :)
Mirek
Ehm, I am not quite sure I understand (or agree) here.
Certainly you do not want to say that ALL allocator requestes should
be 128 bytes aligned... or that sizeof(Mutex) should be 128... That
would be a little bit wasteful... (and would stress L2 cache too).
Mirek
Yeah. I assumed that when David said "allocator" he was referring to an
allocator instance that was used to specifically allocate synchronization
objects. Why I made that assumption, I don't exactly know. Well, if you take
that into account and compare it against stock malloc then its suboptimal to
say the least. Over allocation to satisfy alignment wrt sync objects is
perfectly fine with me. Unless I am operating within the confinement of an
embedded platform...
Humm. To clarify, the last sentence was trying to say that the stock malloc
impl would be suboptimal when compared to an allocator instance that deals
with synchronization object allocation and strict l2-alignment.
Sorry for any confusion!
;^(...
Great! :^D
>> AFAICT, you need to properly pad and explicitly align your
>> data-structures
>> on l2-cacheline boundaries regardless of what allocator your using.
>>
>> http://groups.google.com/group/comp.programming.threads/msg/8036dc8a3...
>>
>> Also, you should also ensure that your mutexs are padded and aligned. For
>> instance, on Windows I usually do something like:
>
> Well, this is a good point I guess. All of my mutexes are global
> variables now, however to reduce contention, I have a lot of them,
> often in big static arrays.
>
> Perhaps I can try to pad them to e.g. 128 byte unions, that would
> still be acceptable size-wise (I have about 2000 mutexes and 8GB to
> play with :).
Think of mutex's A and B which protect completely unrelated datum (dA and
dB). If stores to A or B invalidates dA or dB, well, that's a performance
issue. If stores to dA or dB invalidates A or B, that's not good. If
mutations to A invalidate B, or vise-verse, well, that's another performance
issue. IMVHO, it helps to think of that type of cross interference.
Sometimes, keeping track of all of this micro detail can have a marked
improvement on performance in general...
You generally don't want a store into a datum A to mess around with another
piece of unrelated data. You don't want a store into a lock which protects
datum A to mess around with the cache lines in which A exists...
For instance, if you frequently lock/unlock mutex A, why should that cast a
negative impact upon mutex B -or- the contents of the critical-section which
B protects access to?
> Speaking about it, do you think it is a good idea to use "trylocks" to
> reduce contention? I mean sometimes I need array data, each elemented
> with mutex. I do not care about the order in which I get data from the
> array, therefore I was trying to do 2 through the array, doing "tries"
> on first pass and blocking lock for remaining elements on second. In
> praxis, this so far seems to reduce blocking, but does not seem to
> improve real performance :)
IMVHO, only if your generally guaranteed to have other "important" work to
do if the failure case of a try_lock is hit. No need to call try_lock if the
result of a failure is a spin. Only do try_lock if you can immediately
operate on other work, or if your doing a STM like implementation and need
to lock multiple mutexs and have no access to a total ordering scheme...
Does that make any sense?
> Ehm, I am not quite sure I understand (or agree) here.
> Certainly you do not want to say that ALL allocator requestes should
> be 128 bytes aligned... or that sizeof(Mutex) should be 128... That
> would be a little bit wasteful... (and would stress L2 cache too).
The allocator should be properly tuned for the hardware on which it
runs. What that means for alignment will vary from implementation to
implementation. The locks may or may not need to be padded and align
themselves, depending on what the allocator does.
I certainly agree that on most typical hardware platforms, aligning
small allocations on a 128-byte boundary makes little sense.
DS
Is that a static mutex table? If not, then certainly the max value cannot
get much higher than 2000; right? Also, you need to ensure that your that
important data within a critical-section is padded and properly aligned on
l2 cache line boundaries as well. This ensures that stores to CS A will not
invalidate data within any other CS and vise-versa...
[...]
I wasn't entirely clear as to where these threads were. Were they on this
cache server or on an application talking to the cache server?
Other possible problems/bottlenecks include:
* Virtual memory thrashing -- working set is smaller than your address
space causing pages to be swapped in and out.
* communication channel bandwidth - sql -> cache server -> application
* memory fragmentation in the allocator
joe
I guess my description was confusing. What I do is this:
I have data in array. Each element has mutex. For certain query, I
need a subset of elements of this array. After getting each element, I
process it. The order in which I am reading elements from the array
does not matter.
Now what I was trying was to read data from the array in two passes -
in first, read only elements where trylock succeeds, in second do
blocking lock. Of course, this can be extended to more passes as well.
As I said, this greatly reduced contention in element mutexes, but
surprisingly does not seem to increase the performance ;)
Mirek
Thanks, I guess I can exclude all of them for the current benchmark,
but it is something to consider anyway.
Mirek
What is the average ratio of try_lock successes to failures? AFAICT, my
initial response is that I don't see anything wrong with your scheme. If you
don't care about order, then there is really no reason for a thread to get
inserted into the mutex wait-set when its currently being held during the
"first" pass. Your breaking data-acquisition into an opportunistic fast-path
(e.g., phase 1), and phase 2 would be the slow-path. This makes sense to me.
However, I would be interested in looking at a graph which plots average
ratio of try_lock successes to failures during a diverse number of
workloads. Humm...
> I guess my description was confusing. What I do is this:
>
> I have data in array. Each element has mutex. For certain query, I
> need a subset of elements of this array. After getting each element, I
> process it. The order in which I am reading elements from the array
> does not matter.
>
> Now what I was trying was to read data from the array in two passes -
> in first, read only elements where trylock succeeds, in second do
> blocking lock. Of course, this can be extended to more passes as well.
Then I think the main cause of slowdown in your server is cache-
coherence traffic due to operations on locks and modifications of data
elements.
Dmitriy V'jukov
There really is no modification of data elements, so the remaining
possibility is cache-choerence of mutexes itself :)
Will try avoid that too soon.
Mirek
Ha, mystery solved, at least partially.
In the application, I have a lot of "performance counters" counting
specific events to provide profiling info. These work as simple
unlocked increments of global variables (because we do not need 100%
accuracy there). By having them "unlocked" I have falsely expected
that they will not have any impact on MT performance. Obviously,
false.
By removing these counters, I can gain most of "lost" of performance
back.
I think that the rest (about 5%) can be attributed to data sharing in
remaining mutex. Unfortunately, that one is unavoidable.
I guess my thanks should especially go to Dmitriy as he has sent me to
the right direction.
Thanks everybody. MT is fun:)
Mirek
> Ha, mystery solved, at least partially.
> In the application, I have a lot of "performance counters" counting
> specific events to provide profiling info. These work as simple
> unlocked increments of global variables (because we do not need 100%
> accuracy there). By having them "unlocked" I have falsely expected
> that they will not have any impact on MT performance. Obviously,
> false.
> By removing these counters, I can gain most of "lost" of performance
> back.
> I think that the rest (about 5%) can be attributed to data sharing in
> remaining mutex. Unfortunately, that one is unavoidable.
> I guess my thanks should especially go to Dmitriy as he has sent me to
> the right direction.
Well, AFAICT, Dmitriy is a quick, and very sharp/smart individual.
> Thanks everybody.
Your welcome.
> MT is fun:)
Indeed! :^D