Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What benefits do thread-local variables (threadvar) have ?

12 views
Skip to first unread message

Skybuck Flying

unread,
Oct 18, 2008, 6:32:21 AM10/18/08
to
Hello,

What benefits do thread-local variables (threadvar) have ?

Is there any speed benefit ???

Bye,
Skybuck.


Chris M. Thomasson

unread,
Oct 18, 2008, 3:25:34 PM10/18/08
to
"Skybuck Flying" <Blood...@hotmail.com> wrote in message
news:cd55d$48f9bb2b$541981e1$25...@cache1.tilbu1.nb.home.nl...

> Hello,
>
> What benefits do thread-local variables (threadvar) have ?

They allow one to conveniently create thread-wide visibility of thread local
variables. Here is what I mean by "thread-wide" visibility... Imagine a
thread function with some local variables:


void* thread_entry(void* state) {
int thread_local_var1;
short thread_local_var2;
long thread_local_var3;
int thread_local_var4;
short thread_local_var5;
return 0;
}


Fine. `thread_local_var1/2/3/4/5' is visible to this function. However, what
if you wanted the function to call an existing procedure with an established
API declaration:


void foo() {

}


that you can not change because other code depends on it, and you wanted
that procedure to have access to all of those variables? How would you do it
without altering the actual API? The answer that you can't, unless you take
advantage of TLS; here is simple example:


struct tls {
int var1;
short var2;
long var3;
int var4;
short var5;
};


__declspec(thread) struct tls g_tls;


void foo();


void* thread_entry(void* state) {
foo();
}


void foo() {
/* this procedure can access the tls for the calling thread */
g_tls.var1/2/3/4/5;
}


> Is there any speed benefit ???

Any time you can operate on local variables and not share anything, well,
that's always a plus.

;^)


Unfortunately, I don't have time to go into any more details, I got to go.
Sorry. Anyway, you can take a look at some of the following links for
further context:

http://groups.google.com/group/comp.programming.threads/search?hl=en&group=comp.programming.threads&q=TLS+OR+TSD

ScratchMonkey

unread,
Oct 18, 2008, 11:54:22 PM10/18/08
to
"Skybuck Flying" <Blood...@hotmail.com> wrote in news:cd55d$48f9bb2b
$541981e1$25...@cache1.tilbu1.nb.home.nl:

> What benefits do thread-local variables (threadvar) have ?

Good starting point:

http://en.wikipedia.org/wiki/Thread-Specific_Storage



> Is there any speed benefit ???

Compared to what?

H. Peter Anvin

unread,
Oct 19, 2008, 12:16:53 AM10/19/08
to Scratc...@sewingwitch.com
ScratchMonkey wrote:
>> Is there any speed benefit ???
>
> Compared to what?

If what you're comparing to is making a system call to figure out which
thread you're running in, then yes, there is a huge benefit :)

-hpa

Chris M. Thomasson

unread,
Oct 19, 2008, 12:50:08 AM10/19/08
to
"H. Peter Anvin" <h...@zytor.com> wrote in message
news:48FAB4B5...@zytor.com...

On Windows, 'TlsGetValue()' resolves down to a few instructions, basically
getting an offset from the FS register; no system call is made, the Kernel
is not involved. Platforms with a native PThread implementation also have
similar performance characteristics indeed. However, IMVHO, they incur a
little more overhead when compared with compiler specific TLS extensions...
Oh well.

Skybuck Flying

unread,
Oct 19, 2008, 4:20:45 AM10/19/08
to

"ScratchMonkey" <ScratchMonk...@sewingwitch.com> wrote in message
news:Xns9B3BD4AB6F8...@85.214.105.209...

> "Skybuck Flying" <Blood...@hotmail.com> wrote in news:cd55d$48f9bb2b
> $541981e1$25...@cache1.tilbu1.nb.home.nl:
>
>> What benefits do thread-local variables (threadvar) have ?
>
> Good starting point:
>
> http://en.wikipedia.org/wiki/Thread-Specific_Storage

Already seen that...

It's an description at best, doesn't make me any wiser to any possible
benefits.

>
>> Is there any speed benefit ???
>
> Compared to what?

The usual variables...

Bye,
Skybuck.


Skybuck Flying

unread,
Oct 19, 2008, 4:35:25 AM10/19/08
to
This seems to be an interesting link:

http://www.cs.tau.ac.il/~msagiv/MSThesis/YairSadeThesis.pdf

It starts by mentioning bottlenecks in memory allocation systems for
multi-threaded programs... then it probably goes on about these new kinds of
variables.

Seems interesting enough for a read :)

Bye,
Skybuck ;) :)


Skybuck Flying

unread,
Oct 19, 2008, 4:48:13 AM10/19/08
to
"Skybuck Flying" <Blood...@hotmail.com> wrote in message
news:71a35$48faf140$541981e1$22...@cache2.tilbu1.nb.home.nl...

Well I quickly glanced over it.

It's mostly a document for compiler writes it seems... goes on about some
kind of algorithm or so.. maybe even for dynamic memory allocators or so...

So the above lines I wrote more or less mentions what the problem might be
with multi-threaded systems and how thread local storage... might solve some
of the bottlenecks if present at all..

threads might even have their own heaps. ?!

I also saw something about context switches somewhere on the web probably
wikipedia or so... it mentions that sometimes the context switches might
have to copy all this thread local storage as well ?!? so maybe that would
make context switches more expensive if this feature is used ! ;)

^ Irony is... it could be counter productive ! ;) that's my conclusion for
now ;)

Bye,
Skybuck.


ScratchMonkey

unread,
Oct 19, 2008, 11:53:36 AM10/19/08
to
"Skybuck Flying" <Blood...@hotmail.com> wrote in
news:dc3da$48faf440$541981e1$26...@cache2.tilbu1.nb.home.nl:

> threads might even have their own heaps. ?!

That's useful for faster allocators, eliminating the need to lock the heap
each time you allocate/free objects. If you know an object won't need to be
allocated on one thread and freed on another, and can communicate this to
the runtime, this could be a big savings.

> I also saw something about context switches somewhere on the web
> probably wikipedia or so... it mentions that sometimes the context
> switches might have to copy all this thread local storage as well ?!?
> so maybe that would make context switches more expensive if this
> feature is used ! ;)

That depends on the underlying architecture, and is a good reason why you
should look under the hood to know how the compiler implements things.

ScratchMonkey

unread,
Oct 19, 2008, 11:54:13 AM10/19/08
to
"Skybuck Flying" <Blood...@hotmail.com> wrote in news:7ffeb$48faedd0
$541981e1$15...@cache2.tilbu1.nb.home.nl:

>>> Is there any speed benefit ???
>>
>> Compared to what?
>
> The usual variables...

An object's members are typically referenced as offsets from a base pointer
stored in a register. Referencing TLD is typically done the same way, so
the cost is about the same as a member variable. Automatic data on the
stack is referenced the same way, with the stack or frame pointer register
used as the base pointer.

But the best way to tell is to learn your CPU architecture and then look at
some sample disassembly.

Check out this article on performance:

http://www.acm.org/ubiquity/views/v7i24_fallacy.html

H. Peter Anvin

unread,
Oct 19, 2008, 2:19:35 PM10/19/08
to Chris M. Thomasson

The comment was more to the "is there any speed benefit" pretty much
requires something to compare it to. Under Linux, the __thread
variables are also a segment register offset and is extremely fast.

-hpa

MitchAlsup

unread,
Oct 19, 2008, 3:18:56 PM10/19/08
to
On Oct 19, 3:48 am, "Skybuck Flying" <BloodySh...@hotmail.com> wrote:
> threads might even have their own heaps. ?!

When you are running on a machine with the following properties:
A) more than 4 chips that contain processors
B) accessing another cache is slower than accessing DRAM

All sorts of thread-local optimizations make sense.

On the other hand, when you are operating on a machine where
all the processors are in one chip, or maybe in two chips; there
is not a lot of difference between doing or not doing these kinds
of optimizations.

Mitch

Chris M. Thomasson

unread,
Oct 19, 2008, 5:26:00 PM10/19/08
to
"ScratchMonkey" <ScratchMonk...@sewingwitch.com> wrote in message
news:Xns9B3C5A76B51...@85.214.105.209...

> "Skybuck Flying" <Blood...@hotmail.com> wrote in
> news:dc3da$48faf440$541981e1$26...@cache2.tilbu1.nb.home.nl:
>
>> threads might even have their own heaps. ?!
>
> That's useful for faster allocators, eliminating the need to lock the heap
> each time you allocate/free objects. If you know an object won't need to
> be
> allocated on one thread and freed on another, and can communicate this to
> the runtime, this could be a big savings.

You can have per-thread heaps and still allow objects to be freed in any
thread you want. This is not a limitation in any way shape or form. Here is
simple pseudo-code for a distributed slab allocator I invented:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69


Here is a sample of a distributed region allocator I did:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/4038a7a8a4f5a7cb

the link to the code is here:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html


Here is another example of how to transform single-threaded allocator into
multi-threaded one:

http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/61145
(read whole thread please...)

the link to the code is here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html


Hoard uses per-thread heaps, but it does not use a non-blocking algorithm
when handling remote deallocations. Another algorithm is StreamFlow, which
is close to my algorithms, but it uses 64-bit CAS, while mine uses
single-word CAS which has its advantages...


Any thoughts?


[...]

Chris M. Thomasson

unread,
Oct 19, 2008, 5:40:26 PM10/19/08
to

"MitchAlsup" <Mitch...@aol.com> wrote in message
news:04a9ae33-d053-493c...@t41g2000hsc.googlegroups.com...

Humm... Well, I know that my internal benchmarks show that one can reap
fairly significant improvements in scalability and throughput if mutexs can
be eliminated for local allocations. These enhancements show up on
multi-core systems as well. Also, there is reference counting. I can get rid
of atomic RMW instructions and expensive membars when mutating the reference
count by taking advantage of keeping counter arrays per-thread. This has
major improvements over naive non-distributed counting algorithm which uses
atomic RMW and membar for every single counter mutation.

AFAICT, when one can remove the need for atomics and membars in
synchronization algorithms, the performance gaines show up on basically any
system... Take mutexs for example... Even if your locking scheme is
"perfect" such that contention on a mutex is reduced to near nothing and is
few and far between, there still is significant overheads involved. Usually,
a mutex lock procedure is comprised of an atomic RMW with a subsequent and
very expensive (#StoreLoad | #StoreStore) membar. A mutex unlock is made up
of a (#LoadStore | #StoreStore) membar followed by an atomic RMW. That's 4
expensive instructions for every lock/unlock pair; IMVHO, that's a heck of a
lot of overhead indeed...

Chris M. Thomasson

unread,
Oct 19, 2008, 6:00:22 PM10/19/08
to

"Skybuck Flying" <Blood...@hotmail.com> wrote in message
news:dc3da$48faf440$541981e1$26...@cache2.tilbu1.nb.home.nl...

> "Skybuck Flying" <Blood...@hotmail.com> wrote in message
> news:71a35$48faf140$541981e1$22...@cache2.tilbu1.nb.home.nl...
>> This seems to be an interesting link:
>>
>> http://www.cs.tau.ac.il/~msagiv/MSThesis/YairSadeThesis.pdf
>>
[...]

> I also saw something about context switches somewhere on the web probably
> wikipedia or so... it mentions that sometimes the context switches might
> have to copy all this thread local storage as well ?!? so maybe that would
> make context switches more expensive if this feature is used ! ;)
>
> ^ Irony is... it could be counter productive ! ;) that's my conclusion for
> now ;)

Most of the time, especially when using TLS API's, a simple pointer is all
that will/can be stored into TLS; simple example:


class per_thread {
// [whatever];
};

struct per_thread*
per_thread_self() {
per_thread* tls = (per_thread*)pthread_getspecific(...);
if (! tls) {
tls = new per_thread();
pthread_setspecific(..., tls);
}
return tls;
}

void on_per_thread_tls_dtor(void* state) {
per_thread* tls = (per_thread*)state;
delete tls;
}


In fact the TLS API's of both POSIX and Windows threads (e.g.,
pthread_setspecific()/TlsSetValue()) only allow one to store pointers into
TLS. However, compiler extensions do indeed allow one to store entire
data-structures in TLS. Here is some further context on how Windows
executables manage this moment:

http://www.nynaeve.net/?tag=crt
(scroll to bottom of page for relevant information...)

Chris M. Thomasson

unread,
Oct 19, 2008, 6:01:35 PM10/19/08
to
"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:2INKk.32605$SH5....@newsfe08.iad...

>
> "MitchAlsup" <Mitch...@aol.com> wrote in message
> news:04a9ae33-d053-493c...@t41g2000hsc.googlegroups.com...
> On Oct 19, 3:48 am, "Skybuck Flying" <BloodySh...@hotmail.com> wrote:
>> > threads might even have their own heaps. ?!
>
>> When you are running on a machine with the following properties:
>> A) more than 4 chips that contain processors
>> B) accessing another cache is slower than accessing DRAM
>
>> All sorts of thread-local optimizations make sense.
>
>> On the other hand, when you are operating on a machine where
>> all the processors are in one chip, or maybe in two chips; there
>> is not a lot of difference between doing or not doing these kinds
>> of optimizations.
>
> Humm... Well, I know that my internal benchmarks show that one can reap
> fairly significant improvements in scalability and throughput if mutexs
> can be eliminated for local allocations.

I know that my internal benchmarks do not prove anything. It just what I
personally observed.

[...]

Skybuck Flying

unread,
Oct 19, 2008, 8:40:38 PM10/19/08
to

"ScratchMonkey" <ScratchMonk...@sewingwitch.com> wrote in message
news:Xns9B3C5A91B63...@85.214.105.209...

Read it, it's not enough though to just learn assembly, compilers and
machines.

Software systems them selfes can be very complex... all kinds of modules and
algorithms.

What is needed on top of the mentioned things is more insight into the
system...

Profiling can help with that to spot where the percentages of cpu time go...

Then programmer can decide where to try and optimize first and reduce those
percentages.

In other words: programmers need more tools, preferably integrated into the
IDE.

IDE's are great for writing quick code, and even debugged code... but what
do they offer for performance ?

Not that much... it's not enough to say: the compiler will do all the
optimizations.

The IDE has to do more on the performance front than what they are currently
doing ! ;) which is pretty much nothing at the moment ! ;)

Makes me wonder if hardware people fear such optimization tools :) Maybe
fear that they won't sell any new hardware because the software is so damn
fast ! Well rest assured those fears are unwarrented... the demand for cpu
processing power is unlimited !

Bye,
Skybuck.


Chris M. Thomasson

unread,
Oct 19, 2008, 9:25:59 PM10/19/08
to
"H. Peter Anvin" <h...@zytor.com> wrote in message
news:48FB7A37...@zytor.com...

> Chris M. Thomasson wrote:
>> "H. Peter Anvin" <h...@zytor.com> wrote in message
>> news:48FAB4B5...@zytor.com...
>>> ScratchMonkey wrote:
>>>>> Is there any speed benefit ???
>>>>
>>>> Compared to what?
>>>
>>> If what you're comparing to is making a system call to figure out which
>>> thread you're running in, then yes, there is a huge benefit :)
>>
>> On Windows, 'TlsGetValue()' resolves down to a few instructions,
>> basically getting an offset from the FS register; no system call is made,
>> the Kernel is not involved. Platforms with a native PThread
>> implementation also have similar performance characteristics indeed.
>> However, IMVHO, they incur a little more overhead when compared with
>> compiler specific TLS extensions... Oh well.
>
> The comment was more to the "is there any speed benefit" pretty much
> requires something to compare it to.

Okay; I see.


> Under Linux, the __thread variables are also a segment register offset and
> is extremely fast.

Indeed.

Piotr Wyderski

unread,
Oct 20, 2008, 4:41:39 AM10/20/08
to
Chris M. Thomasson wrote:

> On Windows, 'TlsGetValue()' resolves down to a few instructions, basically
> getting an offset from the FS register; no system call is made, the Kernel
> is not involved.

In my code its overhead is measurable, so I use direct references through
FS.

> Platforms with a native PThread implementation also have similar
> performance characteristics indeed.

Some do, some do not. For this reason I have a stack pointer-based
TLS cache. :-) Fast TLSes are absolutely crucial.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Oct 20, 2008, 4:48:33 AM10/20/08
to
Chris M. Thomasson wrote:

> Humm... Well, I know that my internal benchmarks show that one can reap
> fairly significant improvements in scalability and throughput if mutexs
> can be eliminated for local allocations.

I confirm, the speedup is tremendous.

> These enhancements show up on multi-core systems as well. Also, there is
> reference counting. I can get rid of atomic RMW instructions and expensive
> membars when mutating the reference count by taking advantage of keeping
> counter arrays per-thread. This has major improvements over naive
> non-distributed counting algorithm which uses atomic RMW and membar for
> every single counter mutation.

Confirmed again. :-)

> AFAICT, when one can remove the need for atomics and membars in
> synchronization algorithms, the performance gaines show up on basically
> any system... Take mutexs for example...

In fact, you can make a queue to be your only synchronization primitive
and forget about mutexes, semaphores, events, spinlocks etc. pessimistic
synchronization entities. And we both know how to write a good queue. :-)
But it requires complete re-engineering of the application, so it must be
an architectural decission from day one. It's not possible to introduce it
later without much effort.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Oct 20, 2008, 4:53:04 AM10/20/08
to
Chris M. Thomasson wrote:

> Here is a sample of a distributed region allocator I did:

Why do you need a distributed region allocator? It's being called so rarely
that it has no impact on the system performance... The program should be
as lock-free as possible, but not more... :-D

> Hoard uses per-thread heaps, but it does not use a non-blocking algorithm
> when handling remote deallocations.

Hoard is overrated :-)

> Another algorithm is StreamFlow, which is close to my algorithms, but it
> uses 64-bit CAS, while mine uses single-word CAS which has its
> advantages...

So does mine.

Best regards
Piotr Wyderski

Jim Carlock

unread,
Oct 20, 2008, 9:14:23 AM10/20/08
to
"Skybuck Flying" wrote...
: Makes me wonder if hardware people fear such optimization tools :) Maybe
: fear that they won't sell any new hardware because the software is so
: damn fast! Well rest assured those fears are unwarrented... the demand

: for cpu processing power is unlimited !

Sounds like a course in economics... unlimited wants, limited resources.
And so the price goes up when the demand exceeds the supply.

Hardware people would adore and love such tools and employ the tools in
the daily creation of their hardware. The prices on solid state drives
are coming down. 128GB SSD now sell all over. We've run into some of
what you mentioned about such speed tools. Not many companies print out
the write speeds of their SSD. They avoid printing such and we then rely
upon external sources to provide such information. Alot of these tools
exist as command line tools rarely get talked about. Does anyone know of
such tools?

--
Jim Carlock
Great Prices On Solid State Drives
http://www.microcosmotalk.com/really/cool/products/ssd/


ScratchMonkey

unread,
Oct 20, 2008, 10:29:49 AM10/20/08
to
"Skybuck Flying" <Blood...@hotmail.com> wrote in
news:7b106$48fbd379$541981e1$17...@cache2.tilbu1.nb.home.nl:

> IDE's are great for writing quick code, and even debugged code... but
> what do they offer for performance ?
>
> Not that much... it's not enough to say: the compiler will do all the
> optimizations.
>
> The IDE has to do more on the performance front than what they are
> currently doing ! ;) which is pretty much nothing at the moment ! ;)
>
> Makes me wonder if hardware people fear such optimization tools :)
> Maybe fear that they won't sell any new hardware because the software
> is so damn fast ! Well rest assured those fears are unwarrented... the
> demand for cpu processing power is unlimited !

The danger here is vendor lock-in. The naive programmer will write code to
the hardware much in the same way many now write to the OS, leaving their
application locked to a specific OS and architecture. The savvy programmer
will isolate performance-critical code in platform-specific modules to
allow easy swap-out when the platform changes.

MitchAlsup

unread,
Oct 20, 2008, 11:40:27 AM10/20/08
to
On Oct 19, 4:40 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "MitchAlsup" <MitchAl...@aol.com> wrote in message

Criss:

I was pointing out some hardware features that could modulate whether
to expend effort or not at the software level.
You are pointing out that avoiding things with poor hardware
performance also increases software performance.
There is no actual disagreement, here.

However, there are software implementation features that cause
software to 'have' to utilize low performing hardware "hooks' to make
certain software/system guarentees when software might already known
that these guarentees are manifest.

Mitch

EricP

unread,
Oct 20, 2008, 1:15:16 PM10/20/08
to
Chris M. Thomasson wrote:
>
> Humm... Well, I know that my internal benchmarks show that one can reap
> fairly significant improvements in scalability and throughput if mutexs
> can be eliminated for local allocations. These enhancements show up on
> multi-core systems as well. Also, there is reference counting. I can get
> rid of atomic RMW instructions and expensive membars when mutating the
> reference count by taking advantage of keeping counter arrays
> per-thread. This has major improvements over naive non-distributed
> counting algorithm which uses atomic RMW and membar for every single
> counter mutation.
>
> AFAICT, when one can remove the need for atomics and membars in
> synchronization algorithms, the performance gaines show up on basically
> any system... Take mutexs for example... Even if your locking scheme is
> "perfect" such that contention on a mutex is reduced to near nothing and
> is few and far between, there still is significant overheads involved.
> Usually, a mutex lock procedure is comprised of an atomic RMW with a
> subsequent and very expensive (#StoreLoad | #StoreStore) membar. A mutex
> unlock is made up of a (#LoadStore | #StoreStore) membar followed by an
> atomic RMW. That's 4 expensive instructions for every lock/unlock pair;
> IMVHO, that's a heck of a lot of overhead indeed...

I have seen this kind of statement about atomic instruction performance
made a number of times in c.p.t but how true is it really?
It seems to have an air of urban legend to it.

I have seen atomic instr time measurements, but my impression
was that they measured the instr time and not the incremental
cost of atomic vs normal. I also have some concerns about
measurements made under artificially high contention scenarios
by tight looping and bashing a single cache line which may inflate
costs if the underlying hardware bus protocols are O(N**2).
However this would seem to apply equally to both atomic and normal.

For example, comparing an INC to LOCK INC, the primary cost
appears to be in loading the cache line, if necessary.
(I avoid using exchg as an example as it is atomic by default
on x86/x64). The cache miss cost is present whether atomic or not.

There can be a fight if multiple cpus request exclusive ownership
of a cache line at once and the bus protocol has to resolve this.
The only public documentation I could find on this is handled was on
the very old DASH protocol, and it handled this with a timeout-retry
loop (thus my concern about N**2 bus protocols).

There is a stall of the atomic instr until prior load and stores
have completed. But a own-exclusive prefetch of the target atomic
cache line might be initiated while waiting.

The rules to inhibit load bypassing can-be/are enforced with
replay traps so they only cost on actual collisions.

The rules for subsequent store bypasses cannot be optimized,
however they only seem to affect committed stores after
the atomic instr and store to load forwarding should still
be allowed so some/most execution can continue.

So surely atomic can't have _that_ much affect beyond the
cost of a cache line transfer?

Eric

David Schwartz

unread,
Oct 20, 2008, 2:31:38 PM10/20/08
to
On Oct 19, 1:20 am, "Skybuck Flying" <BloodySh...@hotmail.com> wrote:

> >> Is there any speed benefit ???
>
> > Compared to what?
>
> The usual variables...

Normal variables:
1) You access them.

Thread-specific variables:
1) You figure out which thread this is.
2) You turn that into an offset.
3) You merge that offset with the base address to find the right
variable.
4) You access it.

Clever implementations can make those four steps not be as painful as
they might seem, but it will probably never be as fast as a regular
variable access.

If you're thinking it will somehow be faster because it will be
"closer to the thread", keep in mind that the thread moves.

DS

Chris M. Thomasson

unread,
Oct 20, 2008, 8:08:15 PM10/20/08
to
"EricP" <ThatWould...@thevillage.com> wrote in message
news:823c0$48fcbcec$45c49ea8$16...@TEKSAVVY.COM...

> Chris M. Thomasson wrote:
>>
>> Humm... Well, I know that my internal benchmarks show that one can reap
>> fairly significant improvements in scalability and throughput if mutexs
>> can be eliminated for local allocations. These enhancements show up on
>> multi-core systems as well. Also, there is reference counting. I can get
>> rid of atomic RMW instructions and expensive membars when mutating the
>> reference count by taking advantage of keeping counter arrays per-thread.
>> This has major improvements over naive non-distributed counting algorithm
>> which uses atomic RMW and membar for every single counter mutation.
>>
>> AFAICT, when one can remove the need for atomics and membars in
>> synchronization algorithms, the performance gaines show up on basically
>> any system... Take mutexs for example... Even if your locking scheme is
>> "perfect" such that contention on a mutex is reduced to near nothing and
>> is few and far between, there still is significant overheads involved.
>> Usually, a mutex lock procedure is comprised of an atomic RMW with a
>> subsequent and very expensive (#StoreLoad | #StoreStore) membar. A mutex
>> unlock is made up of a (#LoadStore | #StoreStore) membar followed by an
>> atomic RMW. That's 4 expensive instructions for every lock/unlock pair;
>> IMVHO, that's a heck of a lot of overhead indeed...
>
> I have seen this kind of statement about atomic instruction performance
> made a number of times in c.p.t but how true is it really?
> It seems to have an air of urban legend to it.

Well, the LOCK prefix can indeed decide to lock the BUS. It can also lock
the cache line. Also, it executes something analogous to a full #StoreLoad
membar. How can you compare those actions to that of a "naked" RMW
operation? Which one has less of an effect on the system?

[...]

Chris M. Thomasson

unread,
Oct 20, 2008, 8:09:33 PM10/20/08
to

"MitchAlsup" <Mitch...@aol.com> wrote in message
news:2f6179f0-a5b3-4d0a...@i18g2000prf.googlegroups.com...

On Oct 19, 4:40 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "MitchAlsup" <MitchAl...@aol.com> wrote in message

> Criss:

"Chris"


> I was pointing out some hardware features that could modulate whether
> to expend effort or not at the software level.
> You are pointing out that avoiding things with poor hardware
> performance also increases software performance.
> There is no actual disagreement, here.

:^D


> However, there are software implementation features that cause
> software to 'have' to utilize low performing hardware "hooks' to make
> certain software/system guarentees when software might already known
> that these guarentees are manifest.

Humm... Need to think about this...

Chris M. Thomasson

unread,
Oct 20, 2008, 8:14:39 PM10/20/08
to
"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote in
message news:gdhbk7$ljn$1...@node1.news.atman.pl...

> Chris M. Thomasson wrote:
>
>> Here is a sample of a distributed region allocator I did:
>
> Why do you need a distributed region allocator? It's being called so
> rarely
> that it has no impact on the system performance... The program should be
> as lock-free as possible, but not more... :-D

Well, I did it for fun. I wanted to see how a distributed region allocator
acts in a general purpose environment. I wanted to compute offsets into the
region without using atomic RMW instructions. Notice I don't have the
"flush" procedure which resets the region. I actually have an explicit
deallocate function! This is definitely _not_ normal for a region allocator

in any way shape or form.


Normally region allocation has:

void* region_alloc(size_t);

void region_flush(void);


I was just experimenting; oh well...

;^D

Piotr Wyderski

unread,
Oct 21, 2008, 4:08:26 AM10/21/08
to
Chris M. Thomasson wrote:

> Well, I did it for fun.

And it's the best explanation one can provide. :-)

> Notice I don't have the "flush" procedure which resets the region. I
> actually have
> an explicit deallocate function! This is definitely _not_ normal for a
> region allocator in any way shape or form.
>
>
> Normally region allocation has:
>
> void* region_alloc(size_t);
>
> void region_flush(void);

Frankly, I have never seen an allocator equiped with alloc/flush, only
alloc/dealloc. My own region allocator is a hybrid solution (provides
both regions and large blocks of arbitrary size), so in fact I have two
such alloc/dealloc pairs in it. But it is not lock-free, as its performance
evaluation strongly suggested not doing so.

Best reagrds
Piotr Wyderski

Skybuck Flying

unread,
Oct 21, 2008, 10:58:46 AM10/21/08
to
A demo in Delphi which shows "any speed gain" would be nice ! ;)

Anybody ? :)

Bye,
Skybuck ;)


EricP

unread,
Oct 21, 2008, 11:40:59 AM10/21/08
to

I'm not sure what you are asking.
I am hypothesizing that the performance differences that have been
observed were in fact not due to membars, as you suggest, but were in
fact measurements of the transfer time for contended cache lines.

If my hypothesis is correct then it is not atomic operations
that should be avoided but rather high cache line contention.
Or to phrase this another way, the improvements you may have observed
when algorithms were redesigned to avoid using atomic ops actually
gained their benefits from using spacial separation to avoid
having a thundering herd bashing a single location.

But I have not measured this so I can't prove it.

The difference is subtle but has implications in algorithm design.
If true it means not that atomic ops should be avoided or even
minimized, but rather arranged so that, on average, there are at
most just one or two accessors for each line at any time but then
use as many atomic ops as necessary. It is a change of emphasis.

WRT bus locking, it doesn't actually assert a bus lock for memory
locations as the coherency protocol accomplishes the same effect:

In sysprog guide, Vol3A,
2.6.5 Controlling the Processor
"In the Pentium 4, Intel Xeon, and P6... If a memory access is
cacheable and affects only a single cache line, a cache lock is
invoked and the system bus and the actual memory location in system
memory are not locked during the operation."

Eric

MitchAlsup

unread,
Oct 21, 2008, 1:24:11 PM10/21/08
to
On Oct 20, 12:15 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> So surely atomic can't have _that_ much affect beyond the
> cost of a cache line transfer?

You are making the assumption that atomic stuff operates at CPU
frequencies. They do not. The only part of atomics that operate at CPU
frequencies is the couple of cycles after the line arrives and the
data gets conditionally modified. Everything else is simply waiting
(as seen by the CPU).

With the x86s I am familiar; an atomic has to wait to become non-
speculative, then flush pending write buffers and allow any
outstanding reads to complete, before the atomic stuff can begin. The
non-speculative part is at least as harmful to performance as the
flushing of buffers in typical cases. But memory latency is always a
problem.

Chris M. Thomasson

unread,
Oct 22, 2008, 2:17:24 AM10/22/08
to
"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote in
message news:gdjtc2$6ns$1...@node1.news.atman.pl...

Here is a paper that explains some of the semantics of region allocation,
and some info about a hybrid of heaps and regions called "reaps":

http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf

Region allocation API does not "generally" have a free procedure; your not
supposed to free individual blocks within a region. Instead, you simply
flush the contents by resetting the offset in a single operation. Lets see
if I can quickly create some crude pseudo-code:
___________________________________________________________
#define REGION 8192

struct region {
unsigned char buf[REGION];
size_t offset; /* = 0 */
};

static void*
region_alloc(
struct region* const self,
size_t size
) {
size_t const offset = self->offset;
if (offset + size <= REGION) {
self->offset = offset + size;
return self->buf + offset;
}
return NULL;
}

static void
region_flush(
struct region* const self
) {
self->offset = 0;
}
___________________________________________________________


In order to add an individual free operation, well, one can do what my
non-blocking region allocator did; something like:
___________________________________________________________
#define REGION 8192

struct region {
unsigned char buf[REGION];
size_t offset; /* = 0 */
size_t count; /* = 0 */
};

static void*
region_alloc(
struct region* const self,
size_t size
) {
size_t const offset = self->offset;
if (offset + size <= REGION) {
self->offset = offset + size;
++self->count;
return self->buf + offset;
}
return NULL;
}

static void
region_free(
struct region* const self
) {
--self->count;
if (! self->count) {
self->offset = 0;
}
}
___________________________________________________________


The paper I cited uses something called "reaps" which appends a header to
each allocation such that they can be freed into a heap data-structure.
IMVHO, this is crap as I personally strive to only support headerless
allocations; read here for further context:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb0cc5c0218b8570

http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/61145


Unfortunately, you need to read every message in those threads to gain
proper context... Sorry about that!

;^(...

Chris M. Thomasson

unread,
Oct 22, 2008, 2:29:01 AM10/22/08
to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:9d70c$48fdf83e$45c49ea8$18...@TEKSAVVY.COM...

Humm... Well, the MFENCE instruction destroys peformance because it
basically flushes everything. It ruins acclimated data in the pipelines,
ect...


> If my hypothesis is correct then it is not atomic operations
> that should be avoided but rather high cache line contention.
> Or to phrase this another way, the improvements you may have observed
> when algorithms were redesigned to avoid using atomic ops actually
> gained their benefits from using spacial separation to avoid
> having a thundering herd bashing a single location.
>
> But I have not measured this so I can't prove it.

RCU gets around having to execute a membar, or locked instruction on x86,
for reader threads; this alone has __major__ performance benefits indeed:


http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce


Also, look at some of the measurements of RCU:

http://www.rdrop.com/users/paulmck/RCU


Avoiding the membar can result in order's of magnitude improvement in
throughput, scalability, and overall performance...


> The difference is subtle but has implications in algorithm design.
> If true it means not that atomic ops should be avoided or even
> minimized, but rather arranged so that, on average, there are at
> most just one or two accessors for each line at any time but then
> use as many atomic ops as necessary. It is a change of emphasis.
>
> WRT bus locking, it doesn't actually assert a bus lock for memory
> locations as the coherency protocol accomplishes the same effect:
>
> In sysprog guide, Vol3A,
> 2.6.5 Controlling the Processor
> "In the Pentium 4, Intel Xeon, and P6... If a memory access is
> cacheable and affects only a single cache line, a cache lock is
> invoked and the system bus and the actual memory location in system
> memory are not locked during the operation."

The fast-path locks the cache-line. The slow-path locks the bus. Anyway,
LOCK'd instruction basically behaves like a MFENCE instruction. Humm...
Perhaps I should code up a VERY simple example in C that might be able to
provide some context. Let me think here...

Chris M. Thomasson

unread,
Oct 22, 2008, 2:38:18 AM10/22/08
to
"MitchAlsup" <Mitch...@aol.com> wrote in message
news:57549fad-2347-4a36...@v56g2000hsf.googlegroups.com...

Yeah. I personally would love to see "naked" atomics on x86; think of how
SPARC does things. An atomic operation is not a memory barrier in and of
itself, instead, they are nicely separated. On x86, a spinlock
implementation can use the XCHG instruction to acquire the lock; period, end
of story. On SPARC, the SWAP instruction is equivalent to XCHG _except_ for
one very important moment... One cannot use SWAP __alone__ to implement the
lock portion of a spinlock. Instead, the programmer needs to explicitly add
a (MEMBAR #StoreLoad | #StoreStore) instruction after the result of the SWAP
instruction has indicated that the lock was indeed acquired. IMVHO,
combining membars and atomics like x86 has done was not a very good idea wrt
performance; I like to keep my membars and atomic nicely separated. This is
in my very humble opinion of course!

;^/

Piotr Wyderski

unread,
Oct 22, 2008, 4:13:48 AM10/22/08
to
Chris M. Thomasson wrote:

> Here is a paper that explains some of the semantics of region allocation,
> and some info about a hybrid of heaps and regions called "reaps":
>
> http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf

I will be happy to read it later. :-)

> Region allocation API does not "generally" have a free procedure; your not
> supposed to free individual blocks within a region.

Well, I haven't seen all the existing allocators, but mine frees
individual, blocks,
down to the OS level, if necessary. So it is a kind of surprise to hear
about that.

> The paper I cited uses something called "reaps" which appends a header to
> each allocation such that they can be freed into a heap data-structure.
> IMVHO, this is crap as I personally strive to only support headerless
> allocations; read here for further context:

I agree, headers are crap and thus I don't use them, at least per chunk
basis.
By the way, would you like to discuss about work stealing vs work sharing
queues?

Best reagrds
Piotr Wyderski

Piotr Wyderski

unread,
Oct 22, 2008, 4:18:21 AM10/22/08
to
Chris M. Thomasson wrote:

> RCU gets around having to execute a membar, or locked instruction on x86,
> for reader threads; this alone has __major__ performance benefits indeed:

But RCU is generally a kernel-mode mechanism and it is quite hard to emulate
it efficiently at user level. That's sad, because the idea is brilliant.

> Avoiding the membar can result in order's of magnitude improvement in
> throughput, scalability, and overall performance...

Indeed.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Oct 22, 2008, 4:23:09 AM10/22/08
to
Chris M. Thomasson wrote:

> One cannot use SWAP __alone__ to implement the lock portion of a spinlock.
> Instead, the programmer needs to explicitly add a (MEMBAR #StoreLoad |
> #StoreStore) instruction after the result of the SWAP instruction has
> indicated that the lock was indeed acquired. IMVHO, combining membars and
> atomics like x86 has done was not a very good idea wrt performance; I like
> to keep my membars and atomic nicely separated. This is in my very humble
> opinion of course!

So there are two of us. For instance, think about reference
counters -- why the heck does one need a membar there?
On PowerPC and SPARC they may be implemented membar-lessly...

Best regards
Piotr Wyderski

Skybuck Flying

unread,
Oct 22, 2008, 4:40:18 AM10/22/08
to
I was wondering about cache line effect too.

Suppose I wrote this program... which calls all kinds of routines and has
all these branches.

All this code is ofcourse in memory.

However only some fragments of it are executed.

What if the fragments happen to be on the same instruction cache line...

That would be bad.

So I wonder are there tools available which show in which cache line each
instruction byte ends up ?

Bye,
Skybuck.


EricP

unread,
Oct 22, 2008, 12:05:07 PM10/22/08
to

Well yes, that is the common wisdom I was questioning. :-)
See below.

>> If my hypothesis is correct then it is not atomic operations
>> that should be avoided but rather high cache line contention.
>> Or to phrase this another way, the improvements you may have observed
>> when algorithms were redesigned to avoid using atomic ops actually
>> gained their benefits from using spacial separation to avoid
>> having a thundering herd bashing a single location.
>>
>> But I have not measured this so I can't prove it.
>
> RCU gets around having to execute a membar, or locked instruction on
> x86, for reader threads; this alone has __major__ performance benefits
> indeed:
>
>
> http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce

This link says nothing about performance measurments.

> Also, look at some of the measurements of RCU:
>
> http://www.rdrop.com/users/paulmck/RCU
>
> Avoiding the membar can result in order's of magnitude improvement in
> throughput, scalability, and overall performance...

I'm not sure which documents you think support this.
The only paper I see on performance is

Making Lockless Synchronization Fast:
Performance Implications of Memory Reclamation
Hart1, McKenney, Brown 2006

they compare

quiescent-state-based reclamation (QSBR),
epoch-based reclamation (EBR),
hazard-pointerbased reclamation (HPBR)
Lock-Free Reference Counting (LFRC)

In section 3.1 Memory Consistency they state QSBR does not
use fences, HPBR and EBR, require 1 fence per operation,
and LFRC requires 2 fences (one for count INC and one for DEC).

However the difference between the algorithms would appear to
make a conclusion as to the cost of fences impossible.
According to figure 9, the cost difference between
QSRB (no fence) (cost = ~300) and HPBR (1 fence) (cost = ~350)
is about the same as the difference between
HPBR and EBR (cost = ~390) (both one fence).

Also LFRC (cost = ~900) has all readers and writers bashing
the same counter variable in each object so it is impossible
to separate out how much of the cost is due to 2 fences vs 1 fence,
and how much is line contention.
However if we take the QSRB to HPBR difference of 50 as a guide
to the cost of 1 fence the the *VAST* majority of huge 900 cost
of LFRC is possibly due to line contention!!!

Eric


Chris M. Thomasson

unread,
Oct 23, 2008, 3:41:12 AM10/23/08
to

"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote in
message news:gdmi1i$vt5$1...@node1.news.atman.pl...

I have no time right now. I PROMISE to engage in a detailed conversation
with you on this subject indeed. However, of the top of my head, well, SUN
Microsystems has a fairly elegant work-stealing queue in the works:

http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf


Here is my little retarded anti-work-stealing anomaly:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/6019e377e4aa73ec


Its based on "so-called" work-`requesting'... I look forward to the flames!
READ Dmitriy's ececellent response PLEASE!!!

;^D Need to go now. I have to work on work... CRAP! Fun times!


;^(...

EricP

unread,
Oct 26, 2008, 1:41:20 PM10/26/08
to

I played with this a bit to test the cost of fences and atomics.
I have a (sorry Mitch) Intel core duo E8400 3 GHz

The test program randomly accesses an array of cache line sized records
doing some simple increment operations on either 8 or 16 32-bit
variables, with 0, 1 or 2 MFENCE, or 1 or 2 atomic increments mixed in
and measures the RDTSC count. The tests include filters to screen
out timer interrupts that occur inside the measurment windows.
I can provide more details if you would like.

The results, measured in RDTSC cycle counts, is that a MFENCE
costs between 16 to 35 cycles, a LOCK XADD between 20 to 50 cycles.
This is against a base loop cost of about 400 cycles.

My real interest is in an atomic contention test.
However I only have a duo processor so that would be
of limited usefulness, but I'll give it a try later anyway.

Eric


Chris M. Thomasson

unread,
Oct 26, 2008, 9:32:10 PM10/26/08
to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:2b8ed$4904ac42$45c49ea8$15...@TEKSAVVY.COM...

Here is a VERY quick and dirty test I whipped up using GCC and pthreads that
attempts to measure the impact that LOCK'd RMW and MFENCE instructions have
on overall scalability and throughput: (I hope the newsreader does not
mangle this post)
_______________________________________________________________________
#define _XOPEN_SOURCE 600
#include <pthread.h>
#include <iostream>
#include <climits>
#include <ctime>


#define L2CACHE 128
#define QUEUE_DEPTH 128
#define ITER_PER_THREAD 100000000
#define THREADS 2
#define RUNS 2
#define YIELD 512


typedef unsigned long atomic_word32;

typedef char sassert[
sizeof(atomic_word32) == 32 / CHAR_BIT
];


__attribute__((always_inline))
static __inline__ atomic_word32
XADD32(
atomic_word32 volatile *self,
atomic_word32 const addend
) {
atomic_word32 ret;
__asm__ __volatile__(
"LOCK XADD %2, %1 ;\n"
:
"=&r" (ret),
"=m" (*self)
:
"0" (addend)
:
"memory", "cc"
);
return ret;
}


#define MFENCE() __asm__ __volatile__ ("MFENCE" : : : "memory")


struct queue {
struct node {
struct node* volatile m_next;
char pad[13];
};

node* volatile m_head;
node* volatile m_tail;

char pad1[L2CACHE - (sizeof(struct node*) * 2)];
atomic_word32 volatile m_membar;
char pad2[L2CACHE - sizeof(atomic_word32)];

queue() : m_head(NULL), m_tail(NULL), m_membar(0) {
node* n = new node;
n->m_next = NULL;
m_head = m_tail = n;
for (unsigned i = 1; i < QUEUE_DEPTH; ++i) {
n = new node;
if (! (i % 2)) {
n->m_next = NULL;
m_tail->m_next = n;
m_tail = n;
} else {
n->m_next = m_head;
m_head = n;
}
}
}

~queue() {
node* n = m_head;
while (n) {
node* const nx = n->m_next;
delete n;
n = nx;
}
}

void iterate() {
node* n = m_head;
while (n) {
n = n->m_next;
}
}

void LOCK_RMW() volatile {
XADD32(&m_membar, 1);
}

void SMR_MFENCE_ENTER(atomic_word32 volatile* x) volatile {
retry:
*x = m_membar;
MFENCE();
atomic_word32 cmp2 = m_membar;
if (*x != cmp2) {
goto retry;
}
}

void SMR_MFENCE_LEAVE(atomic_word32 volatile* x) volatile {
// MFENCE();
*x = 0;
}
};


static __attribute__((align(L2CACHE))) queue g_queue;
static pthread_barrier_t g_barrier;


extern "C" void*
thread_iterate_naked(
void *state
) {
pthread_barrier_wait(&g_barrier);
for (unsigned i = 0; i < ITER_PER_THREAD; ++i) {
g_queue.iterate();
if (! (i % YIELD)) {
sched_yield();
}
}
return 0;
}


extern "C" void*
thread_iterate_LOCK_RMW(
void *state
) {
pthread_barrier_wait(&g_barrier);
for (unsigned i = 0; i < ITER_PER_THREAD; ++i) {
g_queue.LOCK_RMW();
g_queue.iterate();
if (! (i % YIELD)) {
sched_yield();
}
g_queue.LOCK_RMW();
}
return 0;
}


extern "C" void*
thread_iterate_SMR_MFENCE(
void *state
) {
atomic_word32 volatile x;
pthread_barrier_wait(&g_barrier);
for (unsigned i = 0; i < ITER_PER_THREAD; ++i) {
g_queue.SMR_MFENCE_ENTER(&x);
g_queue.iterate();
if (! (i % YIELD)) {
sched_yield();
}
g_queue.SMR_MFENCE_LEAVE(&x);
}
return 0;
}


static void
run_test(
void* (__cdecl *fp_entry) (void*)
) {
unsigned i;
pthread_t tid[THREADS];

for (i = 0; i < THREADS; ++i) {
pthread_create(&tid[i], NULL, fp_entry, NULL);
}

for (i = 0; i < THREADS; ++i) {
pthread_join(tid[i], NULL);
}
}


int main(void) {
std::clock_t stime;
long double etime;

pthread_barrier_init(&g_barrier, NULL, THREADS);

for (unsigned r = 1; r < RUNS + 1; ++r) {
std::cout << "EXECUTING RUN " << r << "-of-"
<< RUNS << "..." << std::endl;

stime = std::clock();
run_test(thread_iterate_naked);
etime = ((long double)(std::clock() - stime)) / CLOCKS_PER_SEC;
if (! etime) { etime = 1; }
etime *= 1000;
std::cout << "thread_iterate_naked - "
<< ITER_PER_THREAD / etime << " per-thread-per-ms in "
<< etime << "ms" << std::endl;

stime = std::clock();
run_test(thread_iterate_LOCK_RMW);
etime = ((long double)(std::clock() - stime)) / CLOCKS_PER_SEC;
if (! etime) { etime = 1; }
etime *= 1000;
std::cout << "thread_iterate_LOCK_RMW - "
<< ITER_PER_THREAD / etime << " per-thread-per-ms in "
<< etime << "ms" << std::endl;

stime = std::clock();
run_test(thread_iterate_SMR_MFENCE);
etime = ((long double)(std::clock() - stime)) / CLOCKS_PER_SEC;
if (! etime) { etime = 1; }
etime *= 1000;
std::cout << "thread_iterate_SMR_MFENCE - "
<< ITER_PER_THREAD / etime << " per-thread-per-ms in "
<< etime << "ms" << std::endl;

std::cout << "RUN " << r << "-of-" << RUNS
<< " COMPLETE!\n\n" << std::endl;
}

pthread_barrier_destroy(&g_barrier);

return 0;
}
_______________________________________________________________________


This little program as-is creates a shared linked-list with really crappy
cache peformance of 128 elements. It has three thread entry points, one that
iterates the list using no LOCK'd and/or MFENCE instructions; this is called
(thread_iterate_naked). Another one (thread_iterate_LOCK_RMW) executes a
LOCK'd RMW before and after an iteration simulating a contentionless
rw-mutex; I say contentonless because there are no writer threads. Finally,
a thread entry called (thread_iterate_SMR_MFENCE) executes a mock SMR load
with MFENCE before iteration, and does a naked SMR store after. Each of
these threads executes 100,000,000 iterations each. I measure how many
iterations are accomplished on a per-thread-per-millisecond basis. Here is
the output I get on my old P4 HyperThreaded 3.06gz system:


EXECUTING RUN 1-of-2...
thread_iterate_naked - 6504.07 per-thread-per-ms in 15375ms
thread_iterate_LOCK_RMW - 5091.65 per-thread-per-ms in 19640ms
thread_iterate_SMR_MFENCE - 4874.24 per-thread-per-ms in 20516ms
RUN 1-of-2 COMPLETE!


EXECUTING RUN 2-of-2...
thread_iterate_naked - 6464.54 per-thread-per-ms in 15469ms
thread_iterate_LOCK_RMW - 5091.65 per-thread-per-ms in 19640ms
thread_iterate_SMR_MFENCE - 4866.89 per-thread-per-ms in 20547ms
RUN 2-of-2 COMPLETE!


Well, this shows that on my system the naked version performs over 1,400
more iterations per-thread-per-millisecond than the LOCK'd version. It gets
over 1,600 when compared to the MFENCE version. IMVHO, that shows that on my
system, there are significant advantages when you eliminate the need for
expensive instructions such as memory barriers and/or atomic RMW's. If you
want to examine some other tests which show how bad memory barriers really
are, I suggest you check out the following site:


http://atomic-ptr-plus.sourceforge.net


Any thoughts on this?

Chris M. Thomasson

unread,
Oct 26, 2008, 9:32:41 PM10/26/08
to
"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote in
message news:gdmia2$cp$1...@node1.news.atman.pl...

> Chris M. Thomasson wrote:
>
>> RCU gets around having to execute a membar, or locked instruction on x86,
>> for reader threads; this alone has __major__ performance benefits indeed:
>
> But RCU is generally a kernel-mode mechanism and it is quite hard to
> emulate
> it efficiently at user level. That's sad, because the idea is brilliant.

Well, actually you can implement it fairly easily on Windows using the
non-documented Nt* API. Also, you can do it using hacks on other platforms
as well. There is a method that works on all platforms which support binding
thread affinity to CPU's, but I don't really like it that much.


>> Avoiding the membar can result in order's of magnitude improvement in
>> throughput, scalability, and overall performance...
>

> Indeed.

Yup.

Anthony Williams

unread,
Oct 27, 2008, 4:26:35 AM10/27/08
to
"Chris M. Thomasson" <n...@spam.invalid> writes:

> "Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote in
> message news:gdmia2$cp$1...@node1.news.atman.pl...
>> Chris M. Thomasson wrote:
>>
>>> RCU gets around having to execute a membar, or locked instruction
>>> on x86, for reader threads; this alone has __major__ performance
>>> benefits indeed:
>>
>> But RCU is generally a kernel-mode mechanism and it is quite hard to
>> emulate
>> it efficiently at user level. That's sad, because the idea is brilliant.
>
> Well, actually you can implement it fairly easily on Windows using the
> non-documented Nt* API.

Which particular APIs?

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

EricP

unread,
Oct 27, 2008, 1:54:13 PM10/27/08
to
Chris M. Thomasson wrote:
>
>> <snip code>

Ok, some comments...

1) If you are on a uniprocessor then there is no reason
to be creating threads as it adds nothing to the test
and increases the noise due to more task switching.

2) Your code does not attempt to filter out things like timer
interrupts and task switches. As these all involve multiple
pipeline flushes and atomics, I don't see how you can make
any association of cause-and-effect to the fence or atomic
under consideration.

3) You shouldn't need 100,000,000 iterations.

4) If I am nor mistaken, it looks like you are measuring the
run time before you create the 3 threads, and that those threads
then wait at a barrier. You are essentially charging the
the fence or atomic instruction for those tens of thousands
of instructions.

Might I suggest the following, which is what I did to filter
out OS noise activity. As you are only looping chasing 128 pointers
in a linked list, the time for one iteration is small enough
that this would work.

Use a pair of RDTSC to measure the time for one iteration.
Do a bunch (say 100 or so) of "training run" iterations
and remember the lowest cycle count.
Scale that lowest cycle count by some amount between 1.5 and 4
and use that scaled value as a filter on you real test runs.
I had to play with this a while to get the value right
so it doesn't filter out too much.
When the test actually runs, measure the cycle count again
and if it is larger than the scaled limit then toss the measurement
because an interrupt or something happened inside the measurement
window. Otherwise accumulate the cost and count the sample.

And loose the threads.

Eric

Chris M. Thomasson

unread,
Oct 27, 2008, 5:51:59 PM10/27/08
to

"Anthony Williams" <antho...@gmail.com> wrote in message
news:utzayl...@gmail.com...

> "Chris M. Thomasson" <n...@spam.invalid> writes:
>
>> "Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote in
>> message news:gdmia2$cp$1...@node1.news.atman.pl...
>>> Chris M. Thomasson wrote:
>>>
>>>> RCU gets around having to execute a membar, or locked instruction
>>>> on x86, for reader threads; this alone has __major__ performance
>>>> benefits indeed:
>>>
>>> But RCU is generally a kernel-mode mechanism and it is quite hard to
>>> emulate
>>> it efficiently at user level. That's sad, because the idea is brilliant.
>>
>> Well, actually you can implement it fairly easily on Windows using the
>> non-documented Nt* API.
>
> Which particular APIs?

`NtQuerySystemInformation()' is the only one you need, after that, you need
to use another little trick to mark and detect per-thread non-quiescent
sections. Beware, some of techniques are patented...

Chris M. Thomasson

unread,
Oct 27, 2008, 6:02:30 PM10/27/08
to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:ge5cmu$ape$1...@aioe.org...

>
> "Anthony Williams" <antho...@gmail.com> wrote in message
> news:utzayl...@gmail.com...
>> "Chris M. Thomasson" <n...@spam.invalid> writes:
>>
>>> "Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote
>>> in
>>> message news:gdmia2$cp$1...@node1.news.atman.pl...
>>>> Chris M. Thomasson wrote:
>>>>
>>>>> RCU gets around having to execute a membar, or locked instruction
>>>>> on x86, for reader threads; this alone has __major__ performance
>>>>> benefits indeed:
>>>>
>>>> But RCU is generally a kernel-mode mechanism and it is quite hard to
>>>> emulate
>>>> it efficiently at user level. That's sad, because the idea is
>>>> brilliant.
>>>
>>> Well, actually you can implement it fairly easily on Windows using the
>>> non-documented Nt* API.
>>
>> Which particular APIs?
>
> `NtQuerySystemInformation()' is the only one you need,

On Vista, one can simply use the `FlushProcessWriteBuffers()' procedure.

Chris M. Thomasson

unread,
Oct 27, 2008, 11:16:11 PM10/27/08
to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:83417$48ff4f8e$45c49ea8$18...@TEKSAVVY.COM...
[...]

Also, check out:

http://atomic-ptr-plus.sourceforge.net

forget about papers, and test things for yourself...

Chris M. Thomasson

unread,
Oct 27, 2008, 11:27:06 PM10/27/08
to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:oMqNk.19049$UD6....@newsfe07.iad...

>
> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> news:ge5cmu$ape$1...@aioe.org...
>>
>> "Anthony Williams" <antho...@gmail.com> wrote in message
>> news:utzayl...@gmail.com...
>>> "Chris M. Thomasson" <n...@spam.invalid> writes:
>>>
>>>> "Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote
>>>> in
>>>> message news:gdmia2$cp$1...@node1.news.atman.pl...
>>>>> Chris M. Thomasson wrote:
>>>>>
>>>>>> RCU gets around having to execute a membar, or locked instruction
>>>>>> on x86, for reader threads; this alone has __major__ performance
>>>>>> benefits indeed:
>>>>>
>>>>> But RCU is generally a kernel-mode mechanism and it is quite hard to
>>>>> emulate
>>>>> it efficiently at user level. That's sad, because the idea is
>>>>> brilliant.
>>>>
>>>> Well, actually you can implement it fairly easily on Windows using the
>>>> non-documented Nt* API.
>>>
>>> Which particular APIs?
>>
>> `NtQuerySystemInformation()' is the only one you need,
>
> On Vista, one can simply use the `FlushProcessWriteBuffers()' procedure.

[...]

Sorry for all the posts, but one more point...


There is a difference between undocumented and Vista API approach. The
former is passive synchronization epoch detection while the latter is
active. The latter issues an IPI which covers all processors in the process
affinity. The latter simply queries cpu activity. IMVHO, the
`FlushProcessWriteBuffers()' has more overheads involved...

MitchAlsup

unread,
Oct 28, 2008, 12:53:22 AM10/28/08
to
On Oct 26, 11:41 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> I played with this a bit to test the cost of fences and atomics.
> I have a (sorry Mitch) Intel core duo E8400 3 GHz

No need to be sorry--if you recall, I have not been with AMD for over
1 year.

Mitch

Chris M. Thomasson

unread,
Oct 28, 2008, 5:56:27 AM10/28/08
to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:7NVLk.15783$bK.1...@newsfe04.iad...
[...]

Any thoughts on the SUN backed algorithm? I attempted to dissect it here
after some sherry:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8ad297f61b369a41

then I came up with the naive work-"requesting" algorithm here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/6019e377e4aa73ec


too much ethanol is NOT GOOD!!!!!! OUCH!


Piotr Wyderski

unread,
Oct 30, 2008, 11:36:02 AM10/30/08
to
Chris M. Thomasson wrote:

> On Vista, one can simply use the `FlushProcessWriteBuffers()' procedure.

Earlier Windows have the ZwFlushWriteBuffer undocumented function in ntdll,
which -- I believe -- has just been made documented on Vista. On the other
hand,
I didn't know about their existence.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Oct 30, 2008, 11:47:45 AM10/30/08
to
Chris M. Thomasson wrote:

> Any thoughts on the SUN backed algorithm?

Yes... I don't like it. Their claims related to its performance are
basically invalid,
since they require expensive memory barriers in the fast path (not shown in
their
pseudocode, so one must figure it out after careful reading). So, the
algorithm is
not as optimistic as they advertise it, since it must always preserve memory
consistency across all the participating nodes. In work-requesting you at
least
know where exactly are those synchronization points, so you can issue
membars
only there, when explicitely asked to do so. On the other hand, it's quite
easy
to show how to induce deep load imbalance in the naive algorithm, as Dimitry
did, so it will require a much more elaborate solution. Anyway, I would bet
that a form of work requesting is the correct answer.

Best regards
Piotr Wyderski

David Schwartz

unread,
Oct 30, 2008, 10:57:27 PM10/30/08
to
On Oct 27, 3:02 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> On Vista, one can simply use the `FlushProcessWriteBuffers()' procedure.

It would be interesting to see what happens to a quad-core Vista box
if you run the following code as a normal user with normal privileges
and normal affinity:
while(1) FlushProcessWriteBuffers();

DS

Chris M. Thomasson

unread,
Oct 31, 2008, 12:11:03 AM10/31/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:b92f7668-7d71-4860...@q30g2000prq.googlegroups.com...

I don't have access to Vista. I used to have the beta, but refused to spend
$$$ on the real thing... Anyway, I don't know if you need to me root user in
order to make use of it. Well, if it actually executed, and you ran it in a
loop like that... Be sure to have a fire extinguisher handy because the box
will probably ignite and burn your house down.

Chris M. Thomasson

unread,
Oct 31, 2008, 12:12:37 AM10/31/08
to

"Piotr Wyderski" <piotr.w...@mothers.against.spam.gmail.com> wrote in
message news:geceq4$qba$1...@node1.news.atman.pl...

AFAICT, `ZwFlushWriteBuffer()' resolves to a NOP on non-Vista boxes...

Chris M. Thomasson

unread,
Oct 31, 2008, 12:14:14 AM10/31/08
to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:LrvOk.73$TK2...@newsfe01.iad...

this is why I would actually prefer to continue using a counting trick along
with `NtQuerySystemInformation()'; even on Vista... Its passive, and does
not issue IPI's.

EricP

unread,
Nov 4, 2008, 3:03:22 PM11/4/08
to
EricP wrote:
> I played with this a bit to test the cost of fences and atomics.
> ...

> The results, measured in RDTSC cycle counts, is that a MFENCE
> costs between 16 to 35 cycles, a LOCK XADD between 20 to 50 cycles.
> This is against a base loop cost of about 400 cycles.

I played with this some more to test the cost of atomic contention.

The cost of an atomic op is 35 to 50 TSC counts.
However if two cpus contend for the same line in an atomic op,
that cost rises to 107 cycles.
I also tested the time to just move a cache line from
cpu1 to cpu2 and that was 50 cycles.

So it would seem that on a core2 duo, most of any
contention cost is moving the cache line.

However I don't think these results can be applied in more
distributed arches like a quad or octo cpu on a Hyper Transport.
Because this core2 duo is all on one chip and knows there is only
1 peer, it could have fast cache coherency comms pathways
that would not be applicable in a more general smp system.

Eric

David Schwartz

unread,
Nov 4, 2008, 4:58:57 PM11/4/08
to
On Nov 4, 12:03 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> The cost of an atomic op is 35 to 50 TSC counts.
> However if two cpus contend for the same line in an atomic op,
> that cost rises to 107 cycles.
> I also tested the time to just move a cache line from
> cpu1 to cpu2 and that was 50 cycles.

Every other source I know of has claimed that the cost of migrating an
L2 cache line on a Core 2 is on the order of 11 to 15 cycles. As I
understand what's involved in an atomic operation on an x86 CPU, it
would have to cost more than an L2 cache miss or migration.

Plus, a significant fraction of atomic operations occur with the cache
line already local to that CPU. Consider, for example, a thread that
allocates ten objects rapidly while the rest of the system is doing
completely different things (threads that belong to other processes
entirely), each of these ten object creations results in locking a
process-local 'malloc' lock, atomically increment an object count, and
so on. All these atomic operations are likely to be hot in the L2
cache except the first set.

DS

EricP

unread,
Nov 5, 2008, 3:47:16 PM11/5/08
to
David Schwartz wrote:
> On Nov 4, 12:03 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>
>> The cost of an atomic op is 35 to 50 TSC counts.
>> However if two cpus contend for the same line in an atomic op,
>> that cost rises to 107 cycles.
>> I also tested the time to just move a cache line from
>> cpu1 to cpu2 and that was 50 cycles.
>
> Every other source I know of has claimed that the cost of migrating an
> L2 cache line on a Core 2 is on the order of 11 to 15 cycles. As I
> understand what's involved in an atomic operation on an x86 CPU, it
> would have to cost more than an L2 cache miss or migration.

Hmmm... yes. I was picking up some overhead costs.
I redesigned those tests and the cache line move is 23,
an atomic mixed with other memory ops is 30,
and a collided atomic LOCK INC is 89.

So there appears to be a cost to the collision beyond the
cache line migration cost even on a duo cpu.

Eric


0 new messages