Intel(R) C++ STM Compiler...

Chris Thomasson

nieprzeczytany,

7 mar 2008, 01:27:567.03.2008

do

Has anybody been playing around with this thing yet? I was wondering if it
uses a lock-free STM implementation...

--
Chris M. Thomasson
http://appcore.home.comcast.net

Joe Seigh

nieprzeczytany,

7 mar 2008, 06:54:367.03.2008

do

Chris Thomasson wrote:
> Has anybody been playing around with this thing yet? I was wondering if
> it uses a lock-free STM implementation...
>

Just obstruction-free AFAICT. I don't see any point in playing around
with it. I also don't see why anyone would expend a significant amount
of resources porting a non-trivial application to it just to see how it
would perform.

What they need is some standardized multi-threading benchmarks to compare
new stuff to. You could then use the results to compare different synchronization
techniques to each other using the standard lock based implementations to
normalize your comparisons.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Chris Thomasson

nieprzeczytany,

7 mar 2008, 15:12:047.03.2008

do

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:02aAj.10127$Ie2.9886@trndny09...

> Chris Thomasson wrote:
>> Has anybody been playing around with this thing yet? I was wondering if
>> it uses a lock-free STM implementation...
>>
>
> Just obstruction-free AFAICT.

I am interested in what you read that leads you to this possible conclusion.
I have read some questions that deal with the "conflict resolution" aspect
of their STM; they don't seem to want to answer them in any detail. Also, I
remember reading a paper on STM from an Intel guy which states that STM
should not be obstruction-free...

> I don't see any point in playing around
> with it. I also don't see why anyone would expend a significant amount
> of resources porting a non-trivial application to it just to see how it
> would perform.

Yeah. Good points especially since the compiler is part of their so-called
"what-if" series.

> What they need is some standardized multi-threading benchmarks to compare
> new stuff to. You could then use the results to compare different
> synchronization
> techniques to each other using the standard lock based implementations to
> normalize your comparisons.

Indeed.

Joe Seigh

nieprzeczytany,

9 mar 2008, 18:52:389.03.2008

do

Chris Thomasson wrote:
>
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:02aAj.10127$Ie2.9886@trndny09...
>> Chris Thomasson wrote:
>>> Has anybody been playing around with this thing yet? I was wondering
>>> if it uses a lock-free STM implementation...
>>>
>>
>> Just obstruction-free AFAICT.
>
> I am interested in what you read that leads you to this possible
> conclusion. I have read some questions that deal with the "conflict
> resolution" aspect of their STM; they don't seem to want to answer them
> in any detail. Also, I remember reading a paper on STM from an Intel guy
> which states that STM should not be obstruction-free...

http://blogs.intel.com/research/2008/02/parallel_programming_with_tran.php
Read their patent applications also. Lots of them.

It will be interesting what they say the challenges of STM are. No one
has explained how they think STM is going to scale. I should do a write
up to explain why STM either will not scale or else it will not be usable
by ordinary programmers. One or the other.

Dmitriy V'jukov

nieprzeczytany,

9 mar 2008, 22:03:059.03.2008

do

On 7 мар, 09:27, "Chris Thomasson" <cris...@comcast.net> wrote:
> Has anybody been playing around with this thing yet? I was wondering if it
> uses a lock-free STM implementation...

It's a bit offtop, but maybe it will be interesting:
http://blogs.sun.com/dave/resource/transact08-dice.pdf

Authors describe HTM as it will be in SUN Rock.
API is only 2 instructions:
// transaction start
chkpt [addr_for_jump_on_failure]
// commit transaction
commit

It will be obstruction-free.
It will be best-effort (transaction can fail w/o any interference).
It doesn't allow function calls inside transaction.
It has limit on number of stores inside transaction.

Authors also provide some benchmark results (simulated). Yes, HTM
doesn't scale well. But STM based on Transactional Locking 2 scales
linearly in their benchmark.

SUN Rock with HTM will be available in 2008!

Dmitriy V'jukov

nieprzeczytany,

9 mar 2008, 22:30:219.03.2008

do

On 7 мар, 14:54, Joe Seigh <jseigh...@xemaps.com> wrote:

> What they need is some standardized multi-threading benchmarks to compare
> new stuff to. You could then use the results to compare different synchronization
> techniques to each other using the standard lock based implementations to
> normalize your comparisons.

Is anybody aware of some multi-threading benchmarks which appropriate
for comparing lock-based/lock-free/transactional memory
synchronization?

It would be interesting to sketch 'mid-level' multi-threading
benchmark. By 'mid-level' I mean some usage patterns for one/several
synchronization primitives (not big full-fledged application, and not
just stress test for single primitive).

For example:
1. Work scheduling.
Benchmark can specify number of threads (one per processor/several per
processor), number of produced work items per consumed work item,
amount of local processing, best-effort fifo or arbitrary order.
2. Global application settings/cache/global object.
One global object (for example, settings object or some cache or
routing table in network router). Threads need read-only access very
frequently. Write access very infrequent. Object is relatively big.
Benchmark can specify number of threads, write access frequncy, amount
of local processing, object size.
3. Session map.
Object maps session id to session object. Threads search for session
object by session id, or insert new session, or delete old session.
Benchmark can specify number of threads, insert/delete frequncy,
amount of local processing, session object size, number of sessions.
4. Memory allocation.
Benchmark can specify number of threads, frequncy of foreign free
operations, amount of local processing.
5. Message-passing/queueing.
Benchmark can specify number of threads, number of queues, number of
producers/consumers per queue, item size, amount of local processing,
required order of processing (fifo/lifo/arbitrary).

Can anybody comment on those items? Propose parameter sets for those
items? Propose new items?

Dmitriy V'jukov

Chris Thomasson

nieprzeczytany,

9 mar 2008, 23:21:469.03.2008

do

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:0b635f6f-ede8-4f6c...@v3g2000hsc.googlegroups.com...

On 7 мар, 09:27, "Chris Thomasson" <cris...@comcast.net> wrote:
> Has anybody been playing around with this thing yet? I was wondering if it
> uses a lock-free STM implementation...

> It's a bit offtop, but maybe it will be interesting:
> http://blogs.sun.com/dave/resource/transact08-dice.pdf

> Authors describe HTM as it will be in SUN Rock.
> API is only 2 instructions:
> // transaction start
> chkpt [addr_for_jump_on_failure]
> // commit transaction
> commit

> It will be obstruction-free.
> It will be best-effort (transaction can fail w/o any interference).
> It doesn't allow function calls inside transaction.
> It has limit on number of stores inside transaction.

> Authors also provide some benchmark results (simulated). Yes, HTM
> doesn't scale well.

indeed.

> But STM based on Transactional Locking 2 scales
> linearly in their benchmark.

Are you referring to the last paper which requires 'schedctl' to cast a
little bit of influence upon the scheduler's behavior? How do they perform
without that functionality... Anyway, I have tried to use STM for basic
things, like a queue/stack/hashmap and compare the lookup (e.g., reader)
performance on a *per-thread level... Sometimes you can get better numbers
from very careful use of locking techniques... Also, H/TM requires two
instructions _per-operation_. As you know, CAS only requires that a thread
make a normal load from a location, while HIM requires a special instruction
to "load a transaction into being" if you willl... For instance, imagine
LL/SC is single-word STM:
_____________________________________________________________
typedef int atomic_word;

static atomic_word g_counter = 0;

atomic_word
counter_xadd(
atomic_word const addend
) {
atomic_word local;
do {
local = LL(&g_counter);
} while (! SC(&g_counter, local + addend));
return local;
}

-vs-

atomic_word
counter_xadd(
atomic_word const addend
) {
atomic_word local;
do {
local = g_counter;
} while (! CAS(&g_counter, local, local + addend));
return local;
}

-or- (using return value from cas)

atomic_word
counter_xadd(
atomic_word const addend
) {
atomic_word cmp, local = g_counter;
do {
cmp = local;
} while ((local = CAS(&g_counter, local, local + addend)) != cmp);
return local;
}
_____________________________________________________________

The CAS can be a winner over LL/SC... IMHO, its more flexible. For instance,
you could predict values in the fast-path of an operation to avoid the load.
Say you used CAS to protect a single-word spinlock, well you could use a
predetermined constant value representing a "locked state" for the compare
operand. In a LL/SC setup, you would have to make a load-linked operation
before you can make use of the sore-conditional (e.g., commit). function.

> SUN Rock with HTM will be available in 2008!

Seems like they choose S/HIM over KCSS. Humm... I say this due to
correspondence from Andy Glew:

http://groups.google.com/group/comp.arch/msg/5d607bc7a2433d2c
(last sentence in post...)

I guess I would only use HIM for writer algorithms; no readers allowed!

:^)

===============
* - measuring on a per-thread level could give you more informative results
like:

Shared-Memory Collection A
- Read-Only Searches Per-Second
__________________________________
Thread 0 = 300,123 ops
Thread 1 = 250,211 ops
Thread 2 = 198,346 ops
Thread 3 = 210,875 ops
Thread 4 = 204,322 ops
Thread 5 = 250,125 ops
Thread 6 = 286,037 ops
Thread 7 = 375,524 ops
__________________________________

The searches would be through shared memory collections. The numbers can
vary with the search algorithm, but IMHO that has to be better reader
performance than any S/HTM can provide... ;^)

Chris Thomasson

nieprzeczytany,

9 mar 2008, 23:29:009.03.2008

do

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:lMCdnTkv-N4mNEna...@comcast.com...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:0b635f6f-ede8-4f6c...@v3g2000hsc.googlegroups.com...
> On 7 мар, 09:27, "Chris Thomasson" <cris...@comcast.net> wrote:
>> Has anybody been playing around with this thing yet? I was wondering if
>> it
>> uses a lock-free STM implementation...

[...]

> ===============
> * - measuring on a per-thread level could give you more informative
> results like:
>
>
> Shared-Memory Collection A
> - Read-Only Searches Per-Second
> __________________________________
> Thread 0 = 300,123 ops
> Thread 1 = 250,211 ops
> Thread 2 = 198,346 ops
> Thread 3 = 210,875 ops
> Thread 4 = 204,322 ops
> Thread 5 = 250,125 ops
> Thread 6 = 286,037 ops
> Thread 7 = 375,524 ops
> __________________________________
>
>
> The searches would be through shared memory collections. The numbers can
> vary with the search algorithm, but IMHO that has to be better reader
> performance than any S/HTM can provide... ;^)

Joe Seigh has posted quite a few testing results with numbers that reflect
information such as the total read/write operations per-thread-per-second...

Dmitriy V'jukov

nieprzeczytany,

9 mar 2008, 23:49:049.03.2008

do

On 10 мар, 06:21, "Chris Thomasson" <cris...@comcast.net> wrote:
> Are you referring to the last paper which requires 'schedctl' to cast a
> little bit of influence upon the scheduler's behavior?

Yes.

> How do they perform without that functionality...

I'm not sure, but maybe it will be faster w/o 'schedctl'. Because
'schedctl' is hit on fast-path. And on slow-path one get only 1
additional trx abort per 1 context switch - not very expensive. I
don't think that 'schedctl' is critical for their STM implementation.

Dmitriy V'jukov

Chris Thomasson

nieprzeczytany,

9 mar 2008, 23:56:139.03.2008

do

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:WSZAj.4529$Y33.1935@trndny07...

> Chris Thomasson wrote:
>>
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:02aAj.10127$Ie2.9886@trndny09...
>>> Chris Thomasson wrote:
>>>> Has anybody been playing around with this thing yet? I was wondering if
>>>> it uses a lock-free STM implementation...
>>>>
>>>
>>> Just obstruction-free AFAICT.
>>
>> I am interested in what you read that leads you to this possible
>> conclusion. I have read some questions that deal with the "conflict
>> resolution" aspect of their STM; they don't seem to want to answer them
>> in any detail. Also, I remember reading a paper on STM from an Intel guy
>> which states that STM should not be obstruction-free...
>
> http://blogs.intel.com/research/2008/02/parallel_programming_with_tran.php
> Read their patent applications also. Lots of them.
>
> It will be interesting what they say the challenges of STM are. No one
> has explained how they think STM is going to scale.

I want to read more about the caveats of "free composablility":

http://groups.google.com/group/comp.arch/msg/40f378b71e87f8d0

> I should do a write up
> to explain why STM either will not scale

Indeed... ;^)

> or else it will not be usable
> by ordinary programmers. One or the other.

Right. You have to know what your doing inside those atomic-blocks/scopes...
Its not bullet-proof.

Chris Thomasson

nieprzeczytany,

10 mar 2008, 00:00:3510.03.2008

do

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:8cbdd421-2f35-4368...@47g2000hsb.googlegroups.com...

On 10 мар, 06:21, "Chris Thomasson" <cris...@comcast.net> wrote:
> > Are you referring to the last paper which requires 'schedctl' to cast a
> > little bit of influence upon the scheduler's behavior?

> Yes.

> > How do they perform without that functionality...

I am also sure that the atomic blocks are coded as efficiently as
possible...

> I'm not sure, but maybe it will be faster w/o 'schedctl'. Because
> 'schedctl' is hit on fast-path. And on slow-path one get only 1
> additional trx abort per 1 context switch - not very expensive. I
> don't think that 'schedctl' is critical for their STM implementation.

Its not required to run it, but I think they used it for their graphs. I
can't remember right now, but did they show the difference between using,
and not using 'schedctl'?

Dmitriy V'jukov

nieprzeczytany,

10 mar 2008, 08:32:2410.03.2008

do

On 10 мар, 06:56, "Chris Thomasson" <cris...@comcast.net> wrote:

> > or else it will not be usable
> > by ordinary programmers. One or the other.
>
> Right. You have to know what your doing inside those atomic-blocks/scopes...
> Its not bullet-proof.

Take into consideration limitations of first HTM in Sun Rock. No
function calls. No heavy instructions. Limited amount of stores.
Limited amount of loads. Limitations wrt cache geometry. Etc.
And definitely no nested transactions.
As I understand it will be useless in managed languages (Java/C#),
because programmer have no control over what really will be inside
atomic block. Only asm, or handcrafted C/C++.

Sun will use HTM in JVM for sure. It will be interesting to see other
real-life applications of the HTM... if any...

Dmitriy V'jukov

nieprzeczytany,

10 mar 2008, 08:39:1110.03.2008

do

On 10 мар, 07:00, "Chris Thomasson" <cris...@comcast.net> wrote:

> > > Are you referring to the last paper which requires 'schedctl' to cast a
> > > little bit of influence upon the scheduler's behavior?
> > Yes.
> > > How do they perform without that functionality...
>
> I am also sure that the atomic blocks are coded as efficiently as
> possible...

For sure.
They investigate and tune resulting asm code inside transactions.

> > I'm not sure, but maybe it will be faster w/o 'schedctl'. Because
> > 'schedctl' is hit on fast-path. And on slow-path one get only 1
> > additional trx abort per 1 context switch - not very expensive. I
> > don't think that 'schedctl' is critical for their STM implementation.
>
> Its not required to run it, but I think they used it for their graphs. I
> can't remember right now, but did they show the difference between using,
> and not using 'schedctl'?

No, they didn't. I believe that even with 'schedctl' results are a bit
"far-fetched" :)

Dmitriy V'jukov

nieprzeczytany,

10 mar 2008, 08:49:3110.03.2008

do

On 10 мар, 05:03, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> It's a bit offtop, but maybe it will be interesting:http://blogs.sun.com/dave/resource/transact08-dice.pdf
>
> Authors describe HTM as it will be in SUN Rock.
> API is only 2 instructions:
> // transaction start
> chkpt [addr_for_jump_on_failure]
> // commit transaction
> commit

I didn't understand how they obtain performance/scaling improvement
with 'Lock elision' technique.
It turns out that it's cheaper to execute critical section in
successful transaction than execute mutex lock/unlock.
Can someone comment on this?

Dmitriy V'jukov

Chris Thomasson

nieprzeczytany,

10 mar 2008, 08:52:3810.03.2008

do

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:WSZAj.4529$Y33.1935@trndny07...

> Chris Thomasson wrote:
>>
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:02aAj.10127$Ie2.9886@trndny09...
>>> Chris Thomasson wrote:
>>>> Has anybody been playing around with this thing yet? I was wondering if
>>>> it uses a lock-free STM implementation...
>>>>
>>>
>>> Just obstruction-free AFAICT.

[...]

> It will be interesting what they say the challenges of STM are. No one
> has explained how they think STM is going to scale. I should do a write
> up to explain why STM either will not scale or else it will not be usable
> by ordinary programmers. One or the other.

check this out:

http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30248420.aspx

Apparently, you can live-lock their system with:
_______________________________________
static int g_counter = 0;

threadA() {
atomic {
for (;;) {}
}
}

threadB() {
for (;;) {
atomic {
++g_counter;
}
}
}
_______________________________________

The answer from Intel is sort of funny... Humm, let me paraphrase for a
moment:

If you think STM is analogous to a giant global mutex, well, your correct...

Ouch!

Chris Thomasson

nieprzeczytany,

10 mar 2008, 08:55:3610.03.2008

do

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:92f4f26a-a6a5-400a...@m3g2000hsc.googlegroups.com...

Have you read this:

http://blogs.sun.com/dave/resource/transact08-dice.pdf

?

Chris Thomasson

nieprzeczytany,

10 mar 2008, 09:05:3010.03.2008

do

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:92f4f26a-a6a5-400a...@m3g2000hsc.googlegroups.com...

AFAICT, the critical-sections are subjected to the caveats associated with
the HTM instructions. Therefore, a mutex cannot use the 'lock elision'
technique if any of its critical-sections contain function calls, or has too
many stores, ect... This is because the method relies on speculative
transactional execution of the guarded code.

Dmitriy V'jukov

nieprzeczytany,

10 mar 2008, 09:53:4010.03.2008

do

On 10 мар, 15:55, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
>
> news:92f4f26a-a6a5-400a...@m3g2000hsc.googlegroups.com...

>
> > On 10 ÍÁÒ, 05:03, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > > It's a bit offtop, but maybe it will be
> > > interesting:http://blogs.sun.com/dave/resource/transact08-dice.pdf
>
> > > Authors describe HTM as it will be in SUN Rock.
> > > API is only 2 instructions:
> > > // transaction start
> > > chkpt [addr_for_jump_on_failure]
> > > // commit transaction
> > > commit
> > I didn't understand how they obtain performance/scaling improvement
> > with 'Lock elision' technique.
> > It turns out that it's cheaper to execute critical section in
> > successful transaction than execute mutex lock/unlock.
> > Can someone comment on this?
>
> Have you read this:
>
> http://blogs.sun.com/dave/resource/transact08-dice.pdf
>
> ?

Yes. But I am not sure from what exactly they get performance gain.
Critical section must contain some form of global synchronization.
Anyway. Even with lock elision.

Dmitriy V'jukov

Joe Seigh

nieprzeczytany,

10 mar 2008, 20:00:4210.03.2008

do

That would seem to confirm that their STM implementations is only
obstruction-free if indeed it's live locking.

Dmitriy V'jukov

nieprzeczytany,

11 mar 2008, 07:32:1311.03.2008

do

On Mar 11, 3:00 am, Joe Seigh <jseigh...@xemaps.com> wrote:

> > Apparently, you can live-lock their system with:
> > _______________________________________
> > static int g_counter = 0;
>
> > threadA() {
> > atomic {
> > for (;;) {}
> > }
> > }
>
> > threadB() {
> > for (;;) {
> > atomic {
> > ++g_counter;
> > }
> > }
> > }
> > _______________________________________
>
> > The answer from Intel is sort of funny... Humm, let me paraphrase for a
> > moment:
>
> > If you think STM is analogous to a giant global mutex, well, your
> > correct...
>
> > Ouch!
>
> That would seem to confirm that their STM implementations is only
> obstruction-free if indeed it's live locking.

What will be if I suspend thread A in random point?
If forward progress for thread B will be guaranteed, then STM is
obstruction-free.
If no, then STM is even weaker than obstruction-free. Then I think
it's effectively lock-based. Hmmm... Then it's actually *not* STM at
all...

Dmitriy V'jukov