Atomically acquiring 32/64 spin-locks in single operation...

Chris Thomasson

unread,

Dec 19, 2007, 5:01:00 PM12/19/07

to

This is a dual-hashed bitlock algorithm that can be used for things like
STM. It works by hashing the pointer address to an index into an array of
lock-words, _and_ a corresponding bit. Then, it uses that bit as the
lock-state for the selected lock-word. For instance:

ObjectA Hash Results
-----------------------
Lock-Word Index: 5
Lock-Bit Number: 24 - (0x01000000)

ObjectB Hash Results
-----------------------
Lock-Word Index: 7
Lock-Bit Number: 4 - (0x00000010)

ObjectC Hash Results
-----------------------
Lock-Word Index: 5
Lock-Bit Number: 13 - (0x00002000)

ect...

You can take all the locks by using CAS to compare the lock-word with 0, and
swap in a maximum value.

Here is the pseudo-code draft:
__________________________________________________________________
#include <limits.h>
#include <stddef.h>
#include <math.h>

typedef unsigned int atomicuword;
#define ATOMICUWORD_MAX() UINT_MAX

typedef struct bitmtx_s bitmtx;

#define BITMTXSYS_DEPTH() 32
#define BITMTXSYS_TBLDEPTH() UCHAR_MAX

#define BITMTXSYS_TBLMAXDEPTH() ( \
BITMTXSYS_TBLDEPTH() + 1 \
)

#define BITMTXSYS_HASHPTR(mp_this) (((unsigned char) \
((((atomicuword)(mp_this)) * 307) >> 8) \
))

#define BITMTXSYS_HASHBIT(mp_this) (((atomicuword) \
pow(2, ((BITMTXSYS_HASHPTR(mp_this) % BITMTXSYS_DEPTH()) + 1)) \
))

struct bitmtx_s {
atomicuword locks[BITMTXSYS_TBLMAXDEPTH()];
};

#define BITMTX_STATICINIT() { {0} }

/* acquire the ptr bit lock */
static void bitmtx_lock(
bitmtx* const _this,
void* const ptr
) {
atomicuword const bit = BITMTXSYS_HASHBIT(ptr);
atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
assert(bit);
while (ATOMIC_BTS(mtx, bit)) {
/* exponential backoff */
}
}

/* acquire the ptr lock */
static void bitmtx_lockblock(
bitmtx* const _this,
void* const ptr
) {
atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
while (! ATOMIC_CAS(mtx, 0, ATOMICUWORD_MAX())) {
/* exponential backoff */
}
}

/* release the ptr bit lock */
static void bitmtx_unlock(
bitmtx* const _this,
void* const ptr
) {
atomicuword const bit = BITMTXSYS_HASHBIT(ptr);
atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
assert(bit);
ATOMIC_BC(mtx, bit);
}

/* release the ptr lock */
static void bitmtx_unlockblock(
bitmtx* const _this,
void* const ptr
) {
atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
ATOMIC_STORE(mtx, 0);
}
__________________________________________________________________

Any thoughts on this approach?

--
Chris M. Thomasson
http://appcore.home.comcast.net

Chris Thomasson

unread,

Dec 20, 2007, 1:32:20 AM12/20/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:zNOdnUeVP51XD_Ta...@comcast.com...

> This is a dual-hashed bitlock algorithm that can be used for things like
> STM. It works by hashing the pointer address to an index into an array of
> lock-words, _and_ a corresponding bit. Then, it uses that bit as the
> lock-state for the selected lock-word. For instance:

[...]

> typedef struct bitmtx_s bitmtx;
>
> #define BITMTXSYS_DEPTH() 32

[...]

I think that the BITMTXSYS_DEPTH macro should be defined as:

#define BITMTXSYS_DEPTH() 31

in order to avoid tripping the assertions that read as:

assert(bit);

in the original posted pseudo-code sketch.

Dmitriy Vyukov

unread,

Dec 20, 2007, 3:49:13 AM12/20/07

to

On Dec 20, 1:01 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> This is a dual-hashed bitlock algorithm that can be used for things like
> STM. It works by hashing the pointer address to an index into an array of
> lock-words, _and_ a corresponding bit. Then, it uses that bit as the
> lock-state for the selected lock-word. For instance:

> Any thoughts on this approach?

Definitely this is interesting approach. But can you think out any
situations where such approach will be better than traditional hash
based lock table with lock per cache line? You have mentioned STM -
can you describe in more detail what you mean.
I just can't think out any such situations.

This approach makes it cheaper to acquire batch of mutexes. But at the
same time it makes it noticeably more expensive to acquire single
mutex because of false contention.

This approach can be used on one processor. For example if I have
several threads bound to single processor/core. So CAS must be
implemented as cmpxchg w/o lock prefix. And contention is for free in
this case. But I doubt that this is widespread case.

BTW do you see this:
http://www.accu.org/index.php/journals/1348
Author use similar to some extent approach. He use single word for
mutex and each bit in the word represents separate thread. But his
solution can be used only when all threads are bound to single
processor/core.

Dmitriy V'jukov

Chris Thomasson

unread,

Dec 20, 2007, 4:40:24 PM12/20/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:6315aa76-4288-4365...@d4g2000prg.googlegroups.com...

> On Dec 20, 1:01 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> This is a dual-hashed bitlock algorithm that can be used for things like
>> STM. It works by hashing the pointer address to an index into an array of
>> lock-words, _and_ a corresponding bit. Then, it uses that bit as the
>> lock-state for the selected lock-word. For instance:
>> Any thoughts on this approach?
>
>
> Definitely this is interesting approach. But can you think out any
> situations where such approach will be better than traditional hash
> based lock table with lock per cache line? You have mentioned STM -
> can you describe in more detail what you mean.
> I just can't think out any such situations.

Lock-Based STM can benefit from taking multiple locks in single interlocked
operation because it reduces the overhead of the overall
try-lock-and-backoff algorithm. A transaction descriptor can contain a
hash-list of read and write accesses descriptors which each have a pointer
to a transactional object that they wish to access. Any collisions will be
queued up in the bucket. When the transaction descriptor gets processed by
the commit function, it can start the locking procedure. If it hits a bucket
in the hash-list that has more than one item within it, that means that it
only has to lock this one item, and move on to another bucket. It does not
need to lock each item in there because they all hash to the same lock-word,
and CAS can be used to acquire all the lock-bits within it.

IMO, this would start to kick in during periods of load on the STM. It could
potentially help reduce some of the live-lock.

> This approach makes it cheaper to acquire batch of mutexes. But at the
> same time it makes it noticeably more expensive to acquire single
> mutex because of false contention.

Well, just because objects can map to the exact same lock-word doesn't
necessarily mean that they will expose themselves to a false-contention.
ObjectA and ObjectB can both map to lock-wordA, _however_, ObjectA could map
to lock-bit2, and ObjectB can map to lock-bit18. So, the ATOMIC_BTS function
can hit a fast-path on both objects concurrently. In fact, the ATOMIC_BTS
can hit a fast-path on 32/64 objects concurrently.

Think of a 64-bit system, and Threads1-64 attempt to lock Objects1-64
respectively, which all happen to map to lock-wordA... However, Objects1-64
map to bits 1-64 respectively. Well, that means that threads1-64 will all
execute ATOMIC_BTS hit the fast-path.

Does that make sense?

> This approach can be used on one processor. For example if I have
> several threads bound to single processor/core. So CAS must be
> implemented as cmpxchg w/o lock prefix. And contention is for free in
> this case. But I doubt that this is widespread case.
>
> BTW do you see this:
> http://www.accu.org/index.php/journals/1348
> Author use similar to some extent approach. He use single word for
> mutex and each bit in the word represents separate thread. But his
> solution can be used only when all threads are bound to single
> processor/core.

Humm, that scheme seems to limit the number of threads that can use the word
at any one time. I need to read this, thanks for pointing it out.

Chris Thomasson

unread,

Dec 20, 2007, 4:46:27 PM12/20/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:6315aa76-4288-4365...@d4g2000prg.googlegroups.com...

> On Dec 20, 1:01 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> This is a dual-hashed bitlock algorithm that can be used for things like
>> STM. It works by hashing the pointer address to an index into an array of
>> lock-words, _and_ a corresponding bit. Then, it uses that bit as the
>> lock-state for the selected lock-word. For instance:
>> Any thoughts on this approach?
>
>
> Definitely this is interesting approach. But can you think out any
> situations where such approach will be better than traditional hash
> based lock table with lock per cache line? You have mentioned STM -
> can you describe in more detail what you mean.
> I just can't think out any such situations.
>
> This approach makes it cheaper to acquire batch of mutexes. But at the
> same time it makes it noticeably more expensive to acquire single
> mutex because of false contention.

Ahh, are you talking false-sharing on the cache-line?

Chris Thomasson

unread,

Dec 20, 2007, 9:22:20 PM12/20/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:zNOdnUeVP51XD_Ta...@comcast.com...

> This is a dual-hashed bitlock algorithm that can be used for things like
> STM. It works by hashing the pointer address to an index into an array of
> lock-words, _and_ a corresponding bit. Then, it uses that bit as the
> lock-state for the selected lock-word. For instance:

[...]

> You can take all the locks by using CAS to compare the lock-word with 0,
> and swap in a maximum value.

[...]

Heck, with 64-bit system and DWCAS, you can take 128 locks in an instant.

Dmitriy Vyukov

unread,

Dec 21, 2007, 8:31:08 AM12/21/07

to

On Dec 21, 12:40 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> Lock-Based STM can benefit from taking multiple locks in single interlocked
> operation because it reduces the overhead of the overall
> try-lock-and-backoff algorithm. A transaction descriptor can contain a
> hash-list of read and write accesses descriptors which each have a pointer
> to a transactional object that they wish to access. Any collisions will be
> queued up in the bucket. When the transaction descriptor gets processed by
> the commit function, it can start the locking procedure. If it hits a bucket
> in the hash-list that has more than one item within it, that means that it
> only has to lock this one item, and move on to another bucket. It does not
> need to lock each item in there because they all hash to the same lock-word,
> and CAS can be used to acquire all the lock-bits within it.

Pointer to what exactly do you want to pass to BITMTXSYS_HASHPTR() and
to BITMTXSYS_HASHBIT()?
To transaction descriptor? To read/write accesses descriptor? To
target memory location?

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Dec 21, 2007, 8:31:34 AM12/21/07

to

On Dec 21, 12:46 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

Exactly

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Dec 21, 2007, 8:32:38 AM12/21/07

to

On Dec 21, 5:22 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message

Anyway you have to cope with boundary conditions...
This is just incremental improvement...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Dec 21, 2007, 8:35:38 AM12/21/07

to

On Dec 21, 12:40 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> > BTW do you see this:
> >http://www.accu.org/index.php/journals/1348
> > Author use similar to some extent approach. He use single word for
> > mutex and each bit in the word represents separate thread. But his
> > solution can be used only when all threads are bound to single
> > processor/core.
>
> Humm, that scheme seems to limit the number of threads that can use the word
> at any one time. I need to read this, thanks for pointing it out.

Yes.
I have some private communication with George Shagov - this is very
particular solution.

Dmitriy V'jukov

Chris Thomasson

unread,

Dec 27, 2007, 7:59:35 PM12/27/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:a5f9d802-6982-4661...@e10g2000prf.googlegroups.com...

I will answer this in the next couple of days, either with a post, or STM
source code... I am not sure yet. I generally want to point
BITMTXSYS_HASHPTR _and_ BITMTXSYS_HASHBIT to the various target memory
locations within the transaction descriptor.

Chris Thomasson

unread,

Dec 31, 2007, 5:41:24 PM12/31/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:sOCdnTNxgsAQ1ena...@comcast.com...

You can point the entire transaction descriptor to a lock-word, then hash
the individual reads and writes in to lock-bits within the word. To lock an
entire transaction descriptor you take the lock-word, to lock individual
reads and writes within the transaction, you take the appropriate lock-bits.

That's one way. The other is to apply the lock-word _and_ lock-bit hash to
the transactions individual reads and writes. That way, you get better
granularity. However, it more difficult to lock the entire transaction. You
can't use single CAS because a single descriptor could map to more than one
lock-word.

I like the first method better for some reason...

What do you think?

Dmitriy Vyukov

unread,

Jan 1, 2008, 7:19:42 AM1/1/08

to

On 1 янв, 01:41, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message
>
> news:sOCdnTNxgsAQ1ena...@comcast.com...
>
>
>
> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

Don't your STM algorithm require that single memory location must map
to single lock?
With first method single memory location involved in several
transactions will map to several different locks simultaneously...

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 1, 2008, 7:57:29 PM1/1/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:fdd56fb0-e403-4610...@i7g2000prf.googlegroups.com...

On 1 янв, 01:41, "Chris Thomasson" <cris...@comcast.net> wrote:

[...]

> > You can point the entire transaction descriptor to a lock-word, then
> > hash
> > the individual reads and writes in to lock-bits within the word. To lock
> > an
> > entire transaction descriptor you take the lock-word, to lock individual
> > reads and writes within the transaction, you take the appropriate
> > lock-bits.
> >
> > That's one way. The other is to apply the lock-word _and_ lock-bit hash
> > to
> > the transactions individual reads and writes. That way, you get better
> > granularity. However, it more difficult to lock the entire transaction.
> > You
> > can't use single CAS because a single descriptor could map to more than
> > one
> > lock-word.
> >
> > I like the first method better for some reason...
> >
> > What do you think?

> Don't your STM algorithm require that single memory location must map
> to single lock?
> With first method single memory location involved in several
> transactions will map to several different locks simultaneously...

[...]

Forget the first method; its total useless crap! WTF was I thinking. Oh
yeah, it was New Years Eve, I must have been impaired.

Damn it!

Dmitriy Vyukov

unread,

Jan 2, 2008, 2:51:48 AM1/2/08

to

On 2 янв, 03:57, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
>
> news:fdd56fb0-e403-4610...@i7g2000prf.googlegroups.com...

With second method how you will achieve that al least some reasonable
amount of reads and writes in one transaction will map to single lock
word? You will have 100-1000 of lock words and 1-10 of reads and
writes. So initially every read or write will map to different lock
word. And all you'll get is only shitload of false sharing...

Think of the following example. Payment system. 1000000 accounts. In
every transaction we have to lock 2 "random" accounts.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 2, 2008, 11:29:58 PM1/2/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:bbc8ffcb-cc42-44cf...@s19g2000prg.googlegroups.com...

You have to measure false-sharing overhead vs. ability to aid STM locking
scheme in general wrt taking multiple locks in single atomic op. This is
major problem.

Chris Thomasson

unread,

Jan 2, 2008, 11:34:09 PM1/2/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:6f4af039-499e-468d...@y5g2000hsf.googlegroups.com...

On 2 янв, 03:57, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
>
> news:fdd56fb0-e403-4610...@i7g2000prf.googlegroups.com...
> On 1 ÑÎ×, 01:41, "Chris Thomasson" <cris...@comcast.net> wrote:
> [...]

[...]

> With second method how you will achieve that al least some reasonable
> amount of reads and writes in one transaction will map to single lock
> word?

You can use more advanced hashing/sorting algorithm. In fact, a big part of
my experimental in-house transaction system is based on coalescing clever
hashing with versioning schemes.

> You will have 100-1000 of lock words and 1-10 of reads and
> writes. So initially every read or write will map to different lock
> word. And all you'll get is only shitload of false sharing...

> Think of the following example. Payment system. 1000000 accounts. In
> every transaction we have to lock 2 "random" accounts.

Shitload of false-sharing on lock words indeed.

Chris Thomasson

unread,

Jan 3, 2008, 3:52:39 AM1/3/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:57d1085c-f21c-49c8...@s19g2000prg.googlegroups.com...

Yup.

Dmitriy Vyukov

unread,

Jan 3, 2008, 4:02:25 AM1/3/08

to

On 3 янв, 07:34, "Chris Thomasson" <cris...@comcast.net> wrote:

> > With second method how you will achieve that al least some reasonable
> > amount of reads and writes in one transaction will map to single lock
> > word?
>
> You can use more advanced hashing/sorting algorithm. In fact, a big part of
> my experimental in-house transaction system is based on coalescing clever
> hashing with versioning schemes.

I can't imagine appropriate hashing for this situation:

> > Think of the following example. Payment system. 1000000 accounts. In
> > every transaction we have to lock 2 "random" accounts.

Hashing must be very clever...

Maybe here can be used some kind of temporal binding of memory
locations to lock-words. I.e. if we need this memory location to be
bound to this lock word, we can just temporary bind this memory
location to this lock word. If it is not bound to another lock word
yet...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 3, 2008, 4:04:11 AM1/3/08

to

On 3 янв, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> > This approach makes it cheaper to acquire batch of mutexes. But at the
> >> > same time it makes it noticeably more expensive to acquire single
> >> > mutex because of false contention.
>
> >> Ahh, are you talking false-sharing on the cache-line?
>
> > Exactly
>
> You have to measure false-sharing overhead vs. ability to aid STM locking
> scheme in general wrt taking multiple locks in single atomic op. This is
> major problem.

But I don't see yet how word locking can be useful to STM...

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 3, 2008, 4:16:55 AM1/3/08

to

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

> news:d27b4555-7b6f-4aa0...@n20g2000hsh.googlegroups.com...

Hint... there is multiple versions of objects... Imagine working versioning
and hashing together into something useful... I cannot give source code on
this.

;^(...

Chris Thomasson

unread,

Jan 3, 2008, 4:29:26 AM1/3/08

to

>"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

>news:40f9cb9f-5aa2-4f83...@s19g2000prg.googlegroups.com...

It can only be useful within the following context:

http://groups.google.com/group/comp.programming.threads/msg/93cafcdf17851b3f

You can detect lock-word collisions and just take the whole word instead of
each individual lock-bit within it.

Does that make any sense at all?

Dmitriy Vyukov

unread,

Jan 3, 2008, 4:25:23 AM1/3/08

to

On 3 янв, 12:16, "Chris Thomasson" <cris...@comcast.net> wrote:
> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
> >news:d27b4555-7b6f-4aa0...@n20g2000hsh.googlegroups.com...

> > On 3 ÑÎ×, 07:34, "Chris Thomasson" <cris...@comcast.net> wrote:
> > > > With second method how you will achieve that al least some reasonable
> > > > amount of reads and writes in one transaction will map to single lock
> > > > word?
>
> > > You can use more advanced hashing/sorting algorithm. In fact, a big part
> > > of
> > > my experimental in-house transaction system is based on coalescing
> > > clever
> > > hashing with versioning schemes.
> > I can't imagine appropriate hashing for this situation:
> > > > Think of the following example. Payment system. 1000000 accounts. In
> > > > every transaction we have to lock 2 "random" accounts.
> > Hashing must be very clever...
> > Maybe here can be used some kind of temporal binding of memory
> > locations to lock-words. I.e. if we need this memory location to be
> > bound to this lock word, we can just temporary bind this memory
> > location to this lock word. If it is not bound to another lock word
> > yet...
>
> Hint... there is multiple versions of objects... Imagine working versioning
> and hashing together into something useful... I cannot give source code on
> this.

Ok. I will proceed from the assumption that there *is* something
useful and think more...
;)

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 3, 2008, 4:39:24 AM1/3/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:288e945b-a59a-4f25...@v4g2000hsf.googlegroups.com...
[...]

> Ok. I will proceed from the assumption that there *is* something
> useful and think more...
> ;)

BTW, an official fix for AppCore:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/cf4a952ff5af7d

is going to be announced here within the month!

:^)

Chris Thomasson

unread,

Jan 3, 2008, 4:55:17 AM1/3/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:_ZOdnaN0zq_QNuHa...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

> news:288e945b-a59a-4f25...@v4g2000hsf.googlegroups.com...
> [...]

>
>> Ok. I will proceed from the assumption that there *is* something
>> useful and think more...
>> ;)

Humm... Perhaps you should not use too much of your valuable brain power on
this... After all, is going to be a shi%%y STM in the long run.

In the sense that _any_ STM == crap STM...

That's a problem which still exists indeed.

;^(...

Dmitriy Vyukov

unread,

Jan 9, 2008, 8:43:38 AM1/9/08

to

On Dec 20 2007, 1:01 am, "Chris Thomasson" <cris...@comcast.net>
wrote:

> /* acquire the ptr lock */
> static void bitmtx_lockblock(
> bitmtx* const _this,
> void* const ptr
> ) {
> atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
> while (! ATOMIC_CAS(mtx, 0, ATOMICUWORD_MAX())) {
> /* exponential backoff */
> }

What if there is always some bits locked? There can not be such time
when all bits are 0...

You can do something like this:

/* acquire the ptr lock */
static void bitmtx_lockblock(
bitmtx* const _this,
void* const ptr
) {
atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];

for (int bit = 0; bit != 32; ++bit) {

while (ATOMIC_BTS(mtx, bit)) {
/* exponential backoff */
}
}
}

... this solves the problem... but destroys all advantages of "bulk-
acquire"...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 9, 2008, 8:46:18 AM1/9/08

to

On Jan 3, 12:29 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> >"Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
> >news:40f9cb9f-5aa2-4f83...@s19g2000prg.googlegroups.com...

> >On 3 ÑÎ×, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> >> >> > This approach makes it cheaper to acquire batch of mutexes. But at
> >> >> > the
> >> >> > same time it makes it noticeably more expensive to acquire single
> >> >> > mutex because of false contention.
>
> >> >> Ahh, are you talking false-sharing on the cache-line?
>
> >> > Exactly
>
> >> You have to measure false-sharing overhead vs. ability to aid STM locking
> >> scheme in general wrt taking multiple locks in single atomic op. This is
> >> major problem.
> >But I don't see yet how word locking can be useful to STM...
>
> It can only be useful within the following context:
>

> http://groups.google.com/group/comp.programming.threads/msg/93cafcdf1...

>
> You can detect lock-word collisions and just take the whole word instead of
> each individual lock-bit within it.
>
> Does that make any sense at all?

No. I don't understand how can you gather several separate locks into
one word... don't understand yet ;)

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 9, 2008, 8:51:27 AM1/9/08

to

On Jan 3, 12:55 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> Ok. I will proceed from the assumption that there *is* something
> >> useful and think more...
> >> ;)
>
> Humm... Perhaps you should not use too much of your valuable brain power on
> this... After all, is going to be a shi%%y STM in the long run.
>
> In the sense that _any_ STM == crap STM...
>
> That's a problem which still exists indeed.
>
> ;^(...

Yeah. Ok. Now I'm thinking about ever fastest memory allocator which
will completely beat your distributed lock-free slab-allocator.
Completely. For ever.
;)

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 9, 2008, 11:55:12 AM1/9/08

to

No atomic and store-load sh#t at all! Just few asm instructions per
alloc/free.
Anticipating questions - Yes, it's multithreaded/multicore-aware, not
singlethreaded :)

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 9, 2008, 3:58:42 PM1/9/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1441b8df-ecc5-4434...@e23g2000prf.googlegroups.com...

My allocator can get single-thread performance on _local_ allocations and
deallocations. How can you beat that? Are you getting single-thread
performance on _remote_ deallocations?

Chris Thomasson

unread,

Jan 9, 2008, 4:06:12 PM1/9/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:3a25498d-fb2b-4646...@n20g2000hsh.googlegroups.com...

> On Dec 20 2007, 1:01 am, "Chris Thomasson" <cris...@comcast.net>
> wrote:
>
>
>> /* acquire the ptr lock */
>> static void bitmtx_lockblock(
>> bitmtx* const _this,
>> void* const ptr
>> ) {
>> atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
>> while (! ATOMIC_CAS(mtx, 0, ATOMICUWORD_MAX())) {
>> /* exponential backoff */
>> }
>
>
> What if there is always some bits locked?

Are you talking about a senerio in which a thread forgets to unlock a
previously acquired bit?

> There can not be such time
> when all bits are 0...

That state represents an unlocked lock-word. Why can't there be intermittent
periods where there are no locks being held?

[...]

Chris Thomasson

unread,

Jan 9, 2008, 4:23:00 PM1/9/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:cdfef723-6c60-4d66...@s8g2000prg.googlegroups.com...

On Jan 3, 12:29 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> >"Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
> >news:40f9cb9f-5aa2-4f83...@s19g2000prg.googlegroups.com...

> >On 3 янв, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> >> >> > This approach makes it cheaper to acquire batch of mutexes. But at
> >> >> > the
> >> >> > same time it makes it noticeably more expensive to acquire single
> >> >> > mutex because of false contention.
>
> >> >> Ahh, are you talking false-sharing on the cache-line?
>
> >> > Exactly
>
> >> You have to measure false-sharing overhead vs. ability to aid STM
> >> locking
> >> scheme in general wrt taking multiple locks in single atomic op. This
> >> is
> >> major problem.
> >But I don't see yet how word locking can be useful to STM...
>
> It can only be useful within the following context:
>
> http://groups.google.com/group/comp.programming.threads/msg/93cafcdf1...
>
> You can detect lock-word collisions and just take the whole word instead
> of
> each individual lock-bit within it.
>
> Does that make any sense at all?

> No. I don't understand how can you gather several separate locks into
> one word... don't understand yet ;)

Lets say a transaction descriptor was something like this:

static bitmtx g_mtxtbl = BITMTX_STATICINIT();

struct tdobj {
int ver;
};

struct tdact {
tdact* nx;
tdobj* obj;
int ver;
};

struct tdbucket {
tdact* acts;
};

struct td {
tdbucket buckets[BITMTXSYS_TBLMAXDEPTH()]; /* = 0 */
};

A thread associates 'tdact' structs with a 'tdobj' structs, and sticks them
into its 'td'. Something like:

void td_push(td* const _this, tdact* act) {
tdbucket* const bucket
= &_this->buckets[BITMTXSYS_HASHPTR(act->obj)];
act->nx = bucket->acts;
bucket->acts = act;
}

A "portion" of the locking process can look like:

void td_lock(td* const _this) {
size_t i;
for (i = 0; i < BITMTXSYS_TBLMAXDEPTH(); ++i) {
tdact* const act = _this->buckets[i].acts;
if (act) {
if (act->nx) {
/* take all locks in bucket */
bitmtx_lockblock(&g_mtxtbl, act->obj);
} else {
bitmtx_lock(&g_mtxtbl, act->obj);
}
}
}
}

Of course there is some back-off and retry logic in the actual
implementation.

Dmitriy Vyukov

unread,

Jan 10, 2008, 4:59:27 AM1/10/08

to

On Jan 9, 11:58 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

Сlose to. At least there is no atomic rmw operations or store-load
memory barriers. But, of course, there is some writes to remote
memory, i.e. cache line transfers.

I already have working prototype. I need to investigate some low level
behaviors of my hardware a bit more.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 10, 2008, 5:15:16 AM1/10/08

to

On Jan 10, 12:06 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

>
> news:3a25498d-fb2b-4646...@n20g2000hsh.googlegroups.com...
>
>
>
> > On Dec 20 2007, 1:01 am, "Chris Thomasson" <cris...@comcast.net>
> > wrote:
>
> >> /* acquire the ptr lock */
> >> static void bitmtx_lockblock(
> >> bitmtx* const _this,
> >> void* const ptr
> >> ) {
> >> atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
> >> while (! ATOMIC_CAS(mtx, 0, ATOMICUWORD_MAX())) {
> >> /* exponential backoff */
> >> }
>
> > What if there is always some bits locked?
>
> Are you talking about a senerio in which a thread forgets to unlock a
> previously acquired bit?

:) No
I'm talking about situation when thread 0 holds bit 0 some time, and
thread 1 holds bit 1 some time, ..., and thread N holds bit N some
time. And in total there is always some thread that holds at least 1
bit from 32/64/128 bits.
I understand that 'always' is not 'entirely always', but 'long
enough'.

O! Good analogy. It's like rw-mutex without writer preference. You
know, with such mutex you can have many readers, but too many to not
cause writer starvation.
And in practice rw-mutex usually is made with writer preference. So
you can mark out bit 0 as flag 'writer outstanding' (i.e. 'whole word
lock outstanding'). And if bit lock operation sees this flag set it
waits. This will give preference to 'whole word lock' operation and
prevent starvation.
I'm not sure that it will be better. It's just a thought. What do you
think?

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 10, 2008, 5:44:37 AM1/10/08

to

On Jan 10, 12:23 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Does that make any sense at all?
> > No. I don't understand how can you gather several separate locks into
> > one word... don't understand yet ;)
> Lets say a transaction descriptor was something like this:

So if there is more than one object maps to one bitmtx then you lock
whole bitmtx. And if there is only one object maps to bitmtx then you
lock one bit in bitmtx. Right?

Imvho this is very questionable approach. If there is many bitmtx'es
than probability that more than one object will map to bitmtx is
small. If there is small number of bitmtx'es then contention will be
very high (than maybe it's better to just use one big global mutex).

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 10, 2008, 6:46:55 AM1/10/08

to

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

> news:e75f71e4-60aa-40d1...@s19g2000prg.googlegroups.com...

> On Jan 9, 11:58 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

[...]

> > My allocator can get single-thread performance on _local_ allocations
> > and
> > deallocations. How can you beat that? Are you getting single-thread
> > performance on _remote_ deallocations?

> Сlose to. At least there is no atomic rmw operations or store-load
> memory barriers. But, of course, there is some writes to remote
> memory, i.e. cache line transfers.

I can get single-thread performance on remote deallocations by using a
simple caching algorithm. Then where was the really distributed allocator I
_very_briefly_ mentioned here:

http://groups.google.com/group/comp.programming.threads/msg/8232d2f4db84e268

IIRC, it worked well with producer-consumer, however, I don't have it in
current vZOOM distro. But, I might stick it in there sometime.

> I already have working prototype. I need to investigate some low level
> behaviors of my hardware a bit more.

Your only going to see performance benefit from low-overhead remote
deallocations in certain producer-consumer like models right?

Chris Thomasson

unread,

Jan 10, 2008, 11:49:46 PM1/10/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:IZudnZvc6e86nhva...@comcast.com...

>> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
>> news:e75f71e4-60aa-40d1...@s19g2000prg.googlegroups.com...
>> On Jan 9, 11:58 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
> [...]
>> > My allocator can get single-thread performance on _local_ allocations
>> > and
>> > deallocations. How can you beat that? Are you getting single-thread
>> > performance on _remote_ deallocations?
>
>
>> Сlose to. At least there is no atomic rmw operations or store-load
>> memory barriers. But, of course, there is some writes to remote
>> memory, i.e. cache line transfers.
>
> I can get single-thread performance on remote deallocations by using a
> simple caching algorithm. Then where was the really distributed allocator
> I

^^^^^^^^^^^^^^^^^^^^^^^

Then there was

Chris Thomasson

unread,

Jan 10, 2008, 11:52:58 PM1/10/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:92302bad-85fb-42b9...@m77g2000hsc.googlegroups.com...

> On Jan 10, 12:06 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
>>
>> news:3a25498d-fb2b-4646...@n20g2000hsh.googlegroups.com...
>>
>>
>>
>> > On Dec 20 2007, 1:01 am, "Chris Thomasson" <cris...@comcast.net>
>> > wrote:
>>
>> >> /* acquire the ptr lock */
>> >> static void bitmtx_lockblock(
>> >> bitmtx* const _this,
>> >> void* const ptr
>> >> ) {
>> >> atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];
>> >> while (! ATOMIC_CAS(mtx, 0, ATOMICUWORD_MAX())) {
>> >> /* exponential backoff */
>> >> }
>>
>> > What if there is always some bits locked?
>>
>> Are you talking about a senerio in which a thread forgets to unlock a
>> previously acquired bit?
>
>
> :) No
> I'm talking about situation when thread 0 holds bit 0 some time, and
> thread 1 holds bit 1 some time, ..., and thread N holds bit N some
> time.

[...]

In the actual code there is backoff any retry logic that attemtps to take
care of this scenario as its basically analogous to the fact that in trylock
and retry code, well, it can sort-of livelock under certain scenarios. Think
of a try lock that fails, and the owner releases and another thread
consistently acquires it before it can retry...

Chris Thomasson

unread,

Jan 10, 2008, 11:55:03 PM1/10/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:4fa2e399-ee62-46ac...@i12g2000prf.googlegroups.com...

> On Jan 10, 12:23 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> > Does that make any sense at all?
>> > No. I don't understand how can you gather several separate locks into
>> > one word... don't understand yet ;)
>> Lets say a transaction descriptor was something like this:
>
>
> So if there is more than one object maps to one bitmtx then you lock
> whole bitmtx. And if there is only one object maps to bitmtx then you
> lock one bit in bitmtx. Right?

It does not lock a whole bitmtx, it acquires an entire lock-word in the
bitmtx::locks array.

>
> Imvho this is very questionable approach. If there is many bitmtx'es
> than probability that more than one object will map to bitmtx is
> small. If there is small number of bitmtx'es then contention will be
> very high (than maybe it's better to just use one big global mutex).

You can use an indirection hash algorithm to make a further mapping in the
presence of multiple object versions.

Dmitriy Vyukov

unread,

Jan 11, 2008, 4:44:25 AM1/11/08

to

On Jan 10, 2:46 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> I can get single-thread performance on remote deallocations by using a
> simple caching algorithm. Then where was the really distributed allocator I
> _very_briefly_ mentioned here:

In my algorithm thread don't have to poll N (number of threads)
queues. So it is scalable in that sense. And it easily handles steady
dynamic creation/destruction of threads.

But it uses some very dirty tricks... something like hitting paging
fault as normal event.
And I don't measure it's behavior in the wild, and don't compare to
other allocators.

> > I already have working prototype. I need to investigate some low level
> > behaviors of my hardware a bit more.
>
> Your only going to see performance benefit from low-overhead remote
> deallocations in certain producer-consumer like models right?

I think that every server software have a piece of producer-consumer
pattern in some form...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 11, 2008, 4:50:39 AM1/11/08

to

On Jan 11, 7:55 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

>
> news:4fa2e399-ee62-46ac...@i12g2000prf.googlegroups.com...
>
> > On Jan 10, 12:23 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> >> > Does that make any sense at all?
> >> > No. I don't understand how can you gather several separate locks into
> >> > one word... don't understand yet ;)
> >> Lets say a transaction descriptor was something like this:
>
> > So if there is more than one object maps to one bitmtx then you lock
> > whole bitmtx. And if there is only one object maps to bitmtx then you
> > lock one bit in bitmtx. Right?
>
> It does not lock a whole bitmtx, it acquires an entire lock-word in the
> bitmtx::locks array.

Yes, yes. It's just slip of the pen.

> > Imvho this is very questionable approach. If there is many bitmtx'es
> > than probability that more than one object will map to bitmtx is
> > small. If there is small number of bitmtx'es then contention will be
> > very high (than maybe it's better to just use one big global mutex).
>
> You can use an indirection hash algorithm to make a further mapping in the
> presence of multiple object versions.

As I understand the whole point here is to enforce locality somehow.
I.e. to enforce many objects to map to small number of lock-words. I
don't see ways to make this with acceptable overhead.
This can be done with dynamic binding of objects to lock-words. But
every binding and unbinding must be a CAS. So this is 2*N additional
CAS'es (N - number of objects in transaction). It's just not worth it.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 11, 2008, 4:59:22 AM1/11/08

to

But in this scenario thread at least *can* acquire lock from
application POV. What thread will actually acquire lock is determined
by some policies in OS and hardware. But at least application make it
possible for every thread to acquire.
In case of lock-word thread *cannot* acquire lock even from from
application POV.
Think of rw_mutex with and without writer preference.

But I understand that this is like obstruction-free: forward progress
is not guaranteed theoretically , but in real situation one can be
sure that forward progress will be.

It's difficult to say whether this is real problem with lock-word or
just some theoretical reasoning. Until someone measure lock-word
behavior under very high load.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 11, 2008, 5:09:35 AM1/11/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:e0f9e5d7-9966-4f60...@f47g2000hsd.googlegroups.com...

> On Jan 11, 7:55 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
>>
>> news:4fa2e399-ee62-46ac...@i12g2000prf.googlegroups.com...
>>
>> > On Jan 10, 12:23 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>> >> > Does that make any sense at all?
>> >> > No. I don't understand how can you gather several separate locks
>> >> > into
>> >> > one word... don't understand yet ;)
>> >> Lets say a transaction descriptor was something like this:
>>
>> > So if there is more than one object maps to one bitmtx then you lock
>> > whole bitmtx. And if there is only one object maps to bitmtx then you
>> > lock one bit in bitmtx. Right?
>>
>> It does not lock a whole bitmtx, it acquires an entire lock-word in the
>> bitmtx::locks array.
>
>
> Yes, yes. It's just slip of the pen.

Ahh.. Yes of course. Sorry about that!

:^0

>> > Imvho this is very questionable approach. If there is many bitmtx'es
>> > than probability that more than one object will map to bitmtx is
>> > small. If there is small number of bitmtx'es then contention will be
>> > very high (than maybe it's better to just use one big global mutex).
>>
>> You can use an indirection hash algorithm to make a further mapping in
>> the
>> presence of multiple object versions.
>
>
> As I understand the whole point here is to enforce locality somehow.
> I.e. to enforce many objects to map to small number of lock-words. I
> don't see ways to make this with acceptable overhead.
> This can be done with dynamic binding of objects to lock-words. But
> every binding and unbinding must be a CAS. So this is 2*N additional
> CAS'es (N - number of objects in transaction). It's just not worth it.

Well, I have not seen the technique I use in any paper or proceedings...
However, I think its not new... Anyway, CAS is not involved per-se in the
scheme wrt hashing algorithm itself. BTW, this is all in vain effort to get
some ability to aid locking protocols in STM. In essence, its not worth it
indeed. I need to face the "big" picture, STM is as it is in the end... So,
in the end, I believe your correct and I am most likely barking up the wrong
tree indeed.

;^(...

Chris Thomasson

unread,

Jan 11, 2008, 5:19:52 AM1/11/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:2025cead-ac7b-44fe...@u10g2000prn.googlegroups.com...

> On Jan 10, 2:46 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> I can get single-thread performance on remote deallocations by using a
>> simple caching algorithm. Then where was the really distributed allocator
>> I
>> _very_briefly_ mentioned here:
>
>
> In my algorithm thread don't have to poll N (number of threads)
> queues. So it is scalable in that sense.

Well, I only mentioned some of the outlining properties very briefly. It
only "polls" in the sense that a pool of memory was exhausted and it should
try another bucket. On the remote side, well things get treated specially.
Are queues involved with the transfer back to home-base, well, yes and,
sometimes no. Home-base being the memory-pool of origin wrt a given block of
memory.

> And it easily handles steady
> dynamic creation/destruction of threads.

As does my original slab allocator reference counting logic. Although,
frequent creation/destruction of threads could be considered as some sort-of
systemic "design" flaw on the _application_ side of things. Creating a
thread can be, and should probably be considered as an expensive operation.
I am thinking that thread-pools dynamically expanding and contracting to be
a somewhat "fairly" rare occasion...

> But it uses some very dirty tricks...

Sometimes, the dirtier the trick the better...

;^)

> something like hitting paging
> fault as normal event.

Is that really needed? I can do simple cache with reference count offsets in
remote side. The queue allocator thing was/is experimental. I have not seen
any real "urgent" need to put it into vz distro as of yet. Perhaps you can
change my mind...

> And I don't measure it's behavior in the wild, and don't compare to
> other allocators.

I would suggest going up asking that StreamFlow thing. I can beat it, and I
know you can as well.

>> > I already have working prototype. I need to investigate some low level
>> > behaviors of my hardware a bit more.
>>
>> Your only going to see performance benefit from low-overhead remote
>> deallocations in certain producer-consumer like models right?
>
> I think that every server software have a piece of producer-consumer
> pattern in some form...

Fair enough.

Chris Thomasson

unread,

Jan 11, 2008, 5:24:10 AM1/11/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:iZadnZ2nrMst3Rra...@comcast.com...
[...]

>> And I don't measure it's behavior in the wild, and don't compare to
>> other allocators.
>
> I would suggest going up asking that StreamFlow thing.

^^^^^^^^^^^^^^^^^^^^

going up against that

> I can beat it, and I know you can as well.

[...]

Dmitriy Vyukov

unread,

Jan 11, 2008, 5:18:54 AM1/11/08

to

On Jan 11, 1:09 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> > As I understand the whole point here is to enforce locality somehow.
> > I.e. to enforce many objects to map to small number of lock-words. I
> > don't see ways to make this with acceptable overhead.
> > This can be done with dynamic binding of objects to lock-words. But
> > every binding and unbinding must be a CAS. So this is 2*N additional
> > CAS'es (N - number of objects in transaction). It's just not worth it.
>
> Well, I have not seen the technique I use in any paper or proceedings...
> However, I think its not new... Anyway, CAS is not involved per-se in the
> scheme wrt hashing algorithm itself. BTW, this is all in vain effort to get
> some ability to aid locking protocols in STM. In essence, its not worth it
> indeed. I need to face the "big" picture, STM is as it is in the end... So,
> in the end, I believe your correct and I am most likely barking up the wrong
> tree indeed.

Hmmm. Maybe one can use lazy on-time binding of objects to locks.
Something like:

size_t get_hash(object_t* obj, size_t preferred)
{
if (obj->hash)
return obj->hash;
size_t prev = CAS(obj->hash, 0, preferred);
if (prev)
return prev;
else
return preferred;
}

... in the hope that dynamically will be formed some acceptable
distribution. here thread can somehow enforce "good" distribution of
objects to locks.

... and some mechanism needed to prevent situation when all objects
will be just mapped to lock word 0 :)))

... and this is definitely intrusive approach. you are consistent with
unintrusive approach afaik

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 11, 2008, 5:40:23 AM1/11/08

to

On Jan 11, 1:19 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > In my algorithm thread don't have to poll N (number of threads)
> > queues. So it is scalable in that sense.

> Well, I only mentioned some of the outlining properties very briefly. It
> only "polls" in the sense that a pool of memory was exhausted and it should
> try another bucket. On the remote side, well things get treated specially.
> Are queues involved with the transfer back to home-base, well, yes and,
> sometimes no. Home-base being the memory-pool of origin wrt a given block of
> memory.

I'm not sure I understand you... My English is not so good as C++...
I'm sorry...

> > And it easily handles steady
> > dynamic creation/destruction of threads.
>
> As does my original slab allocator reference counting logic. Although,
> frequent creation/destruction of threads could be considered as some sort-of
> systemic "design" flaw on the _application_ side of things. Creating a
> thread can be, and should probably be considered as an expensive operation.
> I am thinking that thread-pools dynamically expanding and contracting to be
> a somewhat "fairly" rare occasion...

I just mean that if you have design based on spsc-queues then if new
thread is created than you have to add new spsc-queue to every
existent thread.

> > But it uses some very dirty tricks...
>
> Sometimes, the dirtier the trick the better...
>
> ;^)

:)

>
> > something like hitting paging
> > fault as normal event.
>
> Is that really needed?

What? Hitting paging fault? Or allocator?

> I can do simple cache with reference count offsets in
> remote side.

Reference count offsets?

> > And I don't measure it's behavior in the wild, and don't compare to
> > other allocators.
>
> I would suggest going up asking that StreamFlow thing. I can beat it, and I
> know you can as well.

But they have design equivalent to your slab-allocator. We need some
explosive mix of the most dirty tricks to beat them :)))

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 11, 2008, 6:40:21 AM1/11/08

to

On Dec 20 2007, 1:01 am, "Chris Thomasson" <cris...@comcast.net>
wrote:

> You can take all the locks by using CAS to compare the lock-word with 0, and
> swap in a maximum value.
> Any thoughts on this approach?

Ok. I've used a bit of my "valuable brain" for this :)
Bit mutex with selective batch acquire:

static void bitmtx_lock_selective(
bitmtx* const _this,
void* const ptr,
atomicuword mask

){
atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];

for (;;) {
atomicuword volatile const prev = *mtx;
if (0 == (prev & mask)) {
if (ATOMIC_CAS(mtx, prev, prev | mask)) {
break;
}
}
/* exponential backoff */
}
}

static void bitmtx_unlock_selective(
bitmtx* const _this,
void* const ptr,
atomicuword mask

){
atomicuword* const mtx = &_this->locks[BITMTXSYS_HASHPTR(ptr)];

ATOMIC_AND(mtx, ~mask);
}

You can form mask on-the-fly:

void td_push(td* const _this, tdact* act) {
tdbucket* const bucket
= &_this->buckets[BITMTXSYS_HASHPTR(act->obj)];
act->nx = bucket->acts;
bucket->acts = act;

/******************************************/
bucket->mask |= BITMTXSYS_HASHBIT(act->obj);
/******************************************/
}

void td_lock(td* const _this) {
size_t i;
for (i = 0; i < BITMTXSYS_TBLMAXDEPTH(); ++i) {
tdact* const act = _this->buckets[i].acts;
if (act) {
if (act->nx) {

/**********************************/
/* take selective locks in bucket */
bitmtx_lock_selective(&g_mtxtbl, act->obj, _this-
>buckets[i].mask);
/**********************************/
} else {
bitmtx_lock(&g_mtxtbl, act->obj);
}
}
}
}

In this way you can reduce contention and make locking finer grained.
Or you can make locking even smarter:

void td_lock(td* const _this) {
size_t i;
for (i = 0; i < BITMTXSYS_TBLMAXDEPTH(); ++i) {
tdact* const act = _this->buckets[i].acts;
if (act) {

if (_this->buckets[i].count >= THRESHOLD) {
bitmtx_lockblock(&g_mtxtbl, act->obj);
} else if (_this->buckets[i].count != 1) {
bitmtx_lock_selective(&g_mtxtbl, act->obj, _this-
>buckets[i].mask);
} else {
bitmtx_lock(&g_mtxtbl, act->obj);
}
}
}
}

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 11, 2008, 9:17:30 AM1/11/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:742a4a05-35f1-4adc...@j20g2000hsi.googlegroups.com...

> On Dec 20 2007, 1:01 am, "Chris Thomasson" <cris...@comcast.net>
> wrote:
>
>> You can take all the locks by using CAS to compare the lock-word with 0,
>> and
>> swap in a maximum value.
>> Any thoughts on this approach?
>
>
> Ok. I've used a bit of my "valuable brain" for this :)
> Bit mutex with selective batch acquire:

[...]

Your work is well noted as you make a significant improvement on the bitmtx
impl. I just hope that we are not the only people reading this!

;^)

Chris Thomasson

unread,

Jun 3, 2008, 8:07:30 AM6/3/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:zNOdnUeVP51XD_Ta...@comcast.com...
> This is a dual-hashed bitlock algorithm that can be used for things like
> STM. It works by hashing the pointer address to an index into an array of
> lock-words, _and_ a corresponding bit. Then, it uses that bit as the
> lock-state for the selected lock-word. For instance:
[...]

>
> You can take all the locks by using CAS to compare the lock-word with 0,
> and swap in a maximum value.
>
>
>

> Here is the pseudo-code draft:
> __________________________________________________________________
> #include <limits.h>
> #include <stddef.h>

> #include <math.h>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
drop that nonsense!

> typedef unsigned int atomicuword;
> #define ATOMICUWORD_MAX() UINT_MAX
>
> typedef struct bitmtx_s bitmtx;
>
> #define BITMTXSYS_DEPTH() 32
> #define BITMTXSYS_TBLDEPTH() UCHAR_MAX
>
> #define BITMTXSYS_TBLMAXDEPTH() ( \
> BITMTXSYS_TBLDEPTH() + 1 \
> )
>
> #define BITMTXSYS_HASHPTR(mp_this) (((unsigned char) \
> ((((atomicuword)(mp_this)) * 307) >> 8) \
> ))
>
> #define BITMTXSYS_HASHBIT(mp_this) (((atomicuword) \
> pow(2, ((BITMTXSYS_HASHPTR(mp_this) % BITMTXSYS_DEPTH()) + 1)) \
> ))

Let me fix some bugs here... First of all the `BITMTXSYS_DEPTH' macro needs
to read as:

#define BITMTXSYS_DEPTH() 31

And the most retarded bug of all... Well, its not really a bug, its just
completely retarded! The `BITMTXSYS_HASHBIT' macro should read as:

#define BITMTXSYS_HASHBIT(mp_this) (((atomicuword) \
1 << (BITMTXSYS_HASHPTR(mp_this) & (BITMTXSYS_DEPTH() - 1)) \
))

STUPID! ARGH!

[...]