Global lock contention problem

Joe Seigh

unread,

Mar 5, 2006, 12:00:36 PM3/5/06

to

I'm starting to see this too many times. The situation is
one we've seen here before. You have a large global data
structure with dynamically allocated nodes. You need to access
the nodes safely. Using a single global lock it is out for
obvious reasons. You can use per object locking but you
still need a mechanism to safely access the node long enough
to get the lock. Even for that brief amount of time you can
still get contention on the lock. Contention on the object
locks can aggrevate this since you will block while holding
the global lock. You can use reference counting in place of
object locks but you need a portable mechanism for doing that,
e.g. something like boost shared_ptr.

You could use a lock-free solution but there's the issue of
getting lock-free solutions ported to all the necessary platforms
and of educating users, no one's written a "Programming without
Locks" book yet. And I'm not going to do the former since I've
no resources for that kind of thing.

So is there a solution here that doesn't involve lock-free?
This will get me off the hook on trying to support something
I don't have the resources for. I'd be more than happy to
promote the conventional supported solution here.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Michel André

unread,

Mar 5, 2006, 5:43:29 PM3/5/06

to

Joe Seigh wrote:
> So is there a solution here that doesn't involve lock-free?
> This will get me off the hook on trying to support something
> I don't have the resources for. I'd be more than happy to
> promote the conventional supported solution here.

Please elaborate a bit more ;), I'm in the trensches you describe all
day long..

/Michel

Joe Seigh

unread,

Mar 5, 2006, 8:31:10 PM3/5/06

to

I'm the one asking the question which hasn't been answered yet.

But there is one solution based on something discussed earlier
on this. You can use lock hashing. You create an array of
locks and hash the object key, not the object address since
you don't know it yet, to a lock in the array, access the
global data structure to find the object, increment a reference/use
count in the object (or alternatively a lock in the object), and
release the lock. When you're finished with the object, get the hashed
object key lock, decrement the reference/use count (run dtor if zero),
and release the lock.

To update the global data structure, you acquire all of the hashed
access locks, do the modification, and release the locks. This part
is the reason you don't want to use the object key locks as object locks.
It would take too long to acquire "write" access otherwise. You
could do the object key lock acquires in to passes, using trylock on the
first pass and unconditional locks on second pass on the locks
not acquired on the first pass. If you do that use a global "write"
lock to keep simultaneous writers from deadlocking.

Not the greatest solution but at least I can tell people to use
that and it won't be my problem.

Dervish

unread,

Mar 6, 2006, 7:19:09 AM3/6/06

to

Global lock is a most general solution to protect global data. However
for special cases sometimes it is possible to create lock-free
solution.
E.g. if:
1. Global data is changed by only by one writer thread with moderate
speed (e.g. once per second).
2. There are 0 or more readers.
3. It takes no more than limited time (e.g. 0.1 sec) for reader to get
data.

Than:
0. Access to global data is performed through global pointer.

Writer:
1. Copy data to new place.
2. Modify data.
3. Swap pointers. Store old one.
4. Delete data pointed by old pointer after 0.5 sec (or let garabage
collector clean up).

Reader:
5. Just read by pointer, but don't spent more than 0.1 sec in reading
(or let garbage collector work).

There are two problems with this solution:
1. Possible huge memory consumption (write speed is too big).
2. Access to deallocated memory (if reader failed to get access in
timely fashion, or cached some pointer).

Joe Seigh

unread,

Mar 6, 2006, 8:24:56 AM3/6/06

to

Making timing assumptions is known not to work reliably. For GC, it
has to be atomically thread-safe and you have "stop the world" GC problems. There
are forms of GC that will work, e.g. RCU, hazard pointers, atomic reference
counting which have been discussed before but that's all lock-free. I'm
looking for solutions using Posix synchronization only.

Peter Dimov

unread,

Mar 6, 2006, 1:47:33 PM3/6/06

to

Joe Seigh wrote:

> But there is one solution based on something discussed earlier
> on this. You can use lock hashing. You create an array of
> locks and hash the object key, not the object address since
> you don't know it yet, to a lock in the array, access the
> global data structure to find the object, increment a reference/use
> count in the object (or alternatively a lock in the object), and
> release the lock. When you're finished with the object, get the hashed
> object key lock, decrement the reference/use count (run dtor if zero),
> and release the lock.
>
> To update the global data structure, you acquire all of the hashed
> access locks, do the modification, and release the locks.

Can't you use a single rwlock? To find the object, you acquire a read
lock, get a shared_ptr to the object, release the lock. To update, you
acquire a write lock.

Michel André

unread,

Mar 6, 2006, 4:41:42 PM3/6/06

to

I have seen most of the solutions sketched out in this thread:
1. Global lock (structure/data)
2. Lock table with objects hashed (using all lock solution to update
structure or a special structure lock)
3. Rwlocks
4. Asynchrounous updates, if someone accesses the list during an update
add the update to a queue and execute it when the list isn't used anymore.
5. Out of place update and pointer swap and "hoping" that readers will
be done when pointer is reclaimed/reused.

http://www.cs.wustl.edu/~schmidt/PDF/dispatching.pdf
contains some patterns for concurrent dispatching to multiple clients
and it contains some patterns for how to access global dispatch tables.

When doing having done tests almost always 1 wins if you try to keep the
locks as short as humanly possible, which is good practice anyways. But
my gut feel is that there should be a smarter solution. And usually I
have this feel that there will be contention on the lock and everything
will grind to halt ;). 4 performs very well in microbenchmarks due to
lack of lock but is bad style since it relies on timing, scheduling and
is hard if you are getting events that need to be reflected in the
strucure with low latency.

2 also does fairly well if you seldom add or remove nodes, but the
overhead of changing the structure is large.

/Michel

Joe Seigh

unread,

Mar 6, 2006, 5:57:41 PM3/6/06

to

Peter Dimov wrote:
> Joe Seigh wrote:

[...]

>
>
> Can't you use a single rwlock? To find the object, you acquire a read
> lock, get a shared_ptr to the object, release the lock. To update, you
> acquire a write lock.
>

In theory an rwlock implmentation with a lock-free fast path
would be better or as good. Unfortunately most programmers are
imprinted with that generic rwlock they teach in school with counts
of readers, writer, waiting readers, waiting writers, reader preference,
writer preference, etc... Which means an internal mutex and condvar
at least to coordinate things. And if you are having contention
problems using a mutex explicitly, using one implicitly isn't going
to work out any better. The Linux version works this way. So while
using lots of hashed locks sucks, it sucks less than the typical
rwlock implementation.

BTW, you don't have to hash on the object key, you can hash on anything,
e.g. thread or processors ids, etc..., and do a rwlock that way. But
if you don't use object key then you can't do any object level synchronization,
i.e. it acts like a regular rwlock and if you want to do something like
increment a reference count you need an additional lock or an interlocked
function to do it with.

And I think that is the last technique I will be allowed to put into
the public domain.

Chris Thomasson

unread,

Mar 17, 2006, 5:06:49 PM3/17/06

to

Indeed.

> There
> are forms of GC that will work, e.g. RCU, hazard pointers, atomic reference
> counting which have been discussed before but that's all lock-free. I'm
> looking for solutions using Posix synchronization only.

I have a solution for this that uses %99.99 POSIX mutexs; only relies on
dependent load ordering. I will now briefly describe how the garbage
collection part of my solution works:

- Each thread has two mutexs ( owner_lock, memory_lock ) an array of
persistent reference counters ( prefs[] ), a synchronization state
variable ( sync_state ), and a list of deferred objects ( my_defer ).

- A polling thread has a condition variable ( poll_cond ), a mutex (
poll_lock ), a list of registered threads ( treg ) and three lists of
deferred objects ( new_defer, old_defer, final_defer )

- Each thread that registers with the polling thread shall lock the
poll_lock; "lock memory_lock"; push itself onto treg; unlock poll_lock
and signal poll_cond.

- Each thread that unregisters with the polling thread shall lock the
poll_lock; "unlock memory_lock"; pop itself from treg; move my_defer to
the polling threads new_defer; unlock poll_lock and signal poll_cond.

- A registered thread is supposed to periodically or episodically
execute a synchronization operation. This can be achieved by locking the
owner_lock; unlocking the memory_lock; set a flag in sync_state; lock
the memory_lock and finally unlock the owner_lock. When the thread
"unlocks and subsequently locks" the memory_mutex "inside of the
owner_locks critical region" it has the effect of making the critical
section that was protected by memory_mutex "visible to the polling
thread". All of the modifications to the prefs[] array are carried out
under the critical section of the memory_mutex because it is "always
locked", except for "explicit synchronization".

- A registered thread acquires persistent references to an object by
loading a pointer to object; dependent load barrier; hash pointer to
object into index; the indexed counter in prefs[] is incremented.

- A registered thread releases persistent references to an object by
hashing a previously acquired object into an index; the indexed counter
in prefs[] is decremented.

- A registered thread defers an object by making it unreachable and
queuing it into my_defer.

- The polling thread periodically or episodically checks to see if each
registered thread has executed a synchronization operation. This is
achieved by waiting on poll_cond for the desired polling interval. When
its time to poll it iterates through treg and for each thread: lock
owner_lock; examining the sync_state variable for a flag; if the flag is
set continue, if not unlock previously acquired owner_locks and wait for
another polling interval.

- After the polling thread has determined that all of the threads have
executed a synchronization then it resets the sync_state flag of each
thread and unlocks all previously acquired owner_locks; It is now
guaranteed that all "previous effects" on the prefs[] arrays "since the
last synchronization" are "fully visible". It then calls the deferred
requests that exist in final_defer and empties it; swaps new_defer with
old_defer; collects a new generation of deferred requests from my_defer
in all registered threads into new_defer; checks the old_defer list
against each threads prefs[] array. If a old deferred request is not
found to have any persistent references, it is then moved into a the
final_defer list. Then another polling interval is waited for.

That briefly describes a certain portable embodiment of my vzoom patent.
As you can see, the only place that dependent load ordering is
required is when a thread loads a pointer to a shared object. That has
to be about as portable as you can get and still allow for virtually
zero-overhead access. If you have any questions please feel free to fire
away...

:)

Now to address the global locking issue... I have posted a solution that
I call multi_mutex here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi+mutex&rnum=1#3ca11e0c3dcf762c

This can be augmented with a simple "back-off-retry" algorithm to allow
for true two-phase locking. I bet this can be used for lock-based STM
which should address per-node locking. What do you think of this one Joe?

P.S.

I found some more time to hang out here on c.p.t. so I will be around
for a while. I know I promised demo applications of vzoom. The truth is
that I am still tinkering around with them, and I was advised not to
give away any actual source code but only brief descriptions and perhaps
some of the drawings. Plus, I received a couple of licensing offers as
well. So I need to address these issues before I can provide source code.

For now, I can only give brief descriptions, like the one above. If you
want to see some of the drawings I can post some of them as well. Any
thoughts on this?

Chris Thomasson

unread,

Mar 17, 2006, 5:11:49 PM3/17/06

to

> In theory an rwlock implmentation with a lock-free fast path
> would be better or as good. Unfortunately most programmers are
> imprinted with that generic rwlock they teach in school with counts
> of readers, writer, waiting readers, waiting writers, reader preference,
> writer preference, etc... Which means an internal mutex and condvar
> at least to coordinate things. And if you are having contention
> problems using a mutex explicitly, using one implicitly isn't going
> to work out any better. The Linux version works this way. So while
> using lots of hashed locks sucks, it sucks less than the typical
> rwlock implementation.

Here is a quick and easy rwmutex I posted with a lock-free fast-path:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5ffb0ed46a4bcb4b/b73084939aac60f0?q=rw+mutex&rnum=1#b73084939aac60f0

That should be fast enough...

;)

> And I think that is the last technique I will be allowed to put into
> the public domain.

:)

Chris Thomasson

unread,

Mar 17, 2006, 8:29:05 PM3/17/06

to

> Joe Seigh wrote:
>> There
>> are forms of GC that will work, e.g. RCU, hazard pointers, atomic
>> reference
>> counting which have been discussed before but that's all lock-free. I'm
>> looking for solutions using Posix synchronization only.
>
>
> I have a solution for this that uses %99.99 POSIX mutexs; only relies on
> dependent load ordering. I will now briefly describe how the garbage
> collection part of my solution works:
>
>
> - Each thread has two mutexs ( owner_lock, memory_lock ) an array of
> persistent reference counters ( prefs[] ), a synchronization state
> variable ( sync_state ), and a list of deferred objects ( my_defer ).

Here is a simple example of how you could use vzoom to implement a
portable write operation that is "lock-free read safe"... To push into a
shared collection you would do this:

1. vzoom = pthread_getspecific(...)
2. alloc new node
3. lock your_collection->lock
4. init/config new node
5. unlock vzoom->memory_lock
6. lock vzoom->memory_lock
7. make new node visible to readers
8. unlock your_collection->lock

Here are some quick facts:

1. Step 4 can't sink below step 5
2. Step 7 can't rise above step 6

This means that the initialization and/or configuration of the new node
will be visible "before" it is made visible to any potential readers. As
stated before, readers just rely on dependent ordering and
periodically/episodically executing a synchronization operation:

Chris Thomasson

unread,

Mar 17, 2006, 9:12:55 PM3/17/06

to

Dervish wrote:
> Global lock is a most general solution to protect global data. However
> for special cases sometimes it is possible to create lock-free
> solution.
> E.g. if:
> 1. Global data is changed by only by one writer thread with moderate
> speed (e.g. once per second).

There are some techniques that can handle unexpected, and perhaps
prolonged periods of much higher loads; hundreds, even thousands of
writes per-second, perhaps even more! IMHO, a collector should be robust
enough to be able to handle this kind of load.

> 2. There are 0 or more readers.
> 3. It takes no more than limited time (e.g. 0.1 sec) for reader to get
> data.
>
> Than:
> 0. Access to global data is performed through global pointer.
>
> Writer:
> 1. Copy data to new place.
> 2. Modify data.
> 3. Swap pointers. Store old one.
> 4. Delete data pointed by old pointer after 0.5 sec (or let garabage
> collector clean up).
>
> Reader:
> 5. Just read by pointer, but don't spent more than 0.1 sec in reading
> (or let garbage collector work).
>
> There are two problems with this solution:
> 1. Possible huge memory consumption (write speed is too big).

This can be easily addressed by keeping track of the number of deferred
requests. I left the overflow logic out of my collector description I
gave in one of my replies to Joe to help make things easier. The
algorithm can be as simple as this:

If the number of requests goes above a "high-watermark" you can choose
to force writers to wait until the polling thread drops it back below a
"low-watermark".

http://groups.google.com/group/comp.programming.threads/msg/de3eff43d6905998

Joe Seigh

unread,

Mar 18, 2006, 8:50:42 AM3/18/06

to

Chris Thomasson wrote:
[...]

>
>
>
> Now to address the global locking issue... I have posted a solution that
> I call multi_mutex here:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi+mutex&rnum=1#3ca11e0c3dcf762c
>
>
> This can be augmented with a simple "back-off-retry" algorithm to allow
> for true two-phase locking. I bet this can be used for lock-based STM
> which should address per-node locking. What do you think of this one Joe?
>
>
>
>
> P.S.
>
> I found some more time to hang out here on c.p.t. so I will be around
> for a while. I know I promised demo applications of vzoom. The truth is
> that I am still tinkering around with them, and I was advised not to
> give away any actual source code but only brief descriptions and perhaps
> some of the drawings. Plus, I received a couple of licensing offers as
> well. So I need to address these issues before I can provide source code.
>
> For now, I can only give brief descriptions, like the one above. If you
> want to see some of the drawings I can post some of them as well. Any
> thoughts on this?

I don't know. I'm going to be too busy to be spending too much time
on this stuff. It looks like all of the lock-free stuff is or will
be patented so if anyone wants to use something they will have to
license something from somebody. So lock-free will probably never
show up as part of a standard. It will remain propietary. Which
pretty much means shared memory multi-threading is dead in the long
term especially given how clueless Sun and Intel are on how to deal with
scalability issues in this area. I would guess some form of hardware
based message passing will become the predominate architecture in the future.
E.g. something like AMD adding the hyperlink IPC to their ISA to allow bypassing
shared memory as a thread communication mechanism. It will make the Occam/CSP
fanboys happy.

Joe Seigh

unread,

Mar 18, 2006, 8:58:03 AM3/18/06

to

s/hyperlink/hypertransport/