pthread rwlock problem

v...@3priedez.net

unread,

Dec 12, 2005, 10:37:25 AM12/12/05

to

i have software which uses a lot (~1 000 000 now) pthread_rwlock_t
variables.
Program has 15 threads which locks those variables.
After some time (can be 1h can be two days or two weeks) locking
hungs.

pthread_rwlock_wrlock(&Lock);
(gdb) print Lock
$5 = {__rw_lock = {__status = 4294967296, __spinlock = 0},
__rw_readers = 0, __rw_writer = 0x0, __rw_read_waiting = 0x0,
__rw_write_waiting = 0x0, __rw_kind = 0, __rw_pshared = 0}

when i manually change __status to 0 all goes further.

locks are initialized like this
pthread_rwlock_init(&Lock, NULL);

system:

SUSE_LINUX_9.2
Linux version 2.6.8-24.14-smp (geeko@buildhost) (gcc version 3.3.4
(pre 3.3.5 20040809)) #1 SMP Tue Mar 29 09:27:43 UTC 2005
two Intel(R) Xeon(TM) CPU 3.00GHz stepping 01
4GB RAM
software compiled with gcc version 3.3.4 (pre 3.3.5 20040809)

questions:

what does this __status=-1 status means?
what is suggested/max limit of pthread_rwlock_t in one process?
what is possible resolution for this problem?

Joe Seigh

unread,

Dec 12, 2005, 12:32:27 PM12/12/05

to

v...@3priedez.net wrote:
> i have software which uses a lot (~1 000 000 now) pthread_rwlock_t
> variables.
> Program has 15 threads which locks those variables.
> After some time (can be 1h can be two days or two weeks) locking
> hungs.
>
> pthread_rwlock_wrlock(&Lock);
> (gdb) print Lock
> $5 = {__rw_lock = {__status = 4294967296, __spinlock = 0},
> __rw_readers = 0, __rw_writer = 0x0, __rw_read_waiting = 0x0,
> __rw_write_waiting = 0x0, __rw_kind = 0, __rw_pshared = 0}
>
> when i manually change __status to 0 all goes further.
>
> locks are initialized like this
> pthread_rwlock_init(&Lock, NULL);
>

[...]

>
> questions:
>
> what does this __status=-1 status means?
> what is suggested/max limit of pthread_rwlock_t in one process?
> what is possible resolution for this problem?
>

You may have actually cleared a futex. I don't know where the futexes
are in rwlocks but they redefine the rwlock (especially if it was
originally defined in LinuxThreads) so the fields may not be what they
say they are. You may want to display what the rwlock looks like for
known good states to guess what the fields may be. Or look at the
source if you have a high pain threshold.

You should ask on the Linux kernel mailing list where you might find
the people who wrote the code.

Out of curiosity, why are you using 1000000 rwlocks?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

vdl

unread,

Dec 12, 2005, 12:57:27 PM12/12/05

to

tnx. i'll try kernel list.

i got objects which has some arrays of data and i need to lock those
arrays for changes (not to change while other thread is reading array).
i could organize objects in pages and lock pages but that would cost
some processing time also high possibility of need to lock two items
for write in one page at one time (that would result a deadlock). so i
choosed to use more rwlocks to gain more speed.

David Schwartz

unread,

Dec 12, 2005, 1:11:31 PM12/12/05

to

"vdl" <v...@3priedez.net> wrote in message
news:1134410247.1...@g49g2000cwa.googlegroups.com...

Sounds like premature optimization that was actually a pessimization.
Acquiring and releasing r/w locks is quite expensive.

DS

vdl

unread,

Dec 12, 2005, 2:07:57 PM12/12/05

to

> Sounds like premature optimization that was actually a pessimization.
> Acquiring and releasing r/w locks is quite expensive.

what do you suggest? organizing objects in pages and use page mutex?

David Schwartz

unread,

Dec 12, 2005, 2:50:39 PM12/12/05

to

"vdl" <v...@3priedez.net> wrote in message

news:1134414477.2...@z14g2000cwz.googlegroups.com...

>> Sounds like premature optimization that was actually a pessimization.
>> Acquiring and releasing r/w locks is quite expensive.

> what do you suggest? organizing objects in pages and use page mutex?

It's hard to say without knowing more about your particular application.
One possibility would be to use spinlocks or mutexes instead of a read/write
lock. Another possibility would be to use 256 read/write locks instead of a
million. You will ocassionally get "false blocking" because more than one
object will be mapped to each lock, but odds are that this will be much less
than the cost of managing a million locks.

It may be that a more major change is in order. Tell us more about your
application. How many threads do you have? What is the ratio of read
accesses to write accesses? How do threads find objects? How long is a lock
typically held, in both the read and write case?

DS

Joe Seigh

unread,

Dec 12, 2005, 3:02:43 PM12/12/05

to

We'd need more information. Like are the arrays static or dynamically
allocated somehow? Is the data in the arrays or linked to from the
arrays? How is the data accessed? Sequentially, random, or what?

There are a number of solutions but it all depends. You can use
lock hashing, moving barriers (not same as Posix barriers), or
lock-free (though with static arrays, maybe not).

vdl

unread,

Dec 12, 2005, 3:09:54 PM12/12/05

to

> It may be that a more major change is in order. Tell us more about your
> application. How many threads do you have? What is the ratio of read
> accesses to write accesses? How do threads find objects? How long is a lock
> typically held, in both the read and write case?
>
> DS

uff. i'll try to make something ;)

userobject{
int id;
array of data;
datalock;
}

blockobject{
array of userobjects;
}

serverobject{
array of blockobjects; //users are organized in blocks to easier
trace/save/load changes in data
userlock;
}

getting user from id requires to rlock userlock of serverobject
adding/removing user from server requires to wlock userlock of
serverobject

reading data from user requires to get user from serverobject and then
rlock user's datalock
changing data of user requires to get user from serverobject and then
wlock user's datalock

i have 15 worker thread threads (count is not proven to be best one ;)
reading operations are more than write operations
locks are held for very small amount of time. typically should be less
than 5msec.
in real situation more than 95-99% operations ends less than 50 msecs

vdl

unread,

Dec 12, 2005, 3:14:47 PM12/12/05

to

> We'd need more information. Like are the arrays static or dynamically allocated
> somehow? Is the data in the arrays or linked to from the arrays? How is the data
> accessed? Sequentially, random, or what?

data in userdata arrays are sorted
userdata arrays are dynamically allocated and using fixing step are
reallocated to store needed amount of data.

userdata are accessed randomly when searching for specific value
(binary search) and sequentially (when reading userdata and returning
to client)

Joe Seigh

unread,

Dec 12, 2005, 4:17:03 PM12/12/05

to

Okay. Sounds like hash (?) map of user to sorted arrays of users to
resolve hash collisions. Sorted arrays are reallocated if user is
added or deleted.

Need more information about user data. Is this data that can be
read atomically, i.e. a lot of ints?

Note that here are two data structures that need to be synchronized.
The structure that gets you to the user data, and the user data
itself.

Although lock-free is still a developing technology, the nice thing
about it is that you end up just using locks for the writes which
are few enough that you can get away with just one lock sometimes.

But before we go into any advanced stuff, let's make sure the conventional
solutions have been tried first and that there actually is a problem.
Have you tried one global rwlock and verified there is a lock contention
problem? Also, a global rwlock for the hash map and a per blockobject
rwlock?

vdl

unread,

Dec 13, 2005, 2:23:38 AM12/13/05

to

tnx for advices.i'll try using less rwlocks, but i suspect that will
slow down application.

Chris Thomasson

unread,

Dec 13, 2005, 2:40:38 AM12/13/05

to

> i have 15 worker thread threads (count is not proven to be best one ;)
> reading operations are more than write operations
> locks are held for very small amount of time. typically should be less
> than 5msec.
> in real situation more than 95-99% operations ends less than 50 msecs

Well, a static array of mutexs:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi-mutex&rnum=1#3ca11e0c3dcf762c

along with a reader-writer solution:

http://home.comcast.net/~vzoom/
http://home.comcast.net/~vzoom/demos/pc_sample.c
http://atomic-ptr-plus.sourceforge.net/

should be 100% compatible with your synchronization needs...

However, it sounds like you might have a race-condition:

>>> i have software which uses a lot (~1 000 000 now) pthread_rwlock_t
>>> variables.
>>> Program has 15 threads which locks those variables.
>>> After some time (can be 1h can be two days or two weeks) locking
>>> hungs.

It seems that your locking scheme may be too complex. 1,000,000 rw-mutexs
does not sound too efficient, and could be difficult to properly manage. You
are most likely suffering from a subtle locking hierarchy violation, but I
need to see more of your code... How do you read the data stored in the
userobjects?

I think you should be able to lookup user objects without using any locks. I
also think you could read the data in the user objects without locks by
using two version numbers. A writer would hash the pointer to a user object
into an index of the static array of mutexs and then lock the indexed mutex.
The writer would increment the first version counter; membar; perform its
writes; membar; then increment the last version counter. A reader would
lookup the user; read the userobject's first counter; read the user data;
read the last counter; and compare. If the compare succeeds it means it got
a consistent snapshot of the data and can process it, if it fails the read
needs to be retired. I am not sure if this scheme would be compatible with
your user data, I need to know what it consisted of and how you access it
for reading? I need some more detailed info from you in order to really
sketch out an optimal solution.

--
Chris Thomasson

http://home.comcast.net/~vzoom/
Virtually Zero-Overhead Object Management (VZOOM)

David Schwartz

unread,

Dec 13, 2005, 5:04:31 AM12/13/05

to

"vdl" <v...@3priedez.net> wrote in message

news:1134458618....@o13g2000cwo.googlegroups.com...

> tnx for advices.i'll try using less rwlocks, but i suspect that will
> slow down application.

Then something is *very* wrong with the design of the application.

DS

Joe Seigh

unread,

Dec 13, 2005, 7:56:36 AM12/13/05

to

vdl wrote:
> uff. i'll try to make something ;)
>
> userobject{
> int id;
> array of data;
> datalock;
> }
>
> blockobject{
> array of userobjects;
> }
>
> serverobject{
> array of blockobjects; //users are organized in blocks to easier
> trace/save/load changes in data
> userlock;
> }
>
> getting user from id requires to rlock userlock of serverobject
> adding/removing user from server requires to wlock userlock of
> serverobject
>
> reading data from user requires to get user from serverobject and then
> rlock user's datalock
> changing data of user requires to get user from serverobject and then
> wlock user's datalock
>

I think I figured it out. You're not synchronizing the data structure that
contains/maps to the userobjects. When you change that structure (adding
or deleting userobjects) the storage gets reused and you get a race condition,
a late operation by a thread on the old version of storage, which corrupts
the rwlocks. Though they may not look corrupt since you are essentially getting
misplaced lock/unlock operations, e.g. a lock that is locked when it should
be in the unlocked state resulting in deadlock. Plus your userdata is
corrupt since the mappings are probably off on some of the operations.
But you can't tell there since they're probably just statistical counters.

Put in a global lock for the mapping. You may have scalability issues but
you won't know until you try.

Joe Seigh

unread,

Dec 13, 2005, 11:01:58 AM12/13/05

to

He is doing the right thing (keeping lock contention low), just the wrong
way (not properly synchronizing access to the global data structure).

What would be the right way without resorting to a global lock or using
lock-free?

Chris Thomasson

unread,

Dec 13, 2005, 12:54:33 PM12/13/05

to

I made a mistake in the proxy code I posted when I was stripping out some of
my library
specific stuff. The cas loop in the pc_dec( ... ) function did not reload
the cmp var when a cas failed.

Here is the fixed code:

http://home.comcast.net/~vzoom/demos/pc_sample.c

Sorry!

David Schwartz

unread,

Dec 13, 2005, 2:21:57 PM12/13/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:w5edne0C45Z...@comcast.com...

> David Schwartz wrote:

>> "vdl" <v...@3priedez.net> wrote in message
>> news:1134458618....@o13g2000cwo.googlegroups.com...

>>>tnx for advices.i'll try using less rwlocks, but i suspect that will
>>>slow down application.

>> Then something is *very* wrong with the design of the application.

> He is doing the right thing (keeping lock contention low), just the wrong
> way (not properly synchronizing access to the global data structure).

No, I don't think he's doing the right thing. If his task is not CPU
limited, then lock contention won't noticeably effect performance. If it is,
why does he have so many more threads than CPUs?

> What would be the right way without resorting to a global lock or using
> lock-free?

It's hard to say for sure, but a million locks to control contention
with 15 threads can't be right. Since he has already said the locks are only
held briefly, this is an attempt to reduce contention from .01% to .001%, a
net benefit of .009%. ;)

DS

Joe Seigh

unread,

Dec 13, 2005, 2:54:26 PM12/13/05

to

David Schwartz wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:w5edne0C45Z...@comcast.com...

>>What would be the right way without resorting to a global lock or using
>>lock-free?
>
>
> It's hard to say for sure, but a million locks to control contention
> with 15 threads can't be right. Since he has already said the locks are only
> held briefly, this is an attempt to reduce contention from .01% to .001%, a
> net benefit of .009%. ;)
>

Well, forget his solution. I suspect the conventional solution is to make
the problem fit the solution. I.e., limit yourself to situations where
a global rwlock works. So you have to be careful to use access patterns that
work well with the rwlock implementation and limit the number of threads
to what the rwlock can handle, as opposed to what the system could handle.

David Schwartz

unread,

Dec 13, 2005, 3:11:50 PM12/13/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:I72dnaFXd7P...@comcast.com...

> David Schwartz wrote:

>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:w5edne0C45Z...@comcast.com...

>>>What would be the right way without resorting to a global lock or using
>>>lock-free?

>> It's hard to say for sure, but a million locks to control contention
>> with 15 threads can't be right. Since he has already said the locks are
>> only held briefly, this is an attempt to reduce contention from .01% to
>> .001%, a net benefit of .009%. ;)

> Well, forget his solution. I suspect the conventional solution is to make
> the problem fit the solution. I.e., limit yourself to situations where
> a global rwlock works. So you have to be careful to use access patterns
> that
> work well with the rwlock implementation and limit the number of threads
> to what the rwlock can handle, as opposed to what the system could handle.

If you choose the solution of a global rwlock, which is certainly one of
the easiest solutions, you do have to be careful to use access patterns that
work well with that solution. However, there are a lot of other solutions
out there. There is, however, no one great solution that's always better
than a global rwlock and lets you do whatever you want and still have it
perform.

One possible solution is a global rwlock that's very fast for the
uncontended read case (no writer or waiting writers). Then you convert as
many cases where you would normally grab a writelock to cases where you grab
a read lock instead, followed by a second mutex that protects the structure
you plan to write. This can allow you to do pretty much anything but remove
a global object without holding a writelock.

There are many, many other possible patterns and mechanisms that you can
use. Which one is right is a very tricky question. The only really bad thing
to do is to pick a complicated pattern with intricate subleties when you're
not sure it's the right one.

DS

Michel

unread,

Dec 13, 2005, 3:40:38 PM12/13/05

to

> I think you should be able to lookup user objects without using any locks. I
> also think you could read the data in the user objects without locks by
> using two version numbers. A writer would hash the pointer to a user object
> into an index of the static array of mutexs and then lock the indexed mutex.
> The writer would increment the first version counter; membar; perform its
> writes; membar; then increment the last version counter. A reader would
> lookup the user; read the userobject's first counter; read the user data;
> read the last counter; and compare. If the compare succeeds it means it got
> a consistent snapshot of the data and can process it, if it fails the read
> needs to be retired. I am not sure if this scheme would be compatible with
> your user data, I need to know what it consisted of and how you access it
> for reading? I need some more detailed info from you in order to really
> sketch out an optimal solution.

Sounds like a smart solution thats rather easy to implement and
understand for fairly novice developers. I have a couple of questions.
What are the methods to avoid starvation on the reader side? Wait for
exclusive (write) access after a couple of retired reads (given that the
mutex used is fair)? Is there a smart solution if the user objects are
kept in a non threadsafe container and can user objects can be added
when other ones are read?

Regards
/Michel

Peter Dimov

unread,

Dec 14, 2005, 4:47:34 PM12/14/05

to

vdl wrote:

> userobject{
> int id;
> array of data;
> datalock;
> }
>
> blockobject{
> array of userobjects;
> }
>
> serverobject{
> array of blockobjects; //users are organized in blocks to easier
> trace/save/load changes in data
> userlock;
> }
>
> getting user from id requires to rlock userlock of serverobject
> adding/removing user from server requires to wlock userlock of
> serverobject
>
> reading data from user requires to get user from serverobject and then
> rlock user's datalock
> changing data of user requires to get user from serverobject and then
> wlock user's datalock

Joe Seigh is right, you have a race condition:

Thread A gets a reference to user data for id 5;
Thread B deletes user id 5;
Thread C allocates memory and fills it with -1;
Thread A proceeds to access the user data that's just been overwritten.

This is under the assumption that the "get reference" operation doesn't
leave the userlock read-locked until after A is done with the data
reference.

You can switch to

serverobject
{
array of shared_ptr<userobject>;
userlock;
}

or use some garbage collection scheme.

Chris Thomasson

unread,

Dec 17, 2005, 7:43:47 PM12/17/05

to

[...]

>
> Sounds like a smart solution thats rather easy to implement and understand
> for fairly novice developers.

Well, luckily, the actual API's involved can be very straightforward, and
"easy to use". Basically, if you know how to use a rw-mutex, you are likely
to quickly catch on to proxy garbage collection and all of the programming
patterns its compatible with. As mentioned above, there are patterns for
this stuff that are essentially the same as a rw-mutex. Reader threads would
initially follow the following basic logic:

/* 1. enter collected-region */
proxygc_increment( ... );
/* in collected-region; its now safe to
execute the desired reader algorithm(s) */
proxygc_decrement( ... );

or, if your using the following collector:

http://home.comcast.net/~vzoom/demos/pc_sample.c

collector = proxygc_increment( ... );
/* in collected-region wrt collector;
its now safe to execute the desired
reader algorithm(s) */
proxygc_decrement( collector );

or, if your using VZOOM:

/* in collected-region wrt current sync epoch;
its now safe to execute the desired
reader algorithm(s) */

Then a typical "reader algorithm" that could tolerate stale data per-node
could do this:

my_node *next, *cur = collection.front;
read_barrier_depends(); /* dd-load */

while ( cur )
{
next = cur->next;
read_barrier_depends(); /* dd-load */
process( cur );
cur = next;
}

or, if your using VZOOM w/ non-persistent references you could do this:

my_node *next, *cur = vzoom_npref_acq( &collection.front );

while ( cur )
{
next = vzoom_npref_acq( &cur->next );
process( cur );
cur = next;
}

or, if your using VZOOM w/ persistent references you could do this:

my_node *next, *cur = vzoom_pref_inc( &collection.front );

while ( cur )
{
next = vzoom_pref_inc( &collection.front );
process( cur );
vzoom_pref_dec( cur );
cur = next;
}

A reader algorithm that can not tolerant stale data per-node could look like
this:

int ver1, ver2;
my_node *next, *cur = collection.front;
read_barrier_depends(); /* dd-load */

while ( cur )
{
do
{
next = cur->next;
read_barrier_depends(); /* dd-load */
ver1 = cur->ver1;
read_data_locally( cur );
ver2 = cur->ver2;

} while ( ver1 != ver2 );

process_local_data( ... );

cur = next;
}

or, if your using VZOOM w/ non-persistent references:

my_node *next, *cur = vzoom_npref_acq( &collection.front );

while ( cur )
{
do
{
next = vzoom_npref_acq( &cur->next );
ver1 = cur->ver1;
read_data_locally( cur );
ver2 = cur->ver2;
}

while ( ver1 != ver2 );

process_local_data( ... );

cur = next;
}

or, if your using VZOOM w/ persistent references:

my_node *next = 0, *cur = vzoom_pref_inc( &collection.front );

while ( cur )
{
do
{
if ( next ) { vzoom_pref_dec( next ); );
next = vzoom_pref_inc( &cur->next );
ver1 = cur->ver1;
read_data_locally( cur );
ver2 = cur->ver2;
}

while ( ver1 != ver2 );

process_local_data( ... );

vzoom_pref_dec( cur );
cur = next;
}

Another reader pattern that is compatiable with VZOOM:

"volunantary synchronization (portable)"
-------

int sync, sync_yield = 1000;

/* inc persistent reference(s) */
my_node *cur = vzoom_pref_inc( &shared.data );
[...]

for ( sync = 1 ;; ++sync )
{
/* process cur; the processing logic can acq/rel
all of the persistent references it needs. This
greatly differs from SMR... */
process( cur );

/* when cur no longer needed */
if ( whatever_reason() )
{
/* drop old persistent reference(s)
and acquire a new one(s) */
vzoom_pref_dec( &cur );
[...]
cur = vzoom_pref_inc( &shared.data );
[...]
}

/* dynamic ratio of reads:syncs */
if ( ! ( sync % sync_yield ) )
{
/* synchronize the epoch. Any previously
acquired persistent reference(s) can
safely exist before, during, and after
successive synchronized epochs */
vzoom_sync( ... );
}
}

/* drop old persistent references */
vzoom_pref_dec( &cur );
[...]

or,

"involunantary synchronization (non-portable)"
---------

/* inc persistent reference(s) */
my_node *cur = vzoom_pref_inc( &shared.data );
[...]

for ( ;; )
{
/* process cur; the processing logic can acq/rel
all of the persistent reference(s) it needs. This
greatly differs from SMR and/or RCU... */
process( cur );

/* when cur no longer needed */
if ( whatever_reason() )
{
/* drop old persistent reference(s)
and acquire a new one */
vzoom_pref_dec( &cur );
[...]
cur = vzoom_pref_inc( &shared.data );
[...]
}
}

/* drop old persistent reference(s)*/
vzoom_pref_dec( &cur );
[...]

A writer algorithm that does not care about stale data could look like this:

/* producer */
my_node *new = get_new_node();
pthread_mutex_lock( ... );
/* release membar */
/* make new node accessable */
pthread_mutex_unlock( ... );

/* consumer */
my_node *node;

pthread_mutex_lock( ... );
/* load ptr to node */
/* dd-load membar */
/* make node unaccessable */
pthread_mutex_unlock( ... );

/* send node through collector */
defer( node );

A writer algorithm that does care about stale data could look like this:

/* modify */
pthread_mutex_lock( ... );
/* load ptr to node */
/* inc node->ver1 */
/* store/store */
/* modify node */
/* inc node->ver2 */
pthread_mutex_unlock( ... );

/* consumer */
my_node *node;

pthread_mutex_lock( ... );
/* load ptr to node */
/* dd-load membar */
/* make node unaccessable */
pthread_mutex_unlock( ... );

/* send node through collector */
defer( node );

ect...ect...ect...

;)

> I have a couple of questions. What are the methods to avoid starvation on
> the reader side? Wait for exclusive (write) access after a couple of
> retired reads (given that the mutex used is fair)?

A rw-mutex usually makes readers block/wait for all "prior writes" and the
"last write broadcasts". A write usually blocks/waits for all "prior
reads/writes", and the "last read signals/last write broadcasts"...Please
note that well built proxy garbage collectors do not suffer from any
starvation because reads and writes are not mutually excludable...

> Is there a smart solution if the user objects are kept in a non threadsafe
> container

^^^^^^^^^^^^^^

Humm, the container would have to use mutexs or something other in order to
enforce data coherency. Some other distributed designs can incorporate
"non-threadsafe" containers into their designs... Please clarify what you
mean...

> and can user objects can be added when other ones are read?

Yes. You can even use mutexs with a lock-free solution!

:)

vdl

unread,

Dec 18, 2005, 3:51:02 AM12/18/05

to

when deleting users i just mark them for deletion - they stay
phusically another 24h till physically removed - so it should be not
this case. also after getuserfromid i always compare if i'm getting
NULL.

vdl

unread,

Dec 18, 2005, 3:56:15 AM12/18/05

to

also i found one buffer overflow which was trashing stack. so i suspect
there is still one (at least one ;) left which overwrites lock status.