Hazard Pointers w/o memory barrier.

Joe Seigh

unread,

Apr 19, 2005, 8:32:13 AM4/19/05

to

You can get rid of the expensive memory barrier when you set hazard pointers
by combining it with a version of RCU that uses context switches (voluntary
and involuntary), signals, syscalls or anything that causes processor
serialization as quiescent states. You delay checking the hazard pointers
until you have an RCU checkpoint (i.e. all threads have quiesced at least once).
If you're using a polling for SMR/hazard pointers this won't cause any appreciable
extra delay.

The processor serialization from the RCU quiesce points ensure that the hazard
pointer operations are seen in correct order without requiring a memory barrier.
However, you still need to ensure the compiler correctly orders the instructions.
This will probably require inline assembler. On pipelined architectures setting
a hazard pointer is pretty close to a raw pointer load, about 2X on powerpc.

On Solaris and possibly Aix, you can use /proc to get context switching counts
for threads to use as quiesce points. For other unices and Linux, you will need
to periodically signal each thread to execute a signal handler that increments
a signal count to use as quiesce points. For windows, there's no asynchronous
signal facility or /proc, but there's a hack that will work. I won't disclose it since
I don't support windows and I want to see how long it takes someone else
to figure it out.

If you can determine thread state, i.e. running or not running, you can use the
not running as a quiesce state. This takes care of some forward progress
problems RCU has with suspended threads not quiescing periodically if
they've been indefinitely suspended.

The advantage of this over pure RCU is that it allows you to hold local
references for indefinite periods of time. With RCU you have to keep
the references short since RCU is a proxy GC and a thread not quiescing
holds up GC for everything.

Comparison with deferred reference counting. Deferred reference counting
has the problem that in C/C++ it has no control over how local references
are expressed and used, unlike Java VM's. Although not likely, a pointer
doesn't have to like like a pointer to something, it could be an offset or
a hash of something. More problematic is a pointer put into a locally
owned heap object. The thread stack only scan would never catch this
since the reference isn't on the stack. And although a stack GC scan is
cheaper than scanning the heap plus stacks, it's still expensive. Hazard
pointers get around this problem by explicity and formally declaring local
references/usage.

There's an alternative method of pointer usage declation where you'd
simply store the pointer into the hazard pointer w/o the reload and
recheck, and at scan time check the stack frame register save
area as well as the hazard pointers. This is cheaper than scanning
the whole stack but as the hazard pointer reload and recheck are very
cheap on a pipelined cpu to begin with, I don't think it's worth it.

On Z architecture you could use a MVC instruction to store the global
pointer into the hazard pointer and then load from the hazard pointer
rather than the usual hazard pointer logic, since the MVC instruction
isn't interruptable. I don't know if it's any faster but you could do it.

I think that's it.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Chris Thomasson

unread,

Apr 19, 2005, 1:31:58 PM4/19/05

to

> On Solaris and possibly Aix, you can use /proc to get context switching
> counts
> for threads to use as quiesce points. For other unices and Linux, you
> will need
> to periodically signal each thread to execute a signal handler that
> increments
> a signal count to use as quiesce points. For windows, there's no
> asynchronous
> signal facility or /proc, but there's a hack that will work. I won't
> disclose it since
> I don't support windows and I want to see how long it takes someone else
> to figure it out.

You can "probably" use GetThreadTimes and check the kernel time for
per-thread changes. Here is the scheme I have in mind:

typedef struct per_thread_
{
struct per_thread_ *next;

HANDLE thread;

LONG old_q; // = 0
LONG new_q; // = 0

FILETIME old_ktime;
FILETIME new_ktime;

} per_thread_t;

static per_thread_t *g_rcu_threads = 0;
static LONG g_rcu_period = 0;
static HANDLE g_rcu_poll; /* auto-reset */
static CRITICAL_SECTION g_rcu_mutex;

VOID ScanForQuiescentState()
{
per_thread_t *t = g_rcu_threads;
FILETIME c, x, u;

while ( t )
{
if ( t->old_q == t->new_q )
{
GetThreadTimes
( t->thread,
&c,
&x,
&t->new_ktime,
&u );

if ( t->new_ktime.dwLowDateTime !=
t->old_ktime.dwLowDateTime ||
t->new_ktime.dwHighDateTime !=
t->old_ktime.dwHighDateTime )
{
++t->new_q;
}
}

t = t->next;
}
}

BOOL ScanForQuiescentPeriod()
{
BOOL period = TRUE;
per_thread_t *t = g_rcu_threads;

while ( t )
{
if ( t->old_q == t->new_q )
{
period = FALSE;
}

t = t->next;
}
}

VOID HandleQuiescentPeriod()
{
per_thread_t *t = g_rcu_threads;

while ( t )
{
assert( t->old_q != t->new_q );

// xchg (t)'s callback lists, and fire them

// reset
t->old_ktime = t->new_ktime;
t->old_q = t->new_q;

t = t->next;
}

++g_rcu_period;
}

The polling thread would do something like this:

for ( ;; )
{
WaitForSingleObject( g_rcu_poll, 500 );

EnterCriticalSection( &g_rcu_mutex );

ScanForQuiescentState();

if ( ScanForQuiescentPeriod() )
{
// All threads have been through quiescent state
HandleQuiescentPeriod();
}

LeaveCriticalSection( &g_rcu_mutex );
}

What do ya think Joe?

;)

Chris Thomasson

unread,

Apr 19, 2005, 1:34:36 PM4/19/05

to

> BOOL ScanForQuiescentPeriod()
> {
> BOOL period = TRUE;
> per_thread_t *t = g_rcu_threads;
>
> while ( t )
> {
> if ( t->old_q == t->new_q )
> {
> period = FALSE;
> }
>
> t = t->next;
> }

return period;

> }

:)

Joe Seigh

unread,

Apr 19, 2005, 3:24:28 PM4/19/05

to

On Tue, 19 Apr 2005 10:31:58 -0700, Chris Thomasson <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote:

>> On Solaris and possibly Aix, you can use /proc to get context switching
>> counts
>> for threads to use as quiesce points. For other unices and Linux, you
>> will need
>> to periodically signal each thread to execute a signal handler that
>> increments
>> a signal count to use as quiesce points. For windows, there's no
>> asynchronous
>> signal facility or /proc, but there's a hack that will work. I won't
>> disclose it since
>> I don't support windows and I want to see how long it takes someone else
>> to figure it out.
>
> You can "probably" use GetThreadTimes and check the kernel time for
> per-thread changes. Here is the scheme I have in mind:
>
>

[...]

>
>
>
> What do ya think Joe?
>
> ;)
>
>

Pretty quick. That's what I spent part of yesterday doing. Verifying that
Windows had a virtual thread timer that was updated on the fly. Plus
some time verifying that some Linux and Unix functions were totally useless.

Chris Thomasson

unread,

Apr 19, 2005, 5:31:10 PM4/19/05

to

>> You can "probably" use GetThreadTimes and check the kernel time for
>> per-thread changes. Here is the scheme I have in mind:
>>
>>
> [...]
>>
>>
>>
>> What do ya think Joe?
>>
>> ;)
>>
>>
>
> Pretty quick. That's what I spent part of yesterday doing. Verifying
> that
> Windows had a virtual thread timer that was updated on the fly.

http://blog.kalmbachnet.de/?postid=28

:O

Chris Thomasson

unread,

Apr 19, 2005, 5:33:28 PM4/19/05

to

http://www.sysinternals.com/ntw2k/info/tips.shtml#nequantum

Joe Seigh

unread,

Apr 19, 2005, 6:35:29 PM4/19/05

to

On Tue, 19 Apr 2005 14:31:10 -0700, Chris Thomasson <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote:
>>>
>>
>> Pretty quick. That's what I spent part of yesterday doing. Verifying
>> that
>> Windows had a virtual thread timer that was updated on the fly.
>
> http://blog.kalmbachnet.de/?postid=28
>

I was wondering about that. It only happens with Sleep(). If you do
SwitchToThread() the times are updated.

Chris Thomasson

unread,

Apr 20, 2005, 1:45:30 PM4/20/05

to

>>> Pretty quick. That's what I spent part of yesterday doing. Verifying
>>> that
>>> Windows had a virtual thread timer that was updated on the fly.
>>
>> http://blog.kalmbachnet.de/?postid=28
>>
>
> I was wondering about that. It only happens with Sleep(). If you do
> SwitchToThread() the times are updated.

Humm, need to try that.

You could probably also detect quiescent states by counting the
context-switches per-thread using the performance api's...

Chris Thomasson

unread,

Apr 21, 2005, 2:24:35 PM4/21/05

to

> You could probably also detect quiescent states by counting the
> context-switches per-thread using the performance api's...

Arggfhgh!

You need to do a device driver in order to tap into the "context-swap" hook
like Russinovich uses in PMon.

http://www.sysinternals.com/ntw2k/freeware/pmon.shtml

Joe Seigh

unread,

Apr 21, 2005, 8:01:04 PM4/21/05

to

On Tue, 19 Apr 2005 08:32:13 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

> On Z architecture you could use a MVC instruction to store the global
> pointer into the hazard pointer and then load from the hazard pointer
> rather than the usual hazard pointer logic, since the MVC instruction
> isn't interruptable. I don't know if it's any faster but you could do it.

Actually it's not faster if Z arch has the same memory model as 390 arch
with respect to a store followed by a load from the same location. It
acts as a memory barrier as loads had to return values from memory.

Joe Seigh

unread,

Apr 22, 2005, 6:48:08 AM4/22/05

to

You can scan processors for quiescent states instead of threads. It will
work just as good.

Using NtQuerySystemInformation() to get SYSTEM_PROCESSOR_PERFORMANCE_INFORMATION
or SYSTEM_INTERRUPT_INFORMATION for system times or interrupts on a per
processor basis. It may be obsoleted though. The replacement for the
system times is GetSystemTimes() which is aggregated and useless. I have
to find where the server 2003 stuff moved to and see what functions are in
there.

Joe Seigh

unread,

Apr 22, 2005, 9:25:08 AM4/22/05

to

On Fri, 22 Apr 2005 06:48:08 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

>>
> You can scan processors for quiescent states instead of threads. It will
> work just as good.
>
> Using NtQuerySystemInformation() to get SYSTEM_PROCESSOR_PERFORMANCE_INFORMATION
> or SYSTEM_INTERRUPT_INFORMATION for system times or interrupts on a per
> processor basis. It may be obsoleted though. The replacement for the
> system times is GetSystemTimes() which is aggregated and useless. I have
> to find where the server 2003 stuff moved to and see what functions are in
> there.
>
>

It looks like Linux has timing broken out per cpu in /proc/stat at least
on my system. Man pages aren't updated for this.

OS X I can't tell yet. OS X is pretty messy with a Mach microkernel,
BSD kernel, legacy crap from OS 9. Looks like the gibbage in, gibbage out
school of design.

Joe Seigh

unread,

Apr 22, 2005, 4:45:28 PM4/22/05

to

OS X thread times (using the Mach interface) seem to be busted in
the same manner.

Joe Seigh

unread,

May 2, 2005, 8:35:29 PM5/2/05

to

Finally getting around to modifying my current RCU
implementation to see how fast this will run. Just
taking the memory barriers out gave a 3x improvement
in read access throughput. At that rate that can be
considered as effectively zero overhead. About a 3x
slowdown in write access but that's because I don't
have the SMR logic in yet plus I have to stay with a
slightly crippled version of RCU to avoid imperial
entanglement. Adding the SMR logic should give me a
dramatic improvement in write throughput.

-

Chris Thomasson

unread,

May 2, 2005, 10:07:37 PM5/2/05

to

> Finally getting around to modifying my current RCU
> implementation to see how fast this will run. Just
> taking the memory barriers out gave a 3x improvement
> in read access throughput. At that rate that can be
> considered as effectively zero overhead.

Yes it can. I have already been experimenting with it for a couple of days
and am extremely pleased with the initial results. The thing that makes me a
bit uneasy with my windows ( and some other os's ) implementation is the
dependence on undocumented functions. All of my code that uses this form of
garbage collection could suddenly stop working when the next service pack
gets installed...

;)

> About a 3x
> slowdown in write access but that's because I don't
> have the SMR logic in yet plus I have to stay with a
> slightly crippled version of RCU to avoid imperial
> entanglement.

We aren't the programmers there looking for... ;)

Seriously though, this is why I don't provide the public with my user-space
rcu ( URCU ) implementation. Did you inform McKenney that you were posting
your algorithm? I am not sure if I need to ask permission...

> Adding the SMR logic should give me a
> dramatic improvement in write throughput.

Yeah. Original SMR is well suited for high write throughput because of the
guarantees it puts on the way nodes are deleted. This new URCU_SMR thing
can't seem to handle frequent and persistent writes as good as the original.
It caused some problems in a couple of my algorithms. Its probably a good
idea to export two versions of SMR...

:)

Joe Seigh

unread,

May 2, 2005, 10:40:06 PM5/2/05

to

On Mon, 2 May 2005 19:07:37 -0700, Chris Thomasson <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote:

>> Finally getting around to modifying my current RCU
>> implementation to see how fast this will run. Just
>> taking the memory barriers out gave a 3x improvement
>> in read access throughput. At that rate that can be
>> considered as effectively zero overhead.
>
> Yes it can. I have already been experimenting with it for a couple of days
> and am extremely pleased with the initial results. The thing that makes me a
> bit uneasy with my windows ( and some other os's ) implementation is the
> dependence on undocumented functions. All of my code that uses this form of
> garbage collection could suddenly stop working when the next service pack
> gets installed...
>

You could always have a fall back plan of putting the membar back in
that case. It would run slower but that's Microsoft's problem
especially since it would be their platform running worse than their
competitor's platforms.

>
>
>
>> About a 3x
>> slowdown in write access but that's because I don't
>> have the SMR logic in yet plus I have to stay with a
>> slightly crippled version of RCU to avoid imperial
>> entanglement.
>
> We aren't the programmers there looking for... ;)
>
> Seriously though, this is why I don't provide the public with my user-space
> rcu ( URCU ) implementation. Did you inform McKenney that you were posting
> your algorithm? I am not sure if I need to ask permission...

McKenney and Michael both know about it. The problem is you have to stick
with a implementation based on the first patent that is in the public
domain. The claims were more narrow than they might have been because
we believed they had to be tied to the actual methods in the patent.
Not really a problem if you're monitoring per processor/thread counters
or clocks since that's probably how you'd have to do it but with a commom
counter you could speed up the polling somewhat but it wouldn't strictly
be covered by the first patent.

>
>
>
>
>> Adding the SMR logic should give me a
>> dramatic improvement in write throughput.
>
> Yeah. Original SMR is well suited for high write throughput because of the
> guarantees it puts on the way nodes are deleted. This new URCU_SMR thing
> can't seem to handle frequent and persistent writes as good as the original.
> It caused some problems in a couple of my algorithms. Its probably a good
> idea to export two versions of SMR...
>

If you have a really high write rate then you would probably be better off
with a mutex or one of the write lock-free algorithms. I'm not claiming
the read lock-free stuff is good for all situations.

--

Joe Seigh

unread,

May 3, 2005, 7:34:44 AM5/3/05

to

On Mon, 2 May 2005 19:07:37 -0700, Chris Thomasson <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote:

>> About a 3x
>> slowdown in write access but that's because I don't
>> have the SMR logic in yet plus I have to stay with a
>> slightly crippled version of RCU to avoid imperial
>> entanglement.
>
> We aren't the programmers there looking for... ;)
>
> Seriously though, this is why I don't provide the public with my user-space
> rcu ( URCU ) implementation. Did you inform McKenney that you were posting
> your algorithm? I am not sure if I need to ask permission...
>

Actually, thinking about it, I'd stay away from RCU until IBM makes its
intentions a little more clear. IBM does have this Linux strategy. It's
not a problem for me since I'm not going to publish any code, I'm just
experimenting with it. If you've been following any of the C++ smart
pointer / shared pointer discussions, you'd realize that nothing short
of a scheme that can GC everything with zero overhead in a multi-threaded
environment with no scalability problems and can deal with untracked raw
pointers will satisfy anyone. Until everyone becomes more realistic, and
I don't know how long that will take since most of them don't appear to
have even started on their multi-threaded learning curves, there's not
much use trying to create a general usage smart pointer.

Back to IBM again. They don't seem to publicize any projects that are in
the works, so it's hard to tell what they might be up to here. I would
guess they might try to put this in the Linux kernel since RCU is already
in place and the hazard pointer stuff would solve some problems running
with preemption enabled and some problems with resource granularity that
a pure proxy GC method would have. As far as user space is concerned,
they might try to restrict it's usage to Linux as far as it fits in with
their Linux strategy. That could be a problem however since a lot of
open source projects are platform independent and aren't going to adopt
non-portable api's.

You might run something by IBM's lawyers just to see what they might say.
If you phrase the questions right, you might learn something about IBM's
intentions but being lawyers they'd probably catch on pretty quick. You'd
have more of a chance if you restricted your stuff to Linux though I don't
know if the GPL will allow patent restrictions of that kind. I don't know
how they got RCU in the Linux kernel in the first place like they did.

--

Joe Seigh

unread,

May 3, 2005, 7:13:33 PM5/3/05

to

On Mon, 02 May 2005 20:35:29 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

> Finally getting around to modifying my current RCU
> implementation to see how fast this will run. Just
> taking the memory barriers out gave a 3x improvement
> in read access throughput. At that rate that can be
> considered as effectively zero overhead. About a 3x
> slowdown in write access but that's because I don't
> have the SMR logic in yet plus I have to stay with a
> slightly crippled version of RCU to avoid imperial
> entanglement. Adding the SMR logic should give me a
> dramatic improvement in write throughput.

Hmm... signals thrash things up quite a bit. Things
slow way down unless I put the polling interval up to
300 msec to spread the signaling out a bit. Then I
get about 1,286,000 reads/sec/thread for RCU+SMR vs.
176,000 reads/sec/thread for RCU with the same polling
interval. It's a little over 700 reads/sec/thread for
mutexes or rwlocks.

Messing with signals doesn't look like it's worth the
trouble especially given the scheduler artifacts from
signaling differ from platform to platform. Looks
like it will have to be platform specific code and
no common reference implementation. If it got
done hypothetically speaking.

--

Joe Seigh

unread,

May 3, 2005, 10:31:49 PM5/3/05

to

On Tue, 03 May 2005 19:13:33 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

> Hmm... signals thrash things up quite a bit. Things
> slow way down unless I put the polling interval up to
> 300 msec to spread the signaling out a bit. Then I
> get about 1,286,000 reads/sec/thread for RCU+SMR vs.
> 176,000 reads/sec/thread for RCU with the same polling
> interval. It's a little over 700 reads/sec/thread for
> mutexes or rwlocks.

Ah wait, I must have been looking at the write rate. It
gets well over 500,000 reads/sec/thread at the default
poll interval, 50 msec. That's more reasonable. Now
to try it on Linux to see how it works there. Linux
signaling is preemptive I believe.

Chris Thomasson

unread,

May 4, 2005, 1:25:18 AM5/4/05

to

>> Yes it can. I have already been experimenting with it for a couple of
>> days
>> and am extremely pleased with the initial results. The thing that makes
>> me a
>> bit uneasy with my windows ( and some other os's ) implementation is the
>> dependence on undocumented functions. All of my code that uses this form
>> of
>> garbage collection could suddenly stop working when the next service pack
>> gets installed...
>>
>
> You could always have a fall back plan of putting the membar back in
> that case. It would run slower but that's Microsoft's problem
> especially since it would be their platform running worse than their
> competitor's platforms.

Yeah. So far, windows makes it relatively easy to detect quiescent periods.

>>> slightly crippled version of RCU to avoid imperial
>>> entanglement.
>>
>> We aren't the programmers there looking for... ;)
>>
>> Seriously though, this is why I don't provide the public with my
>> user-space
>> rcu ( URCU ) implementation. Did you inform McKenney that you were
>> posting
>> your algorithm? I am not sure if I need to ask permission...
>
> McKenney and Michael both know about it. The problem is you have to stick
> with a implementation based on the first patent that is in the public
> domain. The claims were more narrow than they might have been because
> we believed they had to be tied to the actual methods in the patent.

What were the claims for the patented version of RCU? Was it the algorithm
that used two counters and three rcu-batch lists nxtlist->curlist->donelist
per-cpu?

> Not really a problem if you're monitoring per processor/thread counters
> or clocks since that's probably how you'd have to do it but with a commom
> counter you could speed up the polling somewhat but it wouldn't strictly
> be covered by the first patent.

Seems that tracking per-cpu counters "and" keeping the batches on per-cpu
lists would infringe on the patent. Humm... My implementation differs from
this, but not much.

>> Yeah. Original SMR is well suited for high write throughput because of
>> the
>> guarantees it puts on the way nodes are deleted. This new URCU_SMR thing
>> can't seem to handle frequent and persistent writes as good as the
>> original.
>> It caused some problems in a couple of my algorithms. Its probably a good
>> idea to export two versions of SMR...
>>
> If you have a really high write rate then you would probably be better off
> with a mutex or one of the write lock-free algorithms.

Unfortunally, using a mutex slows down every test and the kernel usage goes
off the charts. The specific algorithm that this caused problems with is the
lock-free stack that AppCore exports. It's heavily optimized for high number
of writes. It allocates its nodes internally from a three-part cache:

per_thread->lock-free_lifo->malloc/free

It frees the nodes into a two-part cache:

lock-free_lifo->smr_collector

So a write is a mfence->cmpxchg8b and the cache is a cmpxchg8b. No kernel
involved. When I put the stack under high-load ( 5,000,000-10,000,000
push->pop cycles through 32-64 threads ) the smr_collector gets used quite
frequently. Original SMR algo has upper-bound on number of ( batch_size )
pending nodes which make everything work fine. URCU-SMR batch_size's "can"
be volatile under load, and the memory usage started to grow and grow under
stress testing. I put in some code to alter the polling rate dynamically and
auto-adjust to the current number of pending objects, which helped out, but
I could still overload the URCU-SMR system under artificial high-load.

Here is the actual code that does not seem to like URCU-SMR under high-load:

http://appcore.home.comcast.net/appcore/include/cpu/i686/ac_i686_stack_mpmc_h.html

http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_stack_mpmc_c.html

http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html
( np_ac_i686_lfgc_smr_stack_mpmc_pop_dwcas located near the top of the
page )

> I'm not claiming
> the read lock-free stuff is good for all situations.

Yeah, I know. I just felt the need to ( nit ) point out that little caveat,
since it can directly affect my library in an adverse may...

;)

Joe Seigh

unread,

May 4, 2005, 6:53:52 AM5/4/05

to

On Tue, 3 May 2005 22:25:18 -0700, Chris Thomasson <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote:

>
> What were the claims for the patented version of RCU? Was it the algorithm
> that used two counters and three rcu-batch lists nxtlist->curlist->donelist
> per-cpu?
>

The general technique is to defer freeing objects until every thread has
passed through a quiescent state at least once. The first patent, which
is the only one in public domain, was tied to a specific implementation
in the VM operating system and the claims were limited to that implementation.
We didn't even think we invented the general technique so it didn't
occur to us to mention it except as the prior art mentioned in the patent,
i.e. the technique MVS used to purge the TLB's in a multiprocessor
configuration. We didn't mention other techniques that occurred to us
at the time (eg. using timestamps) as part patent analysis (what
techniques could be used to get around the patent) since we didn't
believe at the time it was as good and would have added code to a
performance sensitive part of VM. Plus doing a patent is a lot of
work and we didn't feel like doing a second one. The first patent
used processor local events which had to be observed to occur twice
in order to infer that the processor had passed through a quiescent
state at least once since you started a particular deferred free.
Actually when we initally came up with the idea we only thought
of the at least once bit, we didn't realize the indirect events
had to be observed twice until we did the actual implementation
which the patent was based upon.

If you use events that are dependent on the deferred free, e.g.
timestamps, a common counter, or synchronized observation of processor
state, then you only need to observe the event once to correctly infer
that a processor has passed thru a quiescent state. This can be
used to cut the average delay time in half or in combining tree
logic. The technique is not mentioned in the first patent explicitly
and cannot be assumed to be in the public domain. Except for the
processor state which while not mentioned in the first patent was
in the first implementation to handle processors in wait state.

There's multiple patents on this. Only the first one is in the
public domain. You'd have to make sure your implementation is
covered by that if you don't want to get permission from IBM.

>
>
>
>> Not really a problem if you're monitoring per processor/thread counters
>> or clocks since that's probably how you'd have to do it but with a commom
>> counter you could speed up the polling somewhat but it wouldn't strictly
>> be covered by the first patent.
>
> Seems that tracking per-cpu counters "and" keeping the batches on per-cpu
> lists would infringe on the patent. Humm... My implementation differs from
> this, but not much.

The per-cpu counters are events and are logically equivalent to the event flags
used in the first patent (events w/ single writer vs. eventcounts) but I
suppose one could be picky here.

>
>
>
>
>>> Yeah. Original SMR is well suited for high write throughput because of
>>> the
>>> guarantees it puts on the way nodes are deleted. This new URCU_SMR thing
>>> can't seem to handle frequent and persistent writes as good as the
>>> original.
>>> It caused some problems in a couple of my algorithms. Its probably a good
>>> idea to export two versions of SMR...
>>>
>> If you have a really high write rate then you would probably be better off
>> with a mutex or one of the write lock-free algorithms.
>
> Unfortunally, using a mutex slows down every test and the kernel usage goes
> off the charts. The specific algorithm that this caused problems with is the
> lock-free stack that AppCore exports. It's heavily optimized for high number
> of writes. It allocates its nodes internally from a three-part cache:
>

[...]

I use a semaphore to limit how much resources the writers can tie up at
any one time. Seems preferable to running out of free memory.

Joe Seigh

unread,

May 4, 2005, 7:00:24 AM5/4/05

to

On Tue, 03 May 2005 22:31:49 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

> On Tue, 03 May 2005 19:13:33 -0400, Joe Seigh <jsei...@xemaps.com> wrote:
>
>> Hmm... signals thrash things up quite a bit. Things
>> slow way down unless I put the polling interval up to
>> 300 msec to spread the signaling out a bit. Then I
>> get about 1,286,000 reads/sec/thread for RCU+SMR vs.
>> 176,000 reads/sec/thread for RCU with the same polling
>> interval. It's a little over 700 reads/sec/thread for
>> mutexes or rwlocks.
>
> Ah wait, I must have been looking at the write rate. It
> gets well over 500,000 reads/sec/thread at the default
> poll interval, 50 msec. That's more reasonable. Now
> to try it on Linux to see how it works there. Linux
> signaling is preemptive I believe.
>

Mystery cleared up. I was getting read rates of
9 reads/sec/thread on occasion. It seems that
even though I'm using a barrier to start the readers
and writers all at once, sometimes the writers can
get way ahead of the readers. Increasing the duration
averaged things out better. The read rates works out
to be around 1,500,000+ reads/sec/thread in that case.

Joe Seigh

unread,

May 4, 2005, 3:17:07 PM5/4/05

to

On Wed, 04 May 2005 07:00:24 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

>>
>> Ah wait, I must have been looking at the write rate. It
>> gets well over 500,000 reads/sec/thread at the default
>> poll interval, 50 msec. That's more reasonable. Now
>> to try it on Linux to see how it works there. Linux
>> signaling is preemptive I believe.
>>
>
> Mystery cleared up. I was getting read rates of
> 9 reads/sec/thread on occasion. It seems that
> even though I'm using a barrier to start the readers
> and writers all at once, sometimes the writers can
> get way ahead of the readers. Increasing the duration
> averaged things out better. The read rates works out
> to be around 1,500,000+ reads/sec/thread in that case.
>
>

It's a shock going back to Linux with its poor throughput
rate compared to OS X. And signals on Linux definitely
slow things down. So we can forget about a common
reference implementation. Platform specific implementations
are the way to go.

Nothing to see here. Move along. :)

Chris Thomasson

unread,

May 5, 2005, 3:50:27 AM5/5/05

to

>> What were the claims for the patented version of RCU? Was it the
>> algorithm
>> that used two counters and three rcu-batch lists
>> nxtlist->curlist->donelist
>> per-cpu?
>>
> The general technique is to defer freeing objects until every thread has
> passed through a quiescent state at least once.

That seems fairly generic. Does a patent for this kind of stuff need to
drill down on "specific" quiescent states, or can they actually get away
with highly abstract claims?

> The first patent, which
> is the only one in public domain, was tied to a specific implementation
> in the VM operating system and the claims were limited to that
> implementation.

> We didn't mention other techniques that occurred to us
> at the time (eg. using timestamps) as part patent analysis (what
> techniques could be used to get around the patent) since we didn't
> believe at the time it was as good and would have added code to a
> performance sensitive part of VM. Plus doing a patent is a lot of
> work and we didn't feel like doing a second one. The first patent
> used processor local events which had to be observed to occur twice
> in order to infer that the processor had passed through a quiescent
> state at least once since you started a particular deferred free.

Ahhh.

> Actually when we initally came up with the idea we only thought
> of the at least once bit, we didn't realize the indirect events
> had to be observed twice until we did the actual implementation
> which the patent was based upon.
>
> If you use events that are dependent on the deferred free, e.g.
> timestamps, a common counter, or synchronized observation of processor
> state, then you only need to observe the event once to correctly infer
> that a processor has passed thru a quiescent state.

I see. My implementation does something like this. It's polling algorithm
only has to observe a "certain" state of a per-thread and/or per-cpu
data-structure once in order to detect quiescent state. The algorithm
detects quiescent/grace periods by sampling information from threads and/or
cpu info.

> There's multiple patents on this. Only the first one is in the
> public domain. You'd have to make sure your implementation is
> covered by that if you don't want to get permission from IBM.

Are these some of them:

6,219,690
5,727,209
5,608,893
5,442,758

I really need to examine each of these. I hope they don't tie up every damn
technique!

> The per-cpu counters are events and are logically equivalent to the event
> flags
> used in the first patent (events w/ single writer vs. eventcounts) but I
> suppose one could be picky here.

My unpublished stuff has logic that monitors cpu and/or threads bundled up
in a "single algorithm" that uses pthreads and can run on Linux and Windows;
I don't know how IBM lawyers would take that. Humm... I want to be able to
use my implementation without getting sued!

:O

>> Unfortunally, using a mutex slows down every test and the kernel usage
>> goes
>> off the charts. The specific algorithm that this caused problems with is
>> the
>> lock-free stack that AppCore exports. It's heavily optimized for high
>> number
>> of writes. It allocates its nodes internally from a three-part cache:
>>
> [...]
>
> I use a semaphore to limit how much resources the writers can tie up at
> any one time. Seems preferable to running out of free memory.

:)

Chris Thomasson

unread,

May 5, 2005, 4:01:00 AM5/5/05

to

> Actually, thinking about it, I'd stay away from RCU until IBM makes its
> intentions a little more clear. IBM does have this Linux strategy. It's
> not a problem for me since I'm not going to publish any code, I'm just
> experimenting with it. If you've been following any of the C++ smart
> pointer / shared pointer discussions, you'd realize that nothing short
> of a scheme that can GC everything with zero overhead in a multi-threaded
> environment with no scalability problems and can deal with untracked raw
> pointers will satisfy anyone. Until everyone becomes more realistic, and
> I don't know how long that will take since most of them don't appear to
> have even started on their multi-threaded learning curves, there's not
> much use trying to create a general usage smart pointer.

:)

> Back to IBM again. They don't seem to publicize any projects that are in
> the works, so it's hard to tell what they might be up to here. I would
> guess they might try to put this in the Linux kernel since RCU is already
> in place and the hazard pointer stuff would solve some problems running
> with preemption enabled and some problems with resource granularity that
> a pure proxy GC method would have. As far as user space is concerned,
> they might try to restrict it's usage to Linux as far as it fits in with
> their Linux strategy.

This could be point of contention for my Library. It runs on Windows and
Linux... Yikes!

> That could be a problem however since a lot of
> open source projects are platform independent and aren't going to adopt
> non-portable api's.

Good point.

> You might run something by IBM's lawyers just to see what they might say.
> If you phrase the questions right, you might learn something about IBM's
> intentions but being lawyers they'd probably catch on pretty quick. You'd
> have more of a chance if you restricted your stuff to Linux though I don't
> know if the GPL will allow patent restrictions of that kind. I don't know
> how they got RCU in the Linux kernel in the first place like they did.

Good question. Does Linux really infringe on any active RCU patents? Linux
uses multiple methods to detect quiescent states, I wonder if there is
patent that describes Linux quiescent state detection algorithm.

Joe Seigh

unread,

May 5, 2005, 9:59:45 AM5/5/05

to

On Wed, 04 May 2005 15:17:07 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

>>
>>
>
> It's a shock going back to Linux with its poor throughput
> rate compared to OS X. And signals on Linux definitely
> slow things down. So we can forget about a common
> reference implementation. Platform specific implementations
> are the way to go.
>

I dummied things up to simulate how it would run since I don't
want to bother with the platform specific stuff for now.

The testcase with the same paramters that I used on OS X.
appc is atomic_ptr proxy collector.

r/s/t w/s/t
rcu+smr 498 498
old rcu 106 196
appc 5 247
rwlock 500 0
mutex 25 0

With no preemption at all in read loop. Write loop
has a slight delay in both cases.

rcu+smr 21268 111
old rcu 23051 480
appc 15234 497
mutex 15000 8
rwlock 14250 0

Max flat out rate for a single write thread using a mutex
with no contention is 558000 w/s/t (writes/sec/thread).

A hazard ptr load appears to be only 2x or so of a bare
load which is effectively the same.

Linux seems to have more problems due to it's screwed up
scheduling. What you're seeing is more scheduling artifacts
than anything else.

Joe Seigh

unread,

May 5, 2005, 10:31:24 AM5/5/05

to

On Thu, 5 May 2005 00:50:27 -0700, Chris Thomasson <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote:

>>>
>> The general technique is to defer freeing objects until every thread has
>> passed through a quiescent state at least once.
>
> That seems fairly generic. Does a patent for this kind of stuff need to
> drill down on "specific" quiescent states, or can they actually get away
> with highly abstract claims?

You try to make the claims as broad as possible without claiming prior
art. Usually you have a series of claims starting with the most
broadly possible with subsequent claims being narrowed down with
more qualifications and let the patent office thow out the claims
they think are too broad rather than guess how much the patent office
will let you get away with. This is why you see the claims refer to
previous claims, e.g. "Claim 2: Claim 1 further comprised with ..."
I think we claimed in the first patent that the quiescent states
could be arbitrary. But a particular way of doing a quiescent state
could be patented if it's new.
>
>
[...]

>
>> Actually when we initally came up with the idea we only thought
>> of the at least once bit, we didn't realize the indirect events
>> had to be observed twice until we did the actual implementation
>> which the patent was based upon.
>>
>> If you use events that are dependent on the deferred free, e.g.
>> timestamps, a common counter, or synchronized observation of processor
>> state, then you only need to observe the event once to correctly infer
>> that a processor has passed thru a quiescent state.
>
> I see. My implementation does something like this. It's polling algorithm
> only has to observe a "certain" state of a per-thread and/or per-cpu
> data-structure once in order to detect quiescent state. The algorithm
> detects quiescent/grace periods by sampling information from threads and/or
> cpu info.

As long as you have the correct synchronization and logic that allows
to infer correctly when the freed object is no longer referenced.

>
>
>
>
>> There's multiple patents on this. Only the first one is in the
>> public domain. You'd have to make sure your implementation is
>> covered by that if you don't want to get permission from IBM.
>
>
> Are these some of them:
>
> 6,219,690
> 5,727,209
> 5,608,893
> 5,442,758
>
> I really need to examine each of these. I hope they don't tie up every damn
> technique!

They're mostly continuation patents AFAICT. So I think you only need
to look at the more recent ones. Plus patent 5,809,168 which has lapsed.
McKenney put the list of patents in the Linux Kernel documentation somewhere.

Joe Seigh

unread,

May 8, 2005, 10:21:55 AM5/8/05

to

I think I screwed up by publishing this instead of patenting it.
People may not like patents but at least you can make money off
of them as opposed to FOSS. Definitely need to work on my
habit of posting ideas on usenet without doing a patent feasibility
study first.

Chris Thomasson

unread,

May 10, 2005, 4:58:54 PM5/10/05

to

>I think I screwed up by publishing this instead of patenting it.
> People may not like patents but at least you can make money off
> of them as opposed to FOSS. Definitely need to work on my
> habit of posting ideas on usenet without doing a patent feasibility
> study first.

I may be in a similar dilemma. I have come up with a generic algorithm that
doesn't rely on OS support, and want to publish it but am very wary of doing
so.

;(...

Joe Seigh

unread,

May 27, 2005, 2:55:04 PM5/27/05

to

On Tue, 03 May 2005 22:31:49 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

> On Tue, 03 May 2005 19:13:33 -0400, Joe Seigh <jsei...@xemaps.com> wrote:
>
>> Hmm... signals thrash things up quite a bit. Things
>> slow way down unless I put the polling interval up to
>> 300 msec to spread the signaling out a bit. Then I
>> get about 1,286,000 reads/sec/thread for RCU+SMR vs.
>> 176,000 reads/sec/thread for RCU with the same polling
>> interval. It's a little over 700 reads/sec/thread for
>> mutexes or rwlocks.
>
> Ah wait, I must have been looking at the write rate. It
> gets well over 500,000 reads/sec/thread at the default
> poll interval, 50 msec. That's more reasonable. Now
> to try it on Linux to see how it works there. Linux
> signaling is preemptive I believe.
>

Using the Mach interface on OS X to query thread status
slightly more than doubles the rate, so even on OS X doing
a lot of signaling slows things down. Using the thread
run state basically kills the read rate but that's because
on a uniprocessor the other threads are always in a quiesce
state, ie. not running, when you query them, so there's
nothing to stop the writer threads from just finishing almost
instantly before the reader threads do anything.