Memory barriers in RCU+SMR

Dmitriy V'jukov

unread,

Apr 11, 2008, 11:57:55 AM4/11/08

to

I don't see acquire and release memory barriers in functions smrload()
and smrnull() in Joe Seigh's implementation of SMR+RCU (fastsmr,
http://sourceforge.net/project/showfiles.php?group_id=127837).

I was thinking that SMR+RCU only eliminates the need for #StoreLoad in
smrload(). Why acquire memory barrier is not needed in smrload() in SMR
+RCU? Because of control-dependency? Why release memory barrier is not
needed in smrnull() in SMR+RCU?

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 12, 2008, 9:47:01 AM4/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:a2be3803-e239-4aca...@u12g2000prd.googlegroups.com...

For the release barrier, I think that Joe is using an extra RCU grace-period
before he fires deferment callbacks. As for #StoreLoad, well, its an
integral part of acquire barrier. Are you talking about the barrier not
being able to legally extend into the semantics of the "end-user algorithm"?
In other words, the removal of the #StoreLoad in 'smrload()' may not be
enough to remove it from end-user algorithm?

Dmitriy V'jukov

unread,

Apr 12, 2008, 12:12:16 PM4/12/08

to

On 12 апр, 17:47, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

Yes, I am talking about semantics of the "end-user algorithm".
smrload()/smrnull() it's like critical section for end-user algorithm.
So smrload() must include acquire barrier (#LoadLoad + #LoadStore),
and smrnull() must include release barrier (#LoadStore + #StoreStore).
In order to prevent user code from escaping from "critical section".

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 12, 2008, 1:45:32 PM4/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:5c0b9f71-79e4-4c3a...@m44g2000hsc.googlegroups.com...

> On 12 апр, 17:47, "Chris Thomasson" <cris...@comcast.net> wrote:
> > "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
> >
> > news:a2be3803-e239-4aca...@u12g2000prd.googlegroups.com...
> >
> >I don't see acquire and release memory barriers in functions smrload()
> > > and smrnull() in Joe Seigh's implementation of SMR+RCU (fastsmr,
> > >http://sourceforge.net/project/showfiles.php?group_id=127837).
> >
> > > I was thinking that SMR+RCU only eliminates the need for #StoreLoad in
> > > smrload(). Why acquire memory barrier is not needed in smrload() in
> > > SMR
> > > +RCU? Because of control-dependency? Why release memory barrier is not
> > > needed in smrnull() in SMR+RCU?
> >
> > For the release barrier, I think that Joe is using an extra RCU
> > grace-period
> > before he fires deferment callbacks. As for #StoreLoad, well, its an
> > integral part of acquire barrier. Are you talking about the barrier not
> > being able to legally extend into the semantics of the "end-user
> > algorithm"?
> > In other words, the removal of the #StoreLoad in 'smrload()' may not be
> > enough to remove it from end-user algorithm?

> Yes, I am talking about semantics of the "end-user algorithm".
> smrload()/smrnull() it's like critical section for end-user algorithm.
> So smrload() must include acquire barrier (#LoadLoad + #LoadStore),

I think that end-user algorithm would require a:

#StoreLoad | #StoreStore
<user>
#LoadStore | #StoreStore

> and smrnull() must include release barrier (#LoadStore + #StoreStore).
> In order to prevent user code from escaping from "critical section".

Humm. Well, if I understand Joes algorithm correctly... The epoch which
eludes the acquire, and the extra epoch which eludes the release does indeed
extend into the "stuff" in between. In other words, when the deferment can
be called, the user algorithms use of that specific hazard pointer is
visibly overwith.

Joe Seigh

unread,

Apr 13, 2008, 7:09:27 PM4/13/08

to

http://atomic-ptr-plus.sourceforge.net/design.html

"Once an objects address is no longer seen in any hazard pointers,
an additional RCU checkpoint is taken to ensure any pending accesses
to the object have completed, i.e. memory release semantics."

It's a little out of date since I changed the implementation to support
two different ways of collecting nodes. And there was a third way which
I didn't implement and have forgotten anyway.

The load uses dependent load for acquire semantics.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Chris Thomasson

unread,

Apr 13, 2008, 8:19:08 PM4/13/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:5c0b9f71-79e4-4c3a...@m44g2000hsc.googlegroups.com...

I believe the way Joe is handling things makes it safe. Basically, its kind
of analogous to a RCU implementation that treats calls to
'rcu_read_lock/unlock' as nops. Once something sets deferred, he waits for a
sync-epoch before a hazard-pointer scan is completed. This ensures that
mutations prior to deferment are rendered visible; this gives the acquire
semantics. After that, I believe he waits for another sync-epoch before
actually executing the callback function. After a deferment, the process
goes like 'RCU->SMR->RCU'. The extra sync-epoch is used to get release
semantics.

Chris Thomasson

unread,

Apr 13, 2008, 8:23:41 PM4/13/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:aLydnXfLQK4KCZ_V...@comcast.com...

To be more precise, the first sync-epoch eludes the "store-acquire" barrier
in the SMR algorithm itself. The "load-acquire" barrier is handled through
data-dependant load semantics.

Dmitriy V'jukov

unread,

Apr 14, 2008, 6:52:07 AM4/14/08

to

On 11 апр, 19:57, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> I don't see acquire and release memory barriers in functions smrload()

> and smrnull() in Joe Seigh's implementation of SMR+RCU (fastsmr,http://sourceforge.net/project/showfiles.php?group_id=127837).

>
> I was thinking that SMR+RCU only eliminates the need for #StoreLoad in
> smrload(). Why acquire memory barrier is not needed in smrload() in SMR
> +RCU? Because of control-dependency? Why release memory barrier is not
> needed in smrnull() in SMR+RCU?

Ok, now I understand. Thank you, Chris Thomasson and Joe Seigh.

I work mostly with x86, where acquire and release barriers are usually
not needed. AFAIK on SPARC TSO they are not needed too. I'm
interesting, how much they are expensive on other architectures (PPC,
Itanium)? Is it really worth doing to eliminate them all-out?

I'm thinking about asymmetric rw mutex algorithm:
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/8c6c19698964d1f6

Here is sketch:

void reader_lock()
{
reader_inside[current_thread_index] = true;
if (writer_pending)
{
reader_inside[current_thread_index] = false;
EnterCriticalSection(&cs);
reader_inside[current_thread_index] = true;
LeaveCriticalSection(&cs);
}
acquire_barrier(); // <-------------------------------
}

void reader_unlock()
{
release_barrier(); // <-------------------------------
reader_inside[current_thread_index] = false;
}

void writer_lock()
{
EnterCriticalSection(&cs);
writer_pending = true;
sys_synch(); // rcu_synchronize() or
FlushProcessorWriteBuffers() etc
for (int i = 0; i != max_reader_count; ++i)
{
while (reader_inside[i])
SwitchToThread();
}
}

void writer_unlock()
{
writer_pending = false;
LeaveCriticalSection(&cs);
}

Whether I need acquire_barrier() in reader_lock() and
release_barrier() in reader_unlock()?

It seems that release_barrier() is needed. But I'm not sure about
acquire_barrier(). I'm feeling that it's not needed. Why? I'm not
sure. There is some kind of control-dependency or something... Can
somebody clarify?

I think I can eliminate release_barrier() in reader_unlock(). I need
to employ 'wait for one more epoch' trick, like in SMR+RCU:

void writer_lock()
{
EnterCriticalSection(&cs);
writer_pending = true;
sys_synch(); // rcu_synchronize() or
FlushProcessorWriteBuffers() etc
for (int i = 0; i != max_reader_count; ++i)
{
while (reader_inside[i])
SwitchToThread();
}
// <-----------------------------------------------------
// We've seen that all readers outside critical section, so
// render all their memory accesses visible
sys_synch(); // rcu_synchronize() or
FlushProcessorWriteBuffers() etc
}

What do you think?

It seems that this modification eliminates the need for
acquire_barrier() in reader_lock() as well...

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 17, 2008, 6:19:20 PM4/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:52139a42-7817-4eb7...@d45g2000hsc.googlegroups.com...

On 11 О©╫О©╫О©╫, 19:57, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > I don't see acquire and release memory barriers in functions smrload()
> > and smrnull() in Joe Seigh's implementation of SMR+RCU
> > (fastsmr,http://sourceforge.net/project/showfiles.php?group_id=127837).
> >
> > I was thinking that SMR+RCU only eliminates the need for #StoreLoad in
> > smrload(). Why acquire memory barrier is not needed in smrload() in SMR
> > +RCU? Because of control-dependency? Why release memory barrier is not
> > needed in smrnull() in SMR+RCU?

> Ok, now I understand. Thank you, Chris Thomasson and Joe Seigh.

> I work mostly with x86, where acquire and release barriers are usually
> not needed. AFAIK on SPARC TSO they are not needed too. I'm
> interesting, how much they are expensive on other architectures (PPC,
> Itanium)? Is it really worth doing to eliminate them all-out?

> I'm thinking about asymmetric rw mutex algorithm:
> http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/8c6c19698964d1f6

> Here is sketch:

[...]

> Whether I need acquire_barrier() in reader_lock() and
> release_barrier() in reader_unlock()?

> It seems that release_barrier() is needed. But I'm not sure about
> acquire_barrier(). I'm feeling that it's not needed. Why? I'm not
> sure. There is some kind of control-dependency or something... Can
> somebody clarify?

You need to ensure that nothing from the read-side critical-region can rise
up above any of the acquisition logic. I think that the synchronization
epoch in the writer should go ahead and satisfy that condition. Think about
the following scenario:

static int i = 0;

1: read_lock();
2: int x = i;
3: read_unlock();

Lets assume that there is a condition in which the execution sequence was
reordered and operation 2 became visible before any effects of 1 become
visible... Now a writer comes in and executes a sync-epoch... That will
ensure that the acquisition is also rendered visible. Things are consistent
at this point wrt the acquisition.

> I think I can eliminate release_barrier() in reader_unlock(). I need
> to employ 'wait for one more epoch' trick, like in SMR+RCU:

[...]

> What do you think?

Once the writer observes that there are no more readers, it also needs to
ensure that the operations into the read-side critical-section are committed
before the any writes are made.Therefore, executing an additional sync-epoch
after the writer observes that there are no readers will relieve the need
for the release-barrier. The only potential problem I can see with this is
that writers are going to be fairly expensive, and could possibly cast a
negative effect on the system. The writer is not passively observing a
sync-epoch, instead it actively starts one. You could rig it to work in a
passive mode, but then your going to need to use highly non-portable and
extremely system specific methods. vZOOM can only use passive epoch
detection on certain systems. On windows it makes use of some undocumented
API calls.

> It seems that this modification eliminates the need for
> acquire_barrier() in reader_lock() as well...

I believe that the original algorithm already took care of the acquire
barrier.

Dmitriy V'jukov

unread,

Apr 18, 2008, 4:01:11 AM4/18/08

to

On 18 апр, 02:19, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Whether I need acquire_barrier() in reader_lock() and
> > release_barrier() in reader_unlock()?
> > It seems that release_barrier() is needed. But I'm not sure about
> > acquire_barrier(). I'm feeling that it's not needed. Why? I'm not
> > sure. There is some kind of control-dependency or something... Can
> > somebody clarify?
>
> You need to ensure that nothing from the read-side critical-region can rise
> up above any of the acquisition logic. I think that the synchronization
> epoch in the writer should go ahead and satisfy that condition. Think about
> the following scenario:
>
> static int i = 0;
>
> 1: read_lock();
> 2: int x = i;
> 3: read_unlock();
>
> Lets assume that there is a condition in which the execution sequence was
> reordered and operation 2 became visible before any effects of 1 become
> visible... Now a writer comes in and executes a sync-epoch... That will
> ensure that the acquisition is also rendered visible. Things are consistent
> at this point wrt the acquisition.

Ok. Thank you for explanation.

So on x86/Sparc TSO all needed memory barriers are implied. On other
architectures one can use additional sys_synch() in writer_lock() in
order to eliminate release barrier in reader_unlock().

> Once the writer observes that there are no more readers, it also needs to
> ensure that the operations into the read-side critical-section are committed
> before the any writes are made.Therefore, executing an additional sync-epoch
> after the writer observes that there are no readers will relieve the need
> for the release-barrier. The only potential problem I can see with this is
> that writers are going to be fairly expensive, and could possibly cast a
> negative effect on the system. The writer is not passively observing a
> sync-epoch, instead it actively starts one. You could rig it to work in a
> passive mode, but then your going to need to use highly non-portable and
> extremely system specific methods. vZOOM can only use passive epoch
> detection on certain systems. On windows it makes use of some undocumented
> API calls.

Yeah, sys_synch() is expensive. But here is little problem. Before
executing sys_synch() writer executes:
writer_pending = true;
This forces all readers to block in reader_lock(). So the longer
executes writer_lock() the longer system activity is totally stopped.
Imho expensive but short sys_synch() operation here is a plus. I
really don't want to sluggishly and passively poll something for a
long time here.

Btw in my rw_mutex shootout this mutex totally destroy all other
mutexes provided interval between writes is 1 ms. Provided high write
rate it goes surprisingly well too.

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 18, 2008, 2:35:22 PM4/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:9bc54d21-198c-47b3...@24g2000hsh.googlegroups.com...

Yup.

Good point.

> Btw in my rw_mutex shootout this mutex totally destroy all other
> mutexes provided interval between writes is 1 ms. Provided high write
> rate it goes surprisingly well too.

I have tested your design against other "traditional" rw_mutexs, and it
kills them dead in the water. I have not artificially cranked up the writes
yet. It could be interesting.