Using hierarchical memory as an acquire memory barrier for dependent loads

Joe Seigh

unread,

Sep 11, 2005, 12:29:12 PM9/11/05

to

While perusing patent applications for lock-free techniques, I noticed the one
by Paul McKenney titled "Software implementation of synchronous memory Barriers"
(20020194436) which basically describes an RCU like technique for eliminating the
need for acquire memory barriers by using signals to force them on a as needed
basis. That explains that part of the discussion on RCU+SMR when it occurred on
LKML.

I seems to me you could use the memory hierarchy to accomplish the same
effect. I'm assuming that out of order dependent memory accesses won't cross
memory hierarchy boundaries.

There's two levels of the memory hierarchy we could exploit and possibly without
any special kernel support. One is cache and the other is virtual memory.
Basically, when memory is freed with proper release semantics, either explicit
or supplied by RCU, you use another form of RCU to determine when all
processors have flushed or purged references to that memory from their
cache or from their translation lookaside buffer. The purge/flush instructions
are the quiescent states for the second form of RCU. Once all processors
has quiesced (dropped references) it is safe for writer threads to use that
memory for updates using RCU.

The problem is that as soon as the writer thread starts accessing that memory,
you can start polluting cache and/or the translation lookaside buffer with no
assurance that they will be flushed if a reader thread is dispatched on that
processor before the writer thread completes its updates to memory. The
way around that problem is to use two different memory mappings for the
same memory, one for use by writer threads and one for use by reader threads.
This trick depends on cache working on virtual memory not real memory if
you're using cache as the memory hierarchy level. Also addresses stored
in memory should valid in the reader memory mapping if you don't want
readers doing offset/address to address conversion.

Granularity of memory management would be the cache line size or page size
depending on which mechanism you are using.

On one level, this is similar to what virtual memory managers do when managing
real memory. In fact one of the earliest uses of an RCU like technique was by
MVS to determine when it was safe to reallocate real memory to another
mapping with SIGP signaling to speed things up if it needed to steal pages.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

David Hopwood

unread,

Sep 11, 2005, 7:11:24 PM9/11/05

to

Madness. Reading all that patent obfuscated English has addled your brain.

Cache is supposed to be a transparent abstraction. So is the TLB (software
TLBs notwithstanding). Breaking that will break anyone's ability to understand
what the system is doing, just in order to try (without necessarily succeeding)
to optimize something that isn't a performance bottleneck.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 11, 2005, 7:39:25 PM9/11/05

to

David Hopwood wrote:
>
> Madness. Reading all that patent obfuscated English has addled your brain.

Some of the technical publications aren't any better. :) Most of the patents
are on techniques published by the inventors.

>
> Cache is supposed to be a transparent abstraction. So is the TLB (software
> TLBs notwithstanding). Breaking that will break anyone's ability to
> understand
> what the system is doing, just in order to try (without necessarily
> succeeding)
> to optimize something that isn't a performance bottleneck.
>

You don't need it on most systems since they have dependent load ordering.
They're prefectly capable of running slow without any extra assistance.
But this would allow a more relaxed cache protocol / memory model to be
used that would allow improved hardware performance. Anyway, after
the debacle with getting the x86 memory model semantics confused with its
implementation and attempts at hacks to improve global load ordering,
I would think everyone would be more attuned to the issue.

Nick Maclaren

unread,

Sep 12, 2005, 4:46:30 AM9/12/05

to

In article <wQ2Ve.21920$k22....@fe2.news.blueyonder.co.uk>,

David Hopwood <david.nosp...@blueyonder.co.uk> writes:
|>
|> Cache is supposed to be a transparent abstraction. So is the TLB (software
|> TLBs notwithstanding). Breaking that will break anyone's ability to understand
|> what the system is doing, just in order to try (without necessarily succeeding)
|> to optimize something that isn't a performance bottleneck.

Don't be so certain of the last. I see both cache and TLB handling
being a performance bottleneck on a daily basis, and one of the
solutions to this would involve making them more visible.

Regards,
Nick Maclaren.

Joe Seigh

unread,

Sep 12, 2005, 8:41:36 AM9/12/05

to

Joe Seigh wrote:

[...]

>
> I seems to me you could use the memory hierarchy to accomplish the same
> effect. I'm assuming that out of order dependent memory accesses won't
> cross
> memory hierarchy boundaries.
>
> There's two levels of the memory hierarchy we could exploit and possibly
> without
> any special kernel support. One is cache and the other is virtual memory.
> Basically, when memory is freed with proper release semantics, either
> explicit
> or supplied by RCU, you use another form of RCU to determine when all
> processors have flushed or purged references to that memory from their
> cache or from their translation lookaside buffer. The purge/flush
> instructions
> are the quiescent states for the second form of RCU. Once all processors
> has quiesced (dropped references) it is safe for writer threads to use that
> memory for updates using RCU.
>

[...]

Some forms of speculative execution might break this though by fetching
unrelated data into cache. The most likely would be array traversal
since it can generate addresses not actually valid for the program in
question.

Alexander Terekhov

unread,

Sep 12, 2005, 11:05:49 AM9/12/05

to

Joe Seigh wrote:
[...]

> You don't need it on most systems since they have dependent load ordering.

Compilers can turn dd into cc and reorder, "incorrect" software value
prediction aside for a moment. AFAICS, 20020194436 is about hacking

http://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html

It won't help against "incorrect" hardware value prediction, I'm afraid.

http://www.cs.wisc.edu/~cain/pubs/micro01_correct_vp.pdf

regards,
alexander.

Joe Seigh

unread,

Sep 12, 2005, 11:42:20 AM9/12/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
> [...]
>
>>You don't need it on most systems since they have dependent load ordering.
>
>
> Compilers can turn dd into cc and reorder, "incorrect" software value
> prediction aside for a moment. AFAICS, 20020194436 is about hacking

The OP is addressing hardware not compiler issues.
>
> http://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html

Which is why I used the word "most". Dependent load ordering isn't part
of any of the official memory models as far as I know so the behavior
was allowable. It's also allowable on x86 so it's conceivable that
future Intel or AMD processors could require memory barriers to guarantee
dependent load ordering. Since that hasn't happened yet, it's impossible
to tell how expensive the memory barrier would be. Since multi-threaded
synchronization mechanisms have certain assumptions of efficiency, it's
always good to have alternative implementations as part of your back up
plans. It's a robustness of design issue.

> It won't help against "incorrect" hardware value prediction, I'm afraid.

Not at the cache level but you could do it that the virtual memory level
by mapping the new data into the reader memory address range as part of
the release operation. It's probably more expensive than you want it to
be but it has to work because that's what the virtual memory manager does
when it pages in from backing store.

David Hopwood

unread,

Sep 12, 2005, 11:54:46 AM9/12/05

to

Nick Maclaren wrote:
> David Hopwood <david.nosp...@blueyonder.co.uk> writes:
> |>
> |> Cache is supposed to be a transparent abstraction. So is the TLB (software
> |> TLBs notwithstanding). Breaking that will break anyone's ability to understand
> |> what the system is doing, just in order to try (without necessarily succeeding)
> |> to optimize something that isn't a performance bottleneck.
>
> Don't be so certain of the last. I see both cache and TLB handling
> being a performance bottleneck on a daily basis, and one of the
> solutions to this would involve making them more visible.

The "something that isn't a performance bottleneck" was referring to
acquire memory barriers. Sorry if that wasn't clear.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Alexander Terekhov

unread,

Sep 12, 2005, 12:07:55 PM9/12/05

to

Joe Seigh wrote:
[...]

> Which is why I used the word "most". Dependent load ordering isn't part
> of any of the official memory models as far as I know so the behavior
> was allowable.

Your knowledge is inaccurate.

> It's also allowable on x86

In Alpha speak, x86 loads from WB memory (I mean processor consistency)
all have trailing "MB".

regards,
alexander.

Sander Vesik

unread,

Sep 12, 2005, 12:17:04 PM9/12/05

to

However, thats not what the original poster wanted to optimise - he
wanted to get some benefits for RCU by effectively breaking cache
transparency.

>
>
> Regards,
> Nick Maclaren.
>
>

--
Sander

+++ Out of cheese error +++

Joe Seigh

unread,

Sep 12, 2005, 12:39:43 PM9/12/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
> [...]
>
>>Which is why I used the word "most". Dependent load ordering isn't part
>>of any of the official memory models as far as I know so the behavior
>>was allowable.
>
>
> Your knowledge is inaccurate.
>

I meant the specification. I try not to conflate specification and implementation.

Joe Seigh

unread,

Sep 12, 2005, 12:55:38 PM9/12/05

to

Nick is more correct. I was proposing it as a mechanism for preserving
some of RCU's (or that of lock-free in general) performance if
hardware was changed to improve its performance by relaxing some of the
strict conditions it has to meet now. E.g. relaxing the cache protocols
could improve performance by reducing cache invalidates if the de facto
memory model did not require it.

RCU was developed by Paul McKenney to allow in part, I think, better
performance in a NUMA environment. If you exposed and possibly changed
cache behavior in a so called UMA enviroment you could get some of the
same benefits.

Alexander Terekhov

unread,

Sep 12, 2005, 1:06:38 PM9/12/05

to

Joe Seigh wrote:
[...]
> I meant the specification.

x86 loads from WB memory are loads ".acq" per specification of processor
consistency given in

http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf

I mean

http://groups.google.com/group/comp.programming.threads/msg/88755b3cd32f9f8c

regards,
alexander.

Alexander Terekhov

unread,

Sep 12, 2005, 1:24:40 PM9/12/05

to

Joe Seigh wrote:
[...]

> I try not to conflate specification and implementation.

Conflation or non-conflation aside for a moment, I'm just curious how
your lock-free sem_getvalue() for x86 would look like. Care to share?

regards,
alexander.

Joe Seigh

unread,

Sep 12, 2005, 2:00:39 PM9/12/05

to

(followup set to comp.programming.threads)

Depends. The documentation for that is a little strange and it's not
clear if sem_getvalue has any practical use.

To be on the safe side you might have to use the same synchronization
that the rest of the semaphore implementation uses to set the semaphore
value. Though I might be inclined, if it looks like all you could use
sem_getvalue for is as a simple event synchronization object, to just
to do a simple load followed by a mfence, acquire semantics. It's basically
a question of what the observable behavior is no matter what the
documentation says.

I see you've been looking for places to use your compare and
swap "synchronized" load trick. :)

Nick Maclaren

unread,

Sep 12, 2005, 2:11:35 PM9/12/05

to

In article <axhVe.1600$Kk3....@fe1.news.blueyonder.co.uk>,
David Hopwood <david.nosp...@blueyonder.co.uk> wrote:

>Nick Maclaren wrote:
>> |>
>> |> Cache is supposed to be a transparent abstraction. So is the TLB (software
>> |> TLBs notwithstanding). Breaking that will break anyone's ability to understand
>> |> what the system is doing, just in order to try (without necessarily succeeding)
>> |> to optimize something that isn't a performance bottleneck.
>>
>> Don't be so certain of the last. I see both cache and TLB handling
>> being a performance bottleneck on a daily basis, and one of the
>> solutions to this would involve making them more visible.
>
>The "something that isn't a performance bottleneck" was referring to
>acquire memory barriers. Sorry if that wasn't clear.

Ah. No, I didn't get that.

However, my statement stands, even for those, though I don't see
it that often. It is a major issue for OpenMP and POSIX threads
codes that attempt to deliver small-grain parallelism.

Regards,
Nick Maclaren.

Alexander Terekhov

unread,

Sep 12, 2005, 2:24:30 PM9/12/05

to

Joe Seigh wrote:
[...]

> > Conflation or non-conflation aside for a moment, I'm just curious how
> > your lock-free sem_getvalue() for x86 would look like. Care to share?
> >
>
> Depends. The documentation for that is a little strange and it's not
> clear if sem_getvalue has any practical use.

http://www.opengroup.org/austin/mailarchives/ag/msg08586.html

>
> To be on the safe side you might have to use the same synchronization
> that the rest of the semaphore implementation uses to set the semaphore
> value. Though I might be inclined, if it looks like all you could use
> sem_getvalue for is as a simple event synchronization object, to just
> to do a simple load followed by a mfence, acquire semantics. It's basically

For a simple x86 load from WB memory, you'll get acquire semantics
"for free" no matter whether you want it or not. But acquire doesn't
mean remote write atomicity... illusion of which can be achieved via
a dummy RMW inside lock-free sem_getvalue() (instead of a simple
load... irrespective of implied or extra added fences) on archs like
x86/WB (processor consistency) or Power.

regards,
alexander.

Joe Seigh

unread,

Sep 12, 2005, 3:02:09 PM9/12/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
> [...]
>
>>>Conflation or non-conflation aside for a moment, I'm just curious how
>>>your lock-free sem_getvalue() for x86 would look like. Care to share?
>>>
>>
>>Depends. The documentation for that is a little strange and it's not
>>clear if sem_getvalue has any practical use.
>
>
> http://www.opengroup.org/austin/mailarchives/ag/msg08586.html
>
>

Depends on what you think "synchronize memory" means for posix semaphores,
what is actually observable, and what you mean by "remote write atomicity".

Alexander Terekhov

unread,

Sep 12, 2005, 4:37:17 PM9/12/05

to

Joe Seigh wrote:
>
> Alexander Terekhov wrote:
> > Joe Seigh wrote:
> > [...]
> >
> >>>Conflation or non-conflation aside for a moment, I'm just curious how
> >>>your lock-free sem_getvalue() for x86 would look like. Care to share?
> >>>
> >>
> >>Depends. The documentation for that is a little strange and it's not
> >>clear if sem_getvalue has any practical use.
> >
> >
> > http://www.opengroup.org/austin/mailarchives/ag/msg08586.html
> >
> >
>
> Depends on what you think "synchronize memory" means for posix semaphores,

I think it means much more stricter synchronization protocol than what it
means for mutexes... due to semaphore value exposure, not mere locking as
for mutexes.

> what is actually observable,

Semaphore operations (lock/unlock/getvalue) are meant to be fully-fenced
and provide illusion of remote write atomicity.

> and what you mean by "remote write atomicity".

It's what makes PC (x86/WB) != TSO.

regards,
alexander.

David Hopwood

unread,

Sep 12, 2005, 9:18:43 PM9/12/05

to

I don't think that the cost of acquire barriers, specifically, is the
cause of any performance problems with OpenMP and pthreads in providing
small-grain parallelism. Note that these barriers are almost certainly
not needed anyway on x86[-64], PPC or SPARC.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Nick Maclaren

unread,

Sep 13, 2005, 3:35:00 AM9/13/05

to

In article <TNpVe.24147$k22....@fe2.news.blueyonder.co.uk>,

David Hopwood <david.nosp...@blueyonder.co.uk> wrote:
>>>
>>>The "something that isn't a performance bottleneck" was referring to
>>>acquire memory barriers. Sorry if that wasn't clear.
>>
>> Ah. No, I didn't get that.
>>
>> However, my statement stands, even for those, though I don't see
>> it that often. It is a major issue for OpenMP and POSIX threads
>> codes that attempt to deliver small-grain parallelism.
>
>I don't think that the cost of acquire barriers, specifically, is the
>cause of any performance problems with OpenMP and pthreads in providing
>small-grain parallelism. Note that these barriers are almost certainly
>not needed anyway on x86[-64], PPC or SPARC.

Not acquire barriers in the very limited sense, no. But the logical
extension of them from the ISA to the HLL interface. I agree that
fiddling with the semantics to cure a hardware non-problem is not
worth the bother, but doing to to address the very real problem I
am referring to is.

In particular, if you make cache and TLB control explicit, an OpenMP
program (not POSIX threads, which is beyond redemption) could insert
the relevant calls at the relevant places. With competent hardware
design, you could even get - heresy - checking!

But I agree that what I was referring to was rather more ambitious
than what most other people may have been thinking of, despite the
fact that it IS the same issue, viewed in the large.

Regards,
Nick Maclaren.

Seongbae Park

unread,

Sep 13, 2005, 4:21:11 AM9/13/05

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
> In particular, if you make cache and TLB control explicit, an OpenMP

Do you really believe that programmers can handle
a full explicit control of cache
which could very well cause a hang or extreme starvation ?
I'm not even sure runtime and compiler writers would be able to handle it
very well
- that's maybe because I haven't given a lot of thought to it though.

Shouldn't they (both general programmers and compiler/runtime writers)
be much better off with an "advice" mechanism ?
Prefetch is one such advice, and you can easily imagine
the reverse of prefetch, and prefetch (and reverse) for TLB as well.

> program (not POSIX threads, which is beyond redemption) could insert
> the relevant calls at the relevant places. With competent hardware
> design, you could even get - heresy - checking!
>
> But I agree that what I was referring to was rather more ambitious
> than what most other people may have been thinking of, despite the
> fact that it IS the same issue, viewed in the large.

BTW, explicit control of TLB is not impossible on the current generation
of programs - most modern hardwares support enough features to make
it possible (even if in a limited fashion)
for system software to provide such a feature,
although they are generally not implemented
nor provided to the user level software in a generally useful form.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"

Alexander Terekhov

unread,

Sep 13, 2005, 1:10:46 PM9/13/05

to

David Hopwood wrote:
[...]
> I don't think that the cost of acquire barriers, ... Note that these

> barriers are almost certainly not needed anyway on x86[-64], PPC

You probably mean ddacq in conjunction with loads (RMW stuff aside
for a moment). Because extra sync is certainly needed for both ccacq_*
and plain acq loads on PPC.

regards,
alexander.