I seems to me you could use the memory hierarchy to accomplish the same
effect. I'm assuming that out of order dependent memory accesses won't cross
memory hierarchy boundaries.
There's two levels of the memory hierarchy we could exploit and possibly without
any special kernel support. One is cache and the other is virtual memory.
Basically, when memory is freed with proper release semantics, either explicit
or supplied by RCU, you use another form of RCU to determine when all
processors have flushed or purged references to that memory from their
cache or from their translation lookaside buffer. The purge/flush instructions
are the quiescent states for the second form of RCU. Once all processors
has quiesced (dropped references) it is safe for writer threads to use that
memory for updates using RCU.
The problem is that as soon as the writer thread starts accessing that memory,
you can start polluting cache and/or the translation lookaside buffer with no
assurance that they will be flushed if a reader thread is dispatched on that
processor before the writer thread completes its updates to memory. The
way around that problem is to use two different memory mappings for the
same memory, one for use by writer threads and one for use by reader threads.
This trick depends on cache working on virtual memory not real memory if
you're using cache as the memory hierarchy level. Also addresses stored
in memory should valid in the reader memory mapping if you don't want
readers doing offset/address to address conversion.
Granularity of memory management would be the cache line size or page size
depending on which mechanism you are using.
On one level, this is similar to what virtual memory managers do when managing
real memory. In fact one of the earliest uses of an RCU like technique was by
MVS to determine when it was safe to reallocate real memory to another
mapping with SIGP signaling to speed things up if it needed to steal pages.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
Madness. Reading all that patent obfuscated English has addled your brain.
Cache is supposed to be a transparent abstraction. So is the TLB (software
TLBs notwithstanding). Breaking that will break anyone's ability to understand
what the system is doing, just in order to try (without necessarily succeeding)
to optimize something that isn't a performance bottleneck.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
Some of the technical publications aren't any better. :) Most of the patents
are on techniques published by the inventors.
>
> Cache is supposed to be a transparent abstraction. So is the TLB (software
> TLBs notwithstanding). Breaking that will break anyone's ability to
> understand
> what the system is doing, just in order to try (without necessarily
> succeeding)
> to optimize something that isn't a performance bottleneck.
>
You don't need it on most systems since they have dependent load ordering.
They're prefectly capable of running slow without any extra assistance.
But this would allow a more relaxed cache protocol / memory model to be
used that would allow improved hardware performance. Anyway, after
the debacle with getting the x86 memory model semantics confused with its
implementation and attempts at hacks to improve global load ordering,
I would think everyone would be more attuned to the issue.
Don't be so certain of the last. I see both cache and TLB handling
being a performance bottleneck on a daily basis, and one of the
solutions to this would involve making them more visible.
Regards,
Nick Maclaren.
[...]
>
> I seems to me you could use the memory hierarchy to accomplish the same
> effect. I'm assuming that out of order dependent memory accesses won't
> cross
> memory hierarchy boundaries.
>
> There's two levels of the memory hierarchy we could exploit and possibly
> without
> any special kernel support. One is cache and the other is virtual memory.
> Basically, when memory is freed with proper release semantics, either
> explicit
> or supplied by RCU, you use another form of RCU to determine when all
> processors have flushed or purged references to that memory from their
> cache or from their translation lookaside buffer. The purge/flush
> instructions
> are the quiescent states for the second form of RCU. Once all processors
> has quiesced (dropped references) it is safe for writer threads to use that
> memory for updates using RCU.
>
[...]
Some forms of speculative execution might break this though by fetching
unrelated data into cache. The most likely would be array traversal
since it can generate addresses not actually valid for the program in
question.
Compilers can turn dd into cc and reorder, "incorrect" software value
prediction aside for a moment. AFAICS, 20020194436 is about hacking
http://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
It won't help against "incorrect" hardware value prediction, I'm afraid.
http://www.cs.wisc.edu/~cain/pubs/micro01_correct_vp.pdf
regards,
alexander.
The OP is addressing hardware not compiler issues.
>
> http://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
Which is why I used the word "most". Dependent load ordering isn't part
of any of the official memory models as far as I know so the behavior
was allowable. It's also allowable on x86 so it's conceivable that
future Intel or AMD processors could require memory barriers to guarantee
dependent load ordering. Since that hasn't happened yet, it's impossible
to tell how expensive the memory barrier would be. Since multi-threaded
synchronization mechanisms have certain assumptions of efficiency, it's
always good to have alternative implementations as part of your back up
plans. It's a robustness of design issue.
> It won't help against "incorrect" hardware value prediction, I'm afraid.
Not at the cache level but you could do it that the virtual memory level
by mapping the new data into the reader memory address range as part of
the release operation. It's probably more expensive than you want it to
be but it has to work because that's what the virtual memory manager does
when it pages in from backing store.
The "something that isn't a performance bottleneck" was referring to
acquire memory barriers. Sorry if that wasn't clear.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
Your knowledge is inaccurate.
> It's also allowable on x86
In Alpha speak, x86 loads from WB memory (I mean processor consistency)
all have trailing "MB".
regards,
alexander.
However, thats not what the original poster wanted to optimise - he
wanted to get some benefits for RCU by effectively breaking cache
transparency.
>
>
> Regards,
> Nick Maclaren.
>
>
--
Sander
+++ Out of cheese error +++
I meant the specification. I try not to conflate specification and implementation.
Nick is more correct. I was proposing it as a mechanism for preserving
some of RCU's (or that of lock-free in general) performance if
hardware was changed to improve its performance by relaxing some of the
strict conditions it has to meet now. E.g. relaxing the cache protocols
could improve performance by reducing cache invalidates if the de facto
memory model did not require it.
RCU was developed by Paul McKenney to allow in part, I think, better
performance in a NUMA environment. If you exposed and possibly changed
cache behavior in a so called UMA enviroment you could get some of the
same benefits.
x86 loads from WB memory are loads ".acq" per specification of processor
consistency given in
http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf
I mean
http://groups.google.com/group/comp.programming.threads/msg/88755b3cd32f9f8c
regards,
alexander.
Conflation or non-conflation aside for a moment, I'm just curious how
your lock-free sem_getvalue() for x86 would look like. Care to share?
regards,
alexander.
Depends. The documentation for that is a little strange and it's not
clear if sem_getvalue has any practical use.
To be on the safe side you might have to use the same synchronization
that the rest of the semaphore implementation uses to set the semaphore
value. Though I might be inclined, if it looks like all you could use
sem_getvalue for is as a simple event synchronization object, to just
to do a simple load followed by a mfence, acquire semantics. It's basically
a question of what the observable behavior is no matter what the
documentation says.
I see you've been looking for places to use your compare and
swap "synchronized" load trick. :)
Ah. No, I didn't get that.
However, my statement stands, even for those, though I don't see
it that often. It is a major issue for OpenMP and POSIX threads
codes that attempt to deliver small-grain parallelism.
Regards,
Nick Maclaren.
http://www.opengroup.org/austin/mailarchives/ag/msg08586.html
>
> To be on the safe side you might have to use the same synchronization
> that the rest of the semaphore implementation uses to set the semaphore
> value. Though I might be inclined, if it looks like all you could use
> sem_getvalue for is as a simple event synchronization object, to just
> to do a simple load followed by a mfence, acquire semantics. It's basically
For a simple x86 load from WB memory, you'll get acquire semantics
"for free" no matter whether you want it or not. But acquire doesn't
mean remote write atomicity... illusion of which can be achieved via
a dummy RMW inside lock-free sem_getvalue() (instead of a simple
load... irrespective of implied or extra added fences) on archs like
x86/WB (processor consistency) or Power.
regards,
alexander.
Depends on what you think "synchronize memory" means for posix semaphores,
what is actually observable, and what you mean by "remote write atomicity".
I think it means much more stricter synchronization protocol than what it
means for mutexes... due to semaphore value exposure, not mere locking as
for mutexes.
> what is actually observable,
Semaphore operations (lock/unlock/getvalue) are meant to be fully-fenced
and provide illusion of remote write atomicity.
> and what you mean by "remote write atomicity".
It's what makes PC (x86/WB) != TSO.
regards,
alexander.
I don't think that the cost of acquire barriers, specifically, is the
cause of any performance problems with OpenMP and pthreads in providing
small-grain parallelism. Note that these barriers are almost certainly
not needed anyway on x86[-64], PPC or SPARC.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
Not acquire barriers in the very limited sense, no. But the logical
extension of them from the ISA to the HLL interface. I agree that
fiddling with the semantics to cure a hardware non-problem is not
worth the bother, but doing to to address the very real problem I
am referring to is.
In particular, if you make cache and TLB control explicit, an OpenMP
program (not POSIX threads, which is beyond redemption) could insert
the relevant calls at the relevant places. With competent hardware
design, you could even get - heresy - checking!
But I agree that what I was referring to was rather more ambitious
than what most other people may have been thinking of, despite the
fact that it IS the same issue, viewed in the large.
Regards,
Nick Maclaren.
Do you really believe that programmers can handle
a full explicit control of cache
which could very well cause a hang or extreme starvation ?
I'm not even sure runtime and compiler writers would be able to handle it
very well
- that's maybe because I haven't given a lot of thought to it though.
Shouldn't they (both general programmers and compiler/runtime writers)
be much better off with an "advice" mechanism ?
Prefetch is one such advice, and you can easily imagine
the reverse of prefetch, and prefetch (and reverse) for TLB as well.
> program (not POSIX threads, which is beyond redemption) could insert
> the relevant calls at the relevant places. With competent hardware
> design, you could even get - heresy - checking!
>
> But I agree that what I was referring to was rather more ambitious
> than what most other people may have been thinking of, despite the
> fact that it IS the same issue, viewed in the large.
BTW, explicit control of TLB is not impossible on the current generation
of programs - most modern hardwares support enough features to make
it possible (even if in a limited fashion)
for system software to provide such a feature,
although they are generally not implemented
nor provided to the user level software in a generally useful form.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
You probably mean ddacq in conjunction with loads (RMW stuff aside
for a moment). Because extra sync is certainly needed for both ccacq_*
and plain acq loads on PPC.
regards,
alexander.