I have an idea about a 'mainly' lock free implementation that I'd
appreciate comments on.
I'm attempting to solve the same sort of problem as Joe Seigh's
atomic_ptr (http://atomic-ptr-plus.sourceforge.net/), but in a way
that will work on win64 and win32 (so I want to avoid assembly and
reliance on instructions that are not present on some processors).
What I'm trying to have is a global pointer, but I need to be able to
replace that pointer with a new value every so often (it's
configuration data that needs to be updateable).
Obviously replacing it is easy, but then I have to track the previous
value (which may still be in use) to ensure that it is destructed
correctly.
This is a problem that I have hit many time and I'm sure I'm not alone
in that.
The problem with the atomic_ptr approach (for me) is that it stores
reference data in the same double word (64 bits) as the pointer
itself.
This is fine is you have a CAS instruction that works with a double
word, but AMD64 (and hence win64) does not reliably (that would be a
128bit CAS).
Fortunately for me atomic_ptr is more flexible than I need as I know
my updates will only occur infrequently and I know at compile time how
many of these global pointers I will want.
So my idea is to separate the pointer from the reference count, and to
just track the reference count.
In order to do this I need to have somewhere to keep the pointers, so
my global pointer becomes a singleton containing an array of pointers
and a matching array of reference counts.
When working with the reference counts I need to know some things
about the pointers, so I reserve two bits in each count, one for
IsAvailable and one for IsLocked.
When a pointer is being changed IsLocked is set and no other function
will read or write that pointer (they will work with another one).
When a pointer has been set, IsAvailable is set for that count and
cleared for all the other counts (noting that having two available
pointers is fine, having none is not).
When setting a pointer it doesn't matter where I work from - I can
either start at the beginning each time and get the first once that I
can, or I can rotate around the buffer.
Similarly when being asked for the 'current' pointer I can either
track a rotating index or I can simply return the first that is
available - the latter may have a a race condition that prevents
pointers from being freed, but I'll look into that (I don't think it
does 'cos IsAvailable won't be set).
When a 'pointer' is requested from the object it will return an object
of a local class that contains a reference to the singleton and the
array index of a pointer.
Each copy of the local class will increment the reference count for
the index (which may have IsAvailable set and will not have IsLocked
set).
Each destructor of the local class will decrement the reference count,
when it gets to zero if IsAvailable is not set IsLocked will be set,
the destructor will be called, the pointer destructed and set to null
and the ref count will be reset to zero (both flags cleared).
IsAvailable signifies that the pointer is available for new local
pointers, existing local pointers will be quite happy regardless of
whether IsAvailable is set or not.
All changes to the reference count (or the flags) will be done using
CAS - and note that if the CAS fails it may be necessary to move to a
different array entry.
There are obviously limitations on this:
* It's a singleton, with the inherent issues that causes, but as this
is intended for the sort of config data that is explicitly read at
startup I can live with those constraints - and if you want to copy it
you can have a shared_ptr to it (the copying of that will lock, but
accesses through it won't).
* If anyone takes a copy of the bare pointer returned from the local
object then all bets are off, but the same applies with any reference
counting implementation.
* The time between updates must not be less than the time is takes for
all readers to free their local pointers multiplied by the size of the
arrays (if it is the writer will wait/spin/block) (this is a compile
time tuning issue, the size of the arrays will probably be a template
parameter).
* It wastes some memory on startup, but at least it's constant.
Two additional features that I have thought about but don't currently
think I need:
1. An IsNull flag - will be needed if there is any way for a thread to
read a pointer after IsAvailable and RefCount reach zero but before
the pointer is set to null, IsLocked is intended to sort that out.
2. A version number to prevent ABA - I don't think I should get an ABA
problem (only one reference can exist when refcount gets to zero, and
it will set IsLocked for the duration of the destruction).
I may want to have a lightweight mutex of some sort on setting a
pointer, though in my uses of the class setting will always be done
from one thread.
This is why I class it as 'mainly' lock free, I'm after lock free high
speed reading, writing can take its time.
The use of IsLocked does not actually cause locking on reading - there
will always be at least one unlocked available pointer (even if it's
null).
A few questions that I'd be grateful for answers to:
1. Does this make sense (i.e. have I explained enough)?
2. Does it look like it should work (presuming, of course, that I code
it correctly)?
3. Are you aware of any existing implementation that either works like
this or that solve the same problem?
4. Am I way out of my depth/wasting my time?
5. I have seen derogatory comments about the win Interlocked*
functions, could you direct me to something that details the
deficiencies in these functions?
6. Any other comments at all?
Thank you for even bothering to read this far, and thank you again if
you have any comments.
--
J.T.
Please reply via the newsgroup.
You can create a single-word atomic reference counted pointer with SMR that
can outperform ( totally ia-32 specific ) atomic_ptr on x86; 'lock
cmpxchg8b' can be expensive. SMR based reference counting doesn't have to
rely on the hefty lock prefix on x86, it can use the "more" lightweight
mfence instruction...
--
http://appcore.home.comcast.net/
(portable lock-free data-structures)
But in other discussions people had reported that mfence does not
work on AMD or some other Intel CPUs. Were these issues solved
in the meantime?
You simply can't avoid assembly language wrt lock-free programming. A C
compiler is not trustworthy enough; it can and will screw you.
;(
I believe its part of SSE2. You should try to avoid using the lock prefix
whenever you can.
Instructions. You need an additional membar for the release semantics if you
are storing into a non null hazard pointer.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
Even with the _InterlockedCompareExchange() intrinsic function on
Windows?
Why? The risk of it optimising things into a different order?
I know the first time I do this I'll spend a lot of time working with
the compiled assembly code.
So, for x86 SMR implmentation:
.align 16
.globl ac_i686_lfgc_smr_activate
ac_i686_lfgc_smr_activate:
movl 4(%esp), %edx
movl 8(%esp), %ecx
; you would need to add an extra mfence right here
; mfence
; before the compare loop?
ac_i686_lfgc_smr_activate_reload:
movl (%ecx), %eax
movl %eax, (%edx)
mfence
cmpl (%ecx), %eax
jne ac_i686_lfgc_smr_activate_reload
ret
Humm, this changes things...
Well, I am "not totally sure" of Microsoft's specific guarantees wrt
implementing sensitive multi-threaded algorithms. It might be safe...
Yet again, it might not be so safe after all...
:O
>
> Why? The risk of it optimising things into a different order?
Exactly.
> I know the first time I do this I'll spend a lot of time working with the
> compiled assembly code.
;)
Gee, Chris. I have this odd feeling of deja vu. Haven't we, uh, been
here before? ;-)
The problem with all this is that Microsoft code and documentation was
written for a really primitive in-order X86 architecture. The code has
been adapted somewhat randomly along the way, documentation has not been
updated to reflect DESIGN INTENT, and in fact it's by no means clear
that there really IS any design intent.
So you can experiment and bet your code that they INTEND it to do what
you currently observe, and risk that they'll change it, and that future
adaptations will remain true to your assumption... or you can play a
little safer and go by the most loose interpretation of what they
actually SAY. Or you can be paranoid and assume they'll probably screw
it up in the future for their own reasons (or for no reason at all), and
not trust it as far as it could throw you...
--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062
I am basically saying that you can't really trust Microsoft compilers wrt
lock-free programming.
> So you can experiment and bet your code that they INTEND it to do what you
> currently observe, and risk that they'll change it, and that future
> adaptations will remain true to your assumption... or you can play a
> little safer and go by the most loose interpretation of what they actually
> SAY. Or you can be paranoid and assume they'll probably screw it up in the
> future for their own reasons (or for no reason at all), and not trust it
> as far as it could throw you...
Exactly.
> Chris Thomasson wrote:
>>> Even with the _InterlockedCompareExchange() intrinsic function on Windows?
>>
>> Well, I am "not totally sure" of Microsoft's specific guarantees wrt
>> implementing sensitive multi-threaded algorithms. It might be safe...
>>
>> Yet again, it might not be so safe after all...
>
> Gee, Chris. I have this odd feeling of deja vu. Haven't we, uh, been
> here before? ;-)
>
> The problem with all this is that Microsoft code and documentation was
> written for a really primitive in-order X86 architecture. The code has
> been adapted somewhat randomly along the way, documentation has not been
> updated to reflect DESIGN INTENT, and in fact it's by no means clear
> that there really IS any design intent.
>
> So you can experiment and bet your code that they INTEND it to do what
> you currently observe, and risk that they'll change it, and that future
> adaptations will remain true to your assumption... or you can play a
> little safer and go by the most loose interpretation of what they
> actually SAY. Or you can be paranoid and assume they'll probably screw
> it up in the future for their own reasons (or for no reason at all), and
> not trust it as far as it could throw you...
>
Yes, we have been here before. This is the "Microsoft has no formal explicit
semanitcs but neither does Posix". The only thing we can say about Posix is
there are a few well know thread design patterns that are widely considered
as being supposed to work in Posix. DCL (DCSI) as being an example of a pattern
that does not work in Posix. I'd name some patters that do work in Posix
but I don't think we have names for those patterns.
There's some work being done to define a formal thread model for C++ but
looking at the mailing list discussion so far, they look on track to become
the poster child of how not to do it, i.e. using a meta implementation to
define semantics. Plus they're forgetting that part of what started people
to seriously consider formally defining threads for C++ was lock-free programming
which depends heavily on pointers and pointer abstraction. And C++ pointer
abstraction is somewhat lame to make an understatement. If they would look
at Java, they would see that there is great reliance on being able to distinquish
between thread local references and heap references, the JVM exploiting the
level of pointer indirection that occurs in the JVM. Considering there's two
people that have a background in Java threads, that's rather strange oversight.
I suspect after all the work is done on the C++ thread model you will end up
with stuff that will be considered unsafe to run on any system but Microsoft
windows which has a high tolerance for errors (by windows users that is, not the OS).
Which is where we are today.
http://www.microsoft.com/whdc/driver/kernel/MP_issues.mspx
The paper unfortunately is in doc format. Much of whats stated in the
paper is applicable in user mode as well.
/Michel
I don't understand why Microsoft can't just write some simple memory
visibility guarantees. Something like this would be a start:
Lets take the MS CAS Acquire/Release API...
InterlockedCompareExchangeAcquire
------------------------------------
LPLONG dest,
LONG xchg,
LONG cmp
Function Operation
-----------
This function atomically compares (*dest) with (cmp), and exchanges (*dest)
with (xchg) if they are equal. No update operation is performed on the
(*dest) if they were not equal. The function always returns the initial
value of (*dest).
Operation Memory Visibility
-----------
This function guarantees that the return value is loaded or any updates to
(*dest) are fully applied and made visible to all CPU's before any
subsequent loads/stores are applied or made visible to any CPU. All
Microsoft compilers are fully compliant with the guaranteed behavior of this
function. They will also not optimize anything around calls to this function
that might break the stated guarantees.
InterlockedCompareExchangeRelease
------------------------------------
LPLONG dest,
LONG xchg,
LONG cmp
Function Operation
-----------
This function atomically compares (*dest) with (cmp), and exchanges (*dest)
with (xchg) if they are equal. No update operation is performed on the
(*dest) if they were not equal. The function always returns the initial
value of (*dest).
Operation Memory Visibility
-----------
This function guarantees that any preceding loads/stores are fully applied
and made visible to all CPU's before
return value is loaded or any updates to (*dest) are applied or made visible
to any CPU. All Microsoft compilers are fully compliant with the guaranteed
behavior of this function. They will also not optimize anything around calls
to this function that might break the stated guarantees.
Humm, I wonder if that would be good enough explanation of guaranteed
behavior wrt memory visibility/compiler-safety for acquire/release
semantics...
;)
I wonder if this would also be sufficent:
Generic Mutex Implementation
- Memory Visibility Guarantees
--------------------
A thread of execution in a multi/uni-processor environment will always
observe the following states during the entire (lock=>unlock) state
transformation inherit in any mutex, in the exact order shown:
lock c-section unlock
---------------------------
1 => 2 => 3
State 2 can never be observed before state 1, and state 3 can never be
observed before states 2 or 1. The "=>" symbol represents a state transition
point. Effects of any prior states are required to be fully visible to any
new states.
Humm, these rules seem to imply explicit acquire/release semantics
naturally. The transition from (1=>2) requires acquire in order to make 1
become visible before 2, and the transition from (2=>3) requires release in
order to make 2 become visible before 3. Any thoughts on using state
transition flowcharts for memory visibility documentation?
> The problem with all this is that Microsoft code and documentation was
> written for a really primitive in-order X86 architecture. The code has
> been adapted somewhat randomly along the way, documentation has not been
> updated to reflect DESIGN INTENT, and in fact it's by no means clear that
> there really IS any design intent.
The evidence we have from various unofficial sources and from the
evolving documentation is that Microsoft intends the Interlocked* functions
to continue to have on all platforms the exact same memory visbility
semantics they happen to have on current x86 platforms. This seems to be as
close to a guarantee as we are ever going to get.
DS
"exact same memory visbility semantics they happen to have on current x86
platforms"
Can you tell us what these are?
Ta!
Doug
> Just for those of us in the cheap seats:
>
> "exact same memory visbility semantics they happen to have on current x86
> platforms"
>
> Can you tell us what these are?
They have full memory barriers both before and after the operation.
DS
acquire & release semantics ( st.rel / ld.acq ) are held for loads and
stores to the same locations; you sometimes have to use full membars (
mfence, LOCK prefix ) to enforce the ordering of a "critical-sequence" of
loads/stores to different locations on IA-32.
Subsequent operations to different locations can migrate above a st.rel
without a membar on IA-32...