Regards
Galia
You probably won't like this answer, but if EnterCriticalSection and
LeaveCriticalSection are the performance bottlenecks in your
application, perhaps you should rethink your design. As another poster
has said, an uncontested critical section is acquired pretty quickly.
Why do you need to know this?
David
--
Best regards,
Vladimir Zinin
mailto:vzi...@gmail.com
[...]
The is a false assertion. Describing the operation as "extremely fast" is
really misleading to say the least. You probably don't know that acquiring
and releasing an uncontended lock involves the use of 2 interlocked RMW
instructions and 2 memory barriers, one of which is very expensive. Example
using SPARC membar instruction:
lock() {
Interlocked RMW
membar #StoreLoad | #StoreStore // expensive!
}
unlock() {
membar #LoadStore | #StoreStore
Interlocked RMW
}
On Intel architectures, the #StoreLoad | #StoreStore memory barrier is
handled by the LOCK prefix which is not a lightweight operation to say the
least:
http://groups.google.com/group/comp.programming.threads/msg/aeb7e857ef440520
http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce
FWIW, here is a paper that details the Intel Memory Model:
http://developer.intel.com/products/processor/manuals/318147.pdf
Well, minimizing synchronization is worthwhile, because you can't get faster
than nothing at all. But a critical section is (one of) the fastest
synchronization primitives. Lockless queues can often get down to only one
interlocked read/modify/write, but other synchronization methods are far
more costly. Right?
Not really true.
> Lockless queues can often get down to only one interlocked
> read/modify/write, but other synchronization methods are far more costly.
> Right?
Yes and no:
- Yes in the sense that some of the algorithms can be implemented with
interlocking.
- No in the sense that some of the algorithms can be implemented without any
interlocked whatsoever:
http://groups.google.com/group/comp.programming.threads/msg/cab0505f87084c2b
(I describe how many instructions my vZOOM queuing uses...)
___________________________________
> It's certainly a scalable approach, and if your virtually-zero overhead is
> really virtually zero,
Well, it actually is... For instance, on x86 the push function is made up of
five (5 mov) instructions. The pop function is (1 push; 5 mov; 1 cmp; 1 xor;
1 pop; ret) for 10 instructions.... The push and pop combined is only 15
instructions. IMHO, that is virtually zero overhead when compared to an
implementation that uses the LOCK prefix or an mfence instruction.
___________________________________
Any thoughts?
[...]
It not true when compared to wait-free queue implementations that do not
rely on interlocking or #StoreLoad/#LoadStore membar instructions (e.g.,
LOCK prefix, or mfence instructions on x86)...
How do you handle cache incoherency without using LOCK or some other
interlocking mechanism? You could fill the queue on one core and the other
would never see a change.
By using the memory barriers that are already implied by the x86... I am
talking about an unbounded single producer/consumer wait-free queue that
relies on the assertions made in the following paper:
http://developer.intel.com/products/processor/manuals/318147.pdf
The queue has no #StoreLoad, or even #LoadStore dependencies. The MOV
instruction actually provides more than enough memory ordering, according to
the paper linked to above which basically states that loads/stores on
current x86 are as follows:
void* atomic_x86_loadptr(
void* volatile* psrc
) {
/* atomic load */
1: void* const ptr = *psrc;
/* acquire-load membar */
2: membar #LoadStore | #LoadLoad;
3: return ptr;
}
void* atomic_x86_storeptr(
void* volatile* pdest,
void* const src
) {
/* release-store membar */
1: membar #LoadStore | #StoreStore;
/* atomic store*/
2: *pdest = src;
3: return src;
}
I can show you some pseudo-code if you want... Also, read all of this:
http://en.wikipedia.org/wiki/RCUhttp://en.wikipedia.org/wiki/RCU
> You could fill the queue on one core and the other would never see a
> change.
The single consumer will always eventually get a coherent view of every new
node linked into the queue by the single producer. The consumer uses
"data-dependant" memory barrier after the load to shared pointer that
represents the tail of the queue. Here is an example of dependant barriers:
https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068
(read all)
Notice that reader threads can walk the linked-list while a writer thread
concurrently mutates it.
Looks much much more expensive then an interlocked update to me...
synchronize_rcu() function looks like a last resort synchronization function
I would only do, if nothing else worked... It's extremely intrusive and
really feasible for kernel (NON PREEMPTIVE KERNEL) code only.
"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:OI226bZE...@TK2MSFTNGP02.phx.gbl...
For the reader side? Seriously>?!!!!!!!!!!!!
WRONG!
Read all about RCU! Do not get ignorant now!+
RCU design objective was to reduce number of expensive waits, and use
non-blocking locks instead. In case of writer contention, RCU costs another
deep copy of the guarded object. If it's less expensive than kernel wait,
then it has an advantage over waiting locks.
But to implement RCU, ONE HAS TO USE ATOMIC OPERATIONS ANYWAY.
"So the typical RCU update sequence goes something like the following:
1.. Remove pointers to a data structure, so that subsequent readers cannot
gain a reference to it.
2.. Wait for all previous readers to complete their RCU read-side critical
sections.
3.. At this point, there cannot be any readers who hold references to the
data structure, so it now may safely be reclaimed (e.g., freed, perhaps
using kfree in the Linux kernel). "
Step 2 still can involve heavy wait.
Sorry to disappoint you, but RCU is not a miracle, not panacea, not a latest
days testament. It's just a trick, useful sometimes, and is nothing to get
so excited about.
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:gvCdnc3P7_YL24Xa...@comcast.com...
An interlocked update doesn't need LOCK prefix for the reader either, it
just needs the read to be tagged to avoid caching by an optimizing compiler
(i.e. behavior of volatile keyword before MSVC2005). So on read it is the
same, and on write the interlocked update is far faster.
[...]
The point of RCU is that it allows reader threads to access data-structures
without using any synchronization instructions whatsoever.
You don't seem get it. Are you seriously saying that reader threads
iterating a linked-list using interlocked instructions is the just as fast
as using no interlocking? If you are ohh so wrong.
/* Writer Threads */
node_t *n = malloc(...)
mutex_lock(...);
/* init new node */
n->next = stack->front;
/* release barrier */
/* make new node visible to readers */
stack->front = n;
mutex_unlock(...);
/* Read-Only Threads */
/* read stack anchor */
node_t *nx, *n = stack->front;
/* dependent load-barrier -
read_barrier_depends() for Linux. */
/* Iterate */
while ( n )
{
/* read next node */
nx = n->next;
/* dependent load-barrier -
read_barrier_depends() for Linux. Alpha only */
/* do whatever */
do_whatever( n );
n = nx;
}
How can you possibly say that!
A real question is how useful is that list. You may want to remove items,
too.
If you have to add REMOVE operation to it, the life becomes more
complicated. You cannot have a reader without synchronization then.
By the way, there's no need to torture yourself with K&R-style comments
/**/. All commercially available and free compilers support //, for long
time now.
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:pfWdnUgLGvdS_YTa...@comcast.com...
[...]
You are 100% dead wrong about that! You can most certainly remove and
reclaim items from the list, while there are readers accessing it without
synchronization. Please study up on RCU's defer function:
// Writer Threads / Pop
node_t* n;
mutex_lock(...);
n = stack->front;
if (n) {
stack->front = n->next;
}
mutex_unlock(...);
rcu_defer(n);
// Writer Threads / Push
node_t *n = malloc(...)
mutex_lock(...);
// init new node
n->next = stack->front;
// release barrier
// make new node visible to readers
stack->front = n;
mutex_unlock(...);
// Read-Only Threads
// read stack anchor
node_t *nx, *n = stack->front;
/* dependent load-barrier -
read_barrier_depends() for Linux. */
// Iterate
while ( n )
{
// read next node
nx = n->next;
/* dependent load-barrier -
read_barrier_depends() for Linux. Alpha only */
// do whatever
do_whatever( n );
n = nx;
}
The rcu_defer function allows you to reclaim the node when after it has
entered a quiescent state. Also, look here:
http://groups.google.com/group/comp.programming.threads/msg/c8206c2535fe816e
/**/ is NOT support by C compilers!
Ahh shit!!!!!!!!!!!!
STUPID, STUPID!! ME!!!!!!!!!!!!!!!!!!!!!!!!!!
I meant to say: // is NOT supported by Standard C compilers!
Sorry for any confusions!
I made a retarded mistake!
Study here as well:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/82376b7295e6ce1a
Also, PLEASE do not confuse PDR with GC!
[...]
RCU list is FINE with insertions on the head, middle or tail...
Please, STUDY BEFORE you make massively NAIVE STATEMENTS!
Also Post over comp.programming.threads. We know how to handle all of your
worries...
Check ISO/IEC 9899:1999
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:_cKdnY-ke6V_PITa...@comcast.com...
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:5dudnZXIQdxlwoTa...@comcast.com...
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:0KydnTFwA-IVHoTa...@comcast.com...
Yup. I was lacking sleep, which tends to render a sort -of stupor.
Sorry. However, the information I posted, wrt RCU, is correct.
> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:5dudnZXIQdxlwoTa...@comcast.com...
>> "Alexander Grigoriev" <al...@earthlink.net> wrote in message
>> news:OWGvbSlE...@TK2MSFTNGP05.phx.gbl...
>>> OK. RCU is useable in cases where, for writers, it's very cheap to
>>> allocate
>>
>> [...]
>>
>> The point of RCU is that it allows reader threads to access
>> data-structures without using any synchronization instructions
>> whatsoever.
"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:eN1HfA1E...@TK2MSFTNGP04.phx.gbl...
> Not so easy. For an object, owned by multiple readers, you still need to
> track its lifetime (by employing, for example, reference counts), while it
> can get replaced by a new modified object.
Easy? You don't have to track per-object lifetimes, that would be
inefficient to say the least. You track the epoch's life time:
http://appcore.home.comcast.net/vzoom/round-2.pdf
(Page 2/Section 1.2...)
_________
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/bded388b24fedf7
(read this before round-1...)
This is trade secret! If you search comp.programming.threads the right way,
you will see indeed.
> Actually the code to lock the node n from deletion should be before
> fetching n (or, for non-head node, nx) in the reader thread.
You delete the node when it is quiescent. Research Paul E. McKenney!
http://www.rdrop.com/users/paulmck/
This is common knoledge wrt Linux Kernel Community; you can do it in
Windozws as well.
This is a sync-free operation wrt the object being deleted (node in this
case) because it has already quiesced when the rcu_defer callback is
executed.
[...]
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:MoednUs9FqZF9Yfa...@comcast.com...
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:MoednUs9FqZF9Yfa...@comcast.com...
OOPS!
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/bded388b24fedf7b
"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:uSC37W3E...@TK2MSFTNGP02.phx.gbl...
> So you have to acquiesce all your reader threads to do explicit garbage
> collection. I'm not sure it scales well to large number of threads.
>
You do not have to stop your reader threads in order to track the
synchronization epochs... You don't have to stop-the-world or any of that
crap.