http://relacy.pastebin.com/f2e9297b6
Things are looking good so far. I can't seem to get Relacy to barf up any
errors in the random scheduler with 99999999 iterations of 7 threads. I am
currently running the test using the `rl::fair_full_search_scheduler_type'
scheduler. So far so good!
;^)
IMHO, the algorithm itself is very simple; it's based on your basic doubly
linked-list. I have two counter variables, one keeps track of how many
pushes/pops are performed locally, and the other tracks how many pops are
preformed remotely. The information obtained by subtracting the remote
counter from the local counter is the main aspect of the invention because
it is used to determine if a remote thread can steal a node from the
linked-list. I keep a safe distance of 4 nodes (the distance is represented
by the `WSDEQUE_SPLIT' macro) in this example implementation. I am going to
experiment with a closer distance, but I am worried about touching freed
memory, because a Relacy test I did with a distance of 2 bit the dust in
about 10-15 seconds with an access of freed memory error. Remote threads
steals mutates the tail of the list, and the local thread mutates the head.
Its lock-based on conflict so the algorithm is fairly easy to reason about.
AFAICT, this unbounded deque is the simplest algorithm I have seen to date.
All the others I have seen are based on arrays and they need to reallocate
everything and copy all previous items when the array becomes full. Also, my
implementation is more space efficient because the user allocates nodes when
they need them and there is no pre-allocated array of a fixed size. The
linked-list is completely intrusive in nature, which adds even more to the
space efficiency aspect.
Memory barriers aside for a moment, a push never needs locks, and a pop to a
deque with a size equal to or greater than WSDEQUE_SPLIT needs no locking as
well. A steal operation is non-blocking wrt other stealers because it uses a
try_lock operation to attempt to gain access. The times that a stealer will
block the local owner thread is few and far between.
Anyway, please examine the algorithm and tell me what you think; okay?
Thank you all!
BTW, I am licensing this code under GPL.
BTW, this invention is primarily based on an idea I published on this group
a while back:
http://groups.google.com/group/comp.programming.threads/msg/9eeebb522452f210
I cleaned up the bugs and Relacy seems to like it now...
;^)
If things go they way it looks like they are going, I am planning on
creating a general-purpose work-stealing environment under GPL. Relacy has
been blasting the shi% out of the algorithm using the fair search scheduler
for many hours now, and so far, all is well with the world...
lol.
Humm... Speaking of memory barriers, I have extremely relaxed the membars in
the posted example:
http://relacy.pastebin.com/f5f819070
I only use acquire barrier to load from the counter variables, and release
to store to them. There is no #StoreLoad barrier on the fast-path. Relacy is
not puking out any data-race errors. I think I may have found a way to get
around the nasty #StoreLoad barrier in inherent in the non-blocking deques I
have seen. I need to do some more investigating. If indeed I found a way to
get around the expensive membar, well, this algorithm should really
outperform the one in Clik. Right now, I can't see any need for a
store-to-load pattern that has to be enforced. This is going to get
interesting...
;^)
Ahhhh... Relacy is detecting an access to freed memory error on this version
in another test I am working on (not posted yet) while the version with
sequential consistency works okay.
Humm... I need to do some investigating. I have to admit, its kind of a pain
to read through the lengthy output Relacy generates!
I am hoping that this algorithm turns out to work! Keeping fingers crossed.
;^)
Okay, so far this is the best I can do without pissing Relacy off real bad:
http://relacy.pastebin.com/f272dbf1a
I changes the test code itself to better simulate a "real" work-stealing
environment. I change the members in the node class to use `rl::var' instead
of `rl::atomic' because no two threads can ever load/store from/to them
concurrently. Also, I change the head pointer to `rl::var' because only one
thread ever touches it. I remove some superfluous/unnecessary code that did
not need to be in there (e.g., the double-checking logic). As for memory
barriers, well, I can get away with using them in only a few key places. I
created macros so that I can easily change them:
#define WSDEQUE_LOCAL_RCOUNT_LOAD_MEMBAR rl::memory_order_consume
#define WSDEQUE_LOCAL_LCOUNT_STORE_MEMBAR rl::memory_order_acq_rel
#define WSDEQUE_REMOTE_LCOUNT_LOAD_MEMBAR rl::memory_order_consume
#define WSDEQUE_REMOTE_RCOUNT_STORE_MEMBAR rl::memory_order_relaxed
The first two macros deal with the local functions (e.g., push() and
pop()'), the other two deal with the remote function (e.g., steal()). If I
weaken the `WSDEQUE_LOCAL_LCOUNT_STORE_MEMBAR' to say, a release barrier,
Relacy freaks out and reports an access to freed memory error. In other
words, Relacy says the algorithm does not work when I do that. However, if I
leave the barriers as is, everything works good.
Well, at least the algorithm itself seems to be working.
BTW, can you think of any ways to get rid of the fuc%ing
`memory_order_acq_rel' barrier!!!???
Ouch! ;^o
Okay, so far this is the best I can do without pissing Relacy off real bad:
Well, I tried something fairly odd and used an acquire barrier for the
stores:
#define WSDEQUE_LOCAL_RCOUNT_LOAD_MEMBAR rl::memory_order_consume
#define WSDEQUE_LOCAL_LCOUNT_STORE_MEMBAR rl::memory_order_acquire
#define WSDEQUE_REMOTE_LCOUNT_LOAD_MEMBAR rl::memory_order_consume
#define WSDEQUE_REMOTE_RCOUNT_STORE_MEMBAR rl::memory_order_relaxed
I also removed a store to the tail that I accidentally forgot to add relaxed
memory order. Here is the code:
http://relacy.pastebin.com/f7d79cd0f
Here is the difference between the new code and the last one I posted:
http://relacy.pastebin.com/pastebin.php?diff=f7d79cd0f
Now, here is my question... Is using an acquire barrier for a store even
legal in C++0x? I mean, what would that look like on a SPARC? Humm... I am
guessing that it would be a `#LoadStore | #LoadLoad'. Humm... But on the
other hand, its dealing with a store, so it still might add a nasty
`#StoreLoad'. Need to think...
Anyway, Relacy seems to like the code with the acquire barrier on the stores
because the test is passing the random scheduler with 99999999 iterations of
4 threads.
Any thoughts!?
;^o
Heck, I can even get it to work with a `memory_order_consume' barrier!
#define WSDEQUE_LOCAL_RCOUNT_LOAD_MEMBAR rl::memory_order_consume
#define WSDEQUE_LOCAL_LCOUNT_STORE_MEMBAR rl::memory_order_consume
#define WSDEQUE_REMOTE_LCOUNT_LOAD_MEMBAR rl::memory_order_consume
#define WSDEQUE_REMOTE_RCOUNT_STORE_MEMBAR rl::memory_order_relaxed
Is this a bug in Relacy or something, or is this legit wrt C++0x rules?
--
Hallvard
I was experimenting with relaxing the memory ordering and "documenting" the
experience. The first version I posted:
http://relacy.pastebin.com/f2e9297b6
works okay.
Fine, but why do we need 9 progress reports in 6 hours? Just
hold off posting a few hours once in a while.
--
Hallvard
;^o