Synchronization Algorithm Verificator for C++0x

Dmitriy V'jukov

unread,

Aug 2, 2008, 12:47:59 PM8/2/08

to

I want to announce tool called Relacy Race Detector, which I've
developed. It's synchronization algorithm verificator for C++0x's
relaxed memory model. The tool is implemented in the form of header-
only library for C++03, which can be used for efficient execution of
unit-tests for synchronization algorithms. The tool executes unit-test
many times under control of special scheduler, on every execution
scheduler models different interleaving between threads, at the same
time every execution is exhaustively analyzed for data races, accesses
to freed memory, failed assertions etc. If no errors found then
verification terminates when particular number of interleavings are
verified (for random scheduler), or when all possible interleavings
are verified (for full search scheduler). If error is found then tool
outputs detailed execution history which leads to
error and terminates.
The tool was designed for verification of algorithms like memory
management, memory reclamation for lock-free algorithms, multithreaded
containers (queues, stacks, maps), mutexes, eventcounts and so on.
My personal subjective feeling to date is that tool is capable of
finding extremely subtle algorithmic errors, memory fence placement
errors and memory fence type errors within a second. And after error
is detected error fixing is relatively trivial, because one has
detailed execution history which leads to error.

Main features:
- Relaxed ISO C++0x Memory Model. Relaxed/acquire/release/acq_rel/
seq_cst memory operations and fences. The only non-supported feature
is memory_order_consume, it's simulated with memory_order_acquire.
- Exhaustive automatic error checking (including ABA detection).
- Full-fledged atomics library (with spurious failures in
compare_exchange()).
- Arbitrary number of threads.
- Detailed execution history for failed tests.
- Before/after/invariant functions for test suites.
- No false positives.

Types of detectable errors:
- Race condition (according to ISO C++0x definition)
- Access to uninitialized variable
- Access to freed memory
- Double free
- Memory leak
- Deadlock
- Livelock
- User assert failed
- User invariant failed

You can get some more information (tutorial, examples etc) here:
http://groups.google.com/group/relacy/web

And here is dedicated news group/discussion forum:
http://groups.google.com/group/relacy/topics

If you want to get a copy of the tool, please, contact me via
dvy...@gmail.com.

Any feedback, comments, suggestions are welcome.

Relacy Race Detector is free for open-source, non-commercial
development; research with non-patented, published results;
educational purposes; home/private usage. Please contact me
(dvy...@gmail.com) about other usages.

--------------------------------------------------------
Here is quick example.
Code of unit-test:

#include <relacy/relacy_std.hpp>

// template parameter '2' is number of threads
struct race_test : rl::test_suite<race_test, 2>
{
std::atomic<int> a;
rl::var<int> x;

// executed in single thread before main thread function
void before()
{
a($) = 0;
x($) = 0;
}

// main thread function
void thread(unsigned thread_index)
{
if (0 == thread_index)
{
// code executed by first thread
x($) = 1;
a($).store(1, rl::memory_order_relaxed);
}
else
{
// code executed by second thread
if (1 == a($).load(rl::memory_order_relaxed))
x($) = 2;
}
}

// executed in single thread after main thread function
void after()
{
}

// executed in single thread after every
// 'visible' action in main threads.
// disallowed to modify any state
void invariant()
{
}
};

int main()
{
// start simulation
rl::simulate<race_test>();
}

And here is output of the tool:

struct race_test
DATA RACE
iteration: 8

execution history:
[0] 0: <00366538> atomic store, value=0, (prev value=0),
order=seq_cst, in race_test::before, test.cpp(14)
[1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
[2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
[3] 0: <00366538> atomic store, value=1, (prev value=0),
order=relaxed, in race_test::thread, test.cpp(24)
[4] 1: <00366538> atomic load, value=1, order=relaxed, in
race_test::thread, test.cpp(28)
[5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
[6] 1: data race detected, in race_test::thread, test.cpp(29)

thread 0:
[0] 0: <00366538> atomic store, value=0, (prev value=0),
order=seq_cst, in race_test::before, test.cpp(14)
[1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
[2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
[3] 0: <00366538> atomic store, value=1, (prev value=0),
order=relaxed, in race_test::thread, test.cpp(24)

thread 1:
[4] 1: <00366538> atomic load, value=1, order=relaxed, in
race_test::thread, test.cpp(28)
[5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
[6] 1: data race detected, in race_test::thread, test.cpp(29)
--------------------------------------------------------

Dmitriy V'jukov

unread,

Aug 3, 2008, 8:22:03 AM8/3/08

to

On 2 авг, 20:47, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> I want to announce tool called Relacy Race Detector, which I've
> developed. It's synchronization algorithm verificator for C++0x's
> relaxed memory model.

Q: Can I use Relacy Race Detector to check my algo againts other that C
++0x memory models (x86, PPC, Java, CLI)?

A Yes, you can. Fortunately, C++0x memory model is very relaxaed, so
for the main part it's a "superset" of basically any other memory
model. You just have to define "binding" between your target memory
model and C++0x memory model.

Let's create such binding, for example, for x86 memory model:
- plain load operation is always "acquire" (i.e.
memory_order_acquire)
- plain store operation is always "release" (i.e.
memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e.
memory_order_seq_cst)
- mfence instruction is
std::atomic_thread_fence(memory_order_seq_cst)
That is all. You can create such bindings for other hardware memory
models you are interested in (PPC, Itatium, SPARC etc).

And you can define such binding to other abstract memory model, like
Java MM. Let's see:
- plain load is "relaxed" (i.e. memory_order_relaxed)
- plain store is "relaxed" (i.e. memory_order_relaxed)
- volatile load is "acquire" (i.e. memory_order_acquire)
- volatile store operation is "release" (i.e. memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e.
memory_order_seq_cst)
But here are some caveats. First, you have to emulate work of GC, i.e.
put all allocated memory to special list, and free all allocated
memory in test_suite::after(). Second, you have to manually emit
sequentially consistent memory fence between every volatile store and
volatile load. Third, you have to manually initialize all variables to
default value (0). Fourth, there is no such thing as data race, so you
have to define all variables as std::atomic, this will effectively
disable data race detection mechanizm. Well, actually you can use
rl::var variables, if you know that there must be no concurrent
accesses to variable, this will enable some automatic error detection
wrt data races.
Sounds not very cool... So I'm going to add built-in support for Java
and CLI. Then user would have to define something like RL_JAVA_MODE/
RL_CLI_MODE, and get all those things out-of-the-box. But yes, it
still will be C++ library. What do you think?

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 5, 2008, 4:23:27 PM8/5/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:51a36d85-2fc6-4bd6...@59g2000hsb.googlegroups.com...

>I want to announce tool called Relacy Race Detector, which I've
> developed. It's synchronization algorithm verificator for C++0x's
> relaxed memory model. The tool is implemented in the form of header-
> only library for C++03, which can be used for efficient execution of
> unit-tests for synchronization algorithms. The tool executes unit-test
> many times under control of special scheduler, on every execution
> scheduler models different interleaving between threads, at the same
> time every execution is exhaustively analyzed for data races, accesses
> to freed memory, failed assertions etc. If no errors found then
> verification terminates when particular number of interleavings are
> verified (for random scheduler), or when all possible interleavings
> are verified (for full search scheduler). If error is found then tool
> outputs detailed execution history which leads to
> error and terminates.
> The tool was designed for verification of algorithms like memory
> management, memory reclamation for lock-free algorithms, multithreaded
> containers (queues, stacks, maps), mutexes, eventcounts and so on.
> My personal subjective feeling to date is that tool is capable of
> finding extremely subtle algorithmic errors, memory fence placement
> errors and memory fence type errors within a second. And after error
> is detected error fixing is relatively trivial, because one has
> detailed execution history which leads to error.

[...]

Have you testing the following algorithm yet?

http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf

I thought about coding it up. However, I think the algorithm might have a
patent application.

Chris M. Thomasson

unread,

Aug 5, 2008, 6:46:13 PM8/5/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:51a36d85-2fc6-4bd6...@59g2000hsb.googlegroups.com...

>I want to announce tool called Relacy Race Detector, which I've
> developed. It's synchronization algorithm verificator for C++0x's
> relaxed memory model. The tool is implemented in the form of header-
> only library for C++03, which can be used for efficient execution of
> unit-tests for synchronization algorithms. The tool executes unit-test
> many times under control of special scheduler, on every execution
> scheduler models different interleaving between threads, at the same
> time every execution is exhaustively analyzed for data races, accesses
> to freed memory, failed assertions etc. If no errors found then
> verification terminates when particular number of interleavings are
> verified (for random scheduler), or when all possible interleavings
> are verified (for full search scheduler). If error is found then tool
> outputs detailed execution history which leads to
> error and terminates.
> The tool was designed for verification of algorithms like memory
> management, memory reclamation for lock-free algorithms, multithreaded
> containers (queues, stacks, maps), mutexes, eventcounts and so on.
> My personal subjective feeling to date is that tool is capable of
> finding extremely subtle algorithmic errors, memory fence placement
> errors and memory fence type errors within a second. And after error
> is detected error fixing is relatively trivial, because one has
> detailed execution history which leads to error.

[...]

After you integrate a mutex and condvar into the detection framework, I
would like to model an eventcount.

Also, it seems like you could parallelize the testing process somewhat... I
need to think about this some more.

Anyway, nice work.

BTW, you ask if there is any market for this type of tool in an e-mail, well
I hope you don't mind if I answer that here...

IMVOH, when C++ 0x is finally released, it should have a fairly big impact.
I can see it now... You trademark a logo, and give a company the chance to
stick it on their products to give their customers piece of mind:

Customer: "Well, the software has the Relacy logo on the front of the box;
it must be correct concurrent software indeed!"

Vendor: "Are software lives up to the prerequisites of the Relacy logo. We
are proud to be able to display it on our software packaging."

I hope you make tons of $$$ with this tool Dmitriy. Get rich and, then a
little richer; AFAICT its a winner.

Any thoughts?

=^D

Chris M. Thomasson

unread,

Aug 5, 2008, 6:54:28 PM8/5/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:gH4mk.11785$KZ....@newsfe03.iad...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:51a36d85-2fc6-4bd6...@59g2000hsb.googlegroups.com...
>>I want to announce tool called Relacy Race Detector, which I've
>> developed.

[...]

>
> [...]
>
> After you integrate a mutex and condvar into the detection framework, I
> would like to model an eventcount.

I would be interested to see how long it takes to detect the following bug
you found in an older version of AppCore:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/cf4a952ff5af7d

[...]

Dmitriy V'jukov

unread,

Aug 5, 2008, 10:31:43 PM8/5/08

to

On 6 авг, 02:54, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> I would be interested to see how long it takes to detect the following bug
> you found in an older version of AppCore:
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...

Ok, let's see. I've just finished basic mutex and condvar
implementation.

Here is eventcount:

class eventcount
{
public:
eventcount()
: count(0)
, waiters(0)
{}

void signal_relaxed()
{
unsigned cmp = count($).load(std::memory_order_relaxed);
signal_impl(cmp);
}

void signal()
{
unsigned cmp = count($).fetch_add(0, std::memory_order_seq_cst);
signal_impl(cmp);
}

unsigned get()
{
unsigned cmp = count($).fetch_or(0x80000000,
std::memory_order_seq_cst);
return cmp & 0x7FFFFFFF;
}

void wait(unsigned cmp)
{
unsigned ec = count($).load(std::memory_order_seq_cst);
if (cmp == (ec & 0x7FFFFFFF))
{
guard.lock($);
ec = count($).load(std::memory_order_seq_cst);
if (cmp == (ec & 0x7FFFFFFF))
{
waiters($) += 1;
cv.wait(guard, $);
}
guard.unlock($);
}
}

private:
std::atomic<unsigned> count;
rl::var<unsigned> waiters;
mutex guard;
condition_variable_any cv;

void signal_impl(unsigned cmp)
{
if (cmp & 0x80000000)
{
guard.lock($);
while (false == count($).compare_swap(cmp,
(cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
unsigned w = waiters($);
waiters($) = 0;
guard.unlock($);
if (w)
cv.notify_all($);
}
}
};

Here is the queue:

template<typename T>
class spsc_queue
{
public:
spsc_queue()
{
node* n = RL_NEW node ();
head($) = n;
tail($) = n;
}

~spsc_queue()
{
RL_ASSERT(head($) == tail($));
RL_DELETE((node*)head($));
}

void enqueue(T data)
{
node* n = RL_NEW node (data);
head($)->next($).store(n, std::memory_order_release);
head($) = n;
ec.signal_relaxed();
}

T dequeue()
{
T data = try_dequeue();
while (0 == data)
{
int cmp = ec.get();
data = try_dequeue();
if (data)
break;
ec.wait(cmp);
data = try_dequeue();
if (data)
break;
}
return data;
}

private:
struct node
{
std::atomic<node*> next;
rl::var<T> data;

node(T data = T())
: next(0)
, data(data)
{}
};

rl::var<node*> head;
rl::var<node*> tail;

eventcount ec;

T try_dequeue()
{
node* t = tail($);
node* n = t->next($).load(std::memory_order_acquire);
if (0 == n)
return 0;
T data = n->data($);
RL_DELETE(t);
tail($) = n;
return data;
}
};

Here is the test:

struct spsc_queue_test : rl::test_suite<spsc_queue_test, 2>
{
spsc_queue<int> q;

void thread(unsigned thread_index)
{
if (0 == thread_index)
{

q.enqueue(11);
}
else
{
int d = q.dequeue();
RL_ASSERT(11 == d);
}
}
};

int main()
{
rl::simulate<spsc_queue_test>();
}

It takes around 40 minutes.

Let's run it!
Here is output (note error detected on iteration 3):

struct spsc_queue_test
DEADLOCK
iteration: 3

execution history:
[0] 1: [BEFORE BEGIN], in rl::context_impl<struct
spsc_queue_test,class rl::random_scheduler<2> >::fiber_proc_impl,
context.hpp(341)
[1] 1: <0035398C> atomic store, value=-1, (prev value=0),
order=relaxed, in rl::generic_mutex<0>::generic_mutex, stdlib/
mutex.hpp(77)
[2] 1: memory allocation: addr=00353A80, size=52, in
spsc_queue<int>::spsc_queue, peterson.cpp(397)
[3] 1: <00353938> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(398)
[4] 1: <00353948> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(399)
[5] 1: [BEFORE END], in rl::context_impl<struct spsc_queue_test,class
rl::random_scheduler<2> >::fiber_proc_impl, context.hpp(356)
[6] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[7] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[8] 1: <00353958> fetch_or , prev=0, op=2147483648, new=2147483648,
order=seq_cst, in eventcount::get, peterson.cpp(349)
[9] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[10] 0: memory allocation: addr=0035BCA0, size=52, in
spsc_queue<int>::enqueue, peterson.cpp(410)
[11] 0: <00353938> load, value=00353A80, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[12] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[13] 0: <00353A80> atomic store, value=0035BCA0, (prev
value=00000000), order=release, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[14] 0: <00353938> store, value=0035BCA0, in spsc_queue<int>::enqueue,
peterson.cpp(412)
[15] 0: <00353958> atomic load, value=0 [NOT CURRENT], order=relaxed,
in eventcount::signal_relaxed, peterson.cpp(337)
[16] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(355)
[17] 1: <0035398C> CAS succ orig=-1, cmp=-1, xchg=1, order=acquire, in
rl::generic_mutex<0>::lock, stdlib/mutex.hpp(117)
[18] 1: <0035398C> mutex: lock, in eventcount::wait, peterson.cpp(358)
[19] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(359)
[20] 1: <0035397C> load, value=0, in eventcount::wait,
peterson.cpp(362)
[21] 1: <0035397C> store, value=1, in eventcount::wait,
peterson.cpp(362)
[22] 1: <0035398C> atomic store, value=-1, (prev value=1),
order=release, in rl::generic_mutex<0>::unlock, stdlib/mutex.hpp(209)
[23] 1: <0035398C> mutex: unlock, in eventcount::wait,
peterson.cpp(363)
[24] 1: blocking thread, in eventcount::wait, peterson.cpp(363)
[25] 1: deadlock detected, in eventcount::wait, peterson.cpp(363)

thread 0:
[10] 0: memory allocation: addr=0035BCA0, size=52, in
spsc_queue<int>::enqueue, peterson.cpp(410)
[11] 0: <00353938> load, value=00353A80, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[13] 0: <00353A80> atomic store, value=0035BCA0, (prev
value=00000000), order=release, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[14] 0: <00353938> store, value=0035BCA0, in spsc_queue<int>::enqueue,
peterson.cpp(412)
[15] 0: <00353958> atomic load, value=0 [NOT CURRENT], order=relaxed,
in eventcount::signal_relaxed, peterson.cpp(337)

thread 1:
[0] 1: [BEFORE BEGIN], in rl::context_impl<struct
spsc_queue_test,class rl::random_scheduler<2> >::fiber_proc_impl,
context.hpp(341)
[1] 1: <0035398C> atomic store, value=-1, (prev value=0),
order=relaxed, in rl::generic_mutex<0>::generic_mutex, stdlib/
mutex.hpp(77)
[2] 1: memory allocation: addr=00353A80, size=52, in
spsc_queue<int>::spsc_queue, peterson.cpp(397)
[3] 1: <00353938> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(398)
[4] 1: <00353948> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(399)
[5] 1: [BEFORE END], in rl::context_impl<struct spsc_queue_test,class
rl::random_scheduler<2> >::fiber_proc_impl, context.hpp(356)
[6] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[7] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[8] 1: <00353958> fetch_or , prev=0, op=2147483648, new=2147483648,
order=seq_cst, in eventcount::get, peterson.cpp(349)
[9] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[12] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[16] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(355)
[17] 1: <0035398C> CAS succ orig=-1, cmp=-1, xchg=1, order=acquire, in
rl::generic_mutex<0>::lock, stdlib/mutex.hpp(117)
[18] 1: <0035398C> mutex: lock, in eventcount::wait, peterson.cpp(358)
[19] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(359)
[20] 1: <0035397C> load, value=0, in eventcount::wait,
peterson.cpp(362)
[21] 1: <0035397C> store, value=1, in eventcount::wait,
peterson.cpp(362)
[22] 1: <0035398C> atomic store, value=-1, (prev value=1),
order=release, in rl::generic_mutex<0>::unlock, stdlib/mutex.hpp(209)
[23] 1: <0035398C> mutex: unlock, in eventcount::wait,
peterson.cpp(363)
[24] 1: blocking thread, in eventcount::wait, peterson.cpp(363)
[25] 1: deadlock detected, in eventcount::wait, peterson.cpp(363)

Operations on condition variables are not yet in execution history.

Let's replace 'ec.signal_relaxed()' with 'ec.signal()'. Here is the
output:

struct spsc_queue_test
6% (65536/1000000)
13% (131072/1000000)
19% (196608/1000000)
26% (262144/1000000)
32% (327680/1000000)
39% (393216/1000000)
45% (458752/1000000)
52% (524288/1000000)
58% (589824/1000000)
65% (655360/1000000)
72% (720896/1000000)
78% (786432/1000000)
85% (851968/1000000)
91% (917504/1000000)
98% (983040/1000000)
iterations: 1000000
total time: 2422
throughput: 412881

Test passed. Throughput is around 400'000 test executions per second.

Also test reveals some errors with memory fences.
In signaling function you are using acquire fence in cas, but relaxed
is enough here:
void signal_impl(unsigned cmp)
{
if (cmp & 0x80000000)
{
guard.lock($);
while (false == count($).compare_swap(cmp,
(cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
unsigned w = waiters($);
waiters($) = 0;
guard.unlock($);
if (w)
cv.notify_all($);
}
}

Also. Following naive approach doesn't work (at least in C++0x):
void signal()
{
std::atomic_thread_fence(std::memory_order_seq_cst);
signal_relaxed();
}
You have to execute sequentially consistent atomic RMW on 'ec'
variable here. Because sequentially consistent fence and sequentially
consistent atomic RMW operation doesn' t synchronize-with each other.
Hmmm... This is strange. But I can't find anything about this in
draft... Anthony Williams said that committee made some changes to
initial proposal on fences, but they are not yet published...

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 5, 2008, 11:22:18 PM8/5/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:8488efa1-9383-4232...@a1g2000hsb.googlegroups.com...

On 6 авг, 02:54, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > I would be interested to see how long it takes to detect the following
> > bug
> > you found in an older version of AppCore:
> >
> > http://groups.google.com/group/comp.programming.threads/browse_frm/th...

> Ok, let's see. I've just finished basic mutex and condvar
> implementation.

Can you e-mail the new version to me please?

> Here is eventcount:

class eventcount
{
[...]

void wait(unsigned cmp)
{
unsigned ec = count($).load(std::memory_order_seq_cst);
if (cmp == (ec & 0x7FFFFFFF))
{
guard.lock($);
ec = count($).load(std::memory_order_seq_cst);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I don't think that you need a memory barrier under the lock because the
only mutation which will make the following compare succeed is also
performed under the same lock. The signal-slow path is under the lock, and
the waiter slow-path is under the lock. However, I don't know if this
violates some rule in C++ 0x.

if (cmp == (ec & 0x7FFFFFFF))
{
waiters($) += 1;
cv.wait(guard, $);
}
guard.unlock($);
}
}

};

> Here is the queue:

[...]

> Here is the test:

[...]

> It takes around 40 minutes.

> Let's run it!
> Here is output (note error detected on iteration 3):

> struct spsc_queue_test
> DEADLOCK
> iteration: 3

Cool; it sure does detect the race-condition in the old AppCore! Very nice
indeed.

[...]

> Operations on condition variables are not yet in execution history.

> Let's replace 'ec.signal_relaxed()' with 'ec.signal()'. Here is the
> output:

> struct spsc_queue_test
[...]

> iterations: 1000000
> total time: 2422
> throughput: 412881

> Test passed. Throughput is around 400'000 test executions per second.

Beautiful. My eventcount algorihtm works!

=^D

> Also test reveals some errors with memory fences.
> In signaling function you are using acquire fence in cas, but relaxed
> is enough here:
> void signal_impl(unsigned cmp)
> {
> if (cmp & 0x80000000)
> {
> guard.lock($);
> while (false == count($).compare_swap(cmp,
> (cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
> unsigned w = waiters($);
> waiters($) = 0;
> guard.unlock($);
> if (w)
> cv.notify_all($);
> }
> }

Right. The current public AppCore atomic API is not fine-grain enough to
allow for relaxed operations. I need to change that.

> Also. Following naive approach doesn't work (at least in C++0x):
> void signal()
> {
> std::atomic_thread_fence(std::memory_order_seq_cst);
> signal_relaxed();
> }

Good thing the MFENCE works in x86!

> You have to execute sequentially consistent atomic RMW on 'ec'
> variable here. Because sequentially consistent fence and sequentially
> consistent atomic RMW operation doesn' t synchronize-with each other.
> Hmmm... This is strange. But I can't find anything about this in
> draft... Anthony Williams said that committee made some changes to
> initial proposal on fences, but they are not yet published...

This is weird. a seq_cst fence does not work with seq_cst RMW!? Do you
happen to know the rational for this? IMVHO, it makes no sense at all... I
must be missing something.

:^/

Dmitriy V'jukov

unread,

Aug 6, 2008, 7:18:15 AM8/6/08

to

On Aug 6, 12:23 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Have you testing the following algorithm yet?
>
> http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf

Yes and no. No, I didn't test this particular algorithm. But I test my
analogous algorithm for work-stealing deque.
It's extremely hardcore lock-free algorithm with 500+ LOC. And it was
hangup sometimes, and sometimes cause SIGSEGV. I was trying to debug
it manually, I was trying to insert some asserts, I was trying to
insert some "yields", I was trying to review it several times, I was
trying to analyze dumps. Uselessly.
When I run it under Relacy Race Detector I localize and fix 2 subtle
algo errors and clarify some moments with memory fences it about 3
hours. And it was almost mechanical work, not some black magic.

Dmitriy V'jukov

unread,

Aug 6, 2008, 7:27:06 AM8/6/08

to

On Aug 6, 2:46 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Also, it seems like you could parallelize the testing process somewhat... I
> need to think about this some more.

Yes, this is in todo list:
http://groups.google.com/group/relacy/web/todo-feature-list

Feel free to comment on todo/feature list, propose new items, and
prioritize items.
It's better to do this here:
http://groups.google.com/group/relacy/topics

Dmitriy V'jukov

unread,

Aug 6, 2008, 7:31:34 AM8/6/08

to

On Aug 6, 2:46 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> IMVOH, when C++ 0x is finally released, it should have a fairly big impact.
> I can see it now... You trademark a logo, and give a company the chance to
> stick it on their products to give their customers piece of mind:
>
> Customer: "Well, the software has the Relacy logo on the front of the box;
> it must be correct concurrent software indeed!"
>
> Vendor: "Are software lives up to the prerequisites of the Relacy logo. We
> are proud to be able to display it on our software packaging."
>
> I hope you make tons of $$$ with this tool Dmitriy. Get rich and, then a
> little richer; AFAICT its a winner.
>
> Any thoughts?
>
> =^D

:)
Sounds great!

I think I can start developing 2 logos:
[RELACY APPROVED]
and the second:
[RELACY BUSTED]
:)

Dmitriy V'jukov

Peter Dimov

unread,

Aug 6, 2008, 9:08:54 AM8/6/08

to

On Aug 6, 5:31 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Also. Following naive approach doesn't work (at least in C++0x):
> void signal()
> {
> std::atomic_thread_fence(std::memory_order_seq_cst);
> signal_relaxed();
> }

I'm not sure that I understand the algorithm here, but the reason this
doesn't introduce a synchronize-with relationship is that the fence
needs to follow the load, not precede it.

load.relaxed count
fence.seq_cst

does synchronize with

store.seq_cst count, v

when the load sees the value v written by the store.

If you do need a leading seq_cst fence, why not just use a seq_cst
load?

Dmitriy V'jukov

unread,

Aug 6, 2008, 9:42:23 AM8/6/08

to

On Aug 6, 5:08 pm, Peter Dimov <pdi...@gmail.com> wrote:
> On Aug 6, 5:31 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > Also. Following naive approach doesn't work (at least in C++0x):
> > void signal()
> > {
> > std::atomic_thread_fence(std::memory_order_seq_cst);
> > signal_relaxed();
> > }
>
> I'm not sure that I understand the algorithm here, but the reason this
> doesn't introduce a synchronize-with relationship is that the fence
> needs to follow the load, not precede it.

I will try to describe the issue in more detail in separate post in
near future.

> load.relaxed count
> fence.seq_cst
>
> does synchronize with
>
> store.seq_cst count, v
>
> when the load sees the value v written by the store.
>
> If you do need a leading seq_cst fence, why not just use a seq_cst
> load?

Because this seq_cst fence must introduce synchronizes-with not only
from second thread to this thread, but also possibly it must introduce
synchronizes-with from this thread to second thread. I.e. OR first
thread must synchronize-with second thread OR second thread must
synchronize-with first thread (depending on which thread's seq_cst
operations is first in global order).
If I use seq_cst load, then it only possible that second thread
synchronizes-with this thread, but not vice versa, because seq_cst
load is only acquire, it's NOT release.

Dmitriy V'jukov

unread,

Aug 6, 2008, 10:01:59 AM8/6/08

to

On Aug 6, 5:42 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Aug 6, 5:08 pm, Peter Dimov <pdi...@gmail.com> wrote:
>
> > On Aug 6, 5:31 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > > Also. Following naive approach doesn't work (at least in C++0x):
> > > void signal()
> > > {
> > > std::atomic_thread_fence(std::memory_order_seq_cst);
> > > signal_relaxed();
> > > }
>
> > I'm not sure that I understand the algorithm here, but the reason this
> > doesn't introduce a synchronize-with relationship is that the fence
> > needs to follow the load, not precede it.
>
> I will try to describe the issue in more detail in separate post in
> near future.

Here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a9513cf616dc168c

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 7, 2008, 7:02:15 AM8/7/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:29a4e368-c068-460a...@u6g2000prc.googlegroups.com...

Did you develop Relacy in part so that you can totally "deconstruct" your
work-stealing algorithm into a form in which its "random" run-time behavior
could be studied at another level and therefore be _reliably_ debugged?

;^D

Chris M. Thomasson

unread,

Aug 7, 2008, 7:09:15 AM8/7/08

to

"Peter Dimov" <pdi...@gmail.com> wrote in message
news:8cf29aaf-916b-40dc...@27g2000hsf.googlegroups.com...

> load.relaxed count
> fence.seq_cst

> does synchronize with

> store.seq_cst count, v

That can be done as long as the full fence is executed before the load of
the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
of my algorithm:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380

Here is where Joe Seigh points out the membars:

http://groups.google.com/group/comp.programming.threads/msg/8e7a0379b55557c0

AFAICT, the release barrier in a user algorithm is in the wrong place. The
load cannot be allowed to be hoisted up above the production of any item in
a user container class. Basically, the production of an item into a user
container usually only requires a release barrier. Its in the wrong place
wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
in order to allow the production to be completely visible BEFORE the load
from the eventcount is performed... Where am I going wrong?

Read here for more info:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/cf4a952ff5af7d

Humm...

Dmitriy V'jukov

unread,

Aug 7, 2008, 7:26:35 AM8/7/08

to

On Aug 7, 3:09 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> That can be done as long as the full fence is executed before the load of
> the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
> of my algorithm:
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...

>
> Here is where Joe Seigh points out the membars:
>

> http://groups.google.com/group/comp.programming.threads/msg/8e7a0379b...

>
> AFAICT, the release barrier in a user algorithm is in the wrong place. The
> load cannot be allowed to be hoisted up above the production of any item in
> a user container class. Basically, the production of an item into a user
> container usually only requires a release barrier. Its in the wrong place
> wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
> in order to allow the production to be completely visible BEFORE the load
> from the eventcount is performed... Where am I going wrong?

The problem with C++0x in this context is the following.
Assume container is lock-free stack. And 'production' is implemented
as seq_cst atomic RMW on 'head' variable. In this situation you can
reasonably presume that you can use 'relaxed signaling', i.e. relaxed
load of eventcount w/o any additional fences, because seq_cst atomic
RMW on 'head' variable already contains strong enough fence in right
place. WRONG! Why? See here:

Peter Dimov

unread,

Aug 7, 2008, 9:42:10 AM8/7/08

to

On Aug 7, 2:09 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> AFAICT, the release barrier in a user algorithm is in the wrong place. The
> load cannot be allowed to be hoisted up above the production of any item in
> a user container class. Basically, the production of an item into a user
> container usually only requires a release barrier. Its in the wrong place
> wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
> in order to allow the production to be completely visible BEFORE the load
> from the eventcount is performed... Where am I going wrong?
>
> Read here for more info:
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...
>
> Humm...

Hmmm.

In terms of the C++MM, the options I see are:

1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
get
3. RMW.seq_cst in signal, cas.seq_cst in get

and maybe

4. RMW.acq_rel in signal, cas.acq_rel in get

Chris M. Thomasson

unread,

Aug 7, 2008, 11:13:29 AM8/7/08

to

"Peter Dimov" <pdi...@gmail.com> wrote in message

news:791c816b-dce9-4325...@34g2000hsh.googlegroups.com...

> Hmmm.

> and maybe

For now, I choose number 2 because in the current eventcount algorithm
as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
function needs a trailing StoreLoad. AFAICT, this is fine... As for the
other choices, well, I don't like number 1 because it moves memory
visibility rules directly into an end-user algorithm. I dislike 3-4 because
of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
eventcount state with an addend of zero. However, IMVHO, that looks like
fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
mutation is basically pointless...

Dmitriy V'jukov

unread,

Aug 7, 2008, 11:31:37 AM8/7/08

to

On Aug 7, 7:13 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > In terms of the C++MM, the options I see are:
> > 1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
> > 2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
> > get
> > 3. RMW.seq_cst in signal, cas.seq_cst in get
> > and maybe
> > 4. RMW.acq_rel in signal, cas.acq_rel in get
>
> For now, I choose number 2 because in the current eventcount algorithm
> as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
> function needs a trailing StoreLoad. AFAICT, this is fine... As for the
> other choices, well, I don't like number 1 because it moves memory
> visibility rules directly into an end-user algorithm. I dislike 3-4 because
> of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
> eventcount state with an addend of zero. However, IMVHO, that looks like
> fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
> mutation is basically pointless...

Consider implementation of number 2 on x86.
Producer:
1. Interlocked RMW to enqueue element
2. mfence in signal (even more expensive than Interlocked RMW, and
absolutely useless)

Consumer:
3. Interlocked RMW in get
4. mfence in get (even more expensive than Interlocked RMW, and
absolutely useless)

Maybe compiler can optimize 4 away, because 3 and 4 will be in one
function. But I am not sure that first versions of C++0x compilers
will be so smart. And it's questionable that compiler will be able to
optimize 2 away, because 1 and 2 will be in different functions and
probably in different source files.

Do you like this? Don't hasten forgetting your asm skills ;)

Dmitriy V'jukov

Peter Dimov

unread,

Aug 7, 2008, 3:31:23 PM8/7/08

to

On Aug 7, 6:31 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Consider implementation of number 2 on x86.
> Producer:
> 1. Interlocked RMW to enqueue element
> 2. mfence in signal (even more expensive than Interlocked RMW, and
> absolutely useless)

It might be interesting to time the actual mfence overhead when the
preceding locked RMW instruction has already drained the store queue.

kwikius

unread,

Aug 7, 2008, 7:51:45 PM8/7/08

to

You could , I guess try to extract the individual elements of soup, or
you could just call it soup.

regards
Andy Little

kwikius

unread,

Aug 8, 2008, 3:10:14 AM8/8/08

to

On Aug 8, 12:51 am, kwikius <a...@servocomm.freeserve.co.uk> wrote:

> You could , I guess try to extract the individual elements of soup, or
> you could just call it soup.

hmm. Irespective of my feelings re threads, the above remark was
seriously out of order.

Sincere apologies to anyone offended.

regards
Andy Little

Chris M. Thomasson

unread,

Aug 8, 2008, 8:09:34 PM8/8/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:490d36a1-28e6-48ae...@k36g2000pri.googlegroups.com...

No.

> Don't hasten forgetting your asm skills ;)

Ouch.

Dmitriy V'jukov

unread,

Aug 11, 2008, 2:54:06 PM8/11/08

to

On Aug 7, 3:02 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> >> Have you testing the following algorithm yet?
>
> >>http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
>
> > Yes and no. No, I didn't test this particular algorithm. But I test my
> > analogous algorithm for work-stealing deque.
> > It's extremely hardcore lock-free algorithm with 500+ LOC. And it was
> > hangup sometimes, and sometimes cause SIGSEGV. I was trying to debug
> > it manually, I was trying to insert some asserts, I was trying to
> > insert some "yields", I was trying to review it several times, I was
> > trying to analyze dumps. Uselessly.
> > When I run it under Relacy Race Detector I localize and fix 2 subtle
> > algo errors and clarify some moments with memory fences it about 3
> > hours. And it was almost mechanical work, not some black magic.
>
> Did you develop Relacy in part so that you can totally "deconstruct" your
> work-stealing algorithm into a form in which its "random" run-time behavior
> could be studied at another level and therefore be _reliably_ debugged?
>
> ;^D

Initially, yes, it was the preposition for relacy creation. But now
it's developed into independent tool.

Dmitriy V'jukov

unread,

Aug 15, 2008, 2:35:04 PM8/15/08

to

I want to announce release 1.1 of Relacy Race Detector.

First of all, now you can freely DOWNLOAD latest version of Relacy
Race Detector DIRECTLY FROM WEB:
http://groups.google.com/group/relacy/files

Main change in release 1.1 is support for standard synchronization
primitives:
1. mutex (std::mutex, pthread_mutex_init, InitializeCriticalSection)
2. rw_mutex (pthread_rwlock_init, InitializeSRWLock)
3. condition variable (std::condition_variable,
std::condition_variable_any, pthread_cond_init,
InitializeConditionVariable)
4. semaphore (sem_init, CreateSemaphore)
5. event (CreateEvent)

Now you can test more traditional "mutex-based" algorithms (i.e. no
lock-free), and Relacy still will detect data races, deadlocks,
livelocks, resource leaks, incorrect usage of API etc.
Also you can test mixed lock-based/lock-free algorithms. For example
fast-path is lock-free, and slow-path is mutex-based.
For examples see 'spsc_queue' example in distributive, it uses
std::mutex and std::condition_variable as well as lock-free fast-path.
And following manual implementation of condition_variable via Win
API's CreateEvent and CreateSemaphore:
http://groups.google.com/group/comp.programming.threads/msg/30c2ec41c4d498a2

Also I add RL_DEBUGBREAK_ON_ASSERT and RL_MSVC_OUTPUT options. And
initial_state/final_state parameters. See details here:
http://groups.google.com/group/relacy/web/advanced-moments

Dmitriy V'jukov

unread,

Aug 26, 2008, 2:38:42 PM8/26/08

to

I have uploaded release 1.2 of Relacy Race Detector:
http://groups.google.com/group/relacy/files

The main feature of the release is support for Java/CLI (aka .NET)
algorithm verification (but don't get confused - it's still C++
library).
The support includes following things:
- Emulation of GC: just allocate-and-forget. GC eliminates important
problems associated with lock-free algorithms - safe memory
reclamation and ABA problem. You can use GC and in C++ mode - just
define RL_GC.
- CLI memory model. Stronger atomic RMW operations and stronger
seq_cst fence.
- Java memory model. Stronger atomic RMW operations and stronger
seq_cst fence and seq_cst fence emitted automatically between volatile
store and volatile load.
- Basic CLI API associated with synchronization. rl::nvolatile<>
class emulates CLI volatiles. rl::nvar<> emulates plain CLI variables.
Interlocked operations available in rl::Interlocked namespace (i.e.
rl::Interlocked::CompareExchange()). And also
rl::Thread::MemoryBarrier(), rl::Thread::VolatileRead(),
rl::Thread::VolatileWrite() and rl::Thread::SpinWait().
- Basic Java API associated with sycnronization. rl::jvolatile<>,
rl::jvar<>, rl::AtomicInteger and rl::AtomicLong. Note that extensive
Java/CLI support libraries are not emulated (various ConcurrentQueue's
and so on). But you can use mutexes, condition_variables, semaphores
and events from C++ API, POSIX API or Win API.

Example of CLI algorithm verification you can see here:
http://groups.google.com/group/relacy/browse_frm/thread/f91c913c298a20ab

Java example:
http://groups.google.com/group/relacy/browse_frm/thread/257150359b8fb57d

Also release includes a bunch of bug fixes.

Looking forward to your comments.

Dmitriy V'jukov

unread,

Aug 26, 2008, 2:40:20 PM8/26/08

to

I have uploaded release 1.2 of Relacy Race Detector:
http://groups.google.com/group/relacy/files

Dmitriy V'jukov

unread,

Aug 26, 2008, 3:16:17 PM8/26/08

to

I forgot to mention following moment. There are no such things as data
races and uninitialized variables in Java/CLI. This basically forces
me to disable some types of automatically detectable errors (data
races and accesses to uninitialized variables) in Java/CLI mode. And
from verification point of view this is bad. But in Java/CLI mode you
still can use rl::var<>'s, and this will effectively re-enable
detection of mentioned types of errors. I.e. if you know that some
variable must be accesses only in 'single-threaded way' (for example,
all accesses are protected by mutex), then you can use rl::var<> for
this variable.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 27, 2008, 10:59:34 AM8/27/08

to

On Aug 26, 9:38 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> - Java memory model. Stronger atomic RMW operations and stronger
> seq_cst fence and seq_cst fence emitted automatically between volatile
> store and volatile load.

What are the goals of this mode?

If it's intended to analyze Java code against the formal Java memory
model, the above description is not correct. Consider again

T1: x = 1;
T2: y = 1;
T3: r1 = x; r2 = y;
T4: r3 = y; r4 = x;

(x, y volatile, initial value 0)

(r1,r2,r3,r4) = (1, 0, 1, 0) is disallowed by the Java memory model,
but allowed by your description.

This mode may be considered correct if it's intended to analyze Java
code running on a typical x86 JVM that follows the cookbook, but these
JVMs do not conform to the Java memory model and will likely be fixed
at some point.

Why not just model volatile loads and stores as volatile loads and
stores, instead of transforming the code and analyzing the transformed
result?

Dmitriy V'jukov

unread,

Aug 27, 2008, 11:50:19 AM8/27/08

to

On Aug 27, 6:59 pm, Peter Dimov <pdi...@gmail.com> wrote:
>
> > - Java memory model. Stronger atomic RMW operations and stronger
> > seq_cst fence and seq_cst fence emitted automatically between volatile
> > store and volatile load.
>
> What are the goals of this mode?
>
> If it's intended to analyze Java code against the formal Java memory
> model, the above description is not correct.

Yes, goal of this mode is to analyze Java code (translated to C++)
against formal Java memory model.

> Consider again
>
> T1: x = 1;
> T2: y = 1;
> T3: r1 = x; r2 = y;
> T4: r3 = y; r4 = x;
>
> (x, y volatile, initial value 0)
>
> (r1,r2,r3,r4) = (1, 0, 1, 0) is disallowed by the Java memory model,
> but allowed by your description.
>
> This mode may be considered correct if it's intended to analyze Java
> code running on a typical x86 JVM that follows the cookbook, but these
> JVMs do not conform to the Java memory model and will likely be fixed
> at some point.

Yes, you are right.

17.4.4 Synchronization Order
Every execution has a synchronization order. A synchronization order
is a
total order over all of the synchronization actions of an execution.
...
Synchronization actions induce the synchronized-with relation on
actions,
defined as follows:
...
- A write to a volatile variable v synchronizes-with all subsequent
reads of v by any thread (where subsequent is defined according to the
synchronization
order).

This basically means that there must be total order over all accesses
to volatile variables.

Then what is the most precise way to model Java volatiles in terms of C
++0x?
On the one hand, binding must not allow false positives (i.e. Relacy
falsely detects error). On the other hand, binding must minimize
(ideally eliminate) false negatives (i.e. Relacy fails to detect real
error).

Does modeling of Java volatile stores/loads as C++0x seq_cst stores/
loads satisfy those requirements?

> Why not just model volatile loads and stores as volatile loads and
> stores, instead of transforming the code and analyzing the transformed
> result?

I don't get you here. There are no such synchronization actions as
volatile loads and stores in C++0x. Please elaborate.

Thank you for pointing to the issue! I will fix it!

Dmitriy V'jukov

Peter Dimov

unread,

Aug 27, 2008, 1:02:03 PM8/27/08

to

On Aug 27, 6:50 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Does modeling of Java volatile stores/loads as C++0x seq_cst stores/
> loads satisfy those requirements?

I believe that it does.

> I don't get you here. There are no such synchronization actions as
> volatile loads and stores in C++0x.

Yes, I used the Java meaning of volatile. I meant that Relacy can
directly implement the JMM meaning of volatile for rl::jvolatile,
instead of translating it to C++ terms first. This would allow you to
run the same algorithm expressed in Java and in C++ seq_cst and verify
whether the results match. So Relacy can give us a direct answer to
your first question above. :-)

Dmitriy V'jukov

unread,

Aug 27, 2008, 1:27:40 PM8/27/08

to

On Aug 27, 9:02 pm, Peter Dimov <pdi...@gmail.com> wrote:
> On Aug 27, 6:50 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > Does modeling of Java volatile stores/loads as C++0x seq_cst stores/
> > loads satisfy those requirements?
>
> I believe that it does.

Good! Than it will be simple.

> > I don't get you here. There are no such synchronization actions as
> > volatile loads and stores in C++0x.
>
> Yes, I used the Java meaning of volatile. I meant that Relacy can
> directly implement the JMM meaning of volatile for rl::jvolatile,
> instead of translating it to C++ terms first. This would allow you to
> run the same algorithm expressed in Java and in C++ seq_cst and verify
> whether the results match. So Relacy can give us a direct answer to
> your first question above. :-)

Hmmm... It looks like vicious circle :)

Direct modeling of Java and CLI synchronization primitives I consider
as last resort. I hope that I will be able to easily model Java/CLI
primitives via C++0x primitives. Currently I add only 2 patches. First
I've described here:
http://groups.google.com/group/comp.programming.threads/msg/07c810b38be80bbb
And second is that every atomic RMW is followed by seq_cst fence:
T java_cli_atomic_rmw(...)
{
T r = cpp0x_atomic_rmw(..., memory_order_seq_cst);
cpp0x_atomic_thread_fence(memory_order_seq_cst);
return r;
}
I think that this is intended behavior of CLI Interlocked operations,
because they based on Win32 Interlocked operations, and they are based
on x86 locked instructions :)
I am not sure about Java here. I can't find answer in language
specification and description of atomic package...

Dmitriy V'jukov