Checking Against Other Memory Models (x86, PPC, Java, CLI)

35 views
Skip to first unread message

Dmitriy V'jukov

unread,
Aug 5, 2008, 7:17:56 AM8/5/08
to Relacy Race Detector
Q: Can I use Relacy Race Detector to check my algo againts other that C
++0x memory models (x86, PPC, Java, CLI)?


A: Yes, you can. Fortunately, C++0x memory model is very relaxaed, so
for the main part it's a "superset" of basically any other memory
model. You just have to define "binding" between your target memory
model and C++0x memory model.

Let's create such binding, for example, for x86 memory model:
- plain load operation is always "acquire" (i.e.
memory_order_acquire)
- plain store operation is always "release" (i.e.
memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e.
memory_order_seq_cst)
- mfence instruction is
std::atomic_thread_fence(memory_order_seq_cst)
That is all. You can create such bindings for other hardware memory
models you are interested in (PPC, Itatium, SPARC etc).

And you can define such binding to other abstract memory model, like
Java MM. Let's see:
- plain load is "relaxed" (i.e. memory_order_relaxed)
- plain store is "relaxed" (i.e. memory_order_relaxed)
- volatile load is "acquire" (i.e. memory_order_acquire)
- volatile store operation is "release" (i.e. memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e.
memory_order_seq_cst)
But here are some caveats. First, you have to emulate work of GC, i.e.
put all allocated memory to special list, and free all allocated
memory in test_suite::after(). Second, you have to manually emit
sequentially consistent memory fence between every volatile store and
volatile load. Third, you have to manually initialize all variables to
default value (0). Fourth, there is no such thing as data race, so you
have to define all variables as std::atomic, this will effectively
disable data race detection mechanizm. Well, actually you can use
rl::var variables, if you know that there must be no concurrent
accesses to variable, this will enable some automatic error detection
wrt data races.
Sounds not very cool... So I'm going to add built-in support for Java
and CLI. Then user would have to define something like RL_JAVA_MODE/
RL_CLI_MODE, and get all those things out-of-the-box. But yes, it
still will be C++ library. What do you think?

Dmitriy V'jukov

Dmitriy V'jukov

unread,
Aug 5, 2008, 7:19:07 AM8/5/08
to Relacy Race Detector
Raf_Schietekat wrote:

Sounds cool!

I think that it would take a bit more to model Java volatiles, because
they're also mutually sequentially consistent (right?), which on
typical hardware probably means that they have to be sequentially
consistent with everything (are there any exceptions?). So would you
be able to test the memory model as defined and not as mapped to the
hardware running the test?

Dmitriy V'jukov

unread,
Aug 5, 2008, 7:19:51 AM8/5/08
to Relacy Race Detector
On Aug 5, 3:19 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> Raf_Schietekat wrote:
>
> Sounds cool!

Thank you ;)
Completely! Memory model is modelled entirely programmatically.
I'm running the tests on x86 hardware, but I'm able to test program
against relaxed C++0x memory model, i.e. not total store order,
relaxed interlocked operations and so on.

Btw, what you mean by "mutually sequentially consistent"?
As I understand Java's volatiles:
- volatile load is acquire
- volatile store is release
- and between every volatile store and volatile load must be
#StoreLoad memory fence.
This basically means that "Peterson's mutex works w/o any additional
memory fences". But there are no requirements for total store order to
different locations.

So I can map following Java program:
volatile int x;
x = 1;
int y = x;

to C++0x this way:
std::atomic<int> x;
x.store(1, std::memory_order_release);
std::atomic_thread_fence(std::memory_order_seq_cst);
int y = x.load(std::memory_order_acquire);
Right?

Dmitriy V'jukov

Dmitriy V'jukov

unread,
Aug 5, 2008, 7:20:38 AM8/5/08
to Relacy Race Detector
Raf_Schietekat wrote:

I don't see how unit testing can go beyond the constraints of the
hardware on which it is running? Unless you replaced the basic
building blocks with a mockup that opens up the cracks, so to say, and
you are testing code that sits between your own building blocks
replacement and the testing framework, at the cost of considerably
reduced performance? What would be the relative merits between this
approach and something that tries to find flaws in a more systematic
way, by reasoning about what can go wrong and/or trying to provide
formal proof that the algorithm is correct?

For Java, I'm thinking of the following from <http://www.cs.umd.edu/
users/jmanson/java/journal.pdf>: "5.2.1 Operations on Volatiles are
sequentially consistent." To achieve that, you would need those heavy
#StoreLoad barriers, although a compiler can do a lot to soften the
impact on performance by eliminating redundant barriers and/or
relocating them, even though they're still scheduled statically. A
less sophisticated compiler, or a library of volatiles/atomics, may
have to apply barriers directly around the volatiles that effectively,
if not intentionally, make volatiles sequentially consistent with
everything, so the interlocks are costlier than intended. Luckily C++
atomics don't have the mutual sequential consistency to consider.

From my still limited understanding of the matter the mapping seems
strict enough (maybe a little too strict?), but in a slightly more
complex example it would probably be difficult to model the exact
constraints defined in the Java Language Specification, which probably
cannot be specified in a linear way even within a control block, and
probably require a graph instead, which can be flattened in more than
one way.

Does that make sense?

Dmitriy V'jukov

unread,
Aug 5, 2008, 7:22:15 AM8/5/08
to Relacy Race Detector
On Aug 5, 3:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> Raf_Schietekat wrote:
>
> I don't see how unit testing can go beyond the constraints of the
> hardware on which it is running? Unless you replaced the basic
> building blocks with a mockup that opens up the cracks, so to say, and
> you are testing code that sits between your own building blocks
> replacement and the testing framework, at the cost of considerably
> reduced performance?


Yes. I replace 'basic building blocks'. My std::atomic<> is special
object which tracks 'happens-before' relations. And this is why one
have to use rl::var<> instead of plain variables. And actually
execution of unit test is single-threaded, so underlying hardware
memory model plays no role at all. I manually switch between different
"user threads".
As to performance. Yes, it's reduced. As compared to "unchecked run".
But I don't see any value in high speed of execution of unit-test,
when unit-test doesn't actually test anything.
Execution speed is about 100'000 fully checked executions per second
(very depends on unit-test). Most errors are found on iterations 1-10.
Very subtle errors are found on iterations 1-10'000. So basically
after 1 second of execution one has some confidence.
If you want to check *every possible* "interleaving" (it's possible
with 'full search scheduler'), then, yes, execution speed plays
significant role, because test can run 1 munute, 1 hour, 1 day...


> What would be the relative merits between this
> approach and something that tries to find flaws in a more systematic
> way, by reasoning about what can go wrong and/or trying to provide
> formal proof that the algorithm is correct?



I've considered formal model verification too, and have concluded that
it's not suitable to test real life C/C++ code, it's more for
"scientists", so to say :)
And I'm an engineer, interested in testing real-life synchronization
algorithms written in C/C++, including bit manipulation, pointer
compression, access violations as part of the algorithm etc.
Also with my tool one don't have to rewrite his algorithm in some
"scientific form", and can you his IDE/debugger of choice.


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Aug 5, 2008, 7:23:01 AM8/5/08
to Relacy Race Detector
On Aug 5, 3:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> Raf_Schietekat wrote:
>
> For Java, I'm thinking of the following from <http://www.cs.umd.edu/
> users/jmanson/java/journal.pdf>: "5.2.1 Operations on Volatiles are
> sequentially consistent." To achieve that, you would need those heavy
> #StoreLoad barriers, although a compiler can do a lot to soften the
> impact on performance by eliminating redundant barriers and/or
> relocating them, even though they're still scheduled statically. A
> less sophisticated compiler, or a library of volatiles/atomics, may
> have to apply barriers directly around the volatiles that effectively,
> if not intentionally, make volatiles sequentially consistent with
> everything, so the interlocks are costlier than intended. Luckily C++
> atomics don't have the mutual sequential consistency to consider.
>
> From my still limited understanding of the matter the mapping seems
> strict enough (maybe a little too strict?), but in a slightly more
> complex example it would probably be difficult to model the exact
> constraints defined in the Java Language Specification, which probably
> cannot be specified in a linear way even within a control block, and
> probably require a graph instead, which can be flattened in more than
> one way.


Yes, I agree, that it's difficult to manually model Java's volatiles.
It's possible to place #StoreLoad after every volatile store, for
example. But this is not very good approximation. It's possible to try
to make some manual "optimization" wrt #StoreLoad placement...
But note that C#/CLI doesn't have such problems, because C#'s
volatiles are not sequentially consistent.
Now I am working on incorporating support for Java/C#. Wrt Java's
volatiles I going to use *dynamic* scheduling of #StoreLoad barriers.
Something like this:

struct thread
{
bool store_load_pending;

void volatile_store()
{
store_load_pending = true;
...
}

void store_load_fence()
{
// if user manually execute full fence between volatile store and
volatile load,
// then run-time will not emit additional fence
store_load_pending = false;
...
}

void volatile_load()
{
if (store_load_pending)
store_load_fence();
...
}
};

So I will try to execute #StoreLoad as late as possible. I think that
this is "the worst possible conforming implementation". And "worst" in
context of unit-testing is good, because it can reveal more errors.


Dmitriy V'jukov

arch.r...@intel.com

unread,
Aug 5, 2008, 5:56:53 PM8/5/08
to Relacy Race Detector
I just wanted to say "neat idea!".

We've used Spin (http://spinroot.com/spin/whatispin.html) for modeling
racy algorithms in the past. It's strong point is that it can search
a very large search space because it can apply algebraic techniques.
But as noted, writing the models in a separate language is a pain, and
the fidelity of the model is always an issue. In particular, Spin
does not model memory consistency.

At the other extreme is CHESS(http://research.microsoft.com/projects/
CHESS/) which runs real code, but it's not clear to me if it models
memory consistency issues.

Basing the Relacy checking on std::atomic interactions is a neat
idea. It sounds like a practical sweet spot for checkers.

- Arch Robison

Dmitriy V'jukov

unread,
Aug 5, 2008, 8:14:58 PM8/5/08
to Relacy Race Detector
On 6 авг, 01:56, arch.robi...@intel.com wrote:
> I just wanted to say "neat idea!".

Thank you.

> We've used Spin (http://spinroot.com/spin/whatispin.html) for modeling
> racy algorithms in the past. It's strong point is that it can search
> a very large search space because it can apply algebraic techniques.
> But as noted, writing the models in a separate language is a pain, and
> the fidelity of the model is always an issue. In particular, Spin
> does not model memory consistency.
>
> At the other extreme is CHESS(http://research.microsoft.com/projects/
> CHESS/) which runs real code, but it's not clear to me if it models
> memory consistency issues.

Yes, I am aware of those tools.
The goal of my tool is testing of "real life" C/C++ algorithms (and I
hope Java/C# in near future). I.e. one must be able to test *any*
algorithm he wrote. Including low bit stealing from pointers, pointer
compression, bit fields, C++ templates, dynamic memory allocation,
benign data races, relaxed atomic operations, stand-alone memory
fences and so on. And at the same time tool must model execution as
precise as possible, and catch as much different error types as
possible. Currently tool precisely models very relaxed memory model,
i.e. relaxed, acquire, release, acquire_release,
sequentially_consistent operations and memory fences. Also it models
such things as spurious failures in CAS (yes, it's permitted by C+
+0x), and whether malloc() will return recently freed block or brand
new block (useful for ABA detection).
Also I was trying to make it possible to test real production code.
It's possible with thin macros layer. If you want I can elaborate on
this.

As for Spin. For me it looks very... scholastic. Maybe it's useful for
protocol verification, but not for real-life synchronization
algorithms. Where is dynamic memory allocation? Where is relaxed
memory model? And as you said, what is Promela?
As for large state space. Relacy Race Detector can search arbitrary
size state spaces, because it's stateless (it takes around 624KB of
memory over all execution). But of course execution time is concern. I
was trying to optimize implementation as much as possible, currently
it goes at speed around 10^5-10^6 test executions per second. And it
includes random scheduler, which makes it possible to reveal 99% of
errors within 1 second.

As for CHESS. It looks much closer to real life. Actually I take few
ideas from CHESS. First is context bound scheduler, which gives
reasonable compromise between state space size and obtained
confidence. Second is fair scheduler, which gives ability to test
formally non-terminating algorithms.
What I don't like in CHESS. It's imprecise. It doesn't capture all
memory references (which can be crucial). It doesn't model relaxed
memory model. It doesn't model such things like spurious failures in
CAS. So basically I treat CHESS as *program* verifier (with not very
precise checking), and Relacy as *algorithm* verifier (with precise
checking). The strength of CHESS is that it allows execution of
unmodified code.

Btw, there are some attempts to incorporate relaxed memory model into
CHESS, but currently they model only "store-buffer relaxations", and
they treat them in interesting way: basically, if there exists some
execution which is possible with store buffers and not possible w/o
store buffers it's an error. Why? Maybe it's reasonable for
"application level" code, but not for synchronization algorithms.
This is exactly kind of error which will be reported as error by
CHESS:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30254599.aspx
But you reasonably conclude that this is "intended" behavior.
See this for details:
http://research.microsoft.com/projects/CHESS/sober.pdf


> Basing the Relacy checking on std::atomic interactions is a neat
> idea. It sounds like a practical sweet spot for checkers.

C++0x's atomics seems to be the first "standardized" atomics library.
And C++0x clearly separate atomic variables (which is subject for
concurrent assess) and usual variables (which is subject for single-
threaded access and race detection). I am forcedly comparing C++0x's
approach to Java's approach. In Java mode I just have to simply turn-
off data race detection. This is bad... for Java programmers.


Dmitriy V'jukov
Reply all
Reply to author
Forward
0 new messages