Kicking $hit out of HotSpot

109 views

Skip to first unread message

Dmitriy Vyukov

unread,

Dec 1, 2009, 7:30:26 AM12/1/09

to Relacy Race Detector, lock...@googlegroups.com

This morning I've read David Dice blog "A race in LockSupport park()
arising from weak memory models" (http://blogs.sun.com/dave/entry/
a_race_in_locksupport_park). The fact that such day-one bugs stay
uncaught in things like HotSpot for so long time must set one
thinking.

Ok, let's see how Relacy Race Detector will cope with this. I.e. what
would be if HotSpot developers would use Relacy from day-one.

Here is simple test that models parker class and simple use-case:

class parker
{
pthread_mutex_t mtx;
pthread_cond_t cv;
std::atomic<int> count;

public:
parker()
{
pthread_mutex_init(&mtx, 0);
pthread_cond_init(&cv, 0);
count.store(0, std::memory_order_relaxed);
}

~parker()
{
pthread_mutex_destroy(&mtx);
pthread_cond_destroy(&cv);
}

void park()
{
if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
return;
}
if (pthread_mutex_try_lock(&mtx))
return;

if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
pthread_mutex_unlock(&mtx);
return;
}

pthread_cond_wait(&cv, &mtx) ;
count.store(0, std::memory_order_relaxed);

pthread_mutex_unlock(&mtx);
}

void unpark()
{
pthread_mutex_lock(&mtx);
int prev = count.load(std::memory_order_relaxed);
count.store(1, std::memory_order_relaxed);
pthread_mutex_unlock(&mtx);
if (prev == 0)
pthread_cond_signal(&cv);
}
};

struct thread_param
{
parker p;
std::atomic<int> data;
};

void* consumer_thread(void* ctx)
{
thread_param* p = (thread_param*)ctx;
while (p->data.load(std::memory_order_relaxed) == 0)
p->p.park();
return 0;
}

void* producer_thread(void* ctx)
{
thread_param* p = (thread_param*)ctx;
p->data.store(1, std::memory_order_relaxed);
p->p.unpark();
return 0;
}

void hotspot_test()
{
thread_param p;
p.data.store(0, std::memory_order_relaxed);

pthread_t cons;
pthread_create(&cons, 0, consumer_thread, &p);

pthread_t prod;
pthread_create(&prod, 0, producer_thread, &p);

void* res;
pthread_join(cons, &res);
pthread_join(prod, &res);
}

int main()
{
rl::test_params p;
p.search_type = rl::sched_bound;
p.context_bound = 4;
rl::execute<hotspot_test, 2>(p);
}

First of all, the test detected one more issue in the code, more on
this later. After fixing that issue the output I've got in a blink of
an eye is (it took less than 1 microsecond for Relacy to crank the
issue):

DEADLOCK (deadlock detected)
iteration: 66

thread 1:
[8] 1: <0035FF04> atomic store, value=1, (prev value=0),
order=relaxed, in producer_thread, hotspot.cpp(84)
[9] 1: <0033B31C> mutex: exclusive lock, in parker::unpark, hotspot.cpp
(58)
[10] 1: <0035FEE4> atomic load, value=0, order=relaxed, in
parker::unpark, hotspot.cpp(59)
[11] 1: <0035FEE4> atomic store, value=1, (prev value=0),
order=relaxed, in parker::unpark, hotspot.cpp(60)
[16] 1: <0033B31C> mutex: exclusive unlock, in parker::unpark,
hotspot.cpp(61)
[17] 1: <0033DBC4> cond_var: notify one total_blocked=0 unblocked=0,
in parker::unpark, hotspot.cpp(63)
[18] 1: [THREAD FINISHED], in rl::context_impl<struct
rl::simulate_thunk<&void __cdecl hotspot_test(void),2>,class
rl::context_bound_scheduler<3> >::fiber_proc_impl, context.hpp(396)

thread 2:
[7] 2: <0035FF04> atomic load, value=0, order=relaxed, in
consumer_thread, hotspot.cpp(76)
[12] 2: <0035FEE4> atomic load, value=1, order=relaxed, in
parker::park, hotspot.cpp(29)
[13] 2: <0035FEE4> atomic store, value=0, (prev value=1),
order=relaxed, in parker::park, hotspot.cpp(31)
[14] 2: <0035FF04> atomic load, value=0 [NOT CURRENT], order=relaxed,
in consumer_thread, hotspot.cpp(76)
[15] 2: <0035FEE4> atomic load, value=0, order=relaxed, in
parker::park, hotspot.cpp(29)
[19] 2: <0033B31C> mutex: exclusive lock, in parker::park, hotspot.cpp
(35)
[20] 2: <0035FEE4> atomic load, value=0, order=relaxed, in
parker::park, hotspot.cpp(42)
[21] 2: <0033DBC4> cond_var: wait enter, in parker::park, hotspot.cpp
(49)
[22] 2: <0033B31C> mutex: exclusive unlock, in parker::park,
hotspot.cpp(49)
[23] 2: blocking current thread, in parker::park, hotspot.cpp(49)
[24] 2: DEADLOCK (deadlock detected), in parker::park, hotspot.cpp(49)

What we see here? We see the exact scenario David described: consumer
thread consumes the signal, but then loads non current value of data
(it's marked with "[NOT CURRENT]" in the history), so goes to wait
again; but it will never be signaled again, so we get deadlock.

What I did first is defined RL_FORCE_SEQ_CST macro, which basically
forces sequentially consistent memory model (i.e. all atomic accesses
are done with seq_cst ordering). David said that the issue will not
come up on sequentially consistent memory model. Let's verify! When I
run with RL_FORCE_SEQ_CST defined the test indeed passed successfully,
so now we are sure that the problem is with memory ordering.

Now let's try to fix it with the help of Relacy.
David suggested to insert memory fences after stores to counter in
parker::park():

void park()
{
if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
return;
}
if (pthread_mutex_try_lock(&mtx))
return;

if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
pthread_mutex_unlock(&mtx);
return;
}

pthread_cond_wait(&cv, &mtx) ;
count.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);

pthread_mutex_unlock(&mtx);
}

When I run the code I still get the same deadlock.
David also suggested to insert memory barrier somewhere around
pthread_mutex_unlock() because it does not provides it's own barrier
(actually unlock() always provides release barrier, however it does
not provides seq_cst barrier that I guess David had in mind). I am not
sure what exactly David meant, so insert barrier both before and after
the unlock:

void unpark()
{
pthread_mutex_lock(&mtx);
int prev = count.load(std::memory_order_relaxed);
count.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
pthread_mutex_unlock(&mtx);
std::atomic_thread_fence(std::memory_order_seq_cst);
if (prev == 0)
pthread_cond_signal(&cv);
}

When I run this code I still get the same deadlock. Oh gosh, now $hit
hit the fan for sure. Either Relacy is wrong or David is wrong. Make
your bets! :)
Let's go to the roots - Dekker/Peterson synchronization (http://
en.wikipedia.org/wiki/Peterson%27s_algorithm). In Dekker style
synchronization we need memory barrier between critical store and
load. What are critical accesses in our parker? It's accesses to
parker::counter and thread_param::data (any user state). The core of
our synchronization is:

// consumer thread:
count = 0;
memory_barrier();
if (data == 0)
wait();

// producer thread:
data = 1;
memory_barrier();
if (count == 0)
notify();

Memory barriers between critical stores and loads guarantee that
either consumer will see the data or producer will call notify() or
both (but NOT consumer does not see the data and producer does not
call notify - which results in deadlock).
Ok, now when we make out mind clear on where exactly and why we must
insert barrier, let's rewrite the code accordingly:

void park()
{
if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
return;
}
if (pthread_mutex_try_lock(&mtx))
return;

if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
pthread_mutex_unlock(&mtx);
return;
}

pthread_cond_wait(&cv, &mtx) ;
count.store(0, std::memory_order_relaxed);

pthread_mutex_unlock(&mtx);
}

void unpark()
{
std::atomic_thread_fence(std::memory_order_seq_cst);
pthread_mutex_lock(&mtx);
int prev = count.load(std::memory_order_relaxed);
count.store(1, std::memory_order_relaxed);
pthread_mutex_unlock(&mtx);
if (prev == 0)
pthread_cond_signal(&cv);
}

Let's run it under Relacy, it says:

struct rl::simulate_thunk<&void __cdecl hotspot_test(void),2>
iterations: 856
total time: 1
throughput: 856000

Ta-da! Deadlock has gone!
Note that if user data represented as Java volatile, then you do not
need memory barrier in unpark(), because operations on Java volatiles
are in fact sequentially consistent on its own. However I do not see
any single reason to insert memory barrier around pthread_mutex_unlock
(), Relacy agrees with me.

Now about another issue that mention in the beginning. When I run the
code as-is, I got following output:

struct rl::simulate_thunk<&void __cdecl hotspot_test(void),2>
LIVELOCK (livelock)
iteration: 3346

execution history:
[very lengthy and boring output goes here]

It's quite easy to figure out what is happening from the execution
history, Relacy points to the following scenario. Thread 1 calls
parker::unpark(), locks the mutex, and then gets descheduled by OS
(count is not yet set). Now thread 2 calls parker::park(), it fails to
lock the mutex so just returns, then loads stale value of user state
so calls parker::park() again and so on. Thread 2 basically does
active spin waiting senselessly burning CPU time, this is called
liveness violation. Assume that there are a lot of other threads and
thread 1 has lower priority than thread 2, in such situation thread 2
can burn several seconds of CPU time.
To fix this I modified parker::park() as follows:

void park()
{
if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
return;
}
if (pthread_mutex_try_lock(&mtx))
{
if (get_too_frequently_here())
pthread_yield();
return;
}

if (count.load(std::memory_order_relaxed))
{
count.store(0, std::memory_order_relaxed);
pthread_mutex_unlock(&mtx);
return;
}

pthread_cond_wait(&cv, &mtx) ;
count.store(0, std::memory_order_relaxed);

pthread_mutex_unlock(&mtx);
}

Now we gracefully communicate to the OS that current thread spins
waiting for something and the issue has gone.
Note that the issue is highly unlikely to happen in real-life, but
strictly saying is possible because consumer thread does not execute
any synchronization (it skips memory barrier and failed mutex try lock
does not have to provide any synchronization), so it may always use
stale value of user state cached in register.

P.S. All this (excluding this report) took roughly 30 minutes of time,
and was done on the way from Medvedvodo station of V. I. Lenin Moscow
Metropolitan Railway to Oktjabr'skaja on my single-core laptop
(indeed, you can do verification of algorithms that will run on
Itanium, PowerPC and SPARC with Relacy on single-core x86 box). Are
HotSpot developers so busy that they do not have 30 minutes for
verification of fundamental components? Ok, relax, Relacy was not yet
born when they write it, however I hope you get the message.

--
Dmitriy V'jukov

Reply all

Reply to author

Forward

0 new messages