Writing a unittest against thread (un)safe ref-counter pointer

Henrik Goldman

unread,

Dec 26, 2007, 4:26:27 PM12/26/07

to

Recently I had a bug report on our software which lead to 14 days of
frustrations before the bug was discovered.
It turned out to be a thread (un)safe ref-counter pointer which would
eventually do a double-delete.

I've attached the BUGGY code in a condense format at the end of the post.

The bug which took so long to figure out is that two threads can be deleting
the object and this results in a double-delete:

m_p->SubRefCount();

if (m_p->IsRemovable())

{

delete m_p;

m_p = NULL;

}

If both threads pass get to IsRemovable and then both ask at the same time
then they will both get told that they are the only reference.
The problem is that SubRefCount() and IsRemovable is not synchronized as it
should be.

So as a programmer I wanted to reproduce this problem as a unit-test to
avoid these kind of situations in future.
However it turned out to be harder than imagined.
In a simplistic scenario I simply cannot reproduce it anymore.

Here is my attempt at a testcase written for Unittest++:

struct TestCounterThreadFixture

{

virtual ~TestCounterThreadFixture() {}

static void Callback(CThreadInformation *pParam)

{

CRefCountPtr<CCount> p = (CCount *) pParam->m_pParam;

while (!pParam->m_ShutdownCond.Wait());

}

};

TEST_FIXTURE(TestCounterThreadFixture, ConcurrencyTestCounter)

{

CThread t[3];

CRefCountPtr<CCount> pReal;

int i, j;

for (j = 0; j < 1000; j++)

{

pReal = new CCount;

for (i = 0; i < lengthof(t); i++)

{

t[i].SetCallback(TestCounterThreadFixture::Callback);

//

t[i].SetParam(pReal);

CHECK(t[i].Start());

}

while (pReal.GetRefCount() != lengthof(t) + 1)

Sleep(5);

// Make a race-condition against the last delete

pReal = NULL;

for (i = 0; i < lengthof(t); i++)

{

t[i].SetKillFlag();

}

As you can see I want the 3 threads to synchronize and then let them go in
such a way so they will trigger the double-delete.

It turned to be rather hard to make it do this.

Also as a proposed solution I replaced it with:

int SubRefCount()

{

CMutexGuard MtxGuard(&m_RefCountMutex);

m_nRefCount--;

return m_nRefCount;

}

if (m_p->SubRefCount() == 0)

{

delete m_p;

m_p = NULL;

}

However again I want to have the security to ensure that this actually does
fix the problem...

Any clues on how to make it fail?

Alternatively is there a resource of common tests that could be adopted?
Surely I cannot be the first to write such?

Thanks.

-- Henrik

class CRefCounter

{

public:

virtual ~CRefCounter() {}

CRefCounter() {m_nRefCount = 0;}

void AddRefCount() {CMutexGuard MtxGuard(&m_Mutex); m_nRefCount++; }

void SubRefCount() {CMutexGuard MtxGuard(&m_Mutex);m_nRefCount--;}

bool IsRemovable(){CMutexGuard MtxGuard(&m_Mutex);return m_nRefCount == 0;}

private:

int m_nRefCount;

CMutex m_Mutex;

};

This is the ref-counting class itself. It uses a mutual exclusion object
implemented by critical sections on Windows and pthread_mutex on unix.

template<class T>

class CRefCountPtr

{

public:

CRefCountPtr(T *p = NULL) {m_p = p; AddRefCount();}

CRefCountPtr(const CRefCountPtr &rhs) {m_p = rhs.m_p; AddRefCount(); }

virtual ~CRefCountPtr() { SubRefCount();}

CRefCountPtr &operator=(const CRefCountPtr &rhs) {return *this = rhs.m_p;}

CRefCountPtr &operator= (T *p)

{

if (m_p != p)

{

SubRefCount();

m_p = p;

AddRefCount();

}

return *this;

}

// Several pointer operators removed as they are irrelevant for the question

private:

void AddRefCount()

{

if (m_p != NULL)

{

m_p->AddRefCount();

}

void SubRefCount()

{

if (m_p != NULL)

{

m_p->SubRefCount();

if (m_p->IsRemovable())

{

delete m_p;

m_p = NULL;

}

T *m_p;

};

Chris Thomasson

unread,

Dec 26, 2007, 10:37:44 PM12/26/07

to

"Henrik Goldman" <henrik_...@mail.tele.dk> wrote in message
news:4772ca3a$0$2110$edfa...@dtext02.news.tele.dk...

> Recently I had a bug report on our software which lead to 14 days of
> frustrations before the bug was discovered.
> It turned out to be a thread (un)safe ref-counter pointer which would
> eventually do a double-delete.
>
> I've attached the BUGGY code in a condense format at the end of the post.

[...]

I need to read your post in greater detail in order to give a good answer.
Generally, developing accurate tests for shared-memory threads can be
complicated, and unpredictable. Its kind of like herding race-conditions
instead of cats. You need to work out the algorithm on paper to make sure
its correct before you implement it. Its good to keep in mind that a
race-condition in the algorithm leads to problems in every single
implementation; this is not true the other way around.

You can reproduce errors in threading by making clever use of the debugger.
You can start the debugger; pause all threads except for one; step the only
active thread into position; artificially set the shared state that can
produce the condition; and carefully step the active thread through the
condition.

For instance, you can reproduce the race-condition that manifests itself
when one uses a reference counting algorithm that only provides normal
thread-safety in a situation that demands strong thread-safety on a 100%
consistent basis by artificially setting up the state, and working a thread
through it.

Chris Thomasson

unread,

Dec 27, 2007, 1:28:46 AM12/27/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:282dnfiJFdC8ge7a...@comcast.com...

> "Henrik Goldman" <henrik_...@mail.tele.dk> wrote in message
> news:4772ca3a$0$2110$edfa...@dtext02.news.tele.dk...
>> Recently I had a bug report on our software which lead to 14 days of
>> frustrations before the bug was discovered.
>> It turned out to be a thread (un)safe ref-counter pointer which would
>> eventually do a double-delete.
>>
>> I've attached the BUGGY code in a condense format at the end of the post.
> [...]

[...]

> You can reproduce errors in threading by making clever use of the
> debugger. You can start the debugger; pause all threads except for one;
> step the only active thread into position; artificially set the shared
> state that can produce the condition; and carefully step the active thread
> through the condition.

You can simulate other threads concurrent actions in the debugger by
artificially plugging in their mutations to the state intermittently during
the code stepping. Since your using a single thread to walk the code, you
can simulate _atomic_ multi-word mutations in between each stepped
instruction if you wish. Its a pain in the neck to do, but it works for
sure.

Dmitriy Vyukov

unread,

Dec 27, 2007, 5:32:35 AM12/27/07

to

On 27 дек, 00:26, "Henrik Goldman" <henrik_gold...@mail.tele.dk>
wrote:

> m_p->SubRefCount();

//
---------------------------------------------------------------------------------
if (0 == (rand() % 100))
pthread_yield();
// or SwitchToThread();
//
---------------------------------------------------------------------------------

> if (m_p->IsRemovable())
>
> {
>
> delete m_p;
>
> m_p = NULL;
>
> }

Dmitriy V'jukov

Marcel Müller

unread,

Dec 27, 2007, 7:28:28 AM12/27/07

to

Hi!

Henrik Goldman schrieb:

> Recently I had a bug report on our software which lead to 14 days of
> frustrations before the bug was discovered.
> It turned out to be a thread (un)safe ref-counter pointer which would
> eventually do a double-delete.

> So as a programmer I wanted to reproduce this problem as a unit-test to

> avoid these kind of situations in future.
> However it turned out to be harder than imagined.
> In a simplistic scenario I simply cannot reproduce it anymore.

There is no way to create test cases that are guaranteed to fail on
undefined behaviour. It's in the nature of undefined behaviour that it
may /not/ fail too.

However, as Dmitriy said, there are mostly ways to increase the
probability of the fail case. But as long as they are intrusive (as in
the example), they won't help you much.

In this case creating some tenths of threads which randomly create, copy
and discard references in a common array of references for a while will
maybe do the job. If your memory is clean afterwards and the program did
not crash it is likely that your reference counter is clean.

In your case two thread switches have to occur close together, because
once m_p is NULL the second delete turns into a no-op. So more helpfull
than a test case is maybe a debug build that checks m_p for to be not
NULL before the delete (if this is a valid assumption in your case).

In fact, I would not create a test case for exactly this problem at all
now, because it is really unlikely that exactly this happens again.
I have not found anything better than experience and much care for now
to avoid synchronization issues in core libraries of mutli-threaded
applications. Test cases like the above can contribute, but they will
only check for some faults and are expensive to build. Another method is
a code review of a second (experienced) programmer that is only
marginally involved in the project. I find almost everytime some
potential synchronization issues in code at such reviews. Even in code
that is productive since years.
A clean code desing that decouples synchronization and resource
management as far as possible form the application logic is helpful too.
If the critical parts are in a view small but heavy used classes, bugs
are more likely to be found soon. Using tested implementations like the
boost libraries are good advises too. Documenting the thread-safety of
each class (and sometimes method) is required too. Debug builds might
use assertions to check for some of these requirements.

Marcel

Scott Gifford

unread,

Dec 27, 2007, 10:47:52 AM12/27/07

to

"Henrik Goldman" <henrik_...@mail.tele.dk> writes:

[...]

> a thread (un)safe ref-counter pointer which would eventually do a
> double-delete.

[...]

> So as a programmer I wanted to reproduce this problem as a unit-test
> to avoid these kind of situations in future.

A few things I have done for situations like this:

* Add a method to assert that everything is in a valid state, then
run the methods you think might race a few million times with
frequent checks for valid state. Often they will eventually
expose a problem. When they do, make sure the test prints out
enough information to figure out what went wrong.

* The problem is sometimes the scheduler will always follow the
exact same sequence. As somebody else suggested, you can add to
this technique some random sleeps/yields at points where you see
the possibility of a race.

* You may also be able to try different schedulers by setting
options within your thread library, or adjusting thread
priorities. And you can often use a different scheduler
altogether by testing on a different OS. Different scheduling
might expose races differently.

* IBM had an interesting concurrency testing library for Java called
ConTest. It was a custom thread scheduler that tried to schedule
things in a pessimal way to increase the chance of exposing races,
and also had some randomness so running the tests for a few hours
would exercise many different possibilities. I'm not sure if
something like that exists for C++, but I found it very useful in
Java.

* I have heard good things about Intel's Thread Checker, but mostly
from Intel and I haven't tried it yet myself. It might be worth a
look.

Hope this helps,

----Scott.

Joe Seigh

unread,

Dec 27, 2007, 5:54:17 PM12/27/07

to

Henrik Goldman wrote:
> Recently I had a bug report on our software which lead to 14 days of
> frustrations before the bug was discovered.
> It turned out to be a thread (un)safe ref-counter pointer which would
> eventually do a double-delete.
>
> I've attached the BUGGY code in a condense format at the end of the post.
>
> The bug which took so long to figure out is that two threads can be deleting
> the object and this results in a double-delete:
>
> m_p->SubRefCount();
>
> if (m_p->IsRemovable())
>
> {
>
> delete m_p;
>
> m_p = NULL;
>
> }
>
> If both threads pass get to IsRemovable and then both ask at the same time
> then they will both get told that they are the only reference.
> The problem is that SubRefCount() and IsRemovable is not synchronized as it
> should be.
>
> So as a programmer I wanted to reproduce this problem as a unit-test to
> avoid these kind of situations in future.
> However it turned out to be harder than imagined.
> In a simplistic scenario I simply cannot reproduce it anymore.
>

There exist implementations of thread-safe reference counting that work.
The one that you posted here isn't anywhere near workable. I'd advise
studying or, better yet, using the ones that already work.

One of the ways I test smart pointers is to just do a lot of concurrent
adding new object references and dropping them. You keep count of object
creation and deletion in the ctor and dtor respectively. When your testcase
finishes, the counts should match. Make sure you know the difference between
thread-safe smart pointers and atomically thread-safe smart pointers.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.